Ad Widget

Collapse

Another thread for nasty UNREACHABLE-problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • nms_user
    Member
    • Feb 2009
    • 43

    #1

    Another thread for nasty UNREACHABLE-problem

    Hi all,

    I have searched all the whole forum and many people seem to have that "xxx is unreachable"-problem in fact that the machine is alive, the agent is listening on his port (telnet-check) - in short: everything is fine.

    We have this issue too, raising randomly on nearly every machine we monitor.

    - Sometimes it goes away by restarting the agent in one step
    - sometimes I have to stop the agent, wait several minutes and then restart the agent
    - sometimes I have to set the machine "not monitored" and "monitored" again in zabbix gui
    - sometimes none of the solutions above work - it still remains unreachable for days, even if the whole machine gets rebooted

    A given workaround not to use unreachable and using the ping-check gives me some headache: I have to set very much dependencies from all the triggers pointing to the ping-trigger which I'd like to avoid...

    Playing around with the timout-variables (unreachable/unavailable-delays) in zabbix_server.conf also doesn't help anything.

    It's going across our installation since v. 1.4.6 in all combinations of server and agent-versions.

    How is the unreachable-trigger made (some internal calulations - but which)? What has to happen to get it raising - and make it ok again?

    Does anybody have a solution?
    Last edited by nms_user; 23-06-2009, 09:24.
  • Palmertree
    Senior Member
    • Sep 2005
    • 746

    #2
    Unreachable is determined by the zabbix_server daemon. Make sure you have the following settings in your zabbix_server.conf file. You might need to increase them if there is latency in the network. Also, make sure you route only out of one interface if you have a multi-homed box.

    zabbix_server.conf file:
    Code:
    # After how many seconds of unreachability treat a host as unavailable
    UnreachablePeriod=60
    
    # How ofter check host for availability during the unavailability period
    UnavailableDelay=60

    Comment

    • nms_user
      Member
      • Feb 2009
      • 43

      #3
      "Unreachable is determined by the zabbix_server daemon"
      Yea, but how exactly - as the .status-item doesn't generate constant data flow, how does the server calculate this?

      "Make sure you have the following settings in your zabbix_server.conf file"
      I played around with this variables already, but have set them accurately to your published values now.

      "Also, make sure you route only out of one interface if you have a multi-homed box"
      Neither the zabbix machine nor most servers have two or more interfaces / ip's.

      Thanks
      Last edited by nms_user; 23-06-2009, 14:29.

      Comment

      • nms_user
        Member
        • Feb 2009
        • 43

        #4
        Hello,

        I can update this case a little bit.

        Changing the timeout-variables didn't solve something - the problem occured again.

        This night two servers from the same subsidiary gone "unreachable", but i can ping them from the zabbix-server and the connect to port 10050 is also working fine. And yes, they have only one interface...

        Here something from the zabbix_server-logfile (veiled). Maybe the wan-connection had a short blackout, but why doesn't the zabbix-server then pull them out of unreachable as soon as the connection comes up again:

        ...lots of "Send list of active checks" all the time*...
        26152:20090626:021526 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        26153:20090626:021726 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        26121:20090626:021933 Item [Host_srv1:net.tcp.port[192.168.x.1,53]] error: Get value from agent failed: Cannot connect
        to [srv1.domain1:10050] [Interrupted system call]
        26121:20090626:021933 Parameter [net.tcp.port[192.168.x.1,53]] will be checked after 120 seconds on host [Host_srv1]
        26122:20090626:021936 Item [Host_srv1:net.tcp.port[192.168.x.1,389]] error: Get value from agent failed: Cannot connect
        to [srv1.domain1:10050] [Interrupted system call]
        26122:20090626:021936 Parameter [net.tcp.port[192.168.x.1,389]] will be checked after 120 seconds on host [Host_srv1]
        26156:20090626:022047 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        26121:20090626:022202 Item [Host_srv1:net.tcp.port[192.168.x.1,53]] error: Get value from agent failed: Cannot connect
        to [srv1.domain1:10050] [Interrupted system call]
        26121:20090626:022202 Parameter [net.tcp.port[192.168.x.1,53]] will be checked after 120 seconds on host [Host_srv1]
        26122:20090626:022206 Item [Host_srv1:net.tcp.port[192.168.x.1,389]] error: Get value from agent failed: Cannot connect
        to [srv1.domain1:10050] [Interrupted system call]
        26122:20090626:022206 Parameter [net.tcp.port[192.168.x.1,389]] will be checked after 120 seconds on host [Host_srv1]
        26152:20090626:022933 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        26151:20090626:023233 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        26151:20090626:023433 Send list of active checks to [192.168.x.1] failed: host [srv1.domain1] not found
        ...lots of "Send list of active checks" all the time*...

        *We don't use active checks right now...

        Thanks

        Comment

        • richlv
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Oct 2005
          • 3112

          #5
          1. if you do not use active checks, disable them in the agentd config (and restart agentd afterwards);

          2. could it be that some checks respond, but some don't ? if host has went into unreachable state, zabbix polls one item until data is retrieved. if that item does not respond, zabbix does not attempt to gather data for other items
          Zabbix 3.0 Network Monitoring book

          Comment

          • nms_user
            Member
            • Feb 2009
            • 43

            #6
            Now i'm a little bit confused.

            The Dashboard shows me (panel system status) the servers as unreachable - now since 9 hours x mins.

            If I look into the latest data from the two machines, every item check time is actual.

            Also the last value from the host.status shows me the server as up:
            2009.Jun.26 02:28:46 Up (0)
            2009.Jun.26 02:23:25 Unreachable (2)

            So the problem is based at the trigger which is installation-default:
            {srv1.domain1:status.last(0)}=2

            Why does the trigger hold it's fired-state?

            Comment

            • nms_user
              Member
              • Feb 2009
              • 43

              #7
              Hello,

              no ideas at all?

              Here the same trigger-mess on another machine again:
              2009.Jun.29 11:39:40 Up (0) --> after starting the agent again the trigger went away
              2009.Jun.29 11:36:39 Unreachable (2) --> here I stopped the agent for some mins.
              2009.Jun.27 22:55:01 Up (0) --> trigger didn't release, all the escalation with SMSes after some hours went through
              2009.Jun.27 22:52:45 Unreachable (2) --> trigger gone unreachable

              I think there are two separate problems:
              1) the server-logic sets the host as 'Up', but the trigger doesn't go away
              2) the server-logic marks the host as 'Down', but all the items are getting polled

              This trigger-problem only occurs on the .status-check. Besides this our zabbix-environment is runnig very fine (no broken installation), only this UNREACHABLE is giving us headaches because of many false positives as SMS-escals.

              Comment

              • richlv
                Senior Member
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Oct 2005
                • 3112

                #8
                could you post screenshots of :

                1. latest values for .status item, showing it going to 0;
                2. event history for that trigger during the same period (with "Show unknown" option marked).
                Last edited by richlv; 29-06-2009, 12:21. Reason: "Show unknown" note
                Zabbix 3.0 Network Monitoring book

                Comment

                • nms_user
                  Member
                  • Feb 2009
                  • 43

                  #9
                  Hi,

                  see the requested screenshots attached.
                  Last edited by nms_user; 23-07-2015, 11:21.

                  Comment

                  • richlv
                    Senior Member
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Oct 2005
                    • 3112

                    #10
                    hmm. it indeed looks like two of the data changes (27/22:55:01 and 29/11:36:39) haven't generated events as i'd expect. which particular version is this ?
                    if you still see such a problem with 1.6.5, maybe it's worth reporting on the tracker. looks like a possible bug and i'm out of ideas

                    ps. one more idea, though - is that trigger depending on anything ?
                    Last edited by richlv; 29-06-2009, 14:09. Reason: dependency idea
                    Zabbix 3.0 Network Monitoring book

                    Comment

                    • nms_user
                      Member
                      • Feb 2009
                      • 43

                      #11
                      "which particular version is this ?"
                      We are on 1.6.5 ...

                      "maybe it's worth reporting on the tracker"
                      Opened one now.

                      "one more idea, though - is that trigger depending on anything?"
                      Yes, most machines depend on their next gateway/infrastructure component. But they are right, our terminalservers for example depends on the central core switches and have this problem frequently...


                      Thanks for your assistance!

                      Comment

                      • flo
                        Junior Member
                        • Jun 2009
                        • 2

                        #12
                        Hi,

                        we have the same issue, also running zabbix 1.6.5 and OS is CentOS 5.3

                        Comment

                        • abix_adamj
                          Junior Member
                          • Jun 2008
                          • 3

                          #13
                          I want to report the same situation with agentd running on Debian Lenny 5.0.1 on OpenVZ container.Zabbix 1.6.5 server. The same agentd running on normal openSUSE 11.0 (the same machine as Zabbix server) works perfectly good.

                          Maybe OpenVZ has something to that ?

                          Adam

                          Comment

                          • bek99
                            Junior Member
                            • May 2009
                            • 9

                            #14
                            This seems to be an agent.ping issue maybe? Specifically, if I run a copy of netcat to listen on the agentd port, in a looping shell script, I never get any trigger alerts from zabbix server that the agent is down or any of the associated default triggers that go for various checks (disk space, etc) if no response is received.

                            1.6.5 is version used as well.

                            Here's the script I used w/ netcat.

                            #!/bin/bash
                            while [ 1 = 1 ]
                            do
                            nc -l -p 10050
                            done

                            -b

                            Comment

                            • bek99
                              Junior Member
                              • May 2009
                              • 9

                              #15
                              Originally posted by bek99
                              This seems to be an agent.ping issue maybe? Specifically, if I run a copy of netcat to listen on the agentd port, in a looping shell script, I never get any trigger alerts from zabbix server that the agent is down or any of the associated default triggers that go for various checks (disk space, etc) if no response is received.

                              1.6.5 is version used as well.

                              Here's the script I used w/ netcat.

                              #!/bin/bash
                              while [ 1 = 1 ]
                              do
                              nc -l -p 10050
                              done

                              -b
                              Just to follow up on this, this does not occur on 1.6.4, the behavior is correct and the unreach trigger does alert.

                              Two zabbix servers - hardy 8.04 64 bit.

                              one 1.6.5 -
                              one 1.6.4 -

                              zabbix agents: 1.6.4 and one 1:1.4.2-4ubuntu3

                              zabbix - 1.6.5:

                              If agent was 1.6.4 and replaced w/ loopscript above, it does trigger an unreach with agent.ping and/or proc count of zabbix_agentd
                              If agent was 1.4.2 and replaced w/ loopscript above, it does NOT trigger an unreach with agent.ping and/or proc count of zabbix_agentd


                              zabbix 1.6.4:
                              exhibited correct behavior. If replaced w/ loopscript, unreach trigger does fire with either agent version.
                              Last edited by bek99; 21-07-2009, 23:22. Reason: Adding more info

                              Comment

                              Working...