Ad Widget

Collapse

Availability of one host breaks monitoring of the rest? 1.1beta5

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cameronsto
    Senior Member
    • Oct 2005
    • 148

    #1

    Availability of one host breaks monitoring of the rest? 1.1beta5

    Just noticed an issue in our Zabbix box that seems to indicate a problem with retrieving data from hosts while one other host is unavailable.

    Starting at 12:39pm today one host went unavailable with errors in the server log of "Getting value of...failed". This continued until 12:50 when it started reporting "Timeout while connecting to [....]\nHost [...] will be checked after [60] seconds." The Zabbix agent log reported "Timeout while answering request" at 12:39 with no other entries after that.

    During that same time period Zabbix collected no other data from any other hosts. It looks like the availability of one host was blocking requests for items from other hosts.

    I'll be happy to provide logs and assist in troubleshooting as this seems like a pretty big issue in my opinion.

    -cameron
  • KarmaPolice
    Member
    • Oct 2005
    • 95

    #2
    I've had similar issues in the past, but never been able to track down the problem... I had to disable the server that it was hanging on (in the hosts screen) in order to get things to fire back up...

    Comment

    • Nate Bell
      Senior Member
      • Feb 2005
      • 141

      #3
      I've had this happen a few times with 1.1beta1 (Strangely, always when I leave on vacation as well ). My solution was the same, disable the offending host, let Zabbix catch up, and then reinstate the offending host once it's error has been corrected. I haven't tried a version of Zabbix past beta1, but I do hope this is resolved before 1.1 is released since Zabbix should be able to handle a host breaking down without failing itself.

      Nate

      Comment

      • ad@kbc-clearing.com
        Member
        • Sep 2005
        • 77

        #4
        We are running on 1.1.1 release. but the problem is still there.
        Don't know how to solve it.....

        Comment

        • edeus
          Senior Member
          • Aug 2005
          • 120

          #5
          I have had the same problem with 1.1.1.

          Also if a host goes down, even if it comes up again zabbix doesnt start checking it automatically.

          I have had to disable and or restart zabbix server.

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            Originally posted by edeus
            Also if a host goes down, even if it comes up again zabbix doesnt start checking it automatically.
            It cannot be true!
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • Nate Bell
              Senior Member
              • Feb 2005
              • 141

              #7
              Our phone server stopped responding (it ended up in a sort of zombie state) yesterday and Zabbix handled it very poorly. Instead of skipping past the host and checking the rest of the items in the queue, the queue got backed up and didn't resolve until we reset the phone server. The biggest problem is Zabbix didn't report the host as being down since the queue stopped, and we only realized our phone server was down when someone tried to make a call.

              The behavior I would expect is, if Zabbix runs into a situation where something isn't responding in a timely manner then throw a warning/flag/trigger and move on to the next item. I'm not sure what hardware state triggers Zabbix to fail like this, but since it makes Zabbix grind to a halt with no warning, and doesn't resolve until a user somehow notices and fixes it, this is a pretty serious bug. Unless there is a configuration issue that can resolve this, it makes Zabbix very unreliable for me.

              Alexei, have you encountered this when testing Zabbix and is working being done to correct it? If you haven't encountered it, the last time it happened to me the server that triggered the problem was locked up because a hard drive had a timeout error. If need be those who have seen this could create a list of conditions when this happens.

              Thanks,
              Nate

              Comment

              • dantheman
                Senior Member
                • May 2006
                • 209

                #8
                I am running 1.1.1 right now and am not seeing this behavior. Zabbix works exactly as expected, alerts are being sent, triggers are changing, and it's working incredible well.

                Comment

                • Alexei
                  Founder, CEO
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Sep 2004
                  • 5654

                  #9
                  Nate,

                  Thanks for the additional details. When detecting a timeout situation or a network error ZABBIX immediately skip the host and continue processing for the rest of hosts. One separate ZABBIX server process takes care of the timeouted hosts.

                  Perhaps you have Timeout set too high in ZABBIX server's configuration file? Default 3 sec is OK and I wouldn't increase it.
                  Alexei Vladishev
                  Creator of Zabbix, Product manager
                  New York | Tokyo | Riga
                  My Twitter

                  Comment

                  • ad@kbc-clearing.com
                    Member
                    • Sep 2005
                    • 77

                    #10
                    We noticed that the parameter "UnavailableDelay" plays an important role in polling the unavailable hosts.
                    This parameter is by default hashed in the server config (with value 45).
                    If you make it a small value (e.g. 15), then an unavailable host will be polled every 15 seconds. This can be very time-consuming if several hosts are unavailable. The queue will become very long.
                    If you make UnavailableDelay a bit larger (e.g. 120), then the queue will not get full anymore.
                    It will take upto 2 minutes before Zabbix finds out that an unavailable host is available again. We find this acceptable.

                    Comment

                    • Nate Bell
                      Senior Member
                      • Feb 2005
                      • 141

                      #11
                      Originally posted by Alexei
                      Perhaps you have Timeout set too high in ZABBIX server's configuration file? Default 3 sec is OK and I wouldn't increase it.
                      Hmm. Good idea. My server configuration was set to 5, but I'll lower it to 3. I'll also look at the agent timeouts which may be set higher to accomodate slow Userparameters. Could those be a source of the problem?

                      Nate

                      Comment

                      Working...