Ad Widget

Collapse

Zabbix server stops connecting to agent

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • xofer
    Junior Member
    • Jul 2010
    • 18

    #1

    Zabbix server stops connecting to agent

    From zabbix server log:

    2516:20100719:092222.631 Item [XXXXXXX:net.if.in[eth1,bytes]] error: Get value from agent failed: Cannot connect to [NN.NN.NN.NN:10050] [Interrupted system call]
    2516:20100719:092222.632 ZABBIX Host [XXXXXXX]: first network error, wait for 15 seconds
    2511:20100719:092223.278 Item [XXXXXXX:net.if.in[lo,bytes]] error: Get value from agent failed: Cannot connect to [NN.NN.NN.NN:10050] [Interrupted system call]
    2511:20100719:092223.280 ZABBIX Host [XXXXXXX]: another network error, wait for 15 seconds
    2513:20100719:092224.572 Item [XXXXXXX:system.cpu.load[,avg1]] error: Get value from agent failed: Cannot connect to [NN.NN.NN.NN:10050] [Interrupted system call]
    2513:20100719:092224.573 ZABBIX Host [XXXXXXX]: another network error, wait for 15 seconds
    That was over 2 hours ago.
    So it promises to check again in 15 seconds, but never does. The checks are in the queue.

    Absolutely no alarms are raised, the Z icon in host list stays soothingly green.

    Any ideas?
  • xofer
    Junior Member
    • Jul 2010
    • 18

    #2
    After zabbix server restart the problem went away. Still, it looks like a bug to me. Zabbix agent was up on the host, but it wasn't polled by the server.

    Comment

    • kmradke
      Member
      • Aug 2009
      • 33

      #3
      Do you have any external scripts or web checks that may be hung? In my experience the zabbix timeout portion seems to never kill scripts that are locked up. I needed to add my own internal script timeouts to kill themselves when bad things happen.

      Next time this occurs, check the server for still running external scripts or web checks. I'll bet you will find one and killing it will get things going again.

      Comment

      • xofer
        Junior Member
        • Jul 2010
        • 18

        #4
        No. Just the stock Template_Linux with half of the items disabled. I am just starting to evaluate zabbix.

        I can understand that it gets hung or does not get the information for a number of reasons. What i can not understand is why it remains green and zero alarms raised for 6 hours. And by the way, as a result of the server not communicating with the agent i did not get the real alarms of course (i stopped sendmail for testing purposes and that is how i discovered it).

        Anyway - IMHO the logic in monitoring should be that the lack of information is always an error. Let's say an item should have new data every n seconds then server should consider raising an alarm if there is no data in 3*n or something. If we do not want to activate triggers, at least the host should be marked in error/offline state.

        Comment

        • xofer
          Junior Member
          • Jul 2010
          • 18

          #5
          Found this: https://support.zabbix.com/browse/ZBX-2091

          Seems to be the same bug.

          I am running zabbix 1.8.2 on Centos 5.5 x86_64 if that matters.

          Comment

          • kmradke
            Member
            • Aug 2009
            • 33

            #6
            It is easy to add a trigger to happen if data is not received for an item for a specified amount of time. Unfortunately, as you found, these are not setup by default. You most likely don't want it setup for every item, since that would give lots of triggers if the whole agent went away. I usually trigger off the .nodata item for the agent status.

            Comment

            • xofer
              Junior Member
              • Jul 2010
              • 18

              #7
              Well, in this case it would make more sense to create a trigger that happens when there are checks in the queue older than n. minutes (except that support this seems to come from 1.8.3 according to manual). But it is still a workaround, not a solution. It just alerts me that zabbix is misbehaving.

              Now i have a single check in the queue for 10+ hours for a different agent (net.tcp.port from an agent). Both - the server and the agent have been restarted repeatedly and this check remains in the queue. I have verified that the check works using zabbix_get and it worked in the past from zabbix. Other checks from the same agent seem to go through.

              And while we are discussing workarounds - how do i zap the queue? Google didn't help here.
              Last edited by xofer; 21-07-2010, 11:21.

              Comment

              • xofer
                Junior Member
                • Jul 2010
                • 18

                #8
                I have a check now in the queue now for over 24 hours. Server or agent restart does not make it go away.
                How can i clear the queue?

                Comment

                • xofer
                  Junior Member
                  • Jul 2010
                  • 18

                  #9
                  By now i have the same check in the queue for over 5 days.

                  Is this normal for zabbix?

                  Comment

                  • xofer
                    Junior Member
                    • Jul 2010
                    • 18

                    #10
                    The same check is in the queue for over 2 weeks now. The server and agent have been restarted several times meanwhile. Is there no way to delete the queue?

                    Comment

                    • bashman
                      Senior Member
                      • Dec 2009
                      • 432

                      #11
                      Originally posted by kmradke
                      Do you have any external scripts or web checks that may be hung? In my experience the zabbix timeout portion seems to never kill scripts that are locked up. I needed to add my own internal script timeouts to kill themselves when bad things happen.

                      Next time this occurs, check the server for still running external scripts or web checks. I'll bet you will find one and killing it will get things going again.

                      Zabbix agent timeout seems to apply only for simple checks, looks like it doesn't work for external scripts and zabbix agent items.

                      By the way, try "exit 0" at the end of all your scripts.
                      978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

                      Comment

                      • bashman
                        Senior Member
                        • Dec 2009
                        • 432

                        #12
                        Originally posted by xofer
                        Well, in this case it would make more sense to create a trigger that happens when there are checks in the queue older than n. minutes (except that support this seems to come from 1.8.3 according to manual). But it is still a workaround, not a solution. It just alerts me that zabbix is misbehaving.

                        Now i have a single check in the queue for 10+ hours for a different agent (net.tcp.port from an agent). Both - the server and the agent have been restarted repeatedly and this check remains in the queue. I have verified that the check works using zabbix_get and it worked in the past from zabbix. Other checks from the same agent seem to go through.

                        And while we are discussing workarounds - how do i zap the queue? Google didn't help here.
                        Your "net.tcp.port" item could be stocked in the queue due to this problem:


                        I have the same problem.
                        978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

                        Comment

                        Working...