Ad Widget

Collapse

Active agents stop sending data (1.1.6)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • glut0r
    Member
    • Mar 2007
    • 38

    #1

    Active agents stop sending data (1.1.6)

    We have a few active agents on the run. From time to time, especially after network outages (rather longer than shorter) which prevent them from connecting to zabbix-server, they hang in a strange state: they stop sending active data, but answer to agent.ping. The only possible way to make them talk is to restart them, which quite annoying.
    Below, is a snippet from log:

    Code:
    029195:20070413:103202 OK
    029195:20070413:103202 In send_value([1232076])
    029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>dmZzLmZzLnNpemVbL3RtcCxmcmVlXQ==</key><data>MTIzMjA3Ng==</data></req>]
    029195:20070413:103202 OK
    029195:20070413:103202 In send_value([1035296])
    029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>dmZzLmZzLnNpemVbL3ZhcixmcmVlXQ==</key><data>MTAzNTI5Ng==</data></req>]
    029195:20070413:103202 OK
    029195:20070413:103202 No sleeping
    029195:20070413:103202 In send_value([2426958])
    029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>c3lzdGVtLnVwdGltZQ==</key><data>MjQyNjk1OA==</data></req>]
    029195:20070413:103202 OK
    029195:20070413:103202 Sleeping for 1 seconds
    029195:20070413:103203 In send_value([0])
    029195:20070413:103203 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>bmV0LmlmLmluW2V0aDEsZXJyb3JzXQ==</key><data>MA==</data></req>]
    029194:20070413:104852 In check_security()
    029194:20070413:104852 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost]
    029194:20070413:104852 Before read()
    029194:20070413:104852 After read() 2 [11]
    029194:20070413:104852 Got line:agent.ping
    029194:20070413:104852 Sending back:1
    029191:20070413:104922 In check_security()
    029191:20070413:104922 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost]
    029191:20070413:104922 Before read()
    029191:20070413:104922 After read() 2 [11]
    029191:20070413:104922 Got line:agent.ping
    029191:20070413:104922 Sending back:1
    029192:20070413:105022 In check_security()
    029192:20070413:105022 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost]
    It just hangs there sending no data, which is noticed by server, trigger is triggered, and appropiate alert is generated. As you might have noticed, agent.ping trigger is not tirggered, as it's not the active part of check to be generated.
    I guess it's agent that hung.

    After agent is restarted, the logs show ordinary startup logs:

    Code:
    029193:20070413:133123 Got line:agent.ping
    029193:20070413:133123 Sending back:1
    029190:20070413:133140 Got signal. Exiting ...
    029195:20070413:133140 Got signal. Exiting ...
    029194:20070413:133140 Got signal. Exiting ...
    029193:20070413:133140 Got signal. Exiting ...
    029192:20070413:133140 Got signal. Exiting ...
    029191:20070413:133140 Got signal. Exiting ...
    029190:20070413:133140 One child process died. Exiting ...
    029190:20070413:133140 Cannot remove STAT file [/tmp/zabbix_agentd.tmp]
    029190:20070413:133140 Cannot remove PID file [/var/run/zabbix-agent/zabbix_agentd.pid]
    011963:20070413:133145 zabbix_agentd started. ZABBIX 1.1.6.
    011964:20070413:133145 zabbix_agentd 11964 started
    011965:20070413:133145 zabbix_agentd 11965 started
    011966:20070413:133145 zabbix_agentd 11966 started
    011967:20070413:133145 zabbix_agentd 11967 started
    011968:20070413:133146 zabbix_agentd 11968 started
    011968:20070413:133146 In init_list()
    011968:20070413:133146 In refresh_metrics()
    011968:20070413:133146 get_active_checks: host[foo.bar.com] port[10051]
    011968:20070413:133146 Sending [ZBX_GET_ACTIVE_CHECKS
    any idea what's wrong ? Or maybe any other debug ideas for this problem ?
  • glut0r
    Member
    • Mar 2007
    • 38

    #2
    Don't like answering my own posts, but...
    I've seem to hit debian bug #374758.
    Sometimes zabbix-agent/server hangs after beeing restarted. In my case it happened after logrotate run over logs that had overgrown.
    After removing logrotate stuff, none of my agents stopped sending data until today.
    Seems like this bug will not be fixed until 1.4 so I wait.

    Comment

    • glut0r
      Member
      • Mar 2007
      • 38

      #3
      again, replying to self eh,

      one problem solved, other rediscovered again ...

      seems like agents stop sending data after specific period od time, when communication with server was disrupted (due to network outage or what)

      Seems like I need to switch all my agents off active mode,

      Do you guys use active active agents at all ?

      Comment

      • bbrendon
        Senior Member
        • Sep 2005
        • 870

        #4
        I only use active agents.

        Yes, I have experienced this behavior as well. Actually, just a few hours ago, our Zabbix server filled up /var where mysql stores the innodb tables and zabbix wasn't happy. After resolving the problem and restarting zabbix, about half of the Windows Agents had to be restarted before they would send information to the zabbix server. Very annoying.

        I mostly see the problem with win32 agents, but I do see it with unix as well.
        Unofficial Zabbix Expert
        Blog, Corporate Site

        Comment

        • glut0r
          Member
          • Mar 2007
          • 38

          #5
          Originally posted by infinity005
          Very annoying.
          yeah, real pain in ass. Anyone experiencing this with 1.3 agents ?

          Anyway, agents simply cannot die just like this. I'll have track this issue down. I'm amazed of nobody else hit this before...
          Basically, with agent's death we loose all monitored stuff...

          I'll keep this thread running. Any info appreciated.

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            Is it with Windows agent only?
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • glut0r
              Member
              • Mar 2007
              • 38

              #7
              Originally posted by Alexei
              Is it with Windows agent only?
              Not only. I fight it on Linux (Debian Etch that is).

              Comment

              • glut0r
                Member
                • Mar 2007
                • 38

                #8
                Alexei, would you possibly help with debugging this?
                Any informations you need I'll suply. I have no idea WTF is going on.
                Turning debug on only makes log full with nothing worth further attention.

                Comment

                • glut0r
                  Member
                  • Mar 2007
                  • 38

                  #9
                  Seems like I kind of 'solved' this problem. For a short time my agents were 1.1.6 while my server was 1.1.7. After upgrading all agents, all seems fine.
                  Hate this stuff really, I'd rather sleep than wasting my time on chasing shadows eh.

                  Comment

                  • bbrendon
                    Senior Member
                    • Sep 2005
                    • 870

                    #10
                    I'm 1.1.6 across the board and see it. Don't sleep to tightly.
                    Unofficial Zabbix Expert
                    Blog, Corporate Site

                    Comment

                    • jfl
                      Junior Member
                      Zabbix Certified Specialist
                      • Aug 2006
                      • 2

                      #11
                      I also have this problem with server 1.1.7 on RHEL4 and agents 1.1.7 on Linux (many distributions).

                      However, it's not always related to network outage. Sometimes one agent just stop sending data to zabbix server.

                      To get the agent back, I just restart it.

                      If I can do anything to help, let me know!

                      Comment

                      • glut0r
                        Member
                        • Mar 2007
                        • 38

                        #12
                        heck, I've hit it again, this time only one host. But, I've found this in hung agent's logs:

                        014120:20070429:140257 gethostbyname() failed [Host name lookup failure]
                        014120:20070429:140257 Getting list of active checks failed. Will retry after 60 seconds

                        Means, it died when dns replies didn't make it.

                        I'll try to edit /etc/hosts to make sure zabbix server will always get resolved and we'll see then.

                        Alexei help!

                        Comment

                        • glut0r
                          Member
                          • Mar 2007
                          • 38

                          #13
                          Originally posted by jfl
                          I also have this problem with server 1.1.7 on RHEL4 and agents 1.1.7 on Linux (many distributions).

                          However, it's not always related to network outage. Sometimes one agent just stop sending data to zabbix server.

                          To get the agent back, I just restart it.

                          If I can do anything to help, let me know!
                          well, observe, setup debugs and report. We'll try to gather more info.

                          Comment

                          • jfl
                            Junior Member
                            Zabbix Certified Specialist
                            • Aug 2006
                            • 2

                            #14
                            Originally posted by glut0r
                            well, observe, setup debugs and report. We'll try to gather more info.
                            Now I'm not able to restart the agent correctly. I start it, it sends data to server then it's over, no more data. The zabbix_agentd processes are still running.

                            At debug level 4, nothing special in the logs.

                            However, I've run zabbix_agentd with strace -f for 10 minutes. The agent stopped sending data to server after 5 seconds. The strace output is available at http://step.polymtl.ca/~linux/zabbix_agentd.strace.gz

                            Comment

                            • glut0r
                              Member
                              • Mar 2007
                              • 38

                              #15
                              Another observation here, some of my agents keep randomly hanging, due to poor network connection that is packet loss, and/or high latency and sometimes no dns response.

                              Comment

                              Working...