We have a few active agents on the run. From time to time, especially after network outages (rather longer than shorter) which prevent them from connecting to zabbix-server, they hang in a strange state: they stop sending active data, but answer to agent.ping. The only possible way to make them talk is to restart them, which quite annoying.
Below, is a snippet from log:
It just hangs there sending no data, which is noticed by server, trigger is triggered, and appropiate alert is generated. As you might have noticed, agent.ping trigger is not tirggered, as it's not the active part of check to be generated.
I guess it's agent that hung.
After agent is restarted, the logs show ordinary startup logs:
any idea what's wrong ? Or maybe any other debug ideas for this problem ?
Below, is a snippet from log:
Code:
029195:20070413:103202 OK 029195:20070413:103202 In send_value([1232076]) 029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>dmZzLmZzLnNpemVbL3RtcCxmcmVlXQ==</key><data>MTIzMjA3Ng==</data></req>] 029195:20070413:103202 OK 029195:20070413:103202 In send_value([1035296]) 029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>dmZzLmZzLnNpemVbL3ZhcixmcmVlXQ==</key><data>MTAzNTI5Ng==</data></req>] 029195:20070413:103202 OK 029195:20070413:103202 No sleeping 029195:20070413:103202 In send_value([2426958]) 029195:20070413:103202 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>c3lzdGVtLnVwdGltZQ==</key><data>MjQyNjk1OA==</data></req>] 029195:20070413:103202 OK 029195:20070413:103202 Sleeping for 1 seconds 029195:20070413:103203 In send_value([0]) 029195:20070413:103203 XML before sending [<req><host>cnRyLTEwNS5jb3JlLmxhbmV0Lm5ldC5wbA==</host><key>bmV0LmlmLmluW2V0aDEsZXJyb3JzXQ==</key><data>MA==</data></req>] 029194:20070413:104852 In check_security() 029194:20070413:104852 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost] 029194:20070413:104852 Before read() 029194:20070413:104852 After read() 2 [11] 029194:20070413:104852 Got line:agent.ping 029194:20070413:104852 Sending back:1 029191:20070413:104922 In check_security() 029191:20070413:104922 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost] 029191:20070413:104922 Before read() 029191:20070413:104922 After read() 2 [11] 029191:20070413:104922 Got line:agent.ping 029191:20070413:104922 Sending back:1 029192:20070413:105022 In check_security() 029192:20070413:105022 Connection from [87...]. Allowed servers [foo.bar.com,foo-2.bar.com,localhost]
I guess it's agent that hung.
After agent is restarted, the logs show ordinary startup logs:
Code:
029193:20070413:133123 Got line:agent.ping 029193:20070413:133123 Sending back:1 029190:20070413:133140 Got signal. Exiting ... 029195:20070413:133140 Got signal. Exiting ... 029194:20070413:133140 Got signal. Exiting ... 029193:20070413:133140 Got signal. Exiting ... 029192:20070413:133140 Got signal. Exiting ... 029191:20070413:133140 Got signal. Exiting ... 029190:20070413:133140 One child process died. Exiting ... 029190:20070413:133140 Cannot remove STAT file [/tmp/zabbix_agentd.tmp] 029190:20070413:133140 Cannot remove PID file [/var/run/zabbix-agent/zabbix_agentd.pid] 011963:20070413:133145 zabbix_agentd started. ZABBIX 1.1.6. 011964:20070413:133145 zabbix_agentd 11964 started 011965:20070413:133145 zabbix_agentd 11965 started 011966:20070413:133145 zabbix_agentd 11966 started 011967:20070413:133145 zabbix_agentd 11967 started 011968:20070413:133146 zabbix_agentd 11968 started 011968:20070413:133146 In init_list() 011968:20070413:133146 In refresh_metrics() 011968:20070413:133146 get_active_checks: host[foo.bar.com] port[10051] 011968:20070413:133146 Sending [ZBX_GET_ACTIVE_CHECKS

Comment