This is an intermittent problem, but it's gotten worse lately, and it's happening on multiple servers.
In the zabbix_server.log file, I'm seeing many of these messages:
> Zabbix agent item "system.cpu.load[percpu,avg15]" on host "wings.XXXXXX" failed: first network error, wait for 15 seconds
> resuming Zabbix agent checks on host "wings.XXXXXX": connection restored
It fails often, many times each minute, but not every time. And this is happening to many hosts.
While this was going on, I went to the agent and set DebugLevel=4 so I could watch, and I sent some baloney requests so I could observe responses, maybe see an error that the zabbix server isn't reporting. This is what I saw:
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982
Trying 70.32.115.64...
Connected to wings.XXXXX.
Escape character is '^]'.
ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982
Trying 70.32.115.64...
Connected to wings.XXXXX..
Escape character is '^]'.
ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982
Trying 70.32.115.64...
Connected to wings.XXXXXX.
Escape character is '^]'.
Connection closed by foreign host.
That third time, the connection was closed before I got a response from the agent. I checked the zabbix_agent.log, and I saw that it received all three of the "goofball" requests, so I know it's not the zabbix server failing to deliver the request, but the agent is failing to send a response before terminating the connection.
31259:20150518:154535.885 listener #1 [processing request]
31259:20150518:154535.885 Requested [goofball]
31259:20150518:154535.885 listener #1 [waiting for connection]
31257:20150518:154536.792 collector [processing data]
31257:20150518:154536.792 In update_cpustats()
31257:20150518:154536.792 End of update_cpustats()
31257:20150518:154536.793 collector [idle 1 sec]
31260:20150518:154537.245 listener #2 [processing request]
31260:20150518:154537.245 Requested [goofball]
31260:20150518:154537.245 listener #2 [waiting for connection]
31257:20150518:154537.793 collector [processing data]
31257:20150518:154537.793 In update_cpustats()
31257:20150518:154537.793 End of update_cpustats()
31257:20150518:154537.793 collector [idle 1 sec]
31261:20150518:154538.460 listener #3 [processing request]
31261:20150518:154538.462 Requested [goofball]
31261:20150518:154538.463 listener #3 [waiting for connection]
How can I get the zabbix agent to always sent an answer to requests it receives, instead of this behaviour where it seems to terminate the connection before sending an answer?
EDIT: I couldn't post this because "Too many live links/images found in your post content." despite the fact that I had zero URLs in this text. Removed any mention of the string dot-cee-oh-em.
In the zabbix_server.log file, I'm seeing many of these messages:
> Zabbix agent item "system.cpu.load[percpu,avg15]" on host "wings.XXXXXX" failed: first network error, wait for 15 seconds
> resuming Zabbix agent checks on host "wings.XXXXXX": connection restored
It fails often, many times each minute, but not every time. And this is happening to many hosts.
While this was going on, I went to the agent and set DebugLevel=4 so I could watch, and I sent some baloney requests so I could observe responses, maybe see an error that the zabbix server isn't reporting. This is what I saw:
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982Trying 70.32.115.64...
Connected to wings.XXXXX.
Escape character is '^]'.
ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982Trying 70.32.115.64...
Connected to wings.XXXXX..
Escape character is '^]'.
ZBXD&ZBX_NOTSUPPORTEDUnsupported item key.Connection closed by foreign host.
root@meter:~/tastetest# (echo "goofball"; sleep 1
|telnet wings.XXXXXX 6982Trying 70.32.115.64...
Connected to wings.XXXXXX.
Escape character is '^]'.
Connection closed by foreign host.
That third time, the connection was closed before I got a response from the agent. I checked the zabbix_agent.log, and I saw that it received all three of the "goofball" requests, so I know it's not the zabbix server failing to deliver the request, but the agent is failing to send a response before terminating the connection.
31259:20150518:154535.885 listener #1 [processing request]
31259:20150518:154535.885 Requested [goofball]
31259:20150518:154535.885 listener #1 [waiting for connection]
31257:20150518:154536.792 collector [processing data]
31257:20150518:154536.792 In update_cpustats()
31257:20150518:154536.792 End of update_cpustats()
31257:20150518:154536.793 collector [idle 1 sec]
31260:20150518:154537.245 listener #2 [processing request]
31260:20150518:154537.245 Requested [goofball]
31260:20150518:154537.245 listener #2 [waiting for connection]
31257:20150518:154537.793 collector [processing data]
31257:20150518:154537.793 In update_cpustats()
31257:20150518:154537.793 End of update_cpustats()
31257:20150518:154537.793 collector [idle 1 sec]
31261:20150518:154538.460 listener #3 [processing request]
31261:20150518:154538.462 Requested [goofball]
31261:20150518:154538.463 listener #3 [waiting for connection]
How can I get the zabbix agent to always sent an answer to requests it receives, instead of this behaviour where it seems to terminate the connection before sending an answer?
EDIT: I couldn't post this because "Too many live links/images found in your post content." despite the fact that I had zero URLs in this text. Removed any mention of the string dot-cee-oh-em.

Comment