Hi,
I've got a copy of Zabbix Proxy v2.0.0 (revision 27675, build date 21 May 2012) running on a CentOS 6.2 OpenVZ host in an active setup. While the proxy works fine after restarting it, hours or, more likely, days later it gets into a state where it will accept inbound TCP connections on port 10051 only sporadically. This results in very limited monitoring data ending up on the Zabbix Server.
I tested this a little with netcat on the same system the proxy is running on (my_zabbix_proxy resolves to the public IP address of this proxy server):
Edit #1 (2012-09-20):
I would like to provide some additional information, some of which I've also provided on IRC today.
I think this issue actually consists of two separate issues which should be examined separately:
(1) It sporadically (days, weeks) happens that a Zabbix Proxy server gets into a state where it will only accept two of three incoming TCP connections on the zabbix trapper TCP port (10051). It should be determined why this happens/what causes it, and whether the system is overloaded by the tie this happens. The server which Zabbix Proxy runs on also acts as an OpenVZ host (i.e. it runs a kernel with the OpenVZ patches applied to it, which includes some degree of network virtualization - but it's unclear whether this directly affects an OpenVZ host). There are times when this system experiences high I/O loads, driving the load up to 60 (40 for 15m average), causing SSH to respond with much delay. These times are exceptions, not the rule.
(2) Once Zabbix Proxy got into this state, it can never recover from it. Even now, when there is very little load on this system in terms of networking, disk I/O and CPU, and no lack of unallocated physical RAM nor entropy, Zabbix Proxy fails to accept many inbound connections on the zabbix-trapper port.
So these are two related but still separate issues. While I would by no means exclude the OpenVZ layer as a possible cause of the first issue, I would consider the second issue to be more likely caused by a software issue in Zabbix Proxy itself.
Output of "netstat --statistics": http://pastebin.com/GYf8MJ0k
Output of "netstat -tn | grep 10051" (?): http://pastebin.com/aiznfuZ7
The above posts will expire in 29 days.
As nelsonab pointed out on IRC, the latter clearly doesn't look good - many connections are in SYN_RECV state, have not received an ACK. It is unclear what the cause for this is, and it was suggested that a tcpdump should be recorded to get a better idea of what's going on here.
Edit #2 (2012-09-20):
Here's a tcpdump which was recorded in july: http://www.sendspace.com/file/kbi7y2
This tcpdump was limited to TCP port 10051 and was recorded on the zabbix proxy host. By the time of recording, this zabbix proxy was running within an OpenVZ container on the same physical server where it is running now. We later moved Zabbix Proxy onto the OpenVZ host (where it still is running now) to see whether this would help with the (same) issues we saw then, i.e. mointoring data not being recorded due to failing incoming connections.
This dump has been redacted: IP addresses were rewritten to A.B.C.* - these hosts belong to the zabbix_proxy network. Hostnames in the *_monitored format are on the same network, too. The zabbix_server is a remote host (on the Internet). This is not a full packet dump but just the shortened plain text output.
During the recording of this log, the Zabbix Proxy got into the same state it is in now (around time index 04:20): it receives incoming connections only sporadically. It remains in this state indefinitely, and, at the same time, all other services running on this system (both on the host system and the OpenVZ containers) work as expected.
Here's what the graphs of all systems monitored through this Zabbix Proxy (using an ActiveAgent configuration) look like as soon as this issue occurs:

As a side note, the network link on this system consists of two bonded NICs.
I've got a copy of Zabbix Proxy v2.0.0 (revision 27675, build date 21 May 2012) running on a CentOS 6.2 OpenVZ host in an active setup. While the proxy works fine after restarting it, hours or, more likely, days later it gets into a state where it will accept inbound TCP connections on port 10051 only sporadically. This results in very limited monitoring data ending up on the Zabbix Server.
I tested this a little with netcat on the same system the proxy is running on (my_zabbix_proxy resolves to the public IP address of this proxy server):
Code:
[root@my_zabbix_proxy ~] # while true; do date; nc -vv -w3 my_zabbix_proxy 10051; echo; sleep 5; done Di 18. Sep 17:25:47 CEST 2012 nc: connect to my_zabbix_proxy port 10051 (tcp) timed out: Operation now in progress Di 18. Sep 17:25:55 CEST 2012 Connection to my_zabbix_proxy 10051 port [tcp/zabbix-trapper] succeeded! Di 18. Sep 17:26:06 CEST 2012 nc: connect to my_zabbix_proxy port 10051 (tcp) timed out: Operation now in progress Di 18. Sep 17:26:14 CEST 2012 nc: connect to my_zabbix_proxy port 10051 (tcp) timed out: Operation now in progress Di 18. Sep 17:26:22 CEST 2012 nc: connect to my_zabbix_proxy port 10051 (tcp) timed out: Operation now in progress Di 18. Sep 17:26:30 CEST 2012 Connection to my_zabbix_proxy 10051 port [tcp/zabbix-trapper] succeeded! Di 18. Sep 17:26:41 CEST 2012 Connection to my_zabbix_proxy 10051 port [tcp/zabbix-trapper] succeeded! Di 18. Sep 17:26:49 CEST 2012 nc: connect to my_zabbix_proxy port 10051 (tcp) timed out: Operation now in progress
I would like to provide some additional information, some of which I've also provided on IRC today.
I think this issue actually consists of two separate issues which should be examined separately:
(1) It sporadically (days, weeks) happens that a Zabbix Proxy server gets into a state where it will only accept two of three incoming TCP connections on the zabbix trapper TCP port (10051). It should be determined why this happens/what causes it, and whether the system is overloaded by the tie this happens. The server which Zabbix Proxy runs on also acts as an OpenVZ host (i.e. it runs a kernel with the OpenVZ patches applied to it, which includes some degree of network virtualization - but it's unclear whether this directly affects an OpenVZ host). There are times when this system experiences high I/O loads, driving the load up to 60 (40 for 15m average), causing SSH to respond with much delay. These times are exceptions, not the rule.
(2) Once Zabbix Proxy got into this state, it can never recover from it. Even now, when there is very little load on this system in terms of networking, disk I/O and CPU, and no lack of unallocated physical RAM nor entropy, Zabbix Proxy fails to accept many inbound connections on the zabbix-trapper port.
So these are two related but still separate issues. While I would by no means exclude the OpenVZ layer as a possible cause of the first issue, I would consider the second issue to be more likely caused by a software issue in Zabbix Proxy itself.
Output of "netstat --statistics": http://pastebin.com/GYf8MJ0k
Output of "netstat -tn | grep 10051" (?): http://pastebin.com/aiznfuZ7
The above posts will expire in 29 days.
As nelsonab pointed out on IRC, the latter clearly doesn't look good - many connections are in SYN_RECV state, have not received an ACK. It is unclear what the cause for this is, and it was suggested that a tcpdump should be recorded to get a better idea of what's going on here.
Edit #2 (2012-09-20):
Here's a tcpdump which was recorded in july: http://www.sendspace.com/file/kbi7y2
This tcpdump was limited to TCP port 10051 and was recorded on the zabbix proxy host. By the time of recording, this zabbix proxy was running within an OpenVZ container on the same physical server where it is running now. We later moved Zabbix Proxy onto the OpenVZ host (where it still is running now) to see whether this would help with the (same) issues we saw then, i.e. mointoring data not being recorded due to failing incoming connections.
This dump has been redacted: IP addresses were rewritten to A.B.C.* - these hosts belong to the zabbix_proxy network. Hostnames in the *_monitored format are on the same network, too. The zabbix_server is a remote host (on the Internet). This is not a full packet dump but just the shortened plain text output.
During the recording of this log, the Zabbix Proxy got into the same state it is in now (around time index 04:20): it receives incoming connections only sporadically. It remains in this state indefinitely, and, at the same time, all other services running on this system (both on the host system and the OpenVZ containers) work as expected.
Here's what the graphs of all systems monitored through this Zabbix Proxy (using an ActiveAgent configuration) look like as soon as this issue occurs:

As a side note, the network link on this system consists of two bonded NICs.

Comment