Hi all,
I've got two clients (v1.4.4) which are on separate networks than the server (v1.4.4) and behind firewalls, which have tcp/10050 and tcp/10051 open accordingly, that after a period of time these two machines just lose communication. I don't think it's a firewall issue. Once I restart the zabbix_agentd service, it restores communication.
The server can still telnet to tcp/10050 on the client side. The client still has zabbix_agentd running:
zabbix 24885 1 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24887 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24888 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24889 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24890 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24891 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
root 26167 24843 0 11:02 pts/0 00:00:00 grep zabbix
This is what's in my zabbix_server.log:
17154:20071228:180739 Parameter [proc.num[zabbix_server]] will be checked after 240 seconds on host [client1]
17156:20071228:180758 Timeout while answering request
17155:20071228:180800 Timeout while answering request
17156:20071228:180810 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
17153:20071228:180835 Timeout while answering request
17157:20071228:180846 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
17157:20071228:180846 Host [client2]: first network error, wait for 15 seconds
17157:20071228:180846 Parameter [vfs.fs.inode[/tmp,pfree]] will be checked after 120 seconds on host [client2]
17156:20071228:180900 Timeout while answering request
17154:20071228:180902 Timeout while answering request
17155:20071228:180904 Timeout while answering request
17155:20071228:180904 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17155:20071228:180904 Host [client2]: first network error, wait for 15 seconds
17155:20071228:180904 Parameter [net.if.in[eth0,bytes]] will be checked after 20 seconds on host [client2]
17154:20071228:180916 Timeout while answering request
17154:20071228:180926 Timeout while answering request
17153:20071228:180942 Timeout while answering request
17155:20071228:180954 Timeout while answering request
17154:20071228:181020 Timeout while answering request
17154:20071228:181020 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17154:20071228:181020 Host [client2]: first network error, wait for 15 seconds
17154:20071228:181020 Parameter [vfs.fs.inode[/opt,pfree]] will be checked after 120 seconds on host [client2]
17156:20071228:181024 Timeout while answering request
17156:20071228:181024 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17156:20071228:181024 Host [client1]: first network error, wait for 15 seconds
17156:20071228:181024 Parameter [vfs.fs.size[/,pused]] will be checked after 120 seconds on host [client1]
17157:20071228:181058 Timeout while answering request
17157:20071228:181107 Timeout while answering request
I was thinking of disabling all of my hosts for a period of time except these two clients, changing my debugging to 4 and seeing what is produced.
Any recommendations?
Thanks,
Chris
I've got two clients (v1.4.4) which are on separate networks than the server (v1.4.4) and behind firewalls, which have tcp/10050 and tcp/10051 open accordingly, that after a period of time these two machines just lose communication. I don't think it's a firewall issue. Once I restart the zabbix_agentd service, it restores communication.
The server can still telnet to tcp/10050 on the client side. The client still has zabbix_agentd running:
zabbix 24885 1 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24887 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24888 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24889 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24890 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
zabbix 24891 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
root 26167 24843 0 11:02 pts/0 00:00:00 grep zabbix
This is what's in my zabbix_server.log:
17154:20071228:180739 Parameter [proc.num[zabbix_server]] will be checked after 240 seconds on host [client1]
17156:20071228:180758 Timeout while answering request
17155:20071228:180800 Timeout while answering request
17156:20071228:180810 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
17153:20071228:180835 Timeout while answering request
17157:20071228:180846 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
17157:20071228:180846 Host [client2]: first network error, wait for 15 seconds
17157:20071228:180846 Parameter [vfs.fs.inode[/tmp,pfree]] will be checked after 120 seconds on host [client2]
17156:20071228:180900 Timeout while answering request
17154:20071228:180902 Timeout while answering request
17155:20071228:180904 Timeout while answering request
17155:20071228:180904 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17155:20071228:180904 Host [client2]: first network error, wait for 15 seconds
17155:20071228:180904 Parameter [net.if.in[eth0,bytes]] will be checked after 20 seconds on host [client2]
17154:20071228:180916 Timeout while answering request
17154:20071228:180926 Timeout while answering request
17153:20071228:180942 Timeout while answering request
17155:20071228:180954 Timeout while answering request
17154:20071228:181020 Timeout while answering request
17154:20071228:181020 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17154:20071228:181020 Host [client2]: first network error, wait for 15 seconds
17154:20071228:181020 Parameter [vfs.fs.inode[/opt,pfree]] will be checked after 120 seconds on host [client2]
17156:20071228:181024 Timeout while answering request
17156:20071228:181024 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
17156:20071228:181024 Host [client1]: first network error, wait for 15 seconds
17156:20071228:181024 Parameter [vfs.fs.size[/,pused]] will be checked after 120 seconds on host [client1]
17157:20071228:181058 Timeout while answering request
17157:20071228:181107 Timeout while answering request
I was thinking of disabling all of my hosts for a period of time except these two clients, changing my debugging to 4 and seeing what is produced.
Any recommendations?
Thanks,
Chris

Comment