Ad Widget

**batchenr** · 21-02-2017, 17:00

Originally posted by Jason

We've got a CentOS 7 front end zabbix server running Zabbix 2.4.8 with a postgres database running on a CentOS 6 server. The zabbix database is ~330GB for 30 days worth of data for ~800hosts.

Periodically of the active agents and proxies drop offline for anything from 15 minutes to over an hour. Attached is an uptime graph showing 1 outage from tonight.

Passive checks are still collected and there are no gaps in that data.

All of the storage is on fast 15K SAS disks and all the monitoring I have show that latency is low.

I thought it could be network related, but some of the hosts that are dropping offline are on the local network with the zabbix server and connect by IP address. There is a firewall rule on the host allowing all traffic in without restriction on the zabbix server port.

There is nothing untoward showing up the zabbix-server log during these outages.

Any thoughts on possible causes?

zabbix server \ agent logs ?
try disabling all unsupported items on the host that making this issues.

**Jason** · 25-02-2017, 11:12

I missed the logs for the last outage, but it's just happened again this morning.

It's all proxies and hosts that use active checks that go offline.

Exercept from proxy on the same LAN as the zabbix server.
2095:20170225:083939.951 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084039.624 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]
2095:20170225:084039.952 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084139.952 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084239.953 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084339.953 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084342.638 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [110] Connection timed out]. Will retry every 120 second(s)
2095:20170225:084439.954 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084445.638 Still unable to connect...
2095:20170225:084539.954 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084639.955 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084739.955 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084748.638 Still unable to connect...
2094:20170225:084948.639 Connection restored.

Line from another proxy during the outage

6648:20170225:084119.896 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [4] Interrupted system call]
6647:20170225:084120.324 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]

From the server...

4355:20170225:083919.686 unmatched trap received from "192.168.24.10": 08:39:14 2017/02/25 .1.3.6.1.4.1.231.2.54.2.0.2502 Normal "General event" 192.168.24.10 - 78.33.226.210 JAH-SVR01 1488011954 memory 81 80
4280:20170225:083926.890 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4273:20170225:084026.276 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4277:20170225:084113.599 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4305:20170225:084126.614 sending configuration data to proxy "-ZABBIX-Proxy", datalen 526868
4355:20170225:084134.707 unmatched trap received from "192.168.0.10": 08:41:32 2017/02/25 .1.3.6.1.4.1.318.0.35 Normal "General event" 192.168.0.10 - 176.35.188.187 AVR Trim No Longer Active
4279:20170225:084211.565 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4279:20170225:084227.674 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4292:20170225:084306.962 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
4292:20170225:084307.512 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245
4303:20170225:084338.768 sending configuration data to proxy "XZ Zabbix Proxy", datalen 385282
4303:20170225:084338.859 sending configuration data to proxy "SJ proxy", datalen 311054
4272:20170225:084426.078 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4271:20170225:084512.176 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4278:20170225:084611.174 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4276:20170225:084627.312 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4325:20170225:084708.486 sending configuration data to proxy "Cloud01-proxy", datalen 556085
4351:20170225:084710.284 item "Zabbix Database Server

gsql.get.pg.stat_database[{$PGSCRIPTDIR},{$PGSCRIPT_CONFDIR},{HOST.HOST},{$Z ABBIX_AGENTD_CONF},zabbix]" became not supported: Timeout while executing a shell script.
4271:20170225:084826.265 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4304:20170225:084834.856 sending configuration data to proxy "HA Proxy", datalen 1535022
4304:20170225:084836.126 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
4304:20170225:084837.250 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245

**batchenr** · 26-02-2017, 10:54

Originally posted by Jason

I missed the logs for the last outage, but it's just happened again this morning.

It's all proxies and hosts that use active checks that go offline.

Exercept from proxy on the same LAN as the zabbix server.
2095:20170225:083939.951 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084039.624 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]
2095:20170225:084039.952 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084139.952 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084239.953 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084339.953 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084342.638 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [110] Connection timed out]. Will retry every 120 second(s)
2095:20170225:084439.954 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084445.638 Still unable to connect...
2095:20170225:084539.954 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084639.955 sending heartbeat message to server failed: error:"no response: network error", info:""
2095:20170225:084739.955 sending heartbeat message to server failed: error:"no response: network error", info:""
2094:20170225:084748.638 Still unable to connect...
2094:20170225:084948.639 Connection restored.

Line from another proxy during the outage

6648:20170225:084119.896 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [4] Interrupted system call]
6647:20170225:084120.324 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]

From the server...

4355:20170225:083919.686 unmatched trap received from "192.168.24.10": 08:39:14 2017/02/25 .1.3.6.1.4.1.231.2.54.2.0.2502 Normal "General event" 192.168.24.10 - 78.33.226.210 JAH-SVR01 1488011954 memory 81 80
4280:20170225:083926.890 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4273:20170225:084026.276 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4277:20170225:084113.599 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4305:20170225:084126.614 sending configuration data to proxy "-ZABBIX-Proxy", datalen 526868
4355:20170225:084134.707 unmatched trap received from "192.168.0.10": 08:41:32 2017/02/25 .1.3.6.1.4.1.318.0.35 Normal "General event" 192.168.0.10 - 176.35.188.187 AVR Trim No Longer Active
4279:20170225:084211.565 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4279:20170225:084227.674 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4292:20170225:084306.962 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
4292:20170225:084307.512 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245
4303:20170225:084338.768 sending configuration data to proxy "XZ Zabbix Proxy", datalen 385282
4303:20170225:084338.859 sending configuration data to proxy "SJ proxy", datalen 311054
4272:20170225:084426.078 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4271:20170225:084512.176 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4278:20170225:084611.174 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4276:20170225:084627.312 resuming SNMP agent checks on host "AirFibreRemote": connection restored
4325:20170225:084708.486 sending configuration data to proxy "Cloud01-proxy", datalen 556085
4351:20170225:084710.284 item "Zabbix Database Server

gsql.get.pg.stat_database[{$PGSCRIPTDIR},{$PGSCRIPT_CONFDIR},{HOST.HOST},{$Z ABBIX_AGENTD_CONF},zabbix]" became not supported: Timeout while executing a shell script.
4271:20170225:084826.265 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
4304:20170225:084834.856 sending configuration data to proxy "HA Proxy", datalen 1535022
4304:20170225:084836.126 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
4304:20170225:084837.250 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245

i dont use proxy so i dont know - but whenever i got a message from agent about cant connect i changed the port 10052 at /etc/zabbix/zabbix_agent.conf

#port=10050 - just uncomment it and restart

**Jason** · 26-02-2017, 11:01

All of our agents/proxies are configured to connect on that port for various reasons... I don't fancy changing that on over 800 hosts... It's worked fine for years till recently so I'd be surprised if that were it...

**batchenr** · 26-02-2017, 11:06

Originally posted by Jason

All of our agents/proxies are configured to connect on that port for various reasons... I don't fancy changing that on over 800 hosts... It's worked fine for years till recently so I'd be surprised if that were it...

i just told you what works for me - you dont have to do it.
unfortunately dont have proxy and not familiar with this errors.

Ad Widget

Active checks and proxies periodically all go offline...

Active checks and proxies periodically all go offline...

Comment

Comment

Comment

Comment

Comment