We've got a CentOS 7 front end zabbix server running Zabbix 2.4.8 with a postgres database running on a CentOS 6 server. The zabbix database is ~330GB for 30 days worth of data for ~800hosts.
Periodically of the active agents and proxies drop offline for anything from 15 minutes to over an hour. Attached is an uptime graph showing 1 outage from tonight.
Passive checks are still collected and there are no gaps in that data.
All of the storage is on fast 15K SAS disks and all the monitoring I have show that latency is low.
I thought it could be network related, but some of the hosts that are dropping offline are on the local network with the zabbix server and connect by IP address. There is a firewall rule on the host allowing all traffic in without restriction on the zabbix server port.
There is nothing untoward showing up the zabbix-server log during these outages.
Any thoughts on possible causes?
Periodically of the active agents and proxies drop offline for anything from 15 minutes to over an hour. Attached is an uptime graph showing 1 outage from tonight.
Passive checks are still collected and there are no gaps in that data.
All of the storage is on fast 15K SAS disks and all the monitoring I have show that latency is low.
I thought it could be network related, but some of the hosts that are dropping offline are on the local network with the zabbix server and connect by IP address. There is a firewall rule on the host allowing all traffic in without restriction on the zabbix server port.
There is nothing untoward showing up the zabbix-server log during these outages.
Any thoughts on possible causes?
gsql.get.pg.stat_database[{$PGSCRIPTDIR},{$PGSCRIPT_CONFDIR},{HOST.HOST},{$Z ABBIX_AGENTD_CONF},zabbix]" became not supported: Timeout while executing a shell script.
Comment