A little bit stumped here.
Everything updated to latest released version zabbix server and agents on all the monitored servers. No proxy being used. One of the monitored servers intermittently stops responding. Only seems to happen for a few minutes every day or two so difficult to troubleshoot. Random times but so far always in mornings during business hours Los Angeles time. Just get a notice that the server is offline but it's not. When I look at Zabbix graph all the other parameters that are monitored are blank at this time as well. So CPU load etc. So basically the agent is not responding and not sending back any information. Then it comes back on it's own so I get the notice that the server is back online and usually doesn't happen again that day. Sometimes it does within a few more minutes and again comes back. Twice within a 20 minute period seems to be the most in any given day.
Today it stopped responding for an extended period for a change so had some time to try some things. As soon as I restarted the agent it started working again then a few minutes later it stopped. I stopped the agent then started and this time so far it seems to still be working after about half an hour.
Before I restarted the agent I checked a few things. I did a netstat and tcpdump on port 10050 and didn't see anything unusual. No foreign IP's trying to connect on that port. The zabbix server is in a different data center as the monitored server. I have another monitored server in the same data center as the one that stops responding intermittently but this other server never misses a beat. So it's not something with the routing or anything like that.
I double checked the agent config file and it's the same as the other servers I am monitoring. Only thing I have set is the zabbix server IP which is allowed to connect. Everything else is at defaults so passive mode etc.
Anyone have any ideas what this could possibly be?
Everything updated to latest released version zabbix server and agents on all the monitored servers. No proxy being used. One of the monitored servers intermittently stops responding. Only seems to happen for a few minutes every day or two so difficult to troubleshoot. Random times but so far always in mornings during business hours Los Angeles time. Just get a notice that the server is offline but it's not. When I look at Zabbix graph all the other parameters that are monitored are blank at this time as well. So CPU load etc. So basically the agent is not responding and not sending back any information. Then it comes back on it's own so I get the notice that the server is back online and usually doesn't happen again that day. Sometimes it does within a few more minutes and again comes back. Twice within a 20 minute period seems to be the most in any given day.
Today it stopped responding for an extended period for a change so had some time to try some things. As soon as I restarted the agent it started working again then a few minutes later it stopped. I stopped the agent then started and this time so far it seems to still be working after about half an hour.
Before I restarted the agent I checked a few things. I did a netstat and tcpdump on port 10050 and didn't see anything unusual. No foreign IP's trying to connect on that port. The zabbix server is in a different data center as the monitored server. I have another monitored server in the same data center as the one that stops responding intermittently but this other server never misses a beat. So it's not something with the routing or anything like that.
I double checked the agent config file and it's the same as the other servers I am monitoring. Only thing I have set is the zabbix server IP which is allowed to connect. Everything else is at defaults so passive mode etc.
Anyone have any ideas what this could possibly be?
Comment