Hi,
In the last couple of days I'm having big issues with Zabbix.
Every 1-2 hours all of my agents are killed and all servers become unavailable for this reason.
In zabbix_server.log I see errors from time to time like the following:
Zabbix agent item [system.cpu.util[,user,avg1]] on host [SERVER01] failed: another network error, wait for 15 seconds
This error I used to see even before the problem.
Since the problem started I see also:
temporarily disabling Zabbix agent checks on host [SERVER01]: host unavailable
I have no idea why it happens.
I have 166 hosts monitored.
8054 active items.
2496 triggers.
Required server performance is 361.
I'm using Zabbix server with 2vCPU 17GB RAM that runs zabbix_server, mysql_server and the frontend.
I also have Zabbix proxy that has about 70+- servers behind it.
Note that the servers are running in Amazon Cloud.
When this problem occurs all servers see unavailable and I must start the agents for this problems to solve it self.
I've tried various method to fix that problem such as removing unnecessary items/triggers/hosts, I also recovered the server from backup image because I thought maybe the server might be running on bad hardware.
Has anyone encountered such issue?
Anything I can do?
In the last couple of days I'm having big issues with Zabbix.
Every 1-2 hours all of my agents are killed and all servers become unavailable for this reason.
In zabbix_server.log I see errors from time to time like the following:
Zabbix agent item [system.cpu.util[,user,avg1]] on host [SERVER01] failed: another network error, wait for 15 seconds
This error I used to see even before the problem.
Since the problem started I see also:
temporarily disabling Zabbix agent checks on host [SERVER01]: host unavailable
I have no idea why it happens.
I have 166 hosts monitored.
8054 active items.
2496 triggers.
Required server performance is 361.
I'm using Zabbix server with 2vCPU 17GB RAM that runs zabbix_server, mysql_server and the frontend.
I also have Zabbix proxy that has about 70+- servers behind it.
Note that the servers are running in Amazon Cloud.
When this problem occurs all servers see unavailable and I must start the agents for this problems to solve it self.
I've tried various method to fix that problem such as removing unnecessary items/triggers/hosts, I also recovered the server from backup image because I thought maybe the server might be running on bad hardware.
Has anyone encountered such issue?
Anything I can do?








Comment