There are a lot of messages similar to this appearing in the log,
3297:20200429:235705.672 Zabbix agent item "system.cpu.util[,user]" on host "hostname-deleted" failed: first network error, wait for 15 seconds
3301:20200429:235720.692 resuming Zabbix agent checks on host "hostname-deleted": connection restored
This is generating a lot of flapping false alerts. Yet, running zabbix_get can retrieve the item and is very fast,
$ time zabbix_get -s hostname-deleted -k 'system.cpu.util[,user]'
0.133534
real 0m0.012s
user 0m0.003s
sys 0m0.002s
Running this test ten times in a row using a bash for-loop doesn't show any problems.
This is running on a VM in GCP g1-small (1 vCPU, 1.7 GB memory)
The database is PostgresSQL 9.6 using Cloud SQL managed service with an N1 type machine, 2 vCPU, 7.5GB of RAM, and uses SSD for storage.
So far, I've eliminated network issues, such as, firewalls and, also, seLinux.
More background. This is part of moving our infrastructure from on-premises to GCP. As part of this move Zabbix is being upgraded from version 3 to version 4. We have three environments. Dev, Test, and Prod. The image is created via packer and customised at run time using start up scripts. This has worked for Dev and Test. It is the final environment, Prod, that is experiencing the issue. The only difference of note is that both Dev and Test have database servers which are shared with other databases. In Prod a new database instance was created specifically for Zabbix.
Zabbix version: 4.4.8
Number of hosts being monitored: 158
Number of Items: 5853
Number of triggers: 4119
Required server performance, new values per second 58.35
load average: 1.10, 0.67, 0.53
CPU is about 20% busy.
Does anyone have any suggestions on what to check next?
What other information would help pin point the source of the problem?
Thanks in advance.
Comment