Version: Zabbbix 2.4.2
Been trying to debug this for a while, and have not really had any luck. Have done packet captures, and haven't seen anything glaringly obvious.
We have two different zabbix masters. One that is running in EC2 Classic (non VPC), that zabbix_proxy hands about 725 hosts and 123,000 checks, it's poller busy process is about 40% using 60 pollers.
The same size machine in AWS VPC, monitoring across accounts in same region using AWS peering connections, can't seem to get over 300 hosts before the zabbix_proxy busy poller process hits 100% when running 60 pollers. I start to see connection errors in the zabbix_proxy logs. If I move 70 hosts onto another zabbix_proxy, they do stabilize somewhat. I've tried running in debug mode, and that wasn't enlightening.
I'm leaning towards it being a network connection issue, but nothing is standing out as a smoking gun.
Been trying to debug this for a while, and have not really had any luck. Have done packet captures, and haven't seen anything glaringly obvious.
We have two different zabbix masters. One that is running in EC2 Classic (non VPC), that zabbix_proxy hands about 725 hosts and 123,000 checks, it's poller busy process is about 40% using 60 pollers.
The same size machine in AWS VPC, monitoring across accounts in same region using AWS peering connections, can't seem to get over 300 hosts before the zabbix_proxy busy poller process hits 100% when running 60 pollers. I start to see connection errors in the zabbix_proxy logs. If I move 70 hosts onto another zabbix_proxy, they do stabilize somewhat. I've tried running in debug mode, and that wasn't enlightening.
I'm leaning towards it being a network connection issue, but nothing is standing out as a smoking gun.
Code:
23977:20170131:040020.101 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23977:20170131:040020.103 Zabbix agent item "mysql.Created_tmp_disk_tables" on host "dke4-dbtxbs01b.aue1p" failed: first network error, wait for 15 seconds 23980:20170131:040020.103 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23980:20170131:040020.104 Zabbix agent item "proc.num[,,run]" on host "dke4-roapp01e.aue1p" failed: first network error, wait for 15 seconds 23997:20170131:040020.130 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23997:20170131:040020.131 Zabbix agent item "net.if.out[eth0,bytes]" on host "dke4-dbrpss01a.aue1p" failed: first network error, wait for 15 seconds 23982:20170131:040020.131 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23994:20170131:040020.132 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 24026:20170131:040020.132 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23982:20170131:040020.132 Zabbix agent item "vfs.fs.size[/,free]" on host "dke4-smcn01c.aue1p" failed: first network error, wait for 15 seconds 24026:20170131:040020.133 Zabbix agent item "mailq.queue_size" on host "dke4-dbtxss01c.aue1p" failed: first network error, wait for 15 seconds 23994:20170131:040020.133 Zabbix agent item "net.if.in[eth0,bytes]" on host "dke4-esenm01b.aue1p" failed: first network error, wait for 15 seconds 24006:20170131:040020.139 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 24006:20170131:040020.141 Zabbix agent item "system.cpu.util[,softirq,avg1]" on host "dke4-lbmo02b.aue1p" failed: first network error, wait for 15 seconds 23987:20170131:040020.142 [Z3005] query failed: [2006] MySQL server has gone away [begin;] 23987:20170131:040020.144 Zabbix agent item "custom.vfs.dev.read.sectors[xvdb]" on host "dke4-dbenss01c.aue1p" failed: first network error, wait for 15 seconds 23967:20170131:040023.404 received configuration data from server, datalen 5490406 24038:20170131:040024.714 cannot send list of active checks to [10.21.123.108]: host [dke-crtr01a.aue1m] not monitored 24037:20170131:040027.061 cannot send list of active checks to [10.21.75.39]: host [dke5-monorpc01d.aue1m] not monitored 24039:20170131:040028.927 cannot send list of active checks to [10.21.75.18]: host [dke5-mossfs02d.aue1m] not monitored 23967:20170131:040029.804 received configuration data from server, datalen 5490406 24038:20170131:040031.703 cannot send list of active checks to [10.21.79.124]: host [dke5-mosspm01e.aue1m] not monitored
Comment