Hi there,
Would somebody please shed some light on the issue I'm trying to understand and solve.
I've got quite a large deployment where just recently monitoring items seem not to have been getting data on time that results in huge data gaps.
Workload is spread among 3 Zabbix proxy servers.
Just after a fresh reboot of the entire environment there is a burst of work going on across those 3 Zabbix proxy servers and data is coming through.
However after some time the whole system starts to deteriorate and comes to almost a full halt and only occasional items manage to collect data.
Increasing the number of async agent pollers would not help (I increased it from 8 to 20) - they all get clogged up over time.
Interestingly server resources don't seem to be constraints because CPUs and RAM are underutilized and Zabbix proxy servers seem idle.
1. How to explain the values for awaiting state to be ~1000 and maxed out almost all the time? Why are they not picked up moved out to the queue?
2. Assuming they are problematic items and occupy the slots because they get close to the allotted timeout (Timeout=10 in my zabbix_proxy.conf but I also tried to bring it down to 3s) - so why are they not moved out to Unreachable pollers for later attempts?
3. How to check what holds back those 1000 items on a poller and prevents them from being processed?
NOTE: When I randomly pick a host and an item (e.g. CPU) in GUI and push Test button the return value appears instantaneously.
Below is an example of one Zabbix proxy servers and summary of Zabbix processes running on it (also see the screenshots that follow).







Would somebody please shed some light on the issue I'm trying to understand and solve.
I've got quite a large deployment where just recently monitoring items seem not to have been getting data on time that results in huge data gaps.
Workload is spread among 3 Zabbix proxy servers.
Just after a fresh reboot of the entire environment there is a burst of work going on across those 3 Zabbix proxy servers and data is coming through.
However after some time the whole system starts to deteriorate and comes to almost a full halt and only occasional items manage to collect data.
Increasing the number of async agent pollers would not help (I increased it from 8 to 20) - they all get clogged up over time.
Interestingly server resources don't seem to be constraints because CPUs and RAM are underutilized and Zabbix proxy servers seem idle.
1. How to explain the values for awaiting state to be ~1000 and maxed out almost all the time? Why are they not picked up moved out to the queue?
2. Assuming they are problematic items and occupy the slots because they get close to the allotted timeout (Timeout=10 in my zabbix_proxy.conf but I also tried to bring it down to 3s) - so why are they not moved out to Unreachable pollers for later attempts?
3. How to check what holds back those 1000 items on a poller and prevents them from being processed?
NOTE: When I randomly pick a host and an item (e.g. CPU) in GUI and push Test button the return value appears instantaneously.
Below is an example of one Zabbix proxy servers and summary of Zabbix processes running on it (also see the screenshots that follow).
Code:
Parameter Value Details ========= ===== ======= Zabbix server is running Yes zabbix-srv:10051 Zabbix server version 7.0.25 New update available Zabbix frontend version 7.0.25 New update available Latest release 7.0.26 Release notes Number of hosts (enabled/disabled) 2756 2729 / 27 Number of templates 433 Number of items (enabled/disabled/not supported) 301001 282348 / 4351 / 14302 Number of triggers (enabled/disabled [problem/ok]) 103761 91473 / 12288 [526 / 90947] Required server performance, new values per second 2486.43 High availability cluster Disabled
Code:
cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.10 (Ootpa)"
Code:
free -h total used free shared buff/cache available Mem: 15Gi 9.1Gi 655Mi 129Mi 5.6Gi 5.8Gi Swap: 9Gi 49Mi 9Gi
Code:
cat /proc/cpuinfo | grep processor processor : 0 processor : 1 processor : 2 processor : 3
Code:
=== Load: load average: 1.35, 1.07, 1.11 CPU idle: 83.3 id === ### Agent pollers ### Active: 20 Idle: 1 agent poller #1 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #2 [got 7 values, queued 7 in 5 sec, awaiting 1000] agent poller #3 [got 14 values, queued 10 in 5 sec, awaiting 996] agent poller #4 [got 2 values, queued 0 in 5 sec, awaiting 998] agent poller #5 [got 3 values, queued 0 in 5 sec, awaiting 79] agent poller #6 [got 1 values, queued 1 in 5 sec, awaiting 666] agent poller #7 [got 0 values, queued 3 in 5 sec, awaiting 562] agent poller #8 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #9 [got 2 values, queued 4 in 5 sec, awaiting 1000] agent poller #10 [got 6 values, queued 6 in 5 sec, awaiting 1000] agent poller #11 [got 4 values, queued 6 in 5 sec, awaiting 1000] agent poller #12 [got 3 values, queued 2 in 5 sec, awaiting 999] agent poller #13 [got 1 values, queued 0 in 5 sec, awaiting 794] agent poller #14 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #15 [got 7 values, queued 14 in 5 sec, awaiting 804] agent poller #16 [got 1 values, queued 6 in 5 sec, awaiting 251] agent poller #17 [got 1 values, queued 0 in 5 sec, awaiting 998] agent poller #18 [got 11 values, queued 0 in 5 sec, awaiting 986] agent poller #19 [got 2 values, queued 2 in 5 sec, awaiting 292] agent poller #20 [got 1 values, queued 0 in 5 sec, awaiting 620] ### HTTP agent pollers ### Active: 0 Idle: 1 ### SNMP pollers ### Active: 0 Idle: 1 ### Classic pollers ### Active: 1 Idle: 9 poller #25 [got 0 values in 0.000016 sec, getting values] ### Unreachable pollers ### Active: 1 Idle: 9 unreachable poller #15 [got 0 values in 0.000033 sec, getting values] ### Trappers ### Active: 0 Idle: 10 ### Preprocessing manager ### preprocessing manager #1 [queued 147, processed 168 values, idle 5.067374 sec during 5.080943 sec] ### TCP ### ESTABLISHED: 11 TIME_WAIT: 7185