Hi there,
Would somebody please shed some light on the issue I'm trying to understand and solve.
I've got quite a large deployment where just recently monitoring items seem not to have been getting data on time that results in huge data gaps.
Workload is spread among 3 Zabbix proxy servers.
Just after a fresh reboot of the entire environment there is a burst of work going on across those 3 Zabbix proxy servers and data is coming through.
However after some time the whole system starts to deteriorate and comes to almost a full halt and only occasional items manage to collect data.
Increasing the number of async agent pollers would not help (I increased it from 8 to 20) - they all get clogged up over time.
Interestingly server resources don't seem to be constraints because CPUs and RAM are underutilized and Zabbix proxy servers seem idle.
1. How to explain the values for awaiting state to be ~1000 and maxed out almost all the time? Why are they not picked up moved out to the queue?
2. Assuming they are problematic items and occupy the slots because they get close to the allotted timeout (Timeout=10 in my zabbix_proxy.conf but I also tried to bring it down to 3s) - so why are they not moved out to Unreachable pollers for later attempts?
3. How to check what holds back those 1000 items on a poller and prevents them from being processed?
NOTE: When I open one of the hosts and an item that is missing recent data (e.g. CPU) in GUI and push "Get value and test" button the return value appears instantaneously.
Below is an example of one Zabbix proxy servers and summary of Zabbix processes running on it (also see the screenshots that follow).







Would somebody please shed some light on the issue I'm trying to understand and solve.
I've got quite a large deployment where just recently monitoring items seem not to have been getting data on time that results in huge data gaps.
Workload is spread among 3 Zabbix proxy servers.
Just after a fresh reboot of the entire environment there is a burst of work going on across those 3 Zabbix proxy servers and data is coming through.
However after some time the whole system starts to deteriorate and comes to almost a full halt and only occasional items manage to collect data.
Increasing the number of async agent pollers would not help (I increased it from 8 to 20) - they all get clogged up over time.
Interestingly server resources don't seem to be constraints because CPUs and RAM are underutilized and Zabbix proxy servers seem idle.
1. How to explain the values for awaiting state to be ~1000 and maxed out almost all the time? Why are they not picked up moved out to the queue?
2. Assuming they are problematic items and occupy the slots because they get close to the allotted timeout (Timeout=10 in my zabbix_proxy.conf but I also tried to bring it down to 3s) - so why are they not moved out to Unreachable pollers for later attempts?
3. How to check what holds back those 1000 items on a poller and prevents them from being processed?
NOTE: When I open one of the hosts and an item that is missing recent data (e.g. CPU) in GUI and push "Get value and test" button the return value appears instantaneously.
Below is an example of one Zabbix proxy servers and summary of Zabbix processes running on it (also see the screenshots that follow).
Code:
Parameter Value Details ========= ===== ======= Zabbix server is running Yes zabbix-srv:10051 Zabbix server version 7.0.25 New update available Zabbix frontend version 7.0.25 New update available Latest release 7.0.26 Release notes Number of hosts (enabled/disabled) 2756 2729 / 27 Number of templates 433 Number of items (enabled/disabled/not supported) 301001 282348 / 4351 / 14302 Number of triggers (enabled/disabled [problem/ok]) 103761 91473 / 12288 [526 / 90947] Required server performance, new values per second 2486.43 High availability cluster Disabled
Code:
cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.10 (Ootpa)"
Code:
free -h total used free shared buff/cache available Mem: 15Gi 9.1Gi 655Mi 129Mi 5.6Gi 5.8Gi Swap: 9Gi 49Mi 9Gi
Code:
cat /proc/cpuinfo | grep processor processor : 0 processor : 1 processor : 2 processor : 3
Code:
=== Load: load average: 1.35, 1.07, 1.11 CPU idle: 83.3 id === ### Agent pollers ### Active: 20 Idle: 1 agent poller #1 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #2 [got 18 values, queued 18 in 5 sec, awaiting 1000] agent poller #3 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #4 [got 13 values, queued 13 in 5 sec, awaiting 1000] agent poller #5 [got 2 values, queued 2 in 5 sec, awaiting 1000] agent poller #6 [got 2 values, queued 4 in 5 sec, awaiting 1000] agent poller #7 [got 6 values, queued 6 in 5 sec, awaiting 1000] agent poller #8 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #9 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #10 [got 2 values, queued 2 in 5 sec, awaiting 1000] agent poller #11 [got 2 values, queued 2 in 5 sec, awaiting 1000] agent poller #12 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #13 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #14 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #15 [got 1 values, queued 1 in 5 sec, awaiting 1000] agent poller #16 [got 29 values, queued 29 in 5 sec, awaiting 1000] agent poller #17 [got 5 values, queued 5 in 5 sec, awaiting 1000] agent poller #18 [got 2 values, queued 2 in 5 sec, awaiting 1000] agent poller #19 [got 1 values, queued 0 in 5 sec, awaiting 999] agent poller #20 [got 28 values, queued 28 in 5 sec, awaiting 1000] ### HTTP agent pollers ### Active: 0 Idle: 1 ### SNMP pollers ### Active: 0 Idle: 1 ### Classic pollers ### Active: 1 Idle: 9 poller #25 [got 0 values in 0.000016 sec, getting values] ### Unreachable pollers ### Active: 1 Idle: 9 unreachable poller #15 [got 0 values in 0.000033 sec, getting values] ### Trappers ### Active: 0 Idle: 10 ### Preprocessing manager ### preprocessing manager #1 [queued 147, processed 168 values, idle 5.067374 sec during 5.080943 sec] ### TCP ### ESTABLISHED: 11 TIME_WAIT: 7185
Code:
zabbix_proxy -R diaginfo
== history cache diagnostic information ==
Items:0 values:941 time:0.000042
Memory.data:
size: free:536870528 used:0
chunks: free:1 used:0 min:536870528 max:536870528
buckets:
256+:1
Memory.index:
size: free:4145072 used:48736
chunks: free:3 used:4 min:59656 max:3981456
buckets:
256+:3
Top.values:
==
== preprocessing diagnostic information ==
Cached items:33111 pending tasks:0 finished tasks:0 task sequences:0 queued count:3262029 queued size:303204312 direct count:48959 direct size:782045102 history size:1518458 time:0.023078
Top.sequences:
Top.peak:
itemid:14265032 tasks:2
itemid:11402899 tasks:2
Top.values_num:
itemid:1398646 values_num:78
itemid:14433803 values_num:78
itemid:14433785 values_num:78
itemid:14392035 values_num:78
itemid:666550 values_num:78
itemid:5014657 values_num:78
itemid:4991563 values_num:78
itemid:8078952 values_num:78
itemid:1398794 values_num:78
itemid:973346 values_num:78
itemid:14392032 values_num:78
itemid:14433786 values_num:78
itemid:1398331 values_num:78
itemid:14433779 values_num:78
itemid:14433798 values_num:78
itemid:11407413 values_num:78
itemid:665912 values_num:78
itemid:14392047 values_num:78
itemid:14392048 values_num:78
itemid:665978 values_num:78
itemid:7442505 values_num:78
itemid:5679346 values_num:78
itemid:14392036 values_num:78
itemid:14392037 values_num:78
itemid:5679629 values_num:78
Top.values_sz:
itemid:5680793 values_sz:731492
itemid:14265032 values_sz:594896
itemid:7030418 values_sz:354208
itemid:6105421 values_sz:282324
itemid:14327602 values_sz:270972
itemid:6105353 values_sz:268400
itemid:13849001 values_sz:265888
itemid:1144283 values_sz:262171
itemid:5835317 values_sz:259812
itemid:6685189 values_sz:256476
itemid:14770706 values_sz:249500
itemid:14320359 values_sz:244790
itemid:13928290 values_sz:244784
itemid:14320648 values_sz:244784
itemid:14791985 values_sz:244784
itemid:5903930 values_sz:241515
itemid:9352624 values_sz:241094
itemid:5702469 values_sz:238616
itemid:9723496 values_sz:237712
itemid:14324374 values_sz:237287
itemid:11120242 values_sz:235750
itemid:11829377 values_sz:235192
itemid:6587899 values_sz:233548
itemid:8376820 values_sz:233328
itemid:6587830 values_sz:232674
Top.time_ms:
itemid:9259610 time_ms:10
itemid:14194720 time_ms:10
itemid:9272560 time_ms:10
itemid:13152724 time_ms:10
itemid:15042843 time_ms:10
itemid:583540 time_ms:10
itemid:10612986 time_ms:10
itemid:9279594 time_ms:10
itemid:12072248 time_ms:10
itemid:12072041 time_ms:10
itemid:9265000 time_ms:10
itemid:12073827 time_ms:10
itemid:14975884 time_ms:10
itemid:759194 time_ms:10
itemid:12074596 time_ms:10
itemid:14391438 time_ms:10
itemid:11171728 time_ms:10
itemid:10042786 time_ms:10
itemid:11829324 time_ms:10
itemid:10671016 time_ms:10
itemid:12899676 time_ms:10
itemid:9887823 time_ms:10
itemid:12074346 time_ms:10
itemid:15043067 time_ms:10
itemid:5376149 time_ms:10
Top.total_ms:
itemid:14265032 total_ms:30
itemid:14433504 total_ms:30
itemid:14433507 total_ms:30
itemid:6436141 total_ms:30
itemid:13332097 total_ms:20
itemid:14433505 total_ms:20
itemid:11829324 total_ms:20
itemid:14327552 total_ms:20
itemid:14433516 total_ms:20
itemid:666616 total_ms:20
itemid:14265239 total_ms:10
itemid:13599084 total_ms:10
itemid:6587709 total_ms:10
itemid:14194720 total_ms:10
itemid:9272560 total_ms:10
itemid:7742752 total_ms:10
itemid:13152724 total_ms:10
itemid:583540 total_ms:10
itemid:6763783 total_ms:10
itemid:666550 total_ms:10
itemid:759900 total_ms:10
itemid:14490555 total_ms:10
itemid:15566402 total_ms:10
itemid:14433786 total_ms:10
itemid:8865437 total_ms:10
==
== locks diagnostic information ==
Locks:
ZBX_MUTEX_LOG:0x7facec1d0000
ZBX_MUTEX_CACHE:0x7facec1d0028
ZBX_MUTEX_TRENDS:0x7facec1d0050
ZBX_MUTEX_CACHE_IDS:0x7facec1d0078
ZBX_MUTEX_SELFMON:0x7facec1d00a0
ZBX_MUTEX_CPUSTATS:0x7facec1d00c8
ZBX_MUTEX_DISKSTATS:0x7facec1d00f0
ZBX_MUTEX_VALUECACHE:0x7facec1d0118
ZBX_MUTEX_VMWARE:0x7facec1d0140
ZBX_MUTEX_SQLITE3:0x7facec1d0168
ZBX_MUTEX_PROCSTAT:0x7facec1d0190
ZBX_MUTEX_PROXY_HISTORY:0x7facec1d01b8
ZBX_MUTEX_MODBUS:0x7facec1d01e0
ZBX_MUTEX_TREND_FUNC:0x7facec1d0208
ZBX_MUTEX_REMOTE_COMMANDS:0x7facec1d0230
ZBX_MUTEX_PROXY_BUFFER:0x7facec1d0258
ZBX_MUTEX_VPS_MONITOR:0x7facec1d0280
ZBX_RWLOCK_CONFIG:0x7facec1d02a8
ZBX_RWLOCK_CONFIG_HISTORY:0x7facec1d02e0
ZBX_RWLOCK_VALUECACHE:0x7facec1d0318
==
== proxy buffer diagnostic information ==
Memory:
size: free:536840904 used:21128
chunks: free:1 used:530 min:536840904 max:536840904
buckets:
256+:1
==
Comment