Hi!
I use Zabbix 6.0 with PostgreSQL and Elasticsearch 7 as a history storage.
My config is:
3 Zabbix servers with HA manager
20 Zabbix proxies
PostgreSQL 13.4
Elasticsearch 7.10.2
OS: Oracle Linux 9.1
Over 9000 monitored servers, 1.4M items, 800k triggers.
Such amount of servers, triggers, items does not allow using of one database with data and history.
I have a strange Zabbix server behaviuor when I restart current active server.
It seems that Zabbix stops any activity with Elasticsearch data: graphs became empty, item values history doesn't update, indexing and search rates in ELK came close to zero values.
No errors found in logs, no resources overhead(cpu, ram, network), most history syncers stay in idle and several of them getting about 100-1000 items/triggers for 30sec and more.
This situation going about 10 minutes, after that search rates in ELK goes to very big(200000/s), indexing rates stay about 1000/s(normal rate about 50000/s).
History syncers are syncing history too slow:
systemctl status zabbix-server.service | grep 'history sync'
├─241529 "/usr/sbin/zabbix_server: history syncer #1 [processed 35 values, 420 triggers in 93.986136 sec, syncing history]"
├─241530 "/usr/sbin/zabbix_server: history syncer #2 [processed 182 values, 692 triggers in 161.063722 sec, syncing history]"
├─241531 "/usr/sbin/zabbix_server: history syncer #3 [processed 4 values, 429 triggers in 103.734200 sec, syncing history]"
├─241532 "/usr/sbin/zabbix_server: history syncer #4 [processed 69 values, 484 triggers in 115.615771 sec, syncing history]"
├─241533 "/usr/sbin/zabbix_server: history syncer #5 [processed 23 values, 345 triggers in 53.282760 sec, syncing history]"
├─241534 "/usr/sbin/zabbix_server: history syncer #6 [processed 8 values, 423 triggers in 91.310238 sec, syncing history]"
├─241535 "/usr/sbin/zabbix_server: history syncer #7 [processed 538 values, 742 triggers in 141.932219 sec, syncing history]"
├─241536 "/usr/sbin/zabbix_server: history syncer #8 [processed 20 values, 379 triggers in 85.259552 sec, syncing history]"
├─241537 "/usr/sbin/zabbix_server: history syncer #9 [processed 68 values, 479 triggers in 106.694512 sec, syncing history]"
├─241538 "/usr/sbin/zabbix_server: history syncer #10 [processed 11 values, 426 triggers in 98.531984 sec, syncing history]"
├─241539 "/usr/sbin/zabbix_server: history syncer #11 [processed 194 values, 601 triggers in 173.374454 sec, syncing history]"
├─241540 "/usr/sbin/zabbix_server: history syncer #12 [processed 160 values, 567 triggers in 162.175727 sec, syncing history]"
├─241541 "/usr/sbin/zabbix_server: history syncer #13 [processed 1 values, 413 triggers in 101.494687 sec, syncing history]"
├─241542 "/usr/sbin/zabbix_server: history syncer #14 [processed 28 values, 367 triggers in 92.811734 sec, syncing history]"
├─241543 "/usr/sbin/zabbix_server: history syncer #15 [processed 111 values, 503 triggers in 130.510268 sec, syncing history]"
..........
Elasticsearch search and indexing latency is about 0.68ms - 0.74ms.
Zabbix server reports about 100% history syncers loading.
Zabbix server resources:
CPU about 5% - 20%
RAM 25%
Network in/out - 2% - 25% (20M - 600M)
Database and Elasticsearch resources has the same stats.
History data in Zabbix web stay empty. It caused going all triggers to True value and zabbix starts sending thousands false-positive alerts.
When I try restart server again, I see in logs that it is syncyng histosy:
241554:20240116:145542.584 syncing history data... 90.624293%
241554:20240116:145557.256 syncing history data... 90.779234%
241554:20240116:145617.928 syncing history data... 90.934176%
241554:20240116:145631.608 syncing history data... 91.089118%
241554:20240116:145646.871 syncing history data... 91.244059%
After this process finished, server restart completes, but it is still no history data in web for my items and server graphs... After that active server changes (or not) and whole described process starts from begenning....
This state belongs for 1-4 hours. During last 20-40 minutes Zabbix starts indexing and search rates goes to normal, values in Zabbix web start filling.
My Zabbix server caches and processes config:
# performance options
StartDBSyncers=100
StartPollers=100
StartPreprocessors=100
StartPollersUnreachable=200
StartHistoryPollers=200
StartTrappers=10
StartPingers=100
StartDiscoverers=20
StartHTTPPollers=5
StartTimers=20
StartEscalators=50
StartAlerters=20
SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
StartSNMPTrapper=1
MaxHousekeeperDelete=10000
CacheSize=6G
HistoryCacheSize=2G
HistoryIndexCacheSize=2G
TrendCacheSize=2G
TrendFunctionCacheSize=1G
ValueCacheSize=2G
Timeout=30
UnreachablePeriod=70
UnavailableDelay=120
UnreachableDelay=10
AlertScriptsPath=/etc/zabbix/alertscripts
ExternalScripts=/etc/zabbix/externalscripts
LogSlowQueries=3000
StartProxyPollers=30
ProxyConfigFrequency=600
ProxyDataFrequency=30
StartLLDProcessors=50
I've tried decrease caches to default values, but server crashes with this config. Then I've tried to decrease only those values that fails server, but it hasn't solve my problem.
Is there some config vriable or another method to fix this problem?
I use Zabbix 6.0 with PostgreSQL and Elasticsearch 7 as a history storage.
My config is:
3 Zabbix servers with HA manager
20 Zabbix proxies
PostgreSQL 13.4
Elasticsearch 7.10.2
OS: Oracle Linux 9.1
Over 9000 monitored servers, 1.4M items, 800k triggers.
Such amount of servers, triggers, items does not allow using of one database with data and history.
I have a strange Zabbix server behaviuor when I restart current active server.
It seems that Zabbix stops any activity with Elasticsearch data: graphs became empty, item values history doesn't update, indexing and search rates in ELK came close to zero values.
No errors found in logs, no resources overhead(cpu, ram, network), most history syncers stay in idle and several of them getting about 100-1000 items/triggers for 30sec and more.
This situation going about 10 minutes, after that search rates in ELK goes to very big(200000/s), indexing rates stay about 1000/s(normal rate about 50000/s).
History syncers are syncing history too slow:
systemctl status zabbix-server.service | grep 'history sync'
├─241529 "/usr/sbin/zabbix_server: history syncer #1 [processed 35 values, 420 triggers in 93.986136 sec, syncing history]"
├─241530 "/usr/sbin/zabbix_server: history syncer #2 [processed 182 values, 692 triggers in 161.063722 sec, syncing history]"
├─241531 "/usr/sbin/zabbix_server: history syncer #3 [processed 4 values, 429 triggers in 103.734200 sec, syncing history]"
├─241532 "/usr/sbin/zabbix_server: history syncer #4 [processed 69 values, 484 triggers in 115.615771 sec, syncing history]"
├─241533 "/usr/sbin/zabbix_server: history syncer #5 [processed 23 values, 345 triggers in 53.282760 sec, syncing history]"
├─241534 "/usr/sbin/zabbix_server: history syncer #6 [processed 8 values, 423 triggers in 91.310238 sec, syncing history]"
├─241535 "/usr/sbin/zabbix_server: history syncer #7 [processed 538 values, 742 triggers in 141.932219 sec, syncing history]"
├─241536 "/usr/sbin/zabbix_server: history syncer #8 [processed 20 values, 379 triggers in 85.259552 sec, syncing history]"
├─241537 "/usr/sbin/zabbix_server: history syncer #9 [processed 68 values, 479 triggers in 106.694512 sec, syncing history]"
├─241538 "/usr/sbin/zabbix_server: history syncer #10 [processed 11 values, 426 triggers in 98.531984 sec, syncing history]"
├─241539 "/usr/sbin/zabbix_server: history syncer #11 [processed 194 values, 601 triggers in 173.374454 sec, syncing history]"
├─241540 "/usr/sbin/zabbix_server: history syncer #12 [processed 160 values, 567 triggers in 162.175727 sec, syncing history]"
├─241541 "/usr/sbin/zabbix_server: history syncer #13 [processed 1 values, 413 triggers in 101.494687 sec, syncing history]"
├─241542 "/usr/sbin/zabbix_server: history syncer #14 [processed 28 values, 367 triggers in 92.811734 sec, syncing history]"
├─241543 "/usr/sbin/zabbix_server: history syncer #15 [processed 111 values, 503 triggers in 130.510268 sec, syncing history]"
..........
Elasticsearch search and indexing latency is about 0.68ms - 0.74ms.
Zabbix server reports about 100% history syncers loading.
Zabbix server resources:
CPU about 5% - 20%
RAM 25%
Network in/out - 2% - 25% (20M - 600M)
Database and Elasticsearch resources has the same stats.
History data in Zabbix web stay empty. It caused going all triggers to True value and zabbix starts sending thousands false-positive alerts.
When I try restart server again, I see in logs that it is syncyng histosy:
241554:20240116:145542.584 syncing history data... 90.624293%
241554:20240116:145557.256 syncing history data... 90.779234%
241554:20240116:145617.928 syncing history data... 90.934176%
241554:20240116:145631.608 syncing history data... 91.089118%
241554:20240116:145646.871 syncing history data... 91.244059%
After this process finished, server restart completes, but it is still no history data in web for my items and server graphs... After that active server changes (or not) and whole described process starts from begenning....
This state belongs for 1-4 hours. During last 20-40 minutes Zabbix starts indexing and search rates goes to normal, values in Zabbix web start filling.
My Zabbix server caches and processes config:
# performance options
StartDBSyncers=100
StartPollers=100
StartPreprocessors=100
StartPollersUnreachable=200
StartHistoryPollers=200
StartTrappers=10
StartPingers=100
StartDiscoverers=20
StartHTTPPollers=5
StartTimers=20
StartEscalators=50
StartAlerters=20
SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
StartSNMPTrapper=1
MaxHousekeeperDelete=10000
CacheSize=6G
HistoryCacheSize=2G
HistoryIndexCacheSize=2G
TrendCacheSize=2G
TrendFunctionCacheSize=1G
ValueCacheSize=2G
Timeout=30
UnreachablePeriod=70
UnavailableDelay=120
UnreachableDelay=10
AlertScriptsPath=/etc/zabbix/alertscripts
ExternalScripts=/etc/zabbix/externalscripts
LogSlowQueries=3000
StartProxyPollers=30
ProxyConfigFrequency=600
ProxyDataFrequency=30
StartLLDProcessors=50
I've tried decrease caches to default values, but server crashes with this config. Then I've tried to decrease only those values that fails server, but it hasn't solve my problem.
Is there some config vriable or another method to fix this problem?

Comment