Ad Widget

Collapse

Zabbix server 6.0 syncing history from Elasticsearch too slow

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • artem.kh
    Junior Member
    • Jan 2024
    • 2

    #1

    Zabbix server 6.0 syncing history from Elasticsearch too slow

    Hi!
    I use Zabbix 6.0 with PostgreSQL and Elasticsearch 7 as a history storage.
    My config is:
    3 Zabbix servers with HA manager
    20 Zabbix proxies
    PostgreSQL 13.4
    Elasticsearch 7.10.2
    OS: Oracle Linux 9.1
    Over 9000 monitored servers, 1.4M items, 800k triggers.

    Such amount of servers, triggers, items does not allow using of one database with data and history.

    I have a strange Zabbix server behaviuor when I restart current active server.
    It seems that Zabbix stops any activity with Elasticsearch data: graphs became empty, item values history doesn't update, indexing and search rates in ELK came close to zero values.
    No errors found in logs, no resources overhead(cpu, ram, network), most history syncers stay in idle and several of them getting about 100-1000 items/triggers for 30sec and more.
    This situation going about 10 minutes, after that search rates in ELK goes to very big(200000/s), indexing rates stay about 1000/s(normal rate about 50000/s).
    History syncers are syncing history too slow:

    systemctl status zabbix-server.service | grep 'history sync'
    ├─241529 "/usr/sbin/zabbix_server: history syncer #1 [processed 35 values, 420 triggers in 93.986136 sec, syncing history]"
    ├─241530 "/usr/sbin/zabbix_server: history syncer #2 [processed 182 values, 692 triggers in 161.063722 sec, syncing history]"
    ├─241531 "/usr/sbin/zabbix_server: history syncer #3 [processed 4 values, 429 triggers in 103.734200 sec, syncing history]"
    ├─241532 "/usr/sbin/zabbix_server: history syncer #4 [processed 69 values, 484 triggers in 115.615771 sec, syncing history]"
    ├─241533 "/usr/sbin/zabbix_server: history syncer #5 [processed 23 values, 345 triggers in 53.282760 sec, syncing history]"
    ├─241534 "/usr/sbin/zabbix_server: history syncer #6 [processed 8 values, 423 triggers in 91.310238 sec, syncing history]"
    ├─241535 "/usr/sbin/zabbix_server: history syncer #7 [processed 538 values, 742 triggers in 141.932219 sec, syncing history]"
    ├─241536 "/usr/sbin/zabbix_server: history syncer #8 [processed 20 values, 379 triggers in 85.259552 sec, syncing history]"
    ├─241537 "/usr/sbin/zabbix_server: history syncer #9 [processed 68 values, 479 triggers in 106.694512 sec, syncing history]"
    ├─241538 "/usr/sbin/zabbix_server: history syncer #10 [processed 11 values, 426 triggers in 98.531984 sec, syncing history]"
    ├─241539 "/usr/sbin/zabbix_server: history syncer #11 [processed 194 values, 601 triggers in 173.374454 sec, syncing history]"
    ├─241540 "/usr/sbin/zabbix_server: history syncer #12 [processed 160 values, 567 triggers in 162.175727 sec, syncing history]"
    ├─241541 "/usr/sbin/zabbix_server: history syncer #13 [processed 1 values, 413 triggers in 101.494687 sec, syncing history]"
    ├─241542 "/usr/sbin/zabbix_server: history syncer #14 [processed 28 values, 367 triggers in 92.811734 sec, syncing history]"
    ├─241543 "/usr/sbin/zabbix_server: history syncer #15 [processed 111 values, 503 triggers in 130.510268 sec, syncing history]"
    ..........

    Elasticsearch search and indexing latency is about 0.68ms - 0.74ms.
    Zabbix server reports about 100% history syncers loading.
    Zabbix server resources:
    CPU about 5% - 20%
    RAM 25%
    Network in/out - 2% - 25% (20M - 600M)

    ​Database and Elasticsearch resources has the same stats.

    History data in Zabbix web stay empty. It caused going all triggers to True value and zabbix starts sending thousands false-positive alerts.
    When I try restart server again, I see in logs that it is syncyng histosy:

    241554:20240116:145542.584 syncing history data... 90.624293%
    241554:20240116:145557.256 syncing history data... 90.779234%
    241554:20240116:145617.928 syncing history data... 90.934176%
    241554:20240116:145631.608 syncing history data... 91.089118%
    241554:20240116:145646.871 syncing history data... 91.244059%

    After this process finished, server restart completes, but it is still no history data in web for my items and server graphs... After that active server changes (or not) and whole described process starts from begenning....

    This state belongs for 1-4 hours. During last 20-40 minutes Zabbix starts indexing and search rates goes to normal, values in Zabbix web start filling.

    My Zabbix server caches and processes config:

    # performance options
    StartDBSyncers=100
    StartPollers=100
    StartPreprocessors=100
    StartPollersUnreachable=200
    StartHistoryPollers=200
    StartTrappers=10
    StartPingers=100
    StartDiscoverers=20
    StartHTTPPollers=5
    StartTimers=20
    StartEscalators=50
    StartAlerters=20
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    StartSNMPTrapper=1
    MaxHousekeeperDelete=10000
    CacheSize=6G
    HistoryCacheSize=2G
    HistoryIndexCacheSize=2G
    TrendCacheSize=2G
    TrendFunctionCacheSize=1G
    ValueCacheSize=2G
    Timeout=30
    UnreachablePeriod=70
    UnavailableDelay=120
    UnreachableDelay=10
    AlertScriptsPath=/etc/zabbix/alertscripts
    ExternalScripts=/etc/zabbix/externalscripts
    LogSlowQueries=3000
    StartProxyPollers=30
    ProxyConfigFrequency=600
    ProxyDataFrequency=30
    StartLLDProcessors=50

    I've tried decrease caches to default values, but server crashes with this config. Then I've tried to decrease only those values that fails server, but it hasn't solve my problem.

    Is there some config vriable or another method to fix this problem?
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    First of all, ELK stack support is experimental...

    Originally posted by artem.kh
    Hi!
    I use Zabbix 6.0 with PostgreSQL and Elasticsearch 7 as a history storage.
    My config is:
    3 Zabbix servers with HA manager
    20 Zabbix proxies
    PostgreSQL 13.4
    Elasticsearch 7.10.2
    OS: Oracle Linux 9.1
    Over 9000 monitored servers, 1.4M items, 800k triggers.

    Such amount of servers, triggers, items does not allow using of one database with data and history.


    ​And this does not hold true either... I have very similar numbers... a bit more hosts, a bit less items and triggers... PG with Timescale manages easily with that...

    I doubt there is something to set in your Zabbix config... My gut feeling says, your elk stack lags... (but it may as well be empty stomach rumbling...)
    Both of your PG and ELK versions are quite behind. Updating those versions may bring you some improvements.

    Comment

    • artem.kh
      Junior Member
      • Jan 2024
      • 2

      #3
      Thank you for your response.

      Yes, I understand that ELK support is experimental.
      I've choose ELK because Time scale DB is DB based extension with its restrictions like a replication problems, locks, complexity with scaling...
      I have some doubts in its performance and scaling without sharding.

      I've created and checked a lot of statistics and experiments. There is no any metric or error I found, which can be a cause of lags.
      I have a system resources graphs for zabbix server, PG, ELK, ELK performance graphs, graphs with zabbix errors on sending to ELK based on zabbix logs.
      All of these things hasn't helped me to find out the cause of this problem.

      And I say more: when zabbix start active history download its performance comes very high: triggers and values in history syncers stats are about 20000-30000 per 10sec. ELK latency stays good and doesn't differs from usual values. This means that zabbix can get history very fast, but in unknown causes it doesn't do this. It seems like zabbix is simply waiting for something...
      Also I've tried to query data from ELK as zabbix: i've got results very fast.

      My last thought is wrong cache setting. may be it can be a cause.

      Can you share please your approximately performance with TSDB and it's size? Do you use replication and HA?
      May be it will fit for me too​​

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #4
        Performance as NVPS? 4800+. DB size is currently ~650G, but we do not keep very long history, just 14 days + 1 year of trends. PG14.5 + TS2.7.2 + pg_auto_failover (which manages failovers and replication). Using compression in TS reduced DB size from ~1.4T to current size... hosts themselves are 16cpu-s and 128G memory... Maybe a bit oversized, but there were "reasons"..
        I am pretty sure someone with a bit more PG knowledge can squeeze out some hidden performance there...

        Comment

        • Jun.Liu
          Member
          • Apr 2007
          • 91

          #5
          Originally posted by cyber
          Performance as NVPS? 4800+. DB size is currently ~650G, but we do not keep very long history, just 14 days + 1 year of trends. PG14.5 + TS2.7.2 + pg_auto_failover (which manages failovers and replication). Using compression in TS reduced DB size from ~1.4T to current size... hosts themselves are 16cpu-s and 128G memory... Maybe a bit oversized, but there were "reasons"..
          I am pretty sure someone with a bit more PG knowledge can squeeze out some hidden performance there...
          Just wondering is it a standalone server or with many proxy?

          Comment

          • cyber
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2006
            • 4807

            #6
            Originally posted by Jun.Liu

            Just wondering is it a standalone server or with many proxy?
            Theres ~20 proxies involved..

            Comment

            • vso
              Zabbix developer
              • Aug 2016
              • 190

              #7
              How is it possible that 429 triggers were calculated for 4 values ? Is it possible that those are time based triggers ? Maybe some delay should be introduced after restart so that they are calculated later when the load is smaller and there is no actual cache warmup ?
              ├─241531 "/usr/sbin/zabbix_server: history syncer #3 [processed 4 values, 429 triggers in 103.734200 sec, syncing history]"

              Comment

              • cyber
                Senior Member
                Zabbix Certified SpecialistZabbix Certified Professional
                • Dec 2006
                • 4807

                #8
                Originally posted by vso
                How is it possible that 429 triggers were calculated for 4 values ?
                this one is even better..
                ├─241541 "/usr/sbin/zabbix_server: history syncer #13 [processed 1 values, 413 triggers in 101.494687 sec, syncing history]"
                I never thought that the text there means that those triggers and items are related... ::P
                so this one, how do I interpret this? No new values, but bunch of time based triggers recalculated?
                Code:
                 [processed 0 values, 844 triggers in 0.014566 sec, idle 1 sec]]
                and this one? load of data came in, which almost all are related to some trigger and it caused recalculation?
                Code:
                [processed 31363 values, 30691 triggers in 8.070808 sec, idle 1 sec]
                ​​

                Comment

                • vso
                  Zabbix developer
                  • Aug 2016
                  • 190

                  #9
                  There should be some improvements under ZBX-24549. Yes, usually triggers are recalculated due to new values but there are also time based triggers that are recalculated every 30 seconds, this can be a problem especially after restart and delaying their calculation or adding possibility to control when to calculate could help.

                  Comment

                  Working...