Ad Widget

Collapse

Zabbix 3.0 StartTrappers are getting stuck processing data

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ewestdal
    Junior Member
    • Jul 2016
    • 8

    #1

    Zabbix 3.0 StartTrappers are getting stuck processing data

    We currently handle about 7000 nvps with our environment. This doesn't include a large amount of zabbix trapper items as well. We have recently started having and issue where the zabbix trappers on the our primary server will get stuck processing data. As more of them get stuck our busy trapper process monitor goes from aroudn 20% used to 100% used. Once we hit 100% we have to recycle the zabbix primary server to fix the issue.

    Click image for larger version

Name:	Capture.JPG
Views:	1286
Size:	62.7 KB
ID:	368793

    We currently have the starttrappers=50 and I'm wondering what other folks have theirs set to? Has anyone else had this problem and have some suggestions?
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    I think that I may know what kind of bottleneck exactly you are hitting. May I ask:
    - how many pollers processes you have started on the server?
    - how many items is monitored over the server? (and how many passive or snmp and active?)
    - how many proxies you have? (passive and active)
    - how many items and how many triggers? (in this case important is ratio between those numbers)
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • ewestdal
      Junior Member
      • Jul 2016
      • 8

      #3
      Kloczek,

      Here's the answers to your questions:
      • how many pollers processes you have started on the server?
        • we have 224 zabbix related processes started on the server. More specifically we have:
          • 50 start pollers
          • 100 trappers
          • 50 history syncers
          • 1 escalator
          • 1 proxy poller
          • 1 self-monitoring
          • 2 vmware
          • 5 unreachable
          • 1 icmp poller
          • 1 http poller
          • 5 timers
          • and then a few other random zabbix related ones
      • how many items is monitored over the server? (and how many passive or snmp and active?)
        • we currently have 1808477 items being monitored
        • we do not have any snmp going into zabbix
      • how many proxies you have? (passive and active)
        • 27 proxies and they are all active
      • how many items and how many triggers? (in this case important is ratio between those numbers)
        • 1808594 items
        • 874426 triggers

      Essentially what we've noticed so far through debug and tcpdumps that the trapper processes on the Zabbix primary are not stuck but they are just taking forever to process the data. We've been working on reducing that load. Let me know if there is any additional data that might help.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        I'm assuming that with 1.8 mln items you do not have any monitoring over server except zabbix server monitoring themselves.
        1) number of trappers should be lower than number of proxies. Usually something like ratio between 1:2 to 1:1 (trappersroxies) is enough. This especially case with active proxies when proxy decides when will push next batch of the monitoring data to the server
        2) number of history syncers should be not more than 2*number of CPU cores on DB server in case of using NySQL 5.6 and below Whychc one type and version of server DB backend you are using? More syncers may cause as well congestion which is easy to catch by looking on number of tables locks/s. Monitoring of the DB engine should show it is the case with you zabbix stack. You should have look on what shows zabbix[preprocessing_queue] internal server metric
        3) number of other processes like icmp pngers, http pollers could be slashed to 0 as none of the working monitoring on the server should be using those processes.

        Above are not straight related to you main issue but it may me part of the problem.

        OK so now main part. It is not documented anywhere in zabbix doc that poller processes are responsible for processing triggers.
        I've hit this when I've been experimenting on my laptop https://support.zabbix.com/browse/ZBX-14394 trying to minimise memory footprint used by zabbix server
        In other words to have enough speed of processing data which needs to be evaluated against triggers definition (and you have relatively high ratio triggers to items) you may need to increase number of pollers above 50 however I would recommend to really check what happens on DB side because congestion in triggers processing may have root cause on DB side (mainly or as well).
        Before first try to increase pollers you should check how many of those processes are busy. If they are saturated at least above 80% it is possible that you have here here main bottleneck.
        Can you tell what is you current pollers utilisation?
        Other possible causes of your issue is not enough strong DB backend. Questions related to this area can be solved looking on DB engine monitoring data. Main factor will be ratio between read and write IOs on storage layer. Well tuned and architected DB engine should not have less than 1:20 read to write IO (I'm usually trying to keep this around 1:50).
        To slow triggers processing may be caused by to high write latency operations (like inserts and updates) however it is not obvious to many people that key factor to gain really low latency of those queries is necessary to solve first low read IOs latency which is possible to gain only by have enough memory to cache most of MRU/MFU data in memory cache without touching storage.

        PS. If you don't have implemented good enough DB engine monitoring and you are using MySQL >=5.7 you may try to use my Service MySQL template which provides all what is needed to diagnose your issue if it sits on DB engine side.
        From point of diagnosing zabbix server bottleneck you may try as well my Service zabbix server template which has few more thing than standard OOTB zabbix template.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Yet another thought. You are using zabbix 3.0. Many scaleability issues have been solved after this major version. You should try to upgrade to at least 3.4.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • vso
            Zabbix developer
            • Aug 2016
            • 190

            #6
            How many low level discovery items do you have ?

            Comment

            • ggmojki
              Junior Member
              • Jan 2019
              • 1

              #7
              how many proxies you have?

              Comment

              Working...