Ad Widget

Collapse

Graph Zabbix internal process busy% shows some processes working on 100%

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jonathan.lopv
    Junior Member
    • Jan 2022
    • 6

    #1

    Graph Zabbix internal process busy% shows some processes working on 100%

    Hi.
    We´re having some problems with our zabbix server.
    Let me explait a little bit
    We´re running the front end on a VM with debian 10, and, on a different VM we´re running our postgress database.
    We´re monitoring around 5000 hosts via SNMP; our currunt NVPS are 564.65
    The problem we have is that some internal process goes to 100%
    The first is Housekeeper, it goes to 100% and keeps running for days.
    History syncer, LLD workes and availability manager also goes to 100%

    Next I'll share some changes we've done on config file
    - StartPollers=200
    - StartPollersUnreachable=300
    - StartPingers=20
    - StartTimers=10
    - StartDBSyncers=5
    - HistoryCacheSize=128M
    - HistoryIndexCacheSize=128M
    - TrendCacheSize=128M
    - ValueCacheSize=128M
    - Timeout=10


    We notice that sometimes server is running slow, but both front.end and databe are not running out of resources.
    We'd like to know if this behavior is because of this processes runing to 100%



    I'll share some screenshots
    Click image for larger version

Name:	System Info.png
Views:	3091
Size:	18.9 KB
ID:	437654Click image for larger version

Name:	Zabbix Server Performance.png
Views:	3033
Size:	391.5 KB
ID:	437656
    Attached Files
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    Why so many pollers? More unreachable pollers than normal ones? No proxies? Why? 5k snmp hosts... thats work for 5+ proxies...
    Do you see housekeeper related messages in server log? do you see slow queries? Anything abnormal in PG logs?

    Comment

    • jonathan.lopv
      Junior Member
      • Jan 2022
      • 6

      #3
      Hi Cyber

      1.- We're watching this logs on server
      28678:20220104:085205.634 slow query: 4.776096 sec, "commit;"
      28708:20220104:085205.634 slow query: 3.089144 sec, "commit;"
      29248:20220104:085205.634 slow query: 4.465507 sec, "commit;"
      28709:20220104:085205.634 slow query: 3.000932 sec, "commit;"
      28707:20220104:085205.634 slow query: 4.776214 sec, "commit;"
      28689:20220104:085205.634 slow query: 3.192286 sec, "commit;"
      28673:20220104:085205.634 slow query: 4.320673 sec, "commit;"

      1.1 We see the housekeeper start but it never ends
      28692:20220103:121946.798 server #31 started [housekeeper #1]
      28692:20220103:124947.024 executing housekeeper

      1.1.1 We see this on the front-end
      Zabbix server: Utilization of housekeeper processes over 75% - 21h 2m 11s


      2.- And we see this on PG logs
      2022-01-02 01:12:17.641 CST [23776] ERROR: canceling autovacuum​​​ task.
      2022-01-02 01:12:17.641 CST [23776] CONTEXT: automatic vacuum of table «zabbix.public.escalations»
      ​​​​

      3.- About StartUnreachablePollers, as the config file says "At least one poller for unreachable hosts must be running if regular"
      We're monitoring Wireless devices and we often lost communication with hosts.


      4.- About normal pollers, we don't know how many are too much or too little; not sure if there's a formula to set this values depending on how many NVPS or Hosts are being monitored.


      5.- We've not seen the proxies option, but we´re goint to investigate about this option.

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #4
        I have a gut feeling that you should investigate, what goes on in your DB. I am not a DB person, so I am not going to give you any advice here... It seems it has issues with cleaning up escalations table, Maybe needs some manual intervention by DBA...

        Of course you have to have unreachable pollers, but they are used only after normal pollers cannot reach hosts any more. Usually there is less unreachable pollers than normal ones. In my environment... biggest proxies, which do about the same nvps, as your whole system, manage with 25 pollers + 12 unreachable ones... But they don't have to poll 5k hosts, just ~200...

        I strongly suggest to add proxies to your environment. They take a huge load off of your server, which can then deal only with trigger calculations etc.

        Comment

        Working...