Ad Widget

Collapse

Zabbix proxy: Utilization of trapper data collector processes is 100

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gonmd
    Junior Member
    • Aug 2024
    • 9

    #1

    Zabbix proxy: Utilization of trapper data collector processes is 100

    Hi community,

    Our infrastructure:
    Server: Zabbix 6.2.9
    Zabbix proxy: 6.2.9
    Zabbix agent2: 6.2.7 | 6.2.9

    About a month ago our infrastructure had a problem with storage in which all vms became inaccessible for a week. When the problem was solved and the vms became available again, the agents naturally started communicating with the Proxy and ever since the trapper data collector % got up (100%) something that had never happened before (but understandable given the circumstances).
    We then proceeded to leave it like this for the weekend (72 hours aprox.) for it to stabilize, which didn’t.
    Some of the troubleshoot made:
    • All hosts were disabled on frontend and the agents stopped in order to normalize the trapper.
    • A new Proxy (vm) was created | installed to make sure it wasn’t a damage from it or the dB itself. It wasn’t.
    • We began to start to enable the hosts, one by one (for hours or even a whole day) and verified there would be an increase on up to 6% (by host) - for 30 minutes then decrease to 0% for only 15 minutes then up again (flap).
    • Tests were also made where we unlinked and cleared all templates to add them again, still no effect.
    • Agent2 was reinstalled on both previous and recent version, didn’t work.
    • Even a new interface was added (eth2) to the proxy, where we tested to see if it could be an interface issue but no.
    • Values on proxy.conf were adapted to see if there was a stabilization, which made the percentage drop to the usual 0,(…)%, but the behavior (up and down flap) continued.
    Also on the tail this is the message it shows when the timeout is on default (3 s), if changed to 15, 20 even 25 it shows the same message, only disappears on 30s:
    Click image for larger version

Name:	Captura de ecrã 2024-08-27 100822.png
Views:	873
Size:	227.3 KB
ID:	490309


    Zabbix Server: Utilization of trapper data collector processes:

    Click image for larger version  Name:	Captura de ecrã 2024-08-26 111323.png Views:	2 Size:	76.4 KB ID:	490242













    Zabbix Proxy: Utilization of trapper data collector processes:
    Click image for larger version  Name:	Captura de ecrã 2024-08-26 111349.png Views:	2 Size:	161.2 KB ID:	490243











    This are the values that we have on our configuration file:

    zabbix_proxy.conf:
    StartPollersUnreachable=5
    StartPingers=5
    CacheSize=64M
    Timeout=5
    Everything else is default

    agent2.conf:
    Timeout=3

    Before the storage issue we'd never had any problems with the agent.

    Can you please share some insight on what could be the issue here ?

    Thank you

    Attached Files
    Last edited by gonmd; 27-08-2024, 11:09.
  • gonmd
    Junior Member
    • Aug 2024
    • 9

    #2
    Here's some testing :

    On Latest data - This host shows no data on a agent active template with default timeout :

    Click image for larger version

Name:	activeagentemplate.png
Views:	873
Size:	26.9 KB
ID:	490533



    Once changed to a passive agent template data it begins being collected:

    Click image for larger version

Name:	passiveagenttemplate.png
Views:	828
Size:	61.5 KB
ID:	490534

    This is the trapper data collector graph after the template was changed to the passive agent (still the same behavior)

    Click image for larger version

Name:	trapperafterpassivetemplate.png
Views:	839
Size:	58.7 KB
ID:	490535


    And also the same error on the agent log as shown on original post is still being displayed

    Comment

    • gonmd
      Junior Member
      • Aug 2024
      • 9

      #3
      Could anyone help, please ?
      Thanks

      Comment

      • gonmd
        Junior Member
        • Aug 2024
        • 9

        #4
        Hi Community,

        Does anyone have information to share about this behavior that could help us? We are stuck and we don´t understand this behavior with active monitoring, after the problem with physical storage.

        Regards,

        Comment

        • mrportatoes
          Junior Member
          • Mar 2023
          • 6

          #5
          We're having a very similar issue. Currently on 7.0.8 with two proxy servers, one dedicated for active agents. We had a power outage and once the servers came back online, we saw the trapper (for active agents) was hovering between 90 and 100% utilization, where before it was in the ~1% range. I can also see the CPU utilization has gone way down on the proxy server... Almost like its trying to process everything one at a time instead of asynchronously... if that makes sense. Furthermore, we can see that running "ss -ntl" shows that the Recv-Q is pegged at 128. I tried adding the "ListenBacklog" to the proxy conf which essentially increases the Recv-Q however even bumping it up to 4096, it remains pegged at the higher value.

          In the past we had ran into some file deceptors limits but we have /etc/systemd/system/zabbix-proxy.service.d/filelimit.conf defined with some rather large values and, more importantly they haven't changed since the power outage.

          I also bumped our "StartTrappers" value in the proxy conf and by a significant amount (300 to 400) and it only dropped utilization by ~5%. and given it was working before... and likely should have been lowered give ~1% utilization, cranking this up to 1000 isn't the answer.

          I also made some adjustments to the TCP parameters by shorting the keepalive and timeout settings however this hasn't made any difference either.

          sysctl -w net.ipv4.tcp_max_syn_backlog=4096
          sysctl -w net.ipv4.tcp_keepalive_time=600
          ​sysctl -w net.ipv4.tcp_fin_timeout=30

          ​​​If we get this sorted, Ill post the fix here. Likewise, if you have happened to fix it in your instance please share anything you can.
          Cheers

          Comment

          • gonmd
            Junior Member
            • Aug 2024
            • 9

            #6
            Hi,

            Thanks for sharing that info. From our side, the workaround was to convert from active monitoring to passive. We did several test but we were not able to find the fix.
            If you get this sorted, please post the fix.

            Cheers

            Comment

            • mrportatoes
              Junior Member
              • Mar 2023
              • 6

              #7
              Just wanted to check in and note we never got a clear fix either. Since we are running 2 proxies we grouped then and flipped on load balancing. This has brought up some new challenges around host redirects but in general the TCP queues are clear and our "Available" host numbers are as high as they have ever been. I'm certain this is a capacity/configuration issue but can't say definitively what it was.

              I also took a page from your playbook and took a few of our PowerShell based items and flipped them form Active to Passive. This dramatically dropped the queue count for us but can't say it had an impact on agent communication with the proxies.

              Comment

              • cyber
                Senior Member
                Zabbix Certified SpecialistZabbix Certified Professional
                • Dec 2006
                • 4807

                #8
                First 2 posts... I would say, issues with connection from agent to proxy (agent->proxy:10051). It says there, in agent log, that it cannot get active items from proxy... so it is not able to talk to proxy (FW restrictions?). When you changed items from active to passive, you changed the direction of queries... now agent only responds to queries from proxy, as you get data, your connection proxy->agent:10050 is okay.

                Comment

                • gonmd
                  Junior Member
                  • Aug 2024
                  • 9

                  #9
                  Originally posted by cyber
                  First 2 posts... I would say, issues with connection from agent to proxy (agent->proxy:10051). It says there, in agent log, that it cannot get active items from proxy... so it is not able to talk to proxy (FW restrictions?). When you changed items from active to passive, you changed the direction of queries... now agent only responds to queries from proxy, as you get data, your connection proxy->agent:10050 is okay.
                  Hi,

                  If it's firewall, it's strange. Before the storage issue, no problems with active monitoring. Also, some of the clients are on same network as the proxy, so no firewall.

                  [root@somehost ~]# nc -z -v X.X.X.X 10051
                  Ncat: Version 7.70 ( https://nmap.org/ncat )
                  Ncat: Connected to X.X.X.X:10051.
                  Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

                  Comment

                  Working...