Ad Widget

Collapse

Sudden peaks in inbound flows to Zabbix from Agents, halting Zabbix

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • AspenKle
    Junior Member
    • Feb 2024
    • 12

    #1

    Sudden peaks in inbound flows to Zabbix from Agents, halting Zabbix

    Zabbix 6.0.14. © 2001–2023, Zabbix SIA
    Linux (ubuntu 20.04)
    Standard E4as v4 (4 vcpus, 32 GiB memory)

    Zabbix Agents version running on Windows (Windows Server 2019 Datacenter), the monitored hosts.
    zabbix_agent2-6.0.23-windows-amd64-openssl.msi

    Monitored hosts:68
    Required server performance, new values per second 56.16

    We have had a system running for a long time, and it has worked great.
    We have have frequently tuned it when adding new hosts and monitored housekeeping and other processes in order to have a smooth environment running.
    But the last month some monitored host's started to behave differently.

    We noticed it in the frontend first:
    Connection to Zabbix server "localhost" timed out: Possible reasons:

    1. Incorrect server IP/DNS in the "zabbix.conf.php".
    2. Firewall is blocking TCP connection.
    - Connection timed out

    After checking the inbound flows to Zabbix server, the graph was sky high, after more investigation it turns out that 2-3 monitored hosts was sending too much data to Zabbix server.

    (example)

    sudo tail -f zabbix_server.log

    1357:20240130:133326.485 failed to accept an incoming connection: connection rejected, getpername() faild: [107] Transport endpoint is not connected.


    The fix we did was to view/find what monitored host was sending too much data view network tools, and when the monitored hosts were found, we stopped the Agent 2.
    In some case that worked in other cases we had to stop Zabbix server, then Zabbix server agent and start it up again.

    Agent 2 logs (example)

    # Host logs, it was pilling up and doing to much.

    2022/10/13 09:23:13.119956 [101] cannot connect to [ZABBIX-IP:10051]: dial tcp :0->ZABBIX-IP:10051: i/o timeout
    2022/10/13 09:23:13.119956 [101] active check configuration update from host [MONITORED-HOST] started to fail
    [ ..the same logs were just rolling every second ]

    It was/is almost as the ports were exhausted.

    # check telnet local host
    telnet localhost 10050, was always success.
    telnet localhost 10051, was very slow and sometimes not responding.

    The next step we did was to upgrade to a new agent version.
    zabbix_agent2-6.0.26-windows-amd64-openssl.msi

    But we are stilling seeing the issue from time to time.
    The Zabbix dashboards, System performance, Zabbix server health and Zabbix server processes is normal when this happens.
    Housekeeping is also normal.
    We also tuned:
    StartPollers
    Timeout
    And looked / searched at many links after this error or behavior to find a fix, but not success yet.

    I hope someone can point me in the right direction.
    Maybe this is not enough information with respect to the configuration, environment and more.
    Regards





  • AspenKle
    Junior Member
    • Feb 2024
    • 12

    #2
    Update:
    It was/is almost as the ports were exhausted on the Zabbix server.

    # check telnet local host
    telnet localhost 10050, was always success.
    telnet localhost 10051, was very slow and sometimes not responding.

    Comment

    • cyber
      Senior Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Dec 2006
      • 4807

      #3
      It is so tiny load, should work like a charm on those 4 cpus and 32G of mem... DB is on the same host? look around, what does it do at those times... I would suspect it is not really able to perform, either not given enough resources or on very slow disks or something...

      Comment


      • AspenKle
        AspenKle commented
        Editing a comment
        Hi Cyber, thanks for the reply. Yes, this is not a big load, and we did not see any spikes in RAM, CPU or Disk for the Zabbix server.

        We only saw almost like a DDOS attack from the agents, 3 of 68, spamming the Zabbix server with some reconnect over and over again:
        2022/10/13 09:23:13.119956 [101] cannot connect to [ZABBIX-IP:10051]: dial tcp :0->ZABBIX-IP:10051: i/o timeout
        2022/10/13 09:23:13.119956 [101] active check configuration update from host [MONITORED-HOST] started to fail
        [ ..the same logs every second and spinning].

        After stopping the 3 agents, 3 out of 5 times it works/worked.
        2 out of 5 times we had to also:
        sudo service zabbix-server stop, sudo service zabbix agent stop and start those again.


        DB is on remote host, and all stats looks ok for DB when it happens.
    • AspenKle
      Junior Member
      • Feb 2024
      • 12

      #4
      Here is an example of Inbound flows, with the sudden peak that goes up when the agents spams the Zabbix Server. The graph here is for the Zabbix Server.
      Click image for larger version

Name:	image.png
Views:	368
Size:	175.5 KB
ID:	479723
      Last edited by AspenKle; 27-02-2024, 11:18.

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #5
        Try to upgrade server to latest version.. .14 is pretty old one by now. many bugs have been fixed, maybe something related to this kind of issue also. Just too many to read through..

        Comment


        • AspenKle
          AspenKle commented
          Editing a comment
          Yes, we have that on plan. He,he I get you. Thanks for the help so far. For future me or someone, if you land on this page and found the reason or fix. Please add it here.
      • AspenKle
        Junior Member
        • Feb 2024
        • 12

        #6
        Updated to new agent on two hosts that had an issue, agent 2 V.6.0.27. The two agents ran for 1 day, and started to produces errors again.
        After stopping both agents, the Zabbix server was ok after 10 min and inbound flows were normal again.
        Click image for larger version

Name:	image.png
Views:	357
Size:	279.4 KB
ID:	479890

        Comment

        Working...