Ad Widget

Collapse

Dockerised Zabbix 7.0.4 http pollers stopped working.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Wolfsbane2k
    Member
    • Nov 2022
    • 48

    #1

    Dockerised Zabbix 7.0.4 http pollers stopped working.

    Evening all.
    I've got a Zabbix 7.0.4 instance running via the official Docker instances on a vm Ubuntu 22 host with 16 gb ram and 8 cores of an epyc system to monitor circa 30 hosts.

    We've now had 2 instances where Zabbix suddenly stops a polling a some of the HTTP hosts (across a series of network addresses) all at the same time after a while, but not others in the same network ranges. Initially I thought this was a network level problem, or a host problem, not a Zabbix problem because of the apparent randomness of it. In this case, the resulted in an outage of over 4 days.

    Using the 'test' button on the item grabbing data the source item worked fine while the pollers didn't, with results coming in quickly, and if I hit execute now button it doesn't result in any action being taken. (I was watching the network traffic in wireshark) , even if I waited 15 minutes.
    ​​
    I'm the end I just rebooted the fulll Docker stack and it all came back OK.

    I've not been able to get into the zabbix logs today and tbh, can't recall if they will have survived the stack restart.

    I can't find any reported issues against this: has anyone else seen similar?

    Ps, is there any way to monitor the state of the http pollers within a Zabbix dashboard so that if they die I can get an alert?

    Ta
  • Wolfsbane2k
    Member
    • Nov 2022
    • 48

    #2
    This has now happened 4, possibly 5 times.

    I've been spending the last 3 hours trying to dig through logs and investigating this issue; the logs themselves (at debug level 3 ) are of no use - there is nothing in the logs that mentions the HTTP pollers at all.

    The logs to show 2 items being monitored by a HTTP poller flip flopping state though after timing out for 2000ms, and then 1 second later becoming available agian.

    Did push the debuglevel up to 4 and restart the docker (just the server container) and the moment that restarted everything cleared again; didn't happen during the 3 hours or so while in debuglevel4 but did rapidly run out of hard drive space for the logs, meaning i had to return to debuglevel3 and restart the container again, only for the issue to occur 15 minutes later.

    Looking at the network traffic in /out of the containers, it's clear that network traffic just drops to next to nothing about 1 minute 45 seconds before the http requests alarms kick off.

    Have set up a series of monitoring points for the weekend at debug level 3 and will see what if anything happens while awaiting delivery of an extra hard drive.
    Last edited by Wolfsbane2k; 25-10-2024, 18:45.

    Comment

    • Wolfsbane2k
      Member
      • Nov 2022
      • 48

      #3
      Still investigating this with a whole bunch of wireshark's recording network traffic through the system and Debug level 4 on the portainer.

      From intial review, it's not giving me any smoking guns as to what's causing this - the CPU load is less than 5%, the network load less than 400 kb/s , the HTTP pollers are less than 2% used, the Preprocessors queue is 0, and pre-processor internal processes is less than 5%. Queue size over 10 minutes is 22.

      Is there any particular thing I should be looking for in the logs?

      We're seeing another side effect in the lead up to these total outages where the system will declare failures across a staggered time only for them all to "suddenly" clear at the same time. They are all set to be sampled every 1 minute, (edit 2) and report a nodata after 2 minutes.

      For example this is an extract from the problem report.
      Time Resolved Host
      16:05:17 16:05:17 Host 1
      16:05:15 16:05:17 Host 2
      16:05:07 16:05:17 Host 3
      16:05:04 16:05:17 Host 5
      16:05:01 16:05:17 Host 4
      16:05:00 16:05:17 Host 9
      Thanks!

      Edited (2) to add:

      I'm using a local HTTP Proxy, which in the wireshark traces between zabbix and the HTTP proxy I can see lots of similar requests being considered as the same TCP Stream, and all the hosts declare the failure when that TCP stream is torn down. There are a number of other hosts that use the same proxy, but it's http requests are handled via a different stream - assuming this is because it's running in a seperate HTTP polller, and why it's not affected.

      It also appears that something in the docker instance ends up trying to talk to the HTTP proxy on an interface that is only meant to be part of the internal docker networking address range. As i'd set the wireshark capture filter (on the docker host) to capture traffic between the HTTP proxy and Zabbix docker instance external IP address, but left the wireshark capture filter on the HTTP proxy more open, these are only seen in the capture on the HTTP proxy, so i'm not sure what's triggered them - these appear to have come from the web-nginx-myssql-1 container.

      After a complete drop out, i suddenly get a flux of "http requests" to the http proxy for 13 http hosts, all within .001 second of each other, all on different source ports; we've got 20 HTTP pollers in operation, so this equally makes sense, but looks like "something" just kicks them off all at once, which given the intended async nature of the requests is also a little wierd....

      Any help would be greatly appreciated!

      Edit 3: Have bumped StartHTTPPollers and StartHTTPAgentPollers both to 75 to disprove this being a contributing factor as we're only looking at 20 hosts, and many of our items are dependant (pre-processed) items on 1 HTTP agent request to that host. We've got StartPreprocessors set to 200, all for 85 values per second.
      Last edited by Wolfsbane2k; 01-11-2024, 21:10.

      Comment

      • Wolfsbane2k
        Member
        • Nov 2022
        • 48

        #4
        Sadly, despite edit 3 to upthe number of pollers and all other aspects up, this is still ongoing
        Last edited by Wolfsbane2k; 14-11-2024, 11:29.

        Comment

        • Wolfsbane2k
          Member
          • Nov 2022
          • 48

          #5
          After a bunch of investigations but getting nowhere, we jumped to Zabbix 7.2 containers, and ta-da, isssue gone.

          Having reviewed the change logs, can't work out what drove the change, but consider it closed.

          Comment

          • Wolfsbane2k
            Member
            • Nov 2022
            • 48

            #6
            Urgh, went to add a new host using a dependant item from another host, and we've got the old issue back.
            Disabling the host didn't stop the issue from occuring, so we're now back with something flapping states every 2 minutes

            Hmm.

            Comment

            • Wolfsbane2k
              Member
              • Nov 2022
              • 48

              #7
              Following a full recompose of the docker build at 7.2.6 we're no longer seeing this issue - really not sure what's been causing this

              Comment

              Working...