Ad Widget

**Wolfsbane2k** · 25-10-2024, 18:33

This has now happened 4, possibly 5 times.

I've been spending the last 3 hours trying to dig through logs and investigating this issue; the logs themselves (at debug level 3 ) are of no use - there is nothing in the logs that mentions the HTTP pollers at all.

The logs to show 2 items being monitored by a HTTP poller flip flopping state though after timing out for 2000ms, and then 1 second later becoming available agian.

Did push the debuglevel up to 4 and restart the docker (just the server container) and the moment that restarted everything cleared again; didn't happen during the 3 hours or so while in debuglevel4 but did rapidly run out of hard drive space for the logs, meaning i had to return to debuglevel3 and restart the container again, only for the issue to occur 15 minutes later.

Looking at the network traffic in /out of the containers, it's clear that network traffic just drops to next to nothing about 1 minute 45 seconds before the http requests alarms kick off.

Have set up a series of monitoring points for the weekend at debug level 3 and will see what if anything happens while awaiting delivery of an extra hard drive.

**Wolfsbane2k** · 01-11-2024, 18:24

Still investigating this with a whole bunch of wireshark's recording network traffic through the system and Debug level 4 on the portainer.

From intial review, it's not giving me any smoking guns as to what's causing this - the CPU load is less than 5%, the network load less than 400 kb/s , the HTTP pollers are less than 2% used, the Preprocessors queue is 0, and pre-processor internal processes is less than 5%. Queue size over 10 minutes is 22.

Is there any particular thing I should be looking for in the logs?

We're seeing another side effect in the lead up to these total outages where the system will declare failures across a staggered time only for them all to "suddenly" clear at the same time. They are all set to be sampled every 1 minute, (edit 2) and report a nodata after 2 minutes.

For example this is an extract from the problem report.

Time	Resolved	Host
16:05:17	16:05:17	Host 1
16:05:15	16:05:17	Host 2
16:05:07	16:05:17	Host 3
16:05:04	16:05:17	Host 5
16:05:01	16:05:17	Host 4
16:05:00	16:05:17	Host 9

Thanks!

Edited (2) to add:

I'm using a local HTTP Proxy, which in the wireshark traces between zabbix and the HTTP proxy I can see lots of similar requests being considered as the same TCP Stream, and all the hosts declare the failure when that TCP stream is torn down. There are a number of other hosts that use the same proxy, but it's http requests are handled via a different stream - assuming this is because it's running in a seperate HTTP polller, and why it's not affected.

It also appears that something in the docker instance ends up trying to talk to the HTTP proxy on an interface that is only meant to be part of the internal docker networking address range. As i'd set the wireshark capture filter (on the docker host) to capture traffic between the HTTP proxy and Zabbix docker instance external IP address, but left the wireshark capture filter on the HTTP proxy more open, these are only seen in the capture on the HTTP proxy, so i'm not sure what's triggered them - these appear to have come from the web-nginx-myssql-1 container.

After a complete drop out, i suddenly get a flux of "http requests" to the http proxy for 13 http hosts, all within .001 second of each other, all on different source ports; we've got 20 HTTP pollers in operation, so this equally makes sense, but looks like "something" just kicks them off all at once, which given the intended async nature of the requests is also a little wierd....

Any help would be greatly appreciated!

Edit 3: Have bumped StartHTTPPollers and StartHTTPAgentPollers both to 75 to disprove this being a contributing factor as we're only looking at 20 hosts, and many of our items are dependant (pre-processed) items on 1 HTTP agent request to that host. We've got StartPreprocessors set to 200, all for 85 values per second.

**Wolfsbane2k** · 14-11-2024, 11:23

Sadly, despite edit 3 to upthe number of pollers and all other aspects up, this is still ongoing

**Wolfsbane2k** · 31-01-2025, 11:04

After a bunch of investigations but getting nowhere, we jumped to Zabbix 7.2 containers, and ta-da, isssue gone.

Having reviewed the change logs, can't work out what drove the change, but consider it closed.

**Wolfsbane2k** · 24-03-2025, 12:37

Urgh, went to add a new host using a dependant item from another host, and we've got the old issue back.
Disabling the host didn't stop the issue from occuring, so we're now back with something flapping states every 2 minutes

Hmm.

**Wolfsbane2k** · 11-06-2025, 11:23

Following a full recompose of the docker build at 7.2.6 we're no longer seeing this issue - really not sure what's been causing this

Ad Widget

Dockerised Zabbix 7.0.4 http pollers stopped working.

Dockerised Zabbix 7.0.4 http pollers stopped working.

Comment

Comment

Comment

Comment

Comment

Comment