Ad Widget

Collapse

Weird period of no items processed and queue growth - cause?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #1

    Weird period of no items processed and queue growth - cause?

    I have 7.0.6 installed on Ubuntu 24.04.01 on HyperV, it's a new-ish install (12 days old) and very small (19 hosts), so negligible processing load, using Postgresql 16.

    In the wee hours it seemed to stop processing items for 20 +/- minutes, which caused a flurry of unavailable alerts. I think this queue issue is the cause of the unavailable (by ping) errors, but I am baffled as to the cause of the processing issue.

    I've checked the syslog, zabbix log, postgresql log and see nothing unusual in them, and no discontinuities.

    I've checked the hyperv logs and the windows logs for the hypervisor. Nothing unusual there. My initial thought was some kind of freeze, e.g. from a backup snapshot, but no sign of that. Time server logs show no jump or adjustment in time on either the guest or hypervisor, though my gut tells me this is some kind of time jump. There was no reboot, no restart of postgresql or zabbix server.

    The Postgresql server is on the same guest, and only used for zabbix and netdisco, so it's not like an external database hung. And for bureaucracy reasons I haven't started backups of zabbix or postgresql so not related to anything like that. No zabbix proxy or external aspects at all. The hypervisor is a dedicated management server at the moment running nothing else but a domain controller.

    No action was taken to resolve the issue (everyone was asleep), it just fixed itself. It only happened the once. I've got zabbix running in a half dozen clients in similar configurations and never seen this (though this is the only production 7.x version so far, other than my home).

    Does anyone have suggestions of things to check? What could be the cause?

    Linwood


    Click image for larger version

Name:	HangPeriod.jpg
Views:	255
Size:	97.3 KB
ID:	496299
  • Brambo
    Senior Member
    • Jul 2023
    • 245

    #2
    The thing I can think of is that you have agent active items which aren't received on the expected interval. So the issue is maybe not your server but the source hosts which 'fails' to report.
    Maybe an update was running on that time? Besides updates, a backup routine can cause this as well if your hosts are suspended etc etc.

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      There's only one item that uses active checks, and it's on only 4 hosts, not the dozen or more that gave an error. So I don't think so. I do have a large number of external checks (have even replaced icmp with an external version to collect other info). But I can't se anything that would stall those either, and they certainly didn't time out (that would show in the log).

      Comment

      • Brambo
        Senior Member
        • Jul 2023
        • 245

        #4
        You are sure that discovery rules / item prototypes don't uses these items as dependent item for data processing?
        Without more details it's hard to help. e.g. is it every night or just an one off etc etc.

        Comment

        • Linwood
          Senior Member
          • Dec 2013
          • 398

          #5
          Yes. I have historically built most templates for SNMP, added a few non-active zabbix agent, and only added a log file active some time ago. Not for any good reason but just never did much with active. The one active has a trigger but nothing dependent on it. There are a LOT of dependent ones for external checks though. Hmmm... let me go look .... data collector processes did peak at that time. It didn't hit even 80% but who knows if that's accurate if it wasn't processing. But I don't know what would do that either. External checks run in poller processes, right?

          But I don't think these timed out, as I don't see a single external check that timed out in the log at that time.

          PS. It hasn't happened again.

          Click image for larger version

Name:	image.png
Views:	121
Size:	38.8 KB
ID:	496388

          Comment

          • Linwood
            Senior Member
            • Dec 2013
            • 398

            #6
            Happened again last night, roughly (not precisely) same time of day, not the same day of week (as it might be for some scheduled job). I'm suspicious someone is doing something to the hypervisor, stalling the guest. Off to do more detective work. There's nothing I can think of on the zabbix server related to that time of day that I can find.

            Comment

            • Linwood
              Senior Member
              • Dec 2013
              • 398

              #7

              To put a bit of closure to this: Though I do not quite understand how, it appears that tunnels between locations are going down every day about the same time (due to a rekey event with lifetime of 1 day). Why the tunnels go down for about 5 minutes is in someone else's hands who doesn't seem to care and they want to fix this not have me try. The odd part is that they drive up the queue as above, but that may be because I'm using so many external checks and they are timing out, and the accumulation of timing-out item checks is what I am actually seeing.

              At any rate, until we can get rid of the rekey issues and see if it solves the problem, this is in limbo - but I do not think it is a Zabbix mystery after all.

              Comment

              Working...