Ad Widget

Collapse

Zabbix Alert Storm Issues

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Mayhem
    Junior Member
    • Feb 2025
    • 9

    #1

    Zabbix Alert Storm Issues

    Hello,

    I'm running into an issue where all of our servers in production fire alerts for Zabbix agent is not available (or nodata for 30m) irregularly at night. Thing is, none of the servers or agents appear to be down, and the active zabbix server isn't really experiencing any issues. Looking at the the graphs, I do see drops when this happens (Usually brief, last one as an hour since it looks like it happened twice). Which makes sense because the agents can't talk to the server.

    I took at look at the Zabbix server logs around that period, but nothing is standing out as an issue. The zabbix agent logs show some sporadic failing to connect errors, but these don't always line up. I have checked our database backups and VM Snapshots. The database backups start roughly 90 minutes before the issue and generally finish an hour after it occurs, which the VM snaps are longer, but start around 9 PM, so it doesn't seem like it's responsible. I'm also not seeing any memory, CPU, etc spikes on the database or the servers before it happens. There's one when they start responding again of course.

    This issue only happens at night, usually around 1:00 AM my time. We're not in the cloud, and there is two servers and DB's running in high availability mode.
    I am running Zabbix Server 7.0.5 (Same for the agents). We are running the agents in active mode.

    Sample errors from the Zabbix Agent log:
    1177:20251211:005818.695 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
    1177:20251211:005818.695 Active check configuration update started to fail
    1177:20251211:005821.699 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
    1177:20251211:005821.699 Active check data upload started to fail
    1177:20251211:010100.879 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
    1177:20251211:010100.879 Unable to send heartbeat message to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
    1177:20251211:020411.919 active check data upload to [127.0.0.1:10051] is working again
    1177:20251211:020412.364 Active check configuration update from [127.0.0.1:10051] is working again

    2025/12/03 00:58:07.217066 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
    2025/12/03 00:58:07.217066 [102] history upload to [x.x.x.x:10051] [servername] started to fail
    2025/12/03 00:58:07.246611 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
    2025/12/03 00:58:07.246611 [102] active check configuration update from host [servername] started to fail
    2025/12/03 00:58:58.245309 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
    2025/12/03 00:58:58.245309 [102] sending of heartbeat message for [servername] started to fail
    2025/12/03 00:59:22.495833 plugin 'Cpu': time spent in collector task 1.281398 s exceeds collecting interval 1 s
    2025/12/03 00:59:22.495833 plugin 'WindowsPerfMon': time spent in collector task 1.281398 s exceeds collecting interval 1 s
    2025/12/03 01:03:33.224217 [102] history upload to [x.x.x.x:10051] [servername] is working again
    2025/12/03 01:03:40.234425 [102] active check configuration update from [x.x.x.x:10051] is working again
    2025/12/03 01:04:14.201125 [102] sending of heartbeat message to [x.x.x.x:10051] is working again
    Attached Files
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    When everything is stable and working well and there have been no security vulnerabilities that impact your environment, it's fine to stick with an old version.

    When you're having trouble, though, one of the first things most vendors will tell you to do is make sure you're running the latest version (in your series).

    Are all the client systems that are having problems on VM hosts? Or are the problems impacting both physical hosts and VMs?

    Have you left a 'tcpdump' or 'wireshark' network packet trace active during the time when the issues happen? I'm just wondering if that will give you any insight into where the problem may be.

    Comment

    • Mayhem
      Junior Member
      • Feb 2025
      • 9

      #3
      Originally posted by tim.mooney
      When everything is stable and working well and there have been no security vulnerabilities that impact your environment, it's fine to stick with an old version.

      When you're having trouble, though, one of the first things most vendors will tell you to do is make sure you're running the latest version (in your series).

      Are all the client systems that are having problems on VM hosts? Or are the problems impacting both physical hosts and VMs?

      Have you left a 'tcpdump' or 'wireshark' network packet trace active during the time when the issues happen? I'm just wondering if that will give you any insight into where the problem may be.
      Hey,

      Unfortunately, due to corporate I am not allowed to upgrade right now. They are trying to standardize and consolidate some programs and companies they've acquired into fewer instances, and we are already the most modern. But I will say it was working OK for a year before we ran into the issue. While we have continued to ingest additional servers in, it's not been large amounts after the initial spin up of the app.

      Everything we have ingested right now is a VM. There were only a few bare metal servers remaining and we've decommissioned them. I did check with our virtualisation team but they're not seeing anything on their side.

      As for the tcpdump and wireshark idea, I haven't done so again, everything is very restricted and silo'd in my company. As a result, I can't run those myself on any of the servers. I will need to reach out to my Unix team to see if they can set that up. I've also asked them to pull the message log from the server.

      A couple of things I've noticed since I posted:

      One, this issue is happening more frequently than expected. We see the error message about not being able to connect several times a day, but only very briefly, so they don't fire alerts like they do at night. As well. do have some outstanding queue items pretty consistently under Zabbix Agent Active, so I was thinking that maybe there is a constantly low level issue that get exacerbated when the system backs up and additional load is placed on it. Though I still haven't really seen anything new in the zabbix server logs.

      I was thinking of increasing some of the capacity options a little bit in the config to see if that helps.

      Comment

      Working...