Hello,
I'm running into an issue where all of our servers in production fire alerts for Zabbix agent is not available (or nodata for 30m) irregularly at night. Thing is, none of the servers or agents appear to be down, and the active zabbix server isn't really experiencing any issues. Looking at the the graphs, I do see drops when this happens (Usually brief, last one as an hour since it looks like it happened twice). Which makes sense because the agents can't talk to the server.
I took at look at the Zabbix server logs around that period, but nothing is standing out as an issue. The zabbix agent logs show some sporadic failing to connect errors, but these don't always line up. I have checked our database backups and VM Snapshots. The database backups start roughly 90 minutes before the issue and generally finish an hour after it occurs, which the VM snaps are longer, but start around 9 PM, so it doesn't seem like it's responsible. I'm also not seeing any memory, CPU, etc spikes on the database or the servers before it happens. There's one when they start responding again of course.
This issue only happens at night, usually around 1:00 AM my time. We're not in the cloud, and there is two servers and DB's running in high availability mode.
I am running Zabbix Server 7.0.5 (Same for the agents). We are running the agents in active mode.
Sample errors from the Zabbix Agent log:
1177:20251211:005818.695 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:005818.695 Active check configuration update started to fail
1177:20251211:005821.699 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:005821.699 Active check data upload started to fail
1177:20251211:010100.879 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:010100.879 Unable to send heartbeat message to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:020411.919 active check data upload to [127.0.0.1:10051] is working again
1177:20251211:020412.364 Active check configuration update from [127.0.0.1:10051] is working again
2025/12/03 00:58:07.217066 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:07.217066 [102] history upload to [x.x.x.x:10051] [servername] started to fail
2025/12/03 00:58:07.246611 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:07.246611 [102] active check configuration update from host [servername] started to fail
2025/12/03 00:58:58.245309 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:58.245309 [102] sending of heartbeat message for [servername] started to fail
2025/12/03 00:59:22.495833 plugin 'Cpu': time spent in collector task 1.281398 s exceeds collecting interval 1 s
2025/12/03 00:59:22.495833 plugin 'WindowsPerfMon': time spent in collector task 1.281398 s exceeds collecting interval 1 s
2025/12/03 01:03:33.224217 [102] history upload to [x.x.x.x:10051] [servername] is working again
2025/12/03 01:03:40.234425 [102] active check configuration update from [x.x.x.x:10051] is working again
2025/12/03 01:04:14.201125 [102] sending of heartbeat message to [x.x.x.x:10051] is working again
I'm running into an issue where all of our servers in production fire alerts for Zabbix agent is not available (or nodata for 30m) irregularly at night. Thing is, none of the servers or agents appear to be down, and the active zabbix server isn't really experiencing any issues. Looking at the the graphs, I do see drops when this happens (Usually brief, last one as an hour since it looks like it happened twice). Which makes sense because the agents can't talk to the server.
I took at look at the Zabbix server logs around that period, but nothing is standing out as an issue. The zabbix agent logs show some sporadic failing to connect errors, but these don't always line up. I have checked our database backups and VM Snapshots. The database backups start roughly 90 minutes before the issue and generally finish an hour after it occurs, which the VM snaps are longer, but start around 9 PM, so it doesn't seem like it's responsible. I'm also not seeing any memory, CPU, etc spikes on the database or the servers before it happens. There's one when they start responding again of course.
This issue only happens at night, usually around 1:00 AM my time. We're not in the cloud, and there is two servers and DB's running in high availability mode.
I am running Zabbix Server 7.0.5 (Same for the agents). We are running the agents in active mode.
Sample errors from the Zabbix Agent log:
1177:20251211:005818.695 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:005818.695 Active check configuration update started to fail
1177:20251211:005821.699 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:005821.699 Active check data upload started to fail
1177:20251211:010100.879 Unable to connect to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:010100.879 Unable to send heartbeat message to [127.0.0.1]:10051 [cannot connect to [[127.0.0.1]:10051]: connection timed out]
1177:20251211:020411.919 active check data upload to [127.0.0.1:10051] is working again
1177:20251211:020412.364 Active check configuration update from [127.0.0.1:10051] is working again
2025/12/03 00:58:07.217066 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:07.217066 [102] history upload to [x.x.x.x:10051] [servername] started to fail
2025/12/03 00:58:07.246611 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:07.246611 [102] active check configuration update from host [servername] started to fail
2025/12/03 00:58:58.245309 [102] cannot connect to [x.x.x.x:10051]: dial tcp :0->x.x.x.x:10051: i/o timeout
2025/12/03 00:58:58.245309 [102] sending of heartbeat message for [servername] started to fail
2025/12/03 00:59:22.495833 plugin 'Cpu': time spent in collector task 1.281398 s exceeds collecting interval 1 s
2025/12/03 00:59:22.495833 plugin 'WindowsPerfMon': time spent in collector task 1.281398 s exceeds collecting interval 1 s
2025/12/03 01:03:33.224217 [102] history upload to [x.x.x.x:10051] [servername] is working again
2025/12/03 01:03:40.234425 [102] active check configuration update from [x.x.x.x:10051] is working again
2025/12/03 01:04:14.201125 [102] sending of heartbeat message to [x.x.x.x:10051] is working again
Comment