Ad Widget

Collapse

Zabbix 6 Intermittent 6–7m stalls: proxies unreachable, negative events, syncer spike

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ricardo_neves
    Junior Member
    • Aug 2022
    • 5

    #1

    Zabbix 6 Intermittent 6–7m stalls: proxies unreachable, negative events, syncer spike

    Hi all,
    We’re seeing intermittent “stall” periods in our Zabbix environment and could use some guidance on where to look next.

    Environment/architecture (Zabbix 6.0.33):
    2x HAProxy with Keepalived (VIP) in front of:
    2x Zabbix frontends
    2x Zabbix servers (Zabbix HA) - Proxies connect directly to Zabbix servers, with only network elements between them (Firewalls)

    Each frontend and server has its own PGPool instance, all pointing to a single active PostgreSQL/TimescaleDB instance (Migrated from MariaDB).
    20+ Zabbix proxies with MariaDB.
    6500+ Hosts.
    This setup has been stable for ~7 months.


    Symptom:
    We experience bursts where the proxies (and consequently, the hosts that they monitor) are reported as unreachable for a little over 6 minutes. After recovery, all events are very short (less than 1min) or even negative duration (likely due to delayed data arriving).
    Trigger expression: max(/*Hostname*/icmpping[*IP*],360s)<>1 and nodata(/*Hostname*/agent.ping,360s)=1


    No corresponding errors in any logs during the outage window:
    Zabbix agents, proxies, servers: no lost-connection or timeout messages.
    Database and PGPool: no failover or error messages. Only 1 specific slow query warning with max time ~4sec that is present since setup (happening even when environment had 0 active hosts).
    Networking/infrastructure: no packet loss, flaps, or failovers observed.
    Immediately after the stall ends, we see:
    Spikes in zabbix_server history syncer activity.
    CPU spike on the Zabbix server.
    Rapid flush of data cached on the proxies and automatic resolution of the triggered alerts.


    First occurrence was about a month ago, 3 times in 4 days.
    Recurred three times in this last 2 weeks (last time was yesterday), with the same pattern and duration.


    What we’ve checked so far:
    Networking: no evidence of drops or VIP failover during the time windows.
    OS/hypervisor: no sustained CPU/memory pressure, no disk saturation noted during the time window.
    Zabbix logs (current DebugLevel 3): no connection or timeout errors.
    DB/PGPool: no failover, no obvious errors in logs.
    We do see clear evidence of recovery (history syncer and CPU spikes) immediately after each window.


    Are there known scenarios where delayed proxy data leads to negative event durations after recovery? Any specific bug IDs or versions to be aware of?
    For PostgreSQL/TimescaleDB behind PGPool, are there recommended settings or known pitfalls with Zabbix write-heavy workloads that could cause short write stalls?
    Sizing/tuning advice to avoid backlogs cascading into “proxies unreachable”: StartHistorySyncers, CacheSize, ValueCacheSize, StartPreprocessors, preprocessing workers, etc.
    Details we can provide (or are prepared to collect on the next occurrence)


    Thank you in advance for any insights.
    Best regards,
    Ricardo Neves
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    Are there known scenarios where delayed proxy data leads to negative event durations after recovery?
    It is even described in docs... not considered to be bug...


    I dont have anything to say on topic... but my gut feeling still tries to point to network issues...

    Comment

    • ricardo_neves
      Junior Member
      • Aug 2022
      • 5

      #3
      Hi,

      Thank you for your response.

      We know the reason behind the negative trigger durations.

      For some reason, we are experiencing short windows during which no data is written to the Zabbix server, although the proxies continue collecting it. This causes triggers to fire due to the apparent lack of data.

      When the issue resolves and data resumes being written to the database, the Zabbix server retroactively detects that data was actually collected during the alarm period. As a result, the alarm is resolved with a negative duration.

      Our current challenge is identifying what’s preventing the data from reaching the server or being written to the database.

      Our first thought was a network issue, but after checking with our networking team, there’s no evidence of dropped packets or any kind of blockage.

      At this point, we suspect the root cause may be something at the application level within the Zabbix infrastructure.

      Any help or suggestions would be greatly appreciated.

      Best regards,
      Ricardo Neves

      Comment

      Working...