Ad Widget

Collapse

Zabbix 6 Intermittent 6–7m stalls: proxies unreachable, negative events, syncer spike

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ricardo_neves
    Junior Member
    • Aug 2022
    • 5

    #1

    Zabbix 6 Intermittent 6–7m stalls: proxies unreachable, negative events, syncer spike

    Hi all,
    We’re seeing intermittent “stall” periods in our Zabbix environment and could use some guidance on where to look next.

    Environment/architecture (Zabbix 6.0.33):
    2x HAProxy with Keepalived (VIP) in front of:
    2x Zabbix frontends
    2x Zabbix servers (Zabbix HA) - Proxies connect directly to Zabbix servers, with only network elements between them (Firewalls)

    Each frontend and server has its own PGPool instance, all pointing to a single active PostgreSQL/TimescaleDB instance (Migrated from MariaDB).
    20+ Zabbix proxies with MariaDB.
    6500+ Hosts.
    This setup has been stable for ~7 months.


    Symptom:
    We experience bursts where the proxies (and consequently, the hosts that they monitor) are reported as unreachable for a little over 6 minutes. After recovery, all events are very short (less than 1min) or even negative duration (likely due to delayed data arriving).
    Trigger expression: max(/*Hostname*/icmpping[*IP*],360s)<>1 and nodata(/*Hostname*/agent.ping,360s)=1


    No corresponding errors in any logs during the outage window:
    Zabbix agents, proxies, servers: no lost-connection or timeout messages.
    Database and PGPool: no failover or error messages. Only 1 specific slow query warning with max time ~4sec that is present since setup (happening even when environment had 0 active hosts).
    Networking/infrastructure: no packet loss, flaps, or failovers observed.
    Immediately after the stall ends, we see:
    Spikes in zabbix_server history syncer activity.
    CPU spike on the Zabbix server.
    Rapid flush of data cached on the proxies and automatic resolution of the triggered alerts.


    First occurrence was about a month ago, 3 times in 4 days.
    Recurred three times in this last 2 weeks (last time was yesterday), with the same pattern and duration.


    What we’ve checked so far:
    Networking: no evidence of drops or VIP failover during the time windows.
    OS/hypervisor: no sustained CPU/memory pressure, no disk saturation noted during the time window.
    Zabbix logs (current DebugLevel 3): no connection or timeout errors.
    DB/PGPool: no failover, no obvious errors in logs.
    We do see clear evidence of recovery (history syncer and CPU spikes) immediately after each window.


    Are there known scenarios where delayed proxy data leads to negative event durations after recovery? Any specific bug IDs or versions to be aware of?
    For PostgreSQL/TimescaleDB behind PGPool, are there recommended settings or known pitfalls with Zabbix write-heavy workloads that could cause short write stalls?
    Sizing/tuning advice to avoid backlogs cascading into “proxies unreachable”: StartHistorySyncers, CacheSize, ValueCacheSize, StartPreprocessors, preprocessing workers, etc.
    Details we can provide (or are prepared to collect on the next occurrence)


    Thank you in advance for any insights.
    Best regards,
    Ricardo Neves
  • ricardo_neves
    Junior Member
    • Aug 2022
    • 5

    #2
    Created a new topic for this problem in the

    Comment

    Working...