Ad Widget

Collapse

Best practice for LXC Load Monitoring to avoid "False Positives" during Backups?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Ipod86
    Junior Member
    • Dec 2025
    • 8

    #1

    Best practice for LXC Load Monitoring to avoid "False Positives" during Backups?

    Hi everyone, I'm struggling with a well-known issue regarding LXC containers on Proxmox and Zabbix monitoring.
    The Situation:
    I have about 11 LXC containers (Debian/Ubuntu) running on a Proxmox host (ZFS on NVMe). Since LXC containers share the host kernel, they all report the host's Load Average.
    The Problem:
    Whenever the Proxmox host performs a backup or has a brief I/O spike, the Load Average on all 11 containers spikes simultaneously. This triggers a "storm" of alerts in Zabbix, even though the individual containers are idling.
    ​My Approach:
    I'm thinking about modifying the default "Load average is too high" trigger to include CPU Utilization as a secondary condition to filter out host-induced noise.
    ​My Questions:
    ​Is combining Load + CPU Utilization a solid approach for LXC? Or am I risking missing "I/O Wait" issues (where Load is high but CPU Util is low)?
    ​If this is a good path, how should the expression look? I'm currently thinking of something like:
    avg(/Linux by Zabbix agent/system.cpu.load[all,avg1],5m) > X AND avg(/Linux by Zabbix agent/system.cpu.util,5m) > 30
    ​How do you handle this in production? Do you use specific UserParameters (like counting processes in cgroups) or do you simply silence Load alerts for LXCs and only monitor the Proxmox Host's Load?
    ​I would appreciate any advice or shared "Best Practice" templates for Proxmox/LXC environments.
    ​Thanks in advance!

    Please excuse my English, I used an AI assistant to help me phrase this post as my English is not very good.

    EDIT
    For testing I try
    Code:
     
    (   min(/Linux by Zabbix agent mod/system.cpu.load[all,avg1],5m)
        /
        last(/Linux by Zabbix agent mod/system.cpu.num)
      ) > {$LOAD_AVG_PER_CPU.MAX.WARN}
      and
      (
        last(/Linux by Zabbix agent mod/system.cpu.util[,user])
        +
        last(/Linux by Zabbix agent mod/system.cpu.util[,system])
        > 30
        or
        last(/Linux by Zabbix agent mod/system.cpu.util[,iowait]) > 20
      )
      and last(/Linux by Zabbix agent mod/system.cpu.load[all,avg5])>0
      and last(/Linux by Zabbix agent mod/system.cpu.load[all,avg15])>0
    Last edited by Ipod86; 04-01-2026, 13:47.
  • troffasky
    Senior Member
    • Jul 2008
    • 565

    #2
    Originally posted by Ipod86
    Since LXC containers share the host kernel, they all report the host's Load Average.
    If that is true then I don't think there is any point collecting this metric at container-level.

    Comment

    • troffasky
      Senior Member
      • Jul 2008
      • 565

      #3
      Having said that, if you still want to collect this metric and trigger on it, you might want to look at Event Correlation:

      It is possible to correlate events created by completely different triggers and apply the same operations to them all. By creating intelligent correlation rules it is actually possible to save yourself from thousands of repetitive notifications and focus on root causes of a problem!

      Comment

      • Ipod86
        Junior Member
        • Dec 2025
        • 8

        #4
        Originally posted by troffasky

        If that is true then I don't think there is any point collecting this metric at container-level.
        Regarding the question of why we should monitor Load Average in LXC at all if it's dependent on the host:You are right that the value is "inherited" from the host kernel. However, Load is still the only metric that shows processes stuck in an "Uninterruptible Sleep" state (I/O Wait). If I disable Load monitoring entirely, I might miss situations where a specific container is freezing due to disk latency issues on the ZFS pool, even if the CPU utilization is near 0%. My goal is to keep this visibility but filter out the "noise" caused by host-level events like backups.

        Comment

        • Ipod86
          Junior Member
          • Dec 2025
          • 8

          #5
          Originally posted by troffasky
          Having said that, if you still want to collect this metric and trigger on it, you might want to look at Event Correlation:
          I looked into the documentation, but I think Global Event Correlation might be "overkill" and perhaps too reactive for this specific issue. As far as I understand, it would allow Zabbix to close or suppress events after they have been triggered based on certain conditions. However, I prefer a proactive approach within the trigger logic itself. By using a combined expression like (Load > X AND (CPU Util > 30% OR I/O Wait > 20%)), I can prevent the alert storm from happening in the first place. This way, the container "knows" that high load without actual CPU or I/O activity inside the namespace is just host-noise and remains silent. Global correlation seems more suited for suppressing alerts when a core switch or a host goes down entirely, rather than filtering high-frequency performance spikes.
          ​Does anyone have experience with this specific "logic-based filtering" versus "event correlation" for Proxmox environments?

          Comment

          Working...