Hi everyone, I'm struggling with a well-known issue regarding LXC containers on Proxmox and Zabbix monitoring.
The Situation:
I have about 11 LXC containers (Debian/Ubuntu) running on a Proxmox host (ZFS on NVMe). Since LXC containers share the host kernel, they all report the host's Load Average.
The Problem:
Whenever the Proxmox host performs a backup or has a brief I/O spike, the Load Average on all 11 containers spikes simultaneously. This triggers a "storm" of alerts in Zabbix, even though the individual containers are idling.
My Approach:
I'm thinking about modifying the default "Load average is too high" trigger to include CPU Utilization as a secondary condition to filter out host-induced noise.
My Questions:
Is combining Load + CPU Utilization a solid approach for LXC? Or am I risking missing "I/O Wait" issues (where Load is high but CPU Util is low)?
If this is a good path, how should the expression look? I'm currently thinking of something like:
avg(/Linux by Zabbix agent/system.cpu.load[all,avg1],5m) > X AND avg(/Linux by Zabbix agent/system.cpu.util,5m) > 30
How do you handle this in production? Do you use specific UserParameters (like counting processes in cgroups) or do you simply silence Load alerts for LXCs and only monitor the Proxmox Host's Load?
I would appreciate any advice or shared "Best Practice" templates for Proxmox/LXC environments.
Thanks in advance!
Please excuse my English, I used an AI assistant to help me phrase this post as my English is not very good.
EDIT
For testing I try
The Situation:
I have about 11 LXC containers (Debian/Ubuntu) running on a Proxmox host (ZFS on NVMe). Since LXC containers share the host kernel, they all report the host's Load Average.
The Problem:
Whenever the Proxmox host performs a backup or has a brief I/O spike, the Load Average on all 11 containers spikes simultaneously. This triggers a "storm" of alerts in Zabbix, even though the individual containers are idling.
My Approach:
I'm thinking about modifying the default "Load average is too high" trigger to include CPU Utilization as a secondary condition to filter out host-induced noise.
My Questions:
Is combining Load + CPU Utilization a solid approach for LXC? Or am I risking missing "I/O Wait" issues (where Load is high but CPU Util is low)?
If this is a good path, how should the expression look? I'm currently thinking of something like:
avg(/Linux by Zabbix agent/system.cpu.load[all,avg1],5m) > X AND avg(/Linux by Zabbix agent/system.cpu.util,5m) > 30
How do you handle this in production? Do you use specific UserParameters (like counting processes in cgroups) or do you simply silence Load alerts for LXCs and only monitor the Proxmox Host's Load?
I would appreciate any advice or shared "Best Practice" templates for Proxmox/LXC environments.
Thanks in advance!
Please excuse my English, I used an AI assistant to help me phrase this post as my English is not very good.
EDIT
For testing I try
Code:
( min(/Linux by Zabbix agent mod/system.cpu.load[all,avg1],5m)
/
last(/Linux by Zabbix agent mod/system.cpu.num)
) > {$LOAD_AVG_PER_CPU.MAX.WARN}
and
(
last(/Linux by Zabbix agent mod/system.cpu.util[,user])
+
last(/Linux by Zabbix agent mod/system.cpu.util[,system])
> 30
or
last(/Linux by Zabbix agent mod/system.cpu.util[,iowait]) > 20
)
and last(/Linux by Zabbix agent mod/system.cpu.load[all,avg5])>0
and last(/Linux by Zabbix agent mod/system.cpu.load[all,avg15])>0
Comment