Hello everybody,
I maintain a server that keeps sending me alerts that its CPU load is too high. Which isn't good offcourse, so I tried to figure out what was causing the CPU load. So far, I have been unsuccesfull. I have written a little script that stores the processlist, processes that do anything on disk and the actual load of the system in a logfile. This is the output of ps aux --sort=%cpu,, iotop -b -n 1 and uptime. And this script runs every second now (yes, I know that is quite often).
But I noticed a difference in the load reported by Zabbix and the load seen by the server itself. Take the last alert e-mail I recieved, for example. The Zabbix interface reports that the load on 2014 Jan 01 15:12:36 was a whopping 5.14. When I dive in my own logs, I find the follwing:
------------------------< Wed Jan 1 15:12:36 CET 2014 >------------------------------
15:12:36 up 40 days, 15:30, 0 users, load average: 0.29, 0.40, 0.45
In otherwords: Zabbix thinks the CPU load is 5.14, uptime thinks the load is 0.29. In the same second. I find such a big difference a bit paticular. What could be an explanation for this?
What I also find noteworthy is that when I check what the system.cpu.util values tell me is that there isn't any load at all. It is just as good as idle, all day. Which is the situation I expect and what my own log is telling me too.
Some extra information:
Item definition: system.cpu.load[all,avg1]
Trigger definition: {<system_hostname>:system.cpu.load[all,avg1].last(0)}>5
The server having problems is running Debian 7.3 64 bit with Zabbix Agent 2.0.9 from backports
The Zabbix server is running Zabbix 2.2.1 on CentOS 6.5 x64.
What is causing this difference, and how do I get reliable monitoring for the CPU load?
I maintain a server that keeps sending me alerts that its CPU load is too high. Which isn't good offcourse, so I tried to figure out what was causing the CPU load. So far, I have been unsuccesfull. I have written a little script that stores the processlist, processes that do anything on disk and the actual load of the system in a logfile. This is the output of ps aux --sort=%cpu,, iotop -b -n 1 and uptime. And this script runs every second now (yes, I know that is quite often).
But I noticed a difference in the load reported by Zabbix and the load seen by the server itself. Take the last alert e-mail I recieved, for example. The Zabbix interface reports that the load on 2014 Jan 01 15:12:36 was a whopping 5.14. When I dive in my own logs, I find the follwing:
------------------------< Wed Jan 1 15:12:36 CET 2014 >------------------------------
15:12:36 up 40 days, 15:30, 0 users, load average: 0.29, 0.40, 0.45
In otherwords: Zabbix thinks the CPU load is 5.14, uptime thinks the load is 0.29. In the same second. I find such a big difference a bit paticular. What could be an explanation for this?
What I also find noteworthy is that when I check what the system.cpu.util values tell me is that there isn't any load at all. It is just as good as idle, all day. Which is the situation I expect and what my own log is telling me too.
Some extra information:
Item definition: system.cpu.load[all,avg1]
Trigger definition: {<system_hostname>:system.cpu.load[all,avg1].last(0)}>5
The server having problems is running Debian 7.3 64 bit with Zabbix Agent 2.0.9 from backports
The Zabbix server is running Zabbix 2.2.1 on CentOS 6.5 x64.
What is causing this difference, and how do I get reliable monitoring for the CPU load?
Comment