Hi,
We're running a HPe Synergy 480 Gen10 server with Intel Xeon Gold 6244 CPU. In the BIOS NUMA clustering and Sub-NUMA clustering is enabled. Two NUMA clusters are formed, each with half the memory and 8 cores assigned (there are 16 cores in total in the system).
We see that the zabbix agent is allowed to run on only one NUMA node at the same time and gets assigned by Windows to processor group 0. The zabbix agent reports almost zero CPU load which is true for the NUMA node that the agent is assigned to. The other NUMA node however is about 80% utiized (by Oracle) and Windows reports about 45% CPU load for the entire system. When the load on the NUMA node where Oracle runs is 100%, Oracle becomes slow, however zabbix still reports 0% CPU load becuase only the cores in the Numa node 1 are busy and the cores in the NUMA node 0 are idle.
We can change the zabbix agent to run on Numa node 1 by changing the affinity for the processor to run on processor group 1 instead of processor group 0, and it will then report 100% CPU load for that NUMA node.
When running a cpu discovery against the node from the zabbix server we see the following output in which there are indeed 16 cores, but 8 of these are reported 'offline':
[root@rma02 ~]$ zabbix_get -s 10.25.254.53 -p 10050 -k system.cpu.discovery
{"data":[{"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"},{" {#C PU.NUMBER}":1,"{#CPU.STATUS}":"online"},{"{#CPU.NU MBER}":2,"{#CPU.STATUS}":"online"},{"{#CPU.NUMBER} ":3,"{#CPU.STATUS}":"online"},{"{#CPU.NUMBER}" :4," {#CPU.STATUS}":"online"},{"{#CPU.NUMBER}":5,"{#CPU .STATUS}":"online"},{"{#CPU.NUMBER}":6,"{#CPU.STAT US}":"online"},{"{#CPU.NUMBER}":7,"{#CPU.STATUS}" : "online"},{"{#CPU.NUMBER}":8,"{#CPU.STATUS}":" offl ine"},{"{#CPU.NUMBER}":9,"{#CPU.STATUS}":"offline " },{"{#CPU.NUMBER}":10,"{#CPU.STATUS}":"offline"}, { "{#CPU.NUMBER}":11,"{#CPU.STATUS}":"offline"}, {"{# CPU.NUMBER}":12,"{#CPU.STATUS}":"offline"},{"{#CPU .NUMBER}":13,"{#CPU.STATUS}":"offline"},{"{#CPU.NU MBER}":14,"{#CPU.STATUS}":"offline"},{"{#CPU.NUMBE R}":15,"{#CPU.STATUS}":"offline"}]}
[root@rma02 ~]$
By chance (we think) the oracle processes all run on NUMA node 1 which holds all 'offline' CPU's in above output and are therefore not monitored by zabbix.
Is this the intended way of operation for the zabbix agent? (I can imagine it is since reporting 50% overall load when a single NUMA node is at 100% also makes no sense).
How should we treat such a setup monitoring wise? What now happens is that zabbix reports 0% CPU load (which is correct from the zabbix perspective that shows only NUMA node 0), Oracle uses 100% CPU load and has problems and is slow (which is correct from Oracle perspective) and Windows reports 50% CPU load in task manager (which is correct from the perspective of the entire system).
So basically I have 3 different values that are all correct .. :-(
Advise appreciated!
KR,
Rob.
We're running a HPe Synergy 480 Gen10 server with Intel Xeon Gold 6244 CPU. In the BIOS NUMA clustering and Sub-NUMA clustering is enabled. Two NUMA clusters are formed, each with half the memory and 8 cores assigned (there are 16 cores in total in the system).
We see that the zabbix agent is allowed to run on only one NUMA node at the same time and gets assigned by Windows to processor group 0. The zabbix agent reports almost zero CPU load which is true for the NUMA node that the agent is assigned to. The other NUMA node however is about 80% utiized (by Oracle) and Windows reports about 45% CPU load for the entire system. When the load on the NUMA node where Oracle runs is 100%, Oracle becomes slow, however zabbix still reports 0% CPU load becuase only the cores in the Numa node 1 are busy and the cores in the NUMA node 0 are idle.
We can change the zabbix agent to run on Numa node 1 by changing the affinity for the processor to run on processor group 1 instead of processor group 0, and it will then report 100% CPU load for that NUMA node.
When running a cpu discovery against the node from the zabbix server we see the following output in which there are indeed 16 cores, but 8 of these are reported 'offline':
[root@rma02 ~]$ zabbix_get -s 10.25.254.53 -p 10050 -k system.cpu.discovery
{"data":[{"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"},{" {#C PU.NUMBER}":1,"{#CPU.STATUS}":"online"},{"{#CPU.NU MBER}":2,"{#CPU.STATUS}":"online"},{"{#CPU.NUMBER} ":3,"{#CPU.STATUS}":"online"},{"{#CPU.NUMBER}" :4," {#CPU.STATUS}":"online"},{"{#CPU.NUMBER}":5,"{#CPU .STATUS}":"online"},{"{#CPU.NUMBER}":6,"{#CPU.STAT US}":"online"},{"{#CPU.NUMBER}":7,"{#CPU.STATUS}" : "online"},{"{#CPU.NUMBER}":8,"{#CPU.STATUS}":" offl ine"},{"{#CPU.NUMBER}":9,"{#CPU.STATUS}":"offline " },{"{#CPU.NUMBER}":10,"{#CPU.STATUS}":"offline"}, { "{#CPU.NUMBER}":11,"{#CPU.STATUS}":"offline"}, {"{# CPU.NUMBER}":12,"{#CPU.STATUS}":"offline"},{"{#CPU .NUMBER}":13,"{#CPU.STATUS}":"offline"},{"{#CPU.NU MBER}":14,"{#CPU.STATUS}":"offline"},{"{#CPU.NUMBE R}":15,"{#CPU.STATUS}":"offline"}]}
[root@rma02 ~]$
By chance (we think) the oracle processes all run on NUMA node 1 which holds all 'offline' CPU's in above output and are therefore not monitored by zabbix.
Is this the intended way of operation for the zabbix agent? (I can imagine it is since reporting 50% overall load when a single NUMA node is at 100% also makes no sense).
How should we treat such a setup monitoring wise? What now happens is that zabbix reports 0% CPU load (which is correct from the zabbix perspective that shows only NUMA node 0), Oracle uses 100% CPU load and has problems and is slow (which is correct from Oracle perspective) and Windows reports 50% CPU load in task manager (which is correct from the perspective of the entire system).
So basically I have 3 different values that are all correct .. :-(
Advise appreciated!
KR,
Rob.
Comment