Hi
I'm would like to discuss which are the best triggers' treshholds for Hardware Monitoring?
Here is what I've come up with:
I)Memory
1)\Memory\Available Bytes
{t: perf_counter["\Memory\Available Bytes"].avg(300)}<10000000
If less than 10 Megabytes for last 5 minutes - fire trigger: low memory
2) \Memory\Pages/sec
{t: perf_counter["\Memory\Pages/sec"].avg(1800)}>50
If average value is higher than 50 for last 30 minutes - fire trigger:
a) If Trigger 1 is also fired: Low memory
b) If Trigger 1 is not fired: High Pages
II)CPUs
1)\Processor(_Total)\% Processor Time
{t: perf_counter["\Processor(_Total)\% Processor Time"].avg(1800)}>80
If average is more than 80 % for last 30 minutes - fire trigger: CPU's high utilization level
III)Physical disks
_Total in case of one disk only ( or Hardware RAID)
1)\Physical Disk(_Total)\% Disk Time
{t: perf_counter["\Physical Disk(_Total)\% Disk Time"].avg(1800)}>80
If average is more than 80 % for last 30 minutes - fire trigger: HDD is slow
2)\Physical Disk(_Total)\Avg Queue Length
{t: perf_counter["\Physical Disk(_Total)\Avg Queue Length"].avg(1800)}>2
If average value is more than 2 for last 30 minutes - fire trigger: HDD is slow
IV)Network
Swithed Fast Ethernet Network
1)\Network Interface(NIC LAN")\Total Bytes Sent/sec
{t: perf_counter["\Network Interface(NIC LAN")\Total Bytes Sent/sec" ].avg(1800)}>80000000
If average is more than 80Mbit for last 30 minutes - fire trigger: Network bottleneck
This is my 'beta' version of counters and I think how accurate they are, actually. What I want is that all real bottlenecks and poor performance events fired. But minimize number of false alerts.
Questions:
1)Which triggers and counters do you use for hardware monitoring and what do you think how accurate mine are?
2)Pages > 50 but Available Bytes are Ok. What does it mean?
3)High Disk Time. Does it really mean bad performance?
4)What else I'm missing for full picture?
P.S. Windows perfomance monitor counters here are only as examples. They could be not written correctly.
I'm would like to discuss which are the best triggers' treshholds for Hardware Monitoring?
Here is what I've come up with:
I)Memory
1)\Memory\Available Bytes
{t: perf_counter["\Memory\Available Bytes"].avg(300)}<10000000
If less than 10 Megabytes for last 5 minutes - fire trigger: low memory
2) \Memory\Pages/sec
{t: perf_counter["\Memory\Pages/sec"].avg(1800)}>50
If average value is higher than 50 for last 30 minutes - fire trigger:
a) If Trigger 1 is also fired: Low memory
b) If Trigger 1 is not fired: High Pages
II)CPUs
1)\Processor(_Total)\% Processor Time
{t: perf_counter["\Processor(_Total)\% Processor Time"].avg(1800)}>80
If average is more than 80 % for last 30 minutes - fire trigger: CPU's high utilization level
III)Physical disks
_Total in case of one disk only ( or Hardware RAID)
1)\Physical Disk(_Total)\% Disk Time
{t: perf_counter["\Physical Disk(_Total)\% Disk Time"].avg(1800)}>80
If average is more than 80 % for last 30 minutes - fire trigger: HDD is slow
2)\Physical Disk(_Total)\Avg Queue Length
{t: perf_counter["\Physical Disk(_Total)\Avg Queue Length"].avg(1800)}>2
If average value is more than 2 for last 30 minutes - fire trigger: HDD is slow
IV)Network
Swithed Fast Ethernet Network
1)\Network Interface(NIC LAN")\Total Bytes Sent/sec
{t: perf_counter["\Network Interface(NIC LAN")\Total Bytes Sent/sec" ].avg(1800)}>80000000
If average is more than 80Mbit for last 30 minutes - fire trigger: Network bottleneck
This is my 'beta' version of counters and I think how accurate they are, actually. What I want is that all real bottlenecks and poor performance events fired. But minimize number of false alerts.
Questions:
1)Which triggers and counters do you use for hardware monitoring and what do you think how accurate mine are?
2)Pages > 50 but Available Bytes are Ok. What does it mean?
3)High Disk Time. Does it really mean bad performance?
4)What else I'm missing for full picture?
P.S. Windows perfomance monitor counters here are only as examples. They could be not written correctly.
Comment