Ad Widget

**Linwood** · 07-07-2018, 03:15

Zabbix doesn't do well combining data onto single graphs, at least not in terms of different hosts. It's possible of course, just arcane and tedious.

As to the first part, what I think you want to look at are the "trigger expression" functions, like min, max, average, etc. Then define what "spike" means. It's possible to find things like "if the average over the last 5 intervals is more than twice the average over the last 50", but one person's "unusual" is another person's "normal". Look in the trigger expression portion of the manual to see what kind of functions are available.

**JSantram** · 12-07-2018, 17:20

Originally posted by Linwood

Zabbix doesn't do well combining data onto single graphs, at least not in terms of different hosts. It's possible of course, just arcane and tedious.

As to the first part, what I think you want to look at are the "trigger expression" functions, like min, max, average, etc. Then define what "spike" means. It's possible to find things like "if the average over the last 5 intervals is more than twice the average over the last 50", but one person's "unusual" is another person's "normal". Look in the trigger expression portion of the manual to see what kind of functions are available.

I am still trying to figure out these trigger expressions.

Could you show me what the expression below would look like? Sorry for being a pain.

"if the average over the last 5 intervals is more than twice the average over the last 50 intervals"

Also how would i set up the Item for this trigger ?

**Linwood** · 12-07-2018, 18:14

JSantram, as an example here is one I use for bing checks.

( {TRIGGER.VALUE}=0 and {Template Ping Checks:PingLoss.avg(30m)}> 10) or
( {TRIGGER.VALUE}=1 and {Template Ping Checks:PingLoss.avg(60m)}> 1 )

That says if there is no current alert, and the average ping loss over 30 minutes exceeds 10%, then alert. Continue alerting until the average over 60 minutes is below 1%. This pre-dates the recovery expression, so the recovery is built into the trigger.

So you can mix and match different intervals for the same period. Something like

{Template Ping Checks:PingLoss.avg(30m)} > 2 * {Template Ping Checks:PingLoss.avg(60m)}

That would alert if the average over the last 30 minutes is more than twice the average over the last 60 minutes. You can use "#" for count of polls as opposed to a time period for polls if you prefer.

**JSantram** · 16-07-2018, 16:21

Originally posted by Linwood

JSantram, as an example here is one I use for bing checks.

( {TRIGGER.VALUE}=0 and {Template Ping Checks:PingLoss.avg(30m)}> 10) or
( {TRIGGER.VALUE}=1 and {Template Ping Checks:PingLoss.avg(60m)}> 1 )

That says if there is no current alert, and the average ping loss over 30 minutes exceeds 10%, then alert. Continue alerting until the average over 60 minutes is below 1%. This pre-dates the recovery expression, so the recovery is built into the trigger.

So you can mix and match different intervals for the same period. Something like

{Template Ping Checks:PingLoss.avg(30m)} > 2 * {Template Ping Checks:PingLoss.avg(60m)}

That would alert if the average over the last 30 minutes is more than twice the average over the last 60 minutes. You can use "#" for count of polls as opposed to a time period for polls if you prefer.

Well yes, this works for ping loss, which I already have setup. I could use assistance in setting up a rule that over the last 30 minutes, network throughput has raised by, lets say 3 times the normal average of network throughput.

How would this expression look? As i am still sort of confused as to how it would be done. I am getting the grasp of the basics but still working through growing pains.

**Linwood** · 16-07-2018, 16:40

The issue is to define the "normal average", especially if that average includes the most recent period, which it generally will.

The straightforward implementation is to use the time shift feature, e.g. avg(30m) / avg(180m,30m) to compare the most recent 30 against the prior 180. Let's say that entire 30 minutes was 5 times higher. The issue with this is that now minute by minute that higher number is fed into the 180m prior, raising it. If it stays 5 times higher for 180 minutes, then that higher value becomes "normal" (actually much sooner if you are looking for a large jump).

You can lengthen the period of course to try to reduce short term influences, but as a real world example I've hit several times at a new client, someone starts replicating over their WAN and saturates the circuit for literally days. So any test that is time dependent like this will reset and start saying "OK" soon. Also too long of a period if these are high volume triggers may prove a performance issue, e.g. if it has to average every value over days.

You can also look at percentile, but I think you will find the same issue, that over any given period, exceptional values that persist become the new normal when they probably shouldn't. Forecast is interesting, as you may be able to use it to notice a sudden rise, but again resetting is likely to occur.

Now if all you want is to detect spikes, and NOT notice persistent large increases because they are not spikes per se, then I think these can work. But as a network management need, persistent large increases are often worse than brief spikes.

Ad Widget

Creating Network Utilization Graph that alerts with unusual spiking

Creating Network Utilization Graph that alerts with unusual spiking

Comment

Comment

Comment

Comment

Comment