Ad Widget

Collapse

Creating Network Utilization Graph that alerts with unusual spiking

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • JSantram
    Junior Member
    • Jul 2018
    • 4

    #1

    Creating Network Utilization Graph that alerts with unusual spiking

    Hey All,

    Sorry if this has been covered but I am a complete newbie to Zabbix and need some help with this one.

    I was wondering if someone could assist me in creating a Network Utilization Graph that alerts with unusual spiking that lasts for XX minutes or more.

    This will monitor all Windows and Linux machines on ONE graph, if that is possible. Im sure the alerting is possible also, just need guidance

    I still am trying to grasp the concept of all the formulas in Zabbix and if someone could give me a complete Step by Step on this that would be great.

    OR if someone has a template like this already created, could you forward it to me??

    Thanks
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #2
    Zabbix doesn't do well combining data onto single graphs, at least not in terms of different hosts. It's possible of course, just arcane and tedious.

    As to the first part, what I think you want to look at are the "trigger expression" functions, like min, max, average, etc. Then define what "spike" means. It's possible to find things like "if the average over the last 5 intervals is more than twice the average over the last 50", but one person's "unusual" is another person's "normal". Look in the trigger expression portion of the manual to see what kind of functions are available.

    Comment

    • JSantram
      Junior Member
      • Jul 2018
      • 4

      #3
      Originally posted by Linwood
      Zabbix doesn't do well combining data onto single graphs, at least not in terms of different hosts. It's possible of course, just arcane and tedious.

      As to the first part, what I think you want to look at are the "trigger expression" functions, like min, max, average, etc. Then define what "spike" means. It's possible to find things like "if the average over the last 5 intervals is more than twice the average over the last 50", but one person's "unusual" is another person's "normal". Look in the trigger expression portion of the manual to see what kind of functions are available.
      I am still trying to figure out these trigger expressions.

      Could you show me what the expression below would look like? Sorry for being a pain.

      "if the average over the last 5 intervals is more than twice the average over the last 50 intervals"


      Also how would i set up the Item for this trigger ?

      Comment

      • Linwood
        Senior Member
        • Dec 2013
        • 398

        #4
        JSantram, as an example here is one I use for bing checks.

        ( {TRIGGER.VALUE}=0 and {Template Ping Checks:PingLoss.avg(30m)}> 10) or
        ( {TRIGGER.VALUE}=1 and {Template Ping Checks:PingLoss.avg(60m)}> 1 )

        That says if there is no current alert, and the average ping loss over 30 minutes exceeds 10%, then alert. Continue alerting until the average over 60 minutes is below 1%. This pre-dates the recovery expression, so the recovery is built into the trigger.

        So you can mix and match different intervals for the same period. Something like

        {Template Ping Checks:PingLoss.avg(30m)} > 2 * {Template Ping Checks:PingLoss.avg(60m)}

        That would alert if the average over the last 30 minutes is more than twice the average over the last 60 minutes. You can use "#" for count of polls as opposed to a time period for polls if you prefer.

        Comment

        • JSantram
          Junior Member
          • Jul 2018
          • 4

          #5
          Originally posted by Linwood
          JSantram, as an example here is one I use for bing checks.

          ( {TRIGGER.VALUE}=0 and {Template Ping Checks:PingLoss.avg(30m)}> 10) or
          ( {TRIGGER.VALUE}=1 and {Template Ping Checks:PingLoss.avg(60m)}> 1 )

          That says if there is no current alert, and the average ping loss over 30 minutes exceeds 10%, then alert. Continue alerting until the average over 60 minutes is below 1%. This pre-dates the recovery expression, so the recovery is built into the trigger.

          So you can mix and match different intervals for the same period. Something like

          {Template Ping Checks:PingLoss.avg(30m)} > 2 * {Template Ping Checks:PingLoss.avg(60m)}

          That would alert if the average over the last 30 minutes is more than twice the average over the last 60 minutes. You can use "#" for count of polls as opposed to a time period for polls if you prefer.
          Well yes, this works for ping loss, which I already have setup. I could use assistance in setting up a rule that over the last 30 minutes, network throughput has raised by, lets say 3 times the normal average of network throughput.

          How would this expression look? As i am still sort of confused as to how it would be done. I am getting the grasp of the basics but still working through growing pains.

          Comment

          • Linwood
            Senior Member
            • Dec 2013
            • 398

            #6
            The issue is to define the "normal average", especially if that average includes the most recent period, which it generally will.

            The straightforward implementation is to use the time shift feature, e.g. avg(30m) / avg(180m,30m) to compare the most recent 30 against the prior 180. Let's say that entire 30 minutes was 5 times higher. The issue with this is that now minute by minute that higher number is fed into the 180m prior, raising it. If it stays 5 times higher for 180 minutes, then that higher value becomes "normal" (actually much sooner if you are looking for a large jump).

            You can lengthen the period of course to try to reduce short term influences, but as a real world example I've hit several times at a new client, someone starts replicating over their WAN and saturates the circuit for literally days. So any test that is time dependent like this will reset and start saying "OK" soon. Also too long of a period if these are high volume triggers may prove a performance issue, e.g. if it has to average every value over days.

            You can also look at percentile, but I think you will find the same issue, that over any given period, exceptional values that persist become the new normal when they probably shouldn't. Forecast is interesting, as you may be able to use it to notice a sudden rise, but again resetting is likely to occur.

            Now if all you want is to detect spikes, and NOT notice persistent large increases because they are not spikes per se, then I think these can work. But as a network management need, persistent large increases are often worse than brief spikes.

            Comment

            Working...