Ad Widget

Collapse

Help needed (or Feature request if it cant be done)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • alj
    Senior Member
    • Aug 2006
    • 188

    #1

    Help needed (or Feature request if it cant be done)

    I have a problem with one of the monitors.
    Our DC temperature was staying near alarm threshold for about 1 day, and sometimes go 1 degree higher or 1 degree lower. This created big page storm of temperature alarm going on and off for all day.

    Is there any way to implement threshold hysteresis in the trigger? For example lets say it will go on at 30C and go off at 28C.

    Or maybe there might be some other way to avoid this situation?
  • pdwalker
    Senior Member
    • Dec 2005
    • 166

    #2
    There are a couple of ways you can do this, depending on your needs.

    Assuming you had a trigger called temperature, you could do this: (I may have the syntax wrong, so please check the docs)

    Code:
    (({host:temperature.prev(0)} > 30) & ({host.temperature.last(0)} > 30))
    This says that the trigger only goes off if the last two values were over your threshold.

    This is still prone to false positives if the values flap a lot as it only checks two values.

    The other way I prefer is to do this:

    Code:
    {host:temperature.max(<time interval>)} > 30
    This version will take the max temperature for that period of time thus allowing you to check against several intervals of time. e.g. If I monitor at 60 second intervals, and I set my time interval to 300, I am checking 5 values.

    However, that might still not be good enough. The above one should "flap" less often, but if the temperature is constantly changing between, say, 29 and 31 degrees, the trigger will remain on.

    To avoid this, use this trigger instead:

    Code:
    {host:temperature.min(<time interval>)} > 30
    The difference here is that now we are checking to see if the minimum temperature recorded during that time interval, so any "flapping" to values just below the threshold will prevent the trigger from being set off during that interval.

    Of course, this trigger will not detect a large rise in temperatures that occur quickly (or will not for at least the period of <time interval>)

    As a further improvement, I might alter it to include this:

    Code:
    ( ({host:temperature.min(<time interval>)} > 30) |
      ({host.temperature.last(0)} > 38) )
    So here, I am actually setting two thresholds. The first is where a sustained temperature change is noteworthy, the second (and perhaps more serious one) is when the temperature passes a higher temperature and you want to be notified as soon as it happens (help! my server is on fire!)


    Hope that helps.

    - Paul

    Comment

    • alj
      Senior Member
      • Aug 2006
      • 188

      #3
      Thanks for great ideas but that really will not resolve the problem. All those workardounds wont work if temperature is flapping around threshold by 1 degree randomly.

      THere should be 2 conditions in a trigger IMO - trigger ON condition and optional trigger OFF condition to resolve this.

      In case of temperature i would configure ON condition if temperature goes above 31C (need to send page right away with no delay) and OFF condition - when temperature stays below 28C for 5 min (gotta be completely sure to disable alarm condition) that would completely avoid page storms.

      Sometimes OFF condition should be manual (i e after syslog error operator fixes the problem and resets the trigger).

      Comment

      • alj
        Senior Member
        • Aug 2006
        • 188

        #4
        BTW to implement hysteresis all we need is variable which will show current trigger condition.

        ({this:trigger.condition.last(0)}=1 && {host:temp.last(0)} >28C) || ({this:trigger.condition.last(0)}=0 && {host:temp.last(0)} >31C)

        Something like that. This would be priceless feature for minimal changes in sources.

        Comment

        • James Wells
          Senior Member
          • Jun 2005
          • 664

          #5
          Greetings,
          Originally posted by alj
          In case of temperature i would configure ON condition if temperature goes above 31C (need to send page right away with no delay) and OFF condition - when temperature stays below 28C for 5 min (gotta be completely sure to disable alarm condition) that would completely avoid page storms.
          While Paul's suggestions will do some of what you are looking for, I suspect a better solution would be something like the following;
          Code:
          ({host:temperature.avg(#5)} > 30) | ({host:temperature.last(0)} > 33)
          This will set the trigger to ON, only if the average temperature from the last 5 checks is above 30c, or if the last tempurature was greater than 33.

          From there, you could actually build a graduated model if you chose. This would give you the advantage of watching it in the triggers but only getting paged / messaged if the tempurate increased beyond a set threshold. I actually do something like this for a couple of monitors. In this case, it would look something like the following;
          Code:
          This is the informational trigger, does not send alerts, but shows up on trigger status page.
          ({host:temperature.avg(#5)} => 28)  & ({host:temperature.avg(#5)} < 30)
          Code:
          This is the High trigger, sends a single alert, and shows up on trigger status page.
          ({host:temperature.avg(#5)} => 30)  & ({host:temperature.avg(#5)} < 32)
          Code:
          This is the Critical trigger, sends alerts every 10 minutes until issue is resolved, and shows up on trigger status page.
          {host:temperature.avg(#5)} => 32
          These will drastically reduce false positives, as well as allow for flapping.

          A note on the #5. This is trick that I haven't seen documented very often. Instead of triggering based on a a time period, this will trigger based on the number of checks. With this, I could change the periodicity from 30 seconds to 90 seconds without needing to change the trigger. The trigger will still look at the last 5 checks, where before they would cover a 2 minute 30 second period, now they will cover a 7 minute 30 second period.
          Unofficial Zabbix Developer

          Comment

          • alj
            Senior Member
            • Aug 2006
            • 188

            #6
            Hi James, Thanks for interesting suggestions however this will not resolve the problem if you alarm theshold is 30 and temperature is bouncing between 30 and 31 for a long time.

            the average of last 5 values will also flap. No matter how complicated checks we make it will still Flap. maybe a little bit less but the problem wont go away.

            The temp. probe in our datacenter does not return decimal points, its interger. And it changes only within 5-7 degrees total, and alarm condition is about 6 degrees away so every single time it gets close to alarm condition it changes +- 1 degree and stays within that narow range for a long time untill it goes to other narrow range and changes there (people open and close doors, that changes temp just by 1 degree etc etc).

            I ran this scenario through all theese rules, the result is pretty much the same, if you come close to alarm theshold you start to flap with +-1 degree random temp change untill you go further up and make alarm to be solid.

            In order to avoid flaping you need some info about current trigger condition to make OFF theshold different than ON threshold. The easiest way of doing so by creating variable which will represent current trigger condition (to implement trigger Hysteresis - read previous message please).

            Unfortumately i tried different rules (including averaging) and the only solution i found is to disable trigger temporarily after it sent alarm for the first time. Otherwise people get pissed off getting the same page over and over.

            Comment

            • James Wells
              Senior Member
              • Jun 2005
              • 664

              #7
              Originally posted by alj
              Hi James, Thanks for interesting suggestions however this will not resolve the problem if you alarm theshold is 30 and temperature is bouncing between 30 and 31 for a long time.
              Agree, however, if you use the graduated checks, between 30 and 31, you could have it simply show on the trigger status page and not alert until the average exceeds 32. As I show above, where I talk about the graduated alerts, using them allows you to control when the pages and when you see it on the status page.

              The easiest way of doing so by creating variable which will represent current trigger condition (to implement trigger Hysteresis - read previous message please).
              Agree, and I do recall someone, about 12 months ago working on better predictive modelling that would provide some of this functionality. Not sure, but I seem to recall this was something that Cameron was working on.
              Unofficial Zabbix Developer

              Comment

              Working...