Ad Widget

Collapse

CPU Utilisation Trigger

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • vlam
    Senior Member
    Zabbix Certified Specialist
    • Jun 2009
    • 166

    #1

    CPU Utilisation Trigger

    Hi All

    I need some help setting up a trigger for CPU utilisation to fire when the CPU Utilisation is above 75% for more than 30min

    Currently I have
    perf_counter[\Processor(_Total)\%Processor Time].last(,1800)}>75

    My problem is that this fires if the CPU goes above 75% but not only if it is for 30min or more.

    Can any one help me get this correct

    Thanks
    4 Zabbix Frontend Servers (Load balanced)
    2 Zabbix App Servers (HA)
    2 Zabbix Database Servers (HA)
    18 Zabbix Proxy Servers (HA)
    3897 Deployed Zabbix Agents
    6161 Values per second
    X-Layer Integration
    Jaspersoft report Servers (HA)
  • Sugarman
    Junior Member
    • Oct 2014
    • 22

    #2
    I'd say something like this:

    perf_counter[\Processor(_Total)\%Processor Time].min(30)}>75

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      Bear in mind that what you are asking will reset for 30 minutes if there is one interval at 74% even if the very next and subsequent go back to 100%. You might want to not reset the trigger until the CPU is under some number steadily.

      For example, you might say "fire if steadily over 75% for 30 minutes, and reset only when steadily under 75% for 30 minutes". That's a very different condition than "reset if not steadily over" and requires a maintained drop of CPU.

      Here's a discussion of flapping that might help:

      No More Flapping

      Comment

      • vlam
        Senior Member
        Zabbix Certified Specialist
        • Jun 2009
        • 166

        #4
        Linwood

        Thanks for the link

        But I think somewhere either my understanding of the Expression might be incorrect.

        The state it should look something like

        ({TRIGGER.VALUE}=0 & {Oracle DB1:system.cpu.load.last()} > 2)
        |
        ({TRIGGER.VALUE}=1 & {Oracle DB1:system.cpu.load.last()} > 1)

        Mine currently looks like this

        ({TRIGGER.VALUE}=0 & {Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].last()}>75)

        ({TRIGGER.VALUE}=1 & {Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].last()}>70)

        I do get a trigger expression from & {Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].last()}>75)
        Last edited by vlam; 12-05-2016, 17:02.
        4 Zabbix Frontend Servers (Load balanced)
        2 Zabbix App Servers (HA)
        2 Zabbix Database Servers (HA)
        18 Zabbix Proxy Servers (HA)
        3897 Deployed Zabbix Agents
        6161 Values per second
        X-Layer Integration
        Jaspersoft report Servers (HA)

        Comment

        • Linwood
          Senior Member
          • Dec 2013
          • 398

          #5
          Yes, I think, maybe.

          The example you have only looks at the very last value, but it does provide some hysteresis. The second, in English, is

          "Alert if CPU goes over 75 for any one sample, and do not reset until it falls below 70 for one sample."

          The problem with those is that they are instantaneous. So if the processor is a bit erratic, but hovering around 90 (but occasionally falling to 60), it will fire, reset, fire, reset... drive you nuts. What I have is something similar, here is an example (the #KEY yields a total CPU:

          ({Template SNMP Windows Server:CPU[{#KEY}].avg(20m)}>95 and {TRIGGER.VALUE}=0) or
          ({Template SNMP Windows Server:CPU[{#KEY}].avg(20m)}>75 and {TRIGGER.VALUE}=1)

          This says "if the average CPU is above 95% alert, and do not reset until the 20 minute average falls below 75%". Short term spikes in CPU are normal even on a well tuned and capable system; I really don't want alerts because some CPU intensive program kicked in for 3 seconds.

          You may also find some where min and max work out better especially for services, e.g.

          ( {TRIGGER.VALUE}=0 and {Template App HTTP Service:net.tcp.service[http].max(10m)}=0) or
          ( {TRIGGER.VALUE}=1 and {Template App HTTP Service:net.tcp.service[http].min(10m)}=0)

          That says to alert only if the service is down continuously for 10 minutes (if it is not, at least one value would be 1), and to clear the alert only if the service is continuously up for 10 minutes (i.e. continue alerting if there is even one 0 in the 10 minute period). But this could also look to ensure that there are no spikes in the recovery period for something like ping loss (not sure I'd use it for CPU).

          Comment

          • vlam
            Senior Member
            Zabbix Certified Specialist
            • Jun 2009
            • 166

            #6
            I have changed it, currently it looks like

            ({Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].avg(20min)}>75 {TRIGGER.VALUE}=0)
            ({Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].avg(20min)}>70 {TRIGGER.VALUE}=1)

            But it still complains from the {TRIGGER.VALUE}=0)
            and the second part of the trigger there after

            if I remove those parts then it works, but how do I get it all to work together?
            4 Zabbix Frontend Servers (Load balanced)
            2 Zabbix App Servers (HA)
            2 Zabbix Database Servers (HA)
            18 Zabbix Proxy Servers (HA)
            3897 Deployed Zabbix Agents
            6161 Values per second
            X-Layer Integration
            Jaspersoft report Servers (HA)

            Comment

            • vlam
              Senior Member
              Zabbix Certified Specialist
              • Jun 2009
              • 166

              #7
              I Have changed it by building it with the expression constructor

              So currently it looks as follow

              (({Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].avg(1200)}>70) and {TRIGGER.VALUE}=0
              or
              {Template OS Windowserf_counter[\Processor(_Total)\% Processor Time].avg(1200)}>75 and {TRIGGER.VALUE}=1)

              Will see if this works as it did excepted it as valid
              4 Zabbix Frontend Servers (Load balanced)
              2 Zabbix App Servers (HA)
              2 Zabbix Database Servers (HA)
              18 Zabbix Proxy Servers (HA)
              3897 Deployed Zabbix Agents
              6161 Values per second
              X-Layer Integration
              Jaspersoft report Servers (HA)

              Comment

              • vlam
                Senior Member
                Zabbix Certified Specialist
                • Jun 2009
                • 166

                #8
                I had to make a small change as 0 is problem and 1 is OK
                For some reason it swapped the 70 and 75


                Thanks for the help Lynwood
                4 Zabbix Frontend Servers (Load balanced)
                2 Zabbix App Servers (HA)
                2 Zabbix Database Servers (HA)
                18 Zabbix Proxy Servers (HA)
                3897 Deployed Zabbix Agents
                6161 Values per second
                X-Layer Integration
                Jaspersoft report Servers (HA)

                Comment

                Working...