Ad Widget

Collapse

reducing alerts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Hawky
    Junior Member
    • Apr 2009
    • 15

    #1

    reducing alerts

    Hi,

    is it possible to create a trigger like this:

    {TEST_TEMPLATE:textparm.str(OK)}<1

    which only becomes true if the last three checks are different from "OK"?

    I've tried something like this:
    {TEST_TEMPLATE:textparm.str(OK).count(#3,0,"eq")}
    {TEST_TEMPLATE:textparm.str(OK).last(#3)}<1

    I think its totally wrong, but I've no idea how to get it working, some ideas?

    Update:
    I see that in zabbix 1.8.2 it is possible to use str(...) and count(...) together, but
    in 1.8.3 I get an error, if I use both together in a trigger.

    Is there any posibility to use any other "function" in a trigger with str()?

    The Documentation is very inconsistent at this point, also the book from Packt Publishing
    Zabbix 1.8 Networ Monitoring which I bought.

    Thanks,
    Hawky
    Last edited by Hawky; 26-09-2010, 12:29. Reason: Update - str() and count() in 1.8.3
  • richlv
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Oct 2005
    • 3112

    #2
    hmm. where have you seen the notation that allows you to chain trigger functions ?

    nevertheless, did you try something like this ?

    Code:
    count(#3,OK,ne)
    i'm also quite puzzled as to what "Hallo" is doing in your attempts
    Zabbix 3.0 Network Monitoring book

    Comment

    • Hawky
      Junior Member
      • Apr 2009
      • 15

      #3
      @richlv

      thanks for your tip, its works great if, the value is only "OK". But I monitor HP Servers and I get back this value:

      OK - System: 'proliant dl320 g6', S/N: 'CZ19470423', ROM: 'W07 10/02/2009', hardware working fine, da: 1 logical drives, 2 physical drives

      I tought, I use str() or regexp() to filter "OK - System" this works perfect. Your example with count(#3,OK - System,ne) only works if the value contains only "OK - System", so I think its necessarry to use str() or regexp()?!

      Any idea?

      "Hallo" was only a testvalue I changed it in my first post into OK

      Comment

      • richlv
        Senior Member
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Oct 2005
        • 3112

        #4
        try operator "like" for count() function (see zabbix manual for more detail)
        Zabbix 3.0 Network Monitoring book

        Comment

        • bashman
          Senior Member
          • Dec 2009
          • 432

          #5
          You can also reduce alerts using dependencies and you can enable escalations to delay notifications:

          More trigger dependencies info:
          http://www.zabbix.com/documentation/...r_dependencies

          More delay notifications info:
          http://www.zabbix.com/documentation/..._notifications
          978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

          Comment

          • danrog
            Senior Member
            • Sep 2009
            • 164

            #6
            This is more of a general statement on how to reduce alerts (expands on what bashman said about escalations). What we do (and works very well for us) is to setup different actions (29 in total) for the different types triggers. And then in each action we have multiple steps with different step values.

            We use this as a general rule:
            Logs: email/page immediately
            Traps: email/page immediately
            Network/SAN: email/page immediately
            Disk: email/page hourly UNLESS its below 5% free and change rate is high
            Servers: delay 10mins,20mins,30mins,then every 5mins (until ack'd or fixed)
            Website: delay 4mins,8mins,16mins,24mins,then every 2mins (until ack'd or fixed)

            For example:
            Code:
            --------------------------
            Trigger = "Service down ({ITEM.LASTVALUE})"
            --------------------------
            Action: UNIX Admins
            Trigger Description != "Log:"
            Trigger Description != "Disk:"
            Group = Linux Servers
            Escalation: 600 seconds
            Step 2-2 Email/Page UNIX Admins (10min delay to cut down on false positives)
            Step 4-4 Email/Page UNIX Admins (30min delay to cut down on number of emails and usually its been acknowledged by this point)
            Step 6-0 Email/Page UNIX Admins (40min delay and just keep alerting until its either fixed or acknowledged)
            Step 8-0 Email/Page Manager (60min delay until the boss finds out :-)  )
            For logfiles and traps without much (if any) state:
            Code:
            -----------------------
            Trigger = "Log: Windows Cluster Event"
            -----------------------
            Action: Windows Log
            Trigger Description = "Log:"
            Group = Windows Servers
            NO ESCALATION 
            Send email/page to Windows Admins immediately
            Now the email vs. page is setup by the users media defined in their profile. Pages are setup for Disaster/High severities while email is setup for all trigger severities (except Not Classified).

            There are MANY ways you can expand on this and I've only covered our way to manage alerts. For instance, we have an action setup to restart a windows service and email windows admin a restart was attempted on service XYZ with no escalations, so if it restarts it, the other "catchall" windows action never gets hit and saves unneeded pages.

            (Here is the remote command and it uses BASE windows commands so nothing more is needed):
            Code:
            {HOSTNAME}:for /F "skip=1" %i in ('wmic service where "State='Stopped' and StartMode='Auto'" get Name') do net start "%i" >>c:\restart.txt
            
            This command restarts ALL services that are automatic but stopped.  If you don't use the Maintenance feature of Zabbix, you will run into services starting up that you actually want stopped
            Of course this does come with some drawbacks specifically the X minute delay to notify us of an issue (in certain cases) but in our environment (which has ~2000 devices globally), it was better for us to cut down on the number alerts our admins got. You can get around that with different actions for those triggers you know you won't get alerted on frequently. We do this for our network and san infrastructure as those generally don't create false positives for us.

            Then there is the management overhead too, but if you plan all your trigger "types" ahead of time and most importantly stick to it, you generally won't need to modify any actions. This was our 3rd monitoring system in ~6 years (out of the last 7) so we already knew what we wanted and we knew what our alerts should look like. We've been using Zabbix for almost a year in production and we are VERY happy with it (so much in fact our developers are starting to integrate our apps directly into Zabbix)

            Comment

            • bashman
              Senior Member
              • Dec 2009
              • 432

              #7
              The polling interval and trigger expressions also affects the number of alerts you get.

              Try to avoid trigger function "last" in trigger expressions when your polling interval is low, and try to use "max", "avg", "min", "sum" and "count", because not only the last value is used to calculate trigger expression.

              For example: {hostname:icmpping.max(#3)}=0 (It means that the 3 last values have to be "0" (down) to be triggered.

              More info about trigger functions:
              http://www.zabbix.com/documentation/...gger_functions
              978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

              Comment

              • Hawky
                Junior Member
                • Apr 2009
                • 15

                #8
                Hi,

                thanks for the replies, I tried a workaround:

                My script:

                Code:
                #!/bin/bash
                HPASMRES=$(/usr/sbin/check_hpasm --hostname=$1 --community=<snmpsecret>)
                echo "$HPASMRES" > "$(pwd)/results/$1-hpasmd" # testing result
                if $(echo "$HPASMRES" | egrep -q "^OK\ -\ System.*"); then
                        echo "0"
                else
                        echo "1"
                fi
                This script uses the check_hpasm script from http://labs.consol.de/lang/de/nagios/check_hpasm/ to get all relevant informations from my HP servers. If everything is fine, I'll get "0", is there a problem I get back "1".

                Now I use this Trigger:

                Code:
                (({TRIGGER.VALUE}=0)&{<server>:chckhpasm[{HOST.CONN}].max(#3)}=1)|(({TRIGGER.VALUE}=1)&({<server>:chckhpasm[{HOST.CONN}].min(#2)}=0))
                I think:

                State = OK and the last 3 values are "1" ==> PROBLEM!
                State = PROBLEM and the last 2 values are "0" ==> OK!

                But it doesn't work. If the last value is not "0" I directly get an "PROBLEM" Mail ... its annoying.

                Since a make an upgrade to v1.8.3 most of my triggers didn't work as they do before I upgrade.

                Comment

                • richlv
                  Senior Member
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Oct 2005
                  • 3112

                  #9
                  Originally posted by Hawky

                  I think:

                  State = OK and the last 3 values are "1" ==> PROBLEM!
                  0 is ok. 1 is problem. your comparison is "max(#3)}=1", which means "is at least one of last 3 values is 1 (problem) fire an alarm". if you want all 3 values to be problem before this trigger fires, you need min(#3)

                  same problem with trigger resolution part - it will resolve if at least one of last 2 values will be zero, so you would use max instead of min there.
                  Zabbix 3.0 Network Monitoring book

                  Comment

                  Working...