Ad Widget

Collapse

Suspension of alerting in Zabbix

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • davyyd
    Junior Member
    • Dec 2010
    • 3

    #1

    Suspension of alerting in Zabbix

    My name is David Summers. I am the technical lead for a project here at my University to evaluate Zabbix as our monitoring system.

    Currently you can “Acknowledge” a trigger that is active. The trigger will remain inactive only until the monitored item goes in and then out of threshold again. However, we would like to “Suspend” one or multiple triggers for a defined time period (e.g., “24 hours” or “15 minutes”) during which the trigger remains inactive regardless of the changing values for the monitored item.

    We envision this being similar to acknowledging a trigger; and at the time of suspending it a dialog box allows one to enter a set number of minutes or hours. Without any user intervention, when that number of minutes or hours has passed, the suspended trigger would automatically be enabled again.

    This is different than a maintenance window for the entire host, because we need to be able to isolate and suspend individual triggers, not all the triggers for a given host. For example, we may want to suspend for 24 hours a 20% remaining disk usage trigger but want it to resume after this period. Disabling triggers (instead of suspending) has been found to be unreliable in real life because people forget to resume them and it is difficult to differentiate suspended triggers from those you want to remain disabled.

    Suspension functionality has been critical in our environment and we consider it “required” in a monitoring solution. Does Zabbix currently provide this functionality? If not, are there plans to add this functionality in the future?

    Thank you!

    David Summers
  • fmrapid
    Member
    • Aug 2010
    • 43

    #2
    From what I understand, host and hostgroup maintenance can be configured. Someone more proficient in the latest versions can correct me if I am wrong.

    I also agree that individual data points should be available to be put in suspended/maintenance mode.

    Cheers,

    fmrapid

    Comment

    • zabbix_zen
      Senior Member
      • Jul 2009
      • 426

      #3
      I agree this would be an improvement.
      HostGroup and Host maintenance windows don't offer enough granularity when interventions are scheduled.
      It's quite common an intervention over some hosts that only affect a couple of services,
      when it happens we shouldn't be blind to whatever else happens to other metrics in those hosts.

      At least a maintenance check at the Host's Application level should be provided. This way we could attach the intended items to a 'in Maintenance' Application when needed.

      This could be automated using the API, but wouldn't be as foolproof as specifying a timer like suggested.

      Comment

      • harpo
        Junior Member
        • Mar 2010
        • 2

        #4
        Absolutely critical feature

        We also need this functionality -- we need to be able to acknowledge a single trigger (not the entire host) on Saturday afternoon but have it start re-alerting on Monday morning when staff are around and can actually fix it.

        If we "acknowledge" it, it goes away forever. And if we schedule the host for maintenance, if something else breaks on that machine before Monday morning we won't find out.

        Comment

        • untergeek
          Senior Member
          Zabbix Certified Specialist
          • Jun 2009
          • 512

          #5
          As far as "disabling" triggers goes, I totally agree with you. As a general rule, we disallow disabling because people forget.

          However, we created a daily report email that queries the database directly and sends a list of disabled triggers and hosts so you can see what's been disabled. This helps to prevent things from getting out of hand.

          Comment

          • untergeek
            Senior Member
            Zabbix Certified Specialist
            • Jun 2009
            • 512

            #6
            Why not include my script here. It's a working copy for our Oracle database. You could probably adapt this to work for any other db.

            Code:
            #!/bin/bash
            
            SID=ORACLESID
            USERNAME=ORACLE_USER
            PASSWORD=ORACLE_PASSWD
            
            . /opt/oracle/product/11.2/oracle.env
            
            export ORACLE_SID="${SID}"
            
            query_get ()
            #  Pass query body
            #  run query and parse
            #  spit out results
            {   
               local SPOOL=/var/tmp/query_get.$RANDOM
               touch $SPOOL
               local getResult=$(sqlplus -S ${USERNAME}/${PASSWORD}@${ORACLE_SID} << EOF
               set pagesize 0
               set heading off
               set feedback off
               set tab off
               set define off
               spool $SPOOL
               select ${1};
               spool off
               quit
               EOF)
               local LINE
               if [ -n "$getResult" ]; then 
                  cat $SPOOL | grep ',' |
                  while read LINE; do
                   printf "%31s    %s \n" "$(echo $LINE | awk -F, '{print $1}')" "$(echo $LINE | awk -F, '{print $2}' | sed -e 's/disabled//')"
                  done
               else
                  printf "%31s \n" "NONE"
               fi
               rm $SPOOL
               return
            }
            
            passit ()
            {
            echo $1
            }
            
            hosts_disabled="host||',disabled' from hosts where status=1"
            items_unsupported="h.host||','||i.description from items i,hosts h where i.status=3 and i.hostid=h.hostid and h.status=0"
            items_disabled="h.host||','||i.description from items i,hosts h where i.status=1 and i.hostid=h.hostid and h.status=0"
            triggers_disabled="DISTINCT h.host||','||t.description from hosts h, functions f, items i, triggers t where i.itemid=f.itemid AND h.hostid=i.hostid AND f.triggerid=t.triggerid AND t.status=1 and h.status=0"
            triggers_acked="DISTINCT h.host||','||t.description from hosts h, functions f, items i, triggers t, events v, escalations e where i.itemid=f.itemid AND h.hostid=i.hostid AND f.triggerid=t.triggerid AND v.eventid=e.eventid and v.acknowledged=1 AND e.triggerid=t.triggerid and h.status=0"
            triggers_unacked="DISTINCT h.host||','||t.description from hosts h, functions f, items i, triggers t, events v, escalations e where i.itemid=f.itemid AND h.hostid=i.hostid AND f.triggerid=t.triggerid AND v.eventid=e.eventid and v.acknowledged=0 AND e.triggerid=t.triggerid and h.status=0"
            triggers_unknown="DISTINCT h.host||','||t.description from hosts h, functions f, items i, triggers t where i.itemid=f.itemid AND h.hostid=i.hostid AND f.triggerid=t.triggerid AND t.templateid>0 and t.status=0 and t.value=2 and h.status=0"
            
            printf "%31s \n" "Disabled Hosts:"
            printf "%31s \n" "---------------"
            query_get "$hosts_disabled" with
            echo
            printf "%31s \n" "Disabled Items:"
            printf "%31s \n" "---------------"
            query_get "$items_disabled" with
            echo
            printf "%31s \n" "Disabled Triggers:"
            printf "%31s \n" "------------------"
            query_get "$triggers_disabled"
            echo
            printf "%31s \n" "Unacknowledged Triggers:"
            printf "%31s \n" "------------------------"
            query_get "$triggers_unacked"
            echo
            printf "%31s \n" "Acknowledged Triggers:"
            printf "%31s \n" "----------------------"
            query_get "$triggers_acked"
            echo
            printf "%31s \n" "Triggers with status \"unknown\":"
            printf "%31s \n" "-------------------------------"
            query_get "$triggers_unknown"
            echo
            printf "%31s \n" "Unsupported Items:"
            printf "%31s \n" "------------------"
            query_get "$items_unsupported"

            Comment

            • davyyd
              Junior Member
              • Dec 2010
              • 3

              #7
              untergeek

              Thank you for sharing that script.

              If I understand it correctly, it generates a report of hosts, items and triggers and their current status (disabled, etc.).

              I will definitely keep this filed away for our deployment.

              Thanks again!

              David

              Comment

              • untergeek
                Senior Member
                Zabbix Certified Specialist
                • Jun 2009
                • 512

                #8
                You're welcome, and good luck!

                Comment

                Working...