Ad Widget

Collapse

Escalations and Notifications between nodes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • chrisf
    Junior Member
    • Apr 2009
    • 25

    #1

    Escalations and Notifications between nodes

    I have some questions regarding how I need to handle notifications and escalations between nodes.
    I have a Master and Slave node both running 1.6.3

    The master node has actions to handle triggers for both systems and they have been working just fine.

    Recently I had the master reboot and Zabbix did not restart. Unfortunately because the master handles notifications I had no idea this happened until I went to check Zabbix.

    So here is what I am attempting to do.
    I created a slave host on the master and a master host on the slave.
    I had to recreate all the users and mediat types on the slave as they are not kept in sync. And mind you I could not use the unified master web UI I used the slave web ui as the master complained about dupe users and media types. On the slave I created an action with a single condition that the host be the master host and the operation to email unix ops.

    Now here's where a bug IMHO comes into play. I had setup the action operations from step 1 to step 0 with a period of 600 seconds with recovery messages on.

    I shut off the zabbix server on the master and waited.
    Success! An email was sent out. I started zabbix on master.
    Received an OK email. Ten minutes later, I receive another.
    So on the master UI I acknowledge the OK event. No luck still receive OK emails. I check the events table on the slave. It's not listed as acknowledged nor is my acknowledge message found in the acknowledgement table on the slave. It's found on the master though.
    I am confused as I would assume these tables would be sync'd up.
    I feel that the master and slave databases are out of sync and I am still getting emails from the event after doing the following:
    On the slave I changed the action and made the step from 1 to 1.
    unchecked recovery messages.
    Still no luck.

    Finally as a last ditch effort I ran "delete from escalations" on the slave.
    See here:
    mysql> select * from escalations;
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
    | escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
    | 200000000000002 | 200200000000004 | 200200000016032 | 200200000071274 | 0 | 0 | 27 | 2 |
    | 200000000000004 | 200200000000004 | 200200000016026 | 200200000071275 | 0 | 0 | 13 | 2 |
    | 200000000000006 | 200200000000004 | 200200000016026 | 200200000071276 | 0 | 0 | 12 | 2 |
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
    3 rows in set (0.00 sec)

    mysql> delete from escalations;
    Query OK, 3 rows affected (0.03 sec)

    Now when I look on the master web UI i see the event listed as "in progress", but on the slave UI I see "OK".

    So I go to the master escalations table. Here we have 6 entries and none of the IDs match up:

    mysql> select * from escalations;
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
    | escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
    | 100000000000012 | 100100000000004 | 200100000000740 | 200200000071235 | 0 | 0 | 2 | 2 |
    | 100000000000017 | 100100000000004 | 100100000013094 | 100100000000091 | 0 | 0 | 2 | 2 |
    | 100000000000018 | 100100000000004 | 100100000013100 | 100100000000095 | 0 | 0 | 2 | 2 |
    | 100000000000019 | 100100000000004 | 100100000013095 | 100100000000105 | 0 | 0 | 2 | 2 |
    | 100000000000020 | 100100000000004 | 100100000013091 | 100100000000115 | 0 | 0 | 2 | 2 |
    | 100000000000021 | 100100000000004 | 100100000013093 | 100100000000119 | 0 | 0 | 2 | 2 |
    +-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+

    So here are the issues:

    * How do I clean up events on the master node so it doesn't say "In Progress"
    * How can I send a notification every 10 minutes until recovery, but only send ONE recovery message
    * If these "nodes" are supposed to be separate, IE not share users mediatypes, why can't I create the same user or media type on different nodes from the master UI without it complaining? Instead I have to log into the slave node directly to accomplish this
    * It seems event acknowledgement might not be syncing properly between nodes. From my experience with acknowledgements, when you acknowledge an event the notifications should stop or at least the slave DB should have had the acknowledgement column set to "1" for the row i had acknowledged through the master UI. Does this sound correct?

    Thanks

    Chris
  • RobertS
    Member
    • Aug 2006
    • 57

    #2
    "In progress"

    I also experience "In progress" on the master node while the message was sent on the slave node and is reported there correctly. Any solution to this?

    Comment

    • chrisf
      Junior Member
      • Apr 2009
      • 25

      #3
      I gave up. No one offered any insight and the logs provided no information. I'm hoping in later versions of zabbix this feature becomes more stable but at the moment I would NOT advise using it in a production environment.
      I've separated the nodes and each handles its own hosts and monitors each other for failure.

      -Chris

      Comment

      • bisbell
        Junior Member
        • May 2008
        • 14

        #4
        I'm having the same problem- getting constant e-mails that the problem has been resolved. I've tried all the solutions offered in the forum posts I've found, and section 10 in the Zabbix 1.6 guide is a bit lacking.
        My "solution" is to disable the action, then stop and restart the zabbix server processes, then enable the action. This seems to stop the e-mails. At least, that's the only way I've found to stop the e-mails.

        There are a bunch of forum threads on this same issue. If there is a Zabbix moderator reading through this then please let it be known to your engineering staff that this is a pretty major issue that needs to be addressed.
        From a user perspective it should be as simple as checking a box that says "Repeat Notification" and then set a time limit on how often to repeat.
        Having to configure an escalation to get repeat e-mails is not really all that intuitive- and more importantly, it seems that it doesn't work.
        Last edited by bisbell; 19-11-2009, 21:36.

        Comment

        • chrisf
          Junior Member
          • Apr 2009
          • 25

          #5
          I'm in the process of adding the feature I require.
          Basically zabbix will send out notifications until a recovery message is received.
          Once received notifications stop being sent.

          I will provide the patches when I get it working.

          I'm hoping the developers merge this in with the code base as this functionality is essential to a network monitoring application.

          -Chris

          Comment

          • sean
            Junior Member
            • Mar 2008
            • 28

            #6
            Did you follow this through?

            I too would like to be able to configure "reminder" alerts, so that if a trigger is not fixed, or Acknowledged, then a reminder email alert is sent every XX hours.

            I have:
            Step 9 to 0, period=3600
            Operation: send message, choose the user or group
            Subject: REMINDER: {TRIGGER.NAME}: {STATUS}
            Message: {TRIGGER.NAME}: {STATUS}
            Conditions: Event acknowledged = "Not Ack"

            Escalation is also enabled, period 3600.

            Comment

            Working...