I have some questions regarding how I need to handle notifications and escalations between nodes.
I have a Master and Slave node both running 1.6.3
The master node has actions to handle triggers for both systems and they have been working just fine.
Recently I had the master reboot and Zabbix did not restart. Unfortunately because the master handles notifications I had no idea this happened until I went to check Zabbix.
So here is what I am attempting to do.
I created a slave host on the master and a master host on the slave.
I had to recreate all the users and mediat types on the slave as they are not kept in sync. And mind you I could not use the unified master web UI I used the slave web ui as the master complained about dupe users and media types. On the slave I created an action with a single condition that the host be the master host and the operation to email unix ops.
Now here's where a bug IMHO comes into play. I had setup the action operations from step 1 to step 0 with a period of 600 seconds with recovery messages on.
I shut off the zabbix server on the master and waited.
Success! An email was sent out. I started zabbix on master.
Received an OK email. Ten minutes later, I receive another.
So on the master UI I acknowledge the OK event. No luck still receive OK emails. I check the events table on the slave. It's not listed as acknowledged nor is my acknowledge message found in the acknowledgement table on the slave. It's found on the master though.
I am confused as I would assume these tables would be sync'd up.
I feel that the master and slave databases are out of sync and I am still getting emails from the event after doing the following:
On the slave I changed the action and made the step from 1 to 1.
unchecked recovery messages.
Still no luck.
Finally as a last ditch effort I ran "delete from escalations" on the slave.
See here:
mysql> select * from escalations;
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| 200000000000002 | 200200000000004 | 200200000016032 | 200200000071274 | 0 | 0 | 27 | 2 |
| 200000000000004 | 200200000000004 | 200200000016026 | 200200000071275 | 0 | 0 | 13 | 2 |
| 200000000000006 | 200200000000004 | 200200000016026 | 200200000071276 | 0 | 0 | 12 | 2 |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
3 rows in set (0.00 sec)
mysql> delete from escalations;
Query OK, 3 rows affected (0.03 sec)
Now when I look on the master web UI i see the event listed as "in progress", but on the slave UI I see "OK".
So I go to the master escalations table. Here we have 6 entries and none of the IDs match up:
mysql> select * from escalations;
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| 100000000000012 | 100100000000004 | 200100000000740 | 200200000071235 | 0 | 0 | 2 | 2 |
| 100000000000017 | 100100000000004 | 100100000013094 | 100100000000091 | 0 | 0 | 2 | 2 |
| 100000000000018 | 100100000000004 | 100100000013100 | 100100000000095 | 0 | 0 | 2 | 2 |
| 100000000000019 | 100100000000004 | 100100000013095 | 100100000000105 | 0 | 0 | 2 | 2 |
| 100000000000020 | 100100000000004 | 100100000013091 | 100100000000115 | 0 | 0 | 2 | 2 |
| 100000000000021 | 100100000000004 | 100100000013093 | 100100000000119 | 0 | 0 | 2 | 2 |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
So here are the issues:
* How do I clean up events on the master node so it doesn't say "In Progress"
* How can I send a notification every 10 minutes until recovery, but only send ONE recovery message
* If these "nodes" are supposed to be separate, IE not share users mediatypes, why can't I create the same user or media type on different nodes from the master UI without it complaining? Instead I have to log into the slave node directly to accomplish this
* It seems event acknowledgement might not be syncing properly between nodes. From my experience with acknowledgements, when you acknowledge an event the notifications should stop or at least the slave DB should have had the acknowledgement column set to "1" for the row i had acknowledged through the master UI. Does this sound correct?
Thanks
Chris
I have a Master and Slave node both running 1.6.3
The master node has actions to handle triggers for both systems and they have been working just fine.
Recently I had the master reboot and Zabbix did not restart. Unfortunately because the master handles notifications I had no idea this happened until I went to check Zabbix.
So here is what I am attempting to do.
I created a slave host on the master and a master host on the slave.
I had to recreate all the users and mediat types on the slave as they are not kept in sync. And mind you I could not use the unified master web UI I used the slave web ui as the master complained about dupe users and media types. On the slave I created an action with a single condition that the host be the master host and the operation to email unix ops.
Now here's where a bug IMHO comes into play. I had setup the action operations from step 1 to step 0 with a period of 600 seconds with recovery messages on.
I shut off the zabbix server on the master and waited.
Success! An email was sent out. I started zabbix on master.
Received an OK email. Ten minutes later, I receive another.
So on the master UI I acknowledge the OK event. No luck still receive OK emails. I check the events table on the slave. It's not listed as acknowledged nor is my acknowledge message found in the acknowledgement table on the slave. It's found on the master though.
I am confused as I would assume these tables would be sync'd up.
I feel that the master and slave databases are out of sync and I am still getting emails from the event after doing the following:
On the slave I changed the action and made the step from 1 to 1.
unchecked recovery messages.
Still no luck.
Finally as a last ditch effort I ran "delete from escalations" on the slave.
See here:
mysql> select * from escalations;
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| 200000000000002 | 200200000000004 | 200200000016032 | 200200000071274 | 0 | 0 | 27 | 2 |
| 200000000000004 | 200200000000004 | 200200000016026 | 200200000071275 | 0 | 0 | 13 | 2 |
| 200000000000006 | 200200000000004 | 200200000016026 | 200200000071276 | 0 | 0 | 12 | 2 |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
3 rows in set (0.00 sec)
mysql> delete from escalations;
Query OK, 3 rows affected (0.03 sec)
Now when I look on the master web UI i see the event listed as "in progress", but on the slave UI I see "OK".
So I go to the master escalations table. Here we have 6 entries and none of the IDs match up:
mysql> select * from escalations;
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
| 100000000000012 | 100100000000004 | 200100000000740 | 200200000071235 | 0 | 0 | 2 | 2 |
| 100000000000017 | 100100000000004 | 100100000013094 | 100100000000091 | 0 | 0 | 2 | 2 |
| 100000000000018 | 100100000000004 | 100100000013100 | 100100000000095 | 0 | 0 | 2 | 2 |
| 100000000000019 | 100100000000004 | 100100000013095 | 100100000000105 | 0 | 0 | 2 | 2 |
| 100000000000020 | 100100000000004 | 100100000013091 | 100100000000115 | 0 | 0 | 2 | 2 |
| 100000000000021 | 100100000000004 | 100100000013093 | 100100000000119 | 0 | 0 | 2 | 2 |
+-----------------+-----------------+-----------------+-----------------+-----------+-----------+----------+--------+
So here are the issues:
* How do I clean up events on the master node so it doesn't say "In Progress"
* How can I send a notification every 10 minutes until recovery, but only send ONE recovery message
* If these "nodes" are supposed to be separate, IE not share users mediatypes, why can't I create the same user or media type on different nodes from the master UI without it complaining? Instead I have to log into the slave node directly to accomplish this
* It seems event acknowledgement might not be syncing properly between nodes. From my experience with acknowledgements, when you acknowledge an event the notifications should stop or at least the slave DB should have had the acknowledgement column set to "1" for the row i had acknowledged through the master UI. Does this sound correct?
Thanks
Chris
Comment