Hello,
I have a program running on a system that I am trying to monitor. It is a critical program and should not be down for more then 10 minutes. So I set up a trigger:
{host: proc.num[program].last(0)}<1)
This works fine; when the program turns off I get an email saying its down and when the program is on I get an email saying everything is ok again. However, the first night my inbox flooded with emails. Turns out that if a certain action occurs this program resets itself. On every reset, I got an email saying the program was down and then another email a minute later saying it was back up. A little research tells me that if the program is down for more then 20-30 seconds then chances are real high that it isn't coming back up on its own.
So I dug through the forums for a couple of hours, found a couple of good ideas, and I changed my trigger accordingly:
({host: proc.num[program].last(0)}<1)&({host: proc.num[program].nodata(30)})
Now I get no emails at all regardless of the state. Can someone please help me figure out what I have done wrong?
One thing to note: I am running mostly Debian boxes with a few CentOS systems and I use the zabbix that happens to be in the repository for that system. We have a mixture of Debian Etch and Debian Lenny and CentOS 4.4. Therefore I have a mixture of 1.1 (zabbix agents) and 1.4 (zabbix server and a few agents). From my observation this problem seems to occur regardless of the verison though.
Thanks, I appreciate it!
[update edit] I try to test out updates before I apply them. I forgot that I had run updates to zabbix on my test systems. Debian Lenny (testing) is at 1.1.7-1 and Debian Etch (stable) is at 1.1.4-10. The CentOS machine is still at 1.1. I do not know if this helps at all, but I just wanted to clarify because I know that there are a lot more supported features in the newer versions and I have a crazy mixture.
I have a program running on a system that I am trying to monitor. It is a critical program and should not be down for more then 10 minutes. So I set up a trigger:
{host: proc.num[program].last(0)}<1)
This works fine; when the program turns off I get an email saying its down and when the program is on I get an email saying everything is ok again. However, the first night my inbox flooded with emails. Turns out that if a certain action occurs this program resets itself. On every reset, I got an email saying the program was down and then another email a minute later saying it was back up. A little research tells me that if the program is down for more then 20-30 seconds then chances are real high that it isn't coming back up on its own.
So I dug through the forums for a couple of hours, found a couple of good ideas, and I changed my trigger accordingly:
({host: proc.num[program].last(0)}<1)&({host: proc.num[program].nodata(30)})
Now I get no emails at all regardless of the state. Can someone please help me figure out what I have done wrong?
One thing to note: I am running mostly Debian boxes with a few CentOS systems and I use the zabbix that happens to be in the repository for that system. We have a mixture of Debian Etch and Debian Lenny and CentOS 4.4. Therefore I have a mixture of 1.1 (zabbix agents) and 1.4 (zabbix server and a few agents). From my observation this problem seems to occur regardless of the verison though.
Thanks, I appreciate it!
[update edit] I try to test out updates before I apply them. I forgot that I had run updates to zabbix on my test systems. Debian Lenny (testing) is at 1.1.7-1 and Debian Etch (stable) is at 1.1.4-10. The CentOS machine is still at 1.1. I do not know if this helps at all, but I just wanted to clarify because I know that there are a lot more supported features in the newer versions and I have a crazy mixture.
Comment