Greetings,
I have seen a couple of threads and private messages about various triggers, so I thought I would list a couple of the ones I use, in case someone wants to use them.
First off, all of my trigger descripts take the form '{HOSTNAME} -- <Some Trigger Detail>'. i.e. '{HOSTNAME} -- System Uptime > 90 days'. By doing this, I can use the compact trigger display and not only see what servers are having issues, but what the issue is. I also do the same thing for actions.
One of the first groups of triggers I created was for monitoring the agent. There has been a lot of talk lately about the initial value of the status key, which I manually set to 1 when I created the items. This way, the value is always in a known state. BTW. Status 2 means that the server was unable to talk to the agent, while a 1 means that it was sucessful.
Due to network latency and server performance, it is often possible for this check to return fales positives, as such I used the following for my '{HOSTNAME} -- Overall Availability' check;
Very simple trigger, but it will only be active if the server fails to communicate with the Agent two iterations in a row.
The next trigger I created was in response to a series of issues we were having with our VMWare servers. Every VMWare server we had that went over 120 days of uptime started losing the ability to talk to it's own internal database. When this happened, we would lose all control of the VMWare guests at the VMWare console, we were unable to stop running VMWare guests, and unable to start ones that were stopped or crashed, also we were unable to make changes to the configurations. Due to this, we started implementing a policy to reboot our VMWare servers every 100ish days. To monitor this, I created the following triggers;
Another thing we use at our site is cfEngine to manage the configuration of our machines. With cfEngine, we are able to make a single change and have it propagate to all of our Unix servers. Especially helpful if you do a lot of customization from a base install.
One of the first things we noticed was that cfEngine would not run occasionally. We found that someone was accidently starting it as a server instead of as an agent. Now rather than creating a monitor to watch for the server, we decided to watch for the agent's log being updated. This allowed us to kill two birds with one stone, we could see if the log file was updated, and also see if there were errors in the agent run. We configured the agent to run once a day, and set up the '{HOSTNAME} -- CF Agent Failed Last Run' trigger;
This trigger simply checks to see if the file modification time for the cf Agent log file is the same as it was on the last check. If so, this tells us that the agent failed to even start correctly. You may notice that we are not using 86400 seconds, instead we are using 82800. We do this to help cover for things like the remote server doing a complete re-install. Since a complete re-install can take up to 45 minutes, it would return a false positive the day after a complete re-install.
If you find any of these useful, please feel free to use them for you own checks.
I have seen a couple of threads and private messages about various triggers, so I thought I would list a couple of the ones I use, in case someone wants to use them.
First off, all of my trigger descripts take the form '{HOSTNAME} -- <Some Trigger Detail>'. i.e. '{HOSTNAME} -- System Uptime > 90 days'. By doing this, I can use the compact trigger display and not only see what servers are having issues, but what the issue is. I also do the same thing for actions.
One of the first groups of triggers I created was for monitoring the agent. There has been a lot of talk lately about the initial value of the status key, which I manually set to 1 when I created the items. This way, the value is always in a known state. BTW. Status 2 means that the server was unable to talk to the agent, while a 1 means that it was sucessful.
Due to network latency and server performance, it is often possible for this check to return fales positives, as such I used the following for my '{HOSTNAME} -- Overall Availability' check;
Code:
({__Templated_Linux_SVR:status.last(0)}=2)&({__Templated_Linux_SVR:status.prev(0)}=2)
The next trigger I created was in response to a series of issues we were having with our VMWare servers. Every VMWare server we had that went over 120 days of uptime started losing the ability to talk to it's own internal database. When this happened, we would lose all control of the VMWare guests at the VMWare console, we were unable to stop running VMWare guests, and unable to start ones that were stopped or crashed, also we were unable to make changes to the configurations. Due to this, we started implementing a policy to reboot our VMWare servers every 100ish days. To monitor this, I created the following triggers;
- '{HOSTNAME} -- System Uptime > 75 Days'. This trigger is set as a simple warning, so we can start scheduling down time with our customers:
Code:({__Template_Infra_VMWare:system[uptime].last(0)}>6479999)&({__Template_Infra_VMWare:system[uptime].last(0)}<7776000)
- '{HOSTNAME} -- System Uptime > 90 Days'. This trigger is set as an average, just to keep us apprised of the current uptime:
Code:({__Template_Infra_VMWare:system[uptime].last(0)}>7775999)&({__Template_Infra_VMWare:system[uptime].last(0)}<9072000)
- '{HOSTNAME} -- System Uptime > 105 Days'. This trigger is set as a high. If this trigger ever goes active, we let the customers know that if something happens to one of their VMWare guests, that we may be unable to resolve the issue without taking their entire site down:
Code:({__Template_Infra_VMWare:system[uptime].last(0)}>9071999)&({__Template_Infra_VMWare:system[uptime].last(0)}<10368000)
- '{HOSTNAME} -- System Uptime > 120 Days'. This trigger is set as a disaster. If this trigger ever goes active, we do not negotiate, instead we reboot their server at the next maintenance window:
Code:{__Template_Infra_VMWare:system[uptime].last(0)}>10367999
Another thing we use at our site is cfEngine to manage the configuration of our machines. With cfEngine, we are able to make a single change and have it propagate to all of our Unix servers. Especially helpful if you do a lot of customization from a base install.

One of the first things we noticed was that cfEngine would not run occasionally. We found that someone was accidently starting it as a server instead of as an agent. Now rather than creating a monitor to watch for the server, we decided to watch for the agent's log being updated. This allowed us to kill two birds with one stone, we could see if the log file was updated, and also see if there were errors in the agent run. We configured the agent to run once a day, and set up the '{HOSTNAME} -- CF Agent Failed Last Run' trigger;
Code:
{__Template_Linux_SVR:vfs.file.mtime[/var/cfengine/lastrun.log].abschange(0)}<82800
If you find any of these useful, please feel free to use them for you own checks.
Comment