PDA

View Full Version : Immortal zabbix_server


jarek
19-12-2008, 22:47
Hello developers!
It looks, that not only me is experiencing crashes of zabbix_server process. Can you consider changing behavior of server, so it will respawn automatically if something will crash ?
Of course it is good idea, to have perfect, bug-free code, but in reality is very difficult to reach.
In my apps, I'm doing it in that way, that I have a master process which has only fork and waitpid in main loop. If something will crash, it just restarted.

Alexei
22-12-2008, 19:49
In my apps, I'm doing it in that way, that I have a master process which has only fork and waitpid in main loop. If something will crash, it just restarted.
Who restarts the master process? ;)

jarek
27-02-2009, 20:13
Who restarts the master process? ;)

The master process is quite simple, so there is very little risk of crash.
Let see how it can look:

int main(int argc, char **argv)
{
zbx_task_t task = ZBX_TASK_START;
char ch = '\0';

int nodeid = 0;
pid_t pid;

progname = argv[0];

while( 1 )
{
pid = fork();
if( pid == 0 )
break; // We are child, go on
waitpid( pid, NULL, 0 ); // Wait for child to finish
sleep(5); //Prohibit too fast respawning
}

/* Parse the command-line. */
while ((ch = (char)zbx_getopt_long(argc, argv, shortopts, longopts,NULL)) != (char)EOF)
switch (ch) {


Of course for production solution, while(1) can be replaced with some variable, which can be changed i.e. by INTR. Also some interrupt handling can be helpful.
If you like this idea, I can write more efficient solution.

Best regards
Jarek

nelsonab
02-03-2009, 04:14
Automagic respawning to me is a bad idea. If the server is stopping for a reason such as an error the program should stop. Now if it's stopping due to aberrant behavior then ok, maybe there is a point to a restart however there is the risk the previous crash left the application in an unstable state upon restart, ie there is bad data in the DB which will cause subsequent restarts to fail. Yes, Zabbix does like to quit on the rare occasion, and yes this is not very good, however there are other ways to restart the app rather than have the master thread fork itself.

A better solution might be to have a heartbeat script which is tied to a cron job. Every 5 minutes you check to see if you have a zabbix process, if you don't fire an email. This way you can check to see if something is truly gorked before you restart the server.

Though if you really want to get crazy I did read about a solution from one of the early timesharing systems from the 60's/70's called Robin Hood and the Sheriff. Both process would be running concurrently and looking for each other. If you killed one process, say Robin Hood, the Sheriff would restart Robin Hood. If you killed the sheriff process Robin Hood would restart the sheriff process. Extrapolate that to Zabbix and you have an answer to Alexei's question... There is no one watcher... so a watcher watches a watcher who watches zabbix... while watching the watcher.....

Ok... I step away from the comptuer now...

pesadilla
02-03-2009, 09:42
Automagic respawning to me is a bad idea. If the server is stopping for a reason such as an error the program should stop. Now if it's stopping due to aberrant behavior then ok, maybe there is a point to a restart however there is the risk the previous crash left the application in an unstable state upon restart, ie there is bad data in the DB which will cause subsequent restarts to fail. Yes, Zabbix does like to quit on the rare occasion, and yes this is not very good, however there are other ways to restart the app rather than have the master thread fork itself.

A better solution might be to have a heartbeat script which is tied to a cron job. Every 5 minutes you check to see if you have a zabbix process, if you don't fire an email. This way you can check to see if something is truly gorked before you restart the server.
..

agree with this idea

Tenzer
02-03-2009, 09:58
You can also set up Monit (http://mmonit.com/monit/) to monitor the Zabbix server. It can be configured to e-mail you when the server goes down, it can automatically restart the Zabbix server, and you can specify thresholds for how many times it may restart the Zabbix server before it gives up.

jarek
02-03-2009, 13:47
I agree, that autorespawning is not a cure for all bugs, but it is simple solution which decreases risk of data loss.
External watchdogs can be more flexible, but it general they are doing exactly same.
Regarding database: the respawning should be done before intialization of database connection - in this case there is no risk of any data corruption.
If you don't like the idea, it is quite simple to add configuration parameter, which disables the feature.
Most of you are using apache which has the same autorespawning. Is it bad ?

Saftnase
05-03-2009, 16:30
You can also set up Monit (http://mmonit.com/monit/) to monitor the Zabbix server. It can be configured to e-mail you when the server goes down, it can automatically restart the Zabbix server, and you can specify thresholds for how many times it may restart the Zabbix server before it gives up.

I agree on you Tenzer, but i did it with the Zabbix Server itself.

When i had the problem with net-snmp memory leak, i monitored the free swap space, and when it dropped below the trigger i just restarted the hole machine. (Takes only 5 min., so no problem for monitored hosts)