PDA

View Full Version : Watchdog doesn't work in v1.4.2


gustav
11-09-2007, 17:56
I shutdown my mysql server and zabbix_server went down more or less imediately < 2 sek.

I was expecting the zabbix_server to send me an e-mail and then try to contact mysql as soon as it went up again. As I read that zabbix watchdog should do for me.

[root@spik zabbix]# zabbix_server --version
ZABBIX Server (daemon) v1.4.2 (20 August 2007)
Compilation time: Sep 5 2007 14:38:19

Extract from the server log:
7373:20070911:173538 Query failed:MySQL server has gone away [2006]
7391:20070911:173543 Query::select druleid,iprange,delay,nextcheck,name,status
from drules where status=0 and nextcheck<=1189524943 and mod(druleid,1)=0 and d
ruleid>=100000000000000*0 and druleid<=(100000000000000*0+99999999999999)
7391:20070911:173543 Query failed:MySQL server has gone away [2006]
7385:20070911:173543 Failed to connect to database: Error: Can't connect to My
SQL server on 'bi.sthlm.se.eds.com' (111) [2003]
7369:20070911:173543 One child process died. Exiting ...
7369:20070911:173545 ZABBIX Server stopped

Our mysql server is actually very stable and is even though residing on an HA. Nevertheles it would be such a nice feature if this would work. That would make it easier for us to maintain our MySQL server, which by the way doesn't reside on the same machine as the Zabbix server.

What do you guys say, any suggestions?

/Gustav Karlman

alj
11-09-2007, 18:17
It would take alot of time to clean up all the code to avoid crashes and leaks. What would make sense is that zabbix should not die if one of the children exits. In fact like in apache every child can be configured to exit by itself after 100k requests or so to to avoid memory fragmentation and avoid consequences of memory leaks.

The process nanny should restart children as they die.

The easiest way - just to copy that piece of code from apache-prefork model. It has neat features like avoiding fork-storms (it creates only 1 process per second).


The next step would be to implement smart children management. I e config would have only one option - max number of children (to not trip on database connection limit), then zabbix would dynamically decide how many pollers/trappers/http pollers to fork or not to fork (after they exit) based on recent statistics.

gustav
11-09-2007, 21:18
As I understand it, it is suposed to work... It is listed as a new feature in this release, so I asume I did something wrong or it is a bug?

/Gustav

gustav
12-09-2007, 18:19
I found and corrected the problem in the source code, I asume at least, since I haven't studied it in detail.

The problem resided in db_connect and that the errno from mysql connect wasn't handled.

The errno was 2003.
#define CR_CONN_HOST_ERROR 2003

I just added it in the switch and set ret to ZBX_DB_DOWN.

You, Alexei, tell me if it is correct?

Anyway, now it works, if I loose any transactions? I didn't analyze it that carefully...