PDA

View Full Version : 1.4 Node Watcher Thread


agehring
08-06-2007, 18:10
Are there any patches for the "node watcher" thread for server?

I'm getting segment violations (SIGSEGV SIG_DFL) on zabbix server running on OS X, and it always happens with the node watcher thread.

Is there a way to turn of the distributed attribute during run and/or compile time?

Thanks,
Andrew

P.S. Working on debugging it "deeper"...

tronite
08-06-2007, 18:14
Are there any patches for the "node watcher" thread for server?

I'm getting segment violations (SIGSEGV SIG_DFL) on zabbix server running on OS X, and it always happens with the node watcher thread.

Is there a way to turn of the distributed attribute during run and/or compile time?

Thanks,
Andrew

P.S. Working on debugging it "deeper"...

Have you checked what they say here: http://sciss.de/jcollider/doc/api/de/sciss/jcollider/NodeWatcher.html
?

agehring
08-06-2007, 18:19
What does a Java class have to do with zabbix?

agehring
08-06-2007, 19:14
I'm getting "deeper".

The node watcher server SIGSEGVs at the 103942nd call of

In send_to_master_and_slave(node:0)

of the node watcher...

I've repeated this 10 times, without fail...


memory leak?

Andrew

Alexei
08-06-2007, 19:36
Please could you apply this patch and let me know if it helped.

agehring
08-06-2007, 21:55
I appled the patch, and the only change is that nw it process 103941 calls, vs 103942..

This is the log output just before the crash dump...

19307:20070608:134636 In get_master_node(0)
19307:20070608:134636 Query [select masterid from nodes where nodeid=0]
19307:20070608:134636 In get_slave_node(0)
19307:20070608:134636 Query [select masterid from nodes where nodeid=0]
19307:20070608:134636 Query [select nodeid from nodes where masterid=0]
19307:20070608:134636 In process_node(node:0)
19307:20070608:134636 In send_to_master_and_slave(node:0)
19292:20070608:134637 One child process died. Exiting ...
19293:20070608:134637 Got signal. Exiting ...
19294:20070608:134637 Got signal. Exiting ...
19295:20070608:134637 Got signal. Exiting ...
19296:20070608:134637 Got signal. Exiting ...
19297:20070608:134637 Got signal. Exiting ...
19306:20070608:134637 Got signal. Exiting ...
19308:20070608:134637 Got signal. Exiting ...
19299:20070608:134637 Got signal. Exiting ...
19298:20070608:134637 Got signal. Exiting ...
19302:20070608:134637 Got signal. Exiting ...
19304:20070608:134637 Got signal. Exiting ...
19301:20070608:134637 Got signal. Exiting ...
19300:20070608:134637 Got signal. Exiting ...
19305:20070608:134637 Got signal. Exiting ...
19303:20070608:134637 Got signal. Exiting ...
19292:20070608:134639 ZABBIX Server stopped

I also compiled the code with -g (CCFLAGS="-g"), and ran it under gdb.

What is interesting in that senario(to me anyway), is that it DOES NOT crash while either itself, or it parent is running under gdb, but the minute I exit gdb , it immediately crashes.

Thanks,
Andrew

agehring
08-06-2007, 22:19
GDB reported the following...

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xbf7fffec
0x0001b1dc in zabbix_log (level=4, fmt=0x3979c "Query [%s]") at log.c:180
180 {

agehring
08-06-2007, 22:36
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xbf7fffdc
0x0001b1dc in zabbix_log (level=4, fmt=0x3979c "Query [%s]") at log.c:180
180 {

(gdb) x 0x0001b1dc
0x1b1dc <zabbix_log+12>: 0x01ec3be8
(gdb)


And from here I'm lost...

Thanks,
Andrew

agehring
08-06-2007, 22:52
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xbf7fffec
0x9000319e in szone_malloc ()

0x9000319e <szone_malloc+12>: 0x16786de8

Program starts at approx 161M in size, and grows to 161G in about 46 seconds...

Alexei
09-06-2007, 09:25
I am pretty sure that configuration of nodes is messed up on your system. Please could you send me result of:

select * from nodes
NodeID from zabbix_server.log


Thanks.

agehring
11-06-2007, 14:42
mysql> select * from nodes;
+--------+------------+----------+-----------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
| nodeid | name | timezone | ip | port | slave_history | slave_trends | event_lastid | history_lastid | history_str_lastid | history_uint_lastid | nodetype | masterid |
+--------+------------+----------+-----------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
| 0 | Local node | 0 | 127.0.0.1 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 1 | 0 |
+--------+------------+----------+-----------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
1 row in set (0.00 sec)


zabbix:~ root# cat /etc/zabbix/zabbix_server.conf
# This is config file for ZABBIX server process
# To get more information about ZABBIX,
# go http://www.zabbix.com

############ GENERAL PARAMETERS #################

# This defines unique NodeID in distributed setup,
# Default value 0 (standalone server)
# This parameter must be between 0 and 999
#NodeID=0

agehring
11-06-2007, 17:56
I enabled libMallocDebug, and am getting the following...

libMallocDebug[zabbix_server-719]: frame pointer goes from bffee098 to bfffe758 -- assuming invalid.


relevant?

Thanks,
Andrew

Alexei
11-06-2007, 19:16
Please could you try the latest code from svn://svn.zabbix.com/branches/1.4.1 ? We fixed several memory related issues, it could affect the reported problem.

agehring
11-06-2007, 23:03
It still crashes...

I'm going to compile with -g again...


Andrew

Alexei
12-06-2007, 06:57
I'm very very interested in getting more details about this crash! Can you get a backtrace of executed functions? Thank you.

NOB
12-06-2007, 09:53
Hi,

if you just have one masternode with the id 0 - as you seem to have,
there is an endless-loop in the code.
The following patch fixes it - applied to 1.3.7 and 1.3.8.

--- zabbix-1.4/src/zabbix_server/nodewatcher/nodesender.c Tue May 29 14:52:49 2007
+++ zabbix-1.3.8/src/zabbix_server/nodewatcher/nodesender.c Tue May 22 15:02:23 2007
@@ -388,8 +388,8 @@

send_to_master_and_slave(nodeid);

- result = DBselect("select nodeid from nodes where masterid=%d",
- nodeid);
+ result = DBselect("select nodeid from nodes where masterid=%d and nodeid != %d",
+ nodeid, nodeid);
while((row=DBfetch(result)))
{
process_node(atoi(row[0]));

No surprise, that the memory gets allocated that fast.
It's "just" the stack that grows.

We are not running 1.4 server for now. Just waiting for 1.4.1 and
testing the agents of 1.4.

HTH,

Norbert.

agehring
12-06-2007, 15:01
That patch fixed the problem.

I had setup zabbix to run with the libgmalloc, but after doing that it no longer failed.

All of the backtraces I had run against the failed instances had pointed to nodesender.

As a added benefit, the system isn't running at such a breakneck pace anymore either.

Thanks Norbert!
Andrew

Alexei
12-06-2007, 15:12
Note that installation of ZABBIX 1.4 does not create any records in table 'nodes'! The record with NodeId=0 seems suspicious to me.

agehring
12-06-2007, 15:17
Do you believe removing the DB entry would alleviate the issue?

Thanks,
Andrew

Alexei
12-06-2007, 15:42
Yes, it is absolutely safe to remove this entry!

NOB
13-06-2007, 12:19
Hi,

OK, why I have this entry in the "nodes" table:

As soon as you start zabbix_server -n 0 this node is created in the
DB.
After that I was immediately hit by the above problem.
That's why I fixed it.

So, what is a good idea for the "real" master node, i.e. the root
of the tree ?
0 obviously isn't :-)

Regards,

Norbert.