Ad Widget

**xs-** · 28-03-2008, 17:15

Update

All nodes are running 1.4.5

The problem above still exists.
Slave node which stops is running RHEL 4ES with mysql 4.1.20 (std mysql version for rhel4).
Apperantly sometimes it *thinks* 'mysql has gone away' during a history update from slave (node 3) to master (node 1).
This on itself is bad, i think, BUT the master also goes into a problematic state, it:
- Doesnt receive data from active agents
- Doesnt do agent checks
- Doesnt do snmp traps
- Does trigger evaluation
- Does trigger action processing.

-------------8<--------------- Slave node log directly after crash
7848:20080328:105617 NODE 3: Sending new history_uint of node 3 to node 1 datalen 817
7848:20080328:105626 NODE 3: Sending new history of node 3 to node 1 datalen 191
7848:20080328:105627 NODE 3: Sending new history_uint of node 3 to node 1 datalen 197
7848:20080328:105636 NODE 3: Sending new history of node 3 to node 1 datalen 47
7848:20080328:105815 Error while receiving answer from Node [1]
7848:20080328:105815 Query::select id,itemid,clock,value from history_uint_sync where nodeid=3 order by id limit 10000
7848:20080328:105815 Query failed:MySQL server has gone away [2006]
7826:20080328:105815 One child process died. Exiting ...
7826:20080328:105817 ZABBIX Server stopped
-------------8<--------------- Slave node log directly after crash

------------8<--------------- Master node log directly after crash
105478 9450:20080328:155631 Active parameter [windows.ssm[Disk.SmartUsed,G,Percent]] is not supported by agent on host [nlyehvgdc1ms171]
105479 9451:20080328:155631 NODE 1: Received history from node 2 for node 2 datalen 414
105480 9452:20080328:155631 Active parameter [system.cpu.util[,system,avg1]] is not supported by agent on host [nlvocl6]
105481 9452:20080328:155636 Active parameter [system.cpu.util[,idle,avg1]] is not supported by agent on host [nlvu009]
105482 9451:20080328:155641 Active parameter [net.if.in[eth3,bytes]] is not supported by agent on host [nlvud138]
105483 9456:20080328:155654 Expression [({100100000030421}#1)|({100100000030420}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
105484 9456:20080328:155654 Expression [({100100000030459}#1)|({100100000030458}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
105485 9456:20080328:155654 Expression [({100100000030425}#1)|({100100000030424}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030425]]
105486 9456:20080328:155654 Expression [({100100000030309}#1)|({100100000030308}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030309]]
105487 9456:20080328:155654 Expression [({100100000030313}#1)|({100100000030312}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030313]]
105488 9456:20080328:155654 Expression [({100100000030315}#1)|({100100000030314}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030315]]
105489 9456:20080328:155655 Expression [({100100000030641}#1)|({100100000030640}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030641]]
105490 9456:20080328:155655 Expression [({100100000030415}#1)|({100100000030414}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030415]]
105491 9456:20080328:155655 Expression [({100100000030463}#1)|({100100000030462}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030463]]
105492 9457:20080328:155655 Timeout while answering request
105493 9457:20080328:155656 Timeout while connecting to [nlvg153:161]
105494 9457:20080328:155656 Host [nlvg153] will be checked after 60 seconds
105495 9452:20080328:155656 Active parameter [system.cpu.util[,nice,avg1]] is not supported by agent on host [nlxcips36]
105496 9456:20080328:155657 Expression [({100100000035985}#1)|({100100000035984}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000035985]]
105497 9456:20080328:155728 Expression [({100100000030421}#1)|({100100000030420}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
105498 9456:20080328:155728 Expression [({100100000030459}#1)|({100100000030458}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
------------8<--------------- Master node log directly after crash

Appart from the bad function id relation messages, it does nothing.

I *think* the following log line is the one where the connection is lost
105492 9457:20080328:155655 Timeout while answering request
The date/time lines differ because the node is located in a different timezone

@developers
The fact the slave node stops, i can accept (although i am 90% certain the mysql server does not 'go away'). But the fact that the master node also goes blank is a big problem.
Any ideas?

--- Update
I suspect the connection between the master and slave node to have dropped / be unstable. I can imagine that if the connection is dropped during history update, mysql gets the blame (since its read from mysql, sent through a parser and off the master node.

**Alexei** · 28-03-2008, 21:19

Originally posted by xs-

@developers
The fact the slave node stops, i can accept (although i am 90% certain the mysql server does not 'go away'). But the fact that the master node also goes blank is a big problem.

Slave node stopped due to MySQL unavailability. Please check MySQL logs to see what happened.

I do not see any critical problems on the master node side. I am note sure what you mean saying "the master node goes blank".

**xs-** · 31-03-2008, 09:04

As said, the master node
- Doesnt receive data from active agents
- Doesnt do agent checks
- Doesnt do snmp polls
- Does trigger evaluation
- Does trigger action processing.

Symptoms look similar to http://www.zabbix.com/forum/showthread.php?t=8769

And this happens, as far as i can see, because of the slave node crashing / stopping

**xs-** · 01-04-2008, 14:41

After further testing, i get the feeling the it is an mysql issue which pops up only when the child is sending history data to the master (during the send). Probably a mysql4 thing, i need to enable binlogging for this.
But as long as the master node keeps hanging when this happens, i am somewhat reluctant to turn the child node back on (i disabled it for now so the master node keeps working).

I am not posting here because the child node dies (ill make another post when i find out why this happens).
I am posting because when the child node dies during a history update to the master node, the master node partially freezes, displaying the same symptoms presented in the following threads (but the cause is different)

We’ll be back soon!

http://www.zabbix.com/forum/showthread.php?t=8769

We’ll be back soon!

http://www.zabbix.com/forum/showthread.php?t=8874

**xs-** · 03-04-2008, 12:38

Update:

We upgraded the child node to mysql5, problem seems to be gone.

But my main issue still stands, if a fatal query error occurs during a history / config update between nodes, the sending end will stop and because of that the receiving node will 'hang'.

Ad Widget

Distributed master hangs when slave node crashes

Distributed master hangs when slave node crashes

Comment

Comment

Comment

Comment

Comment