Ad Widget

Collapse

Distributed master hangs when slave node crashes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #1

    Distributed master hangs when slave node crashes

    Hi,

    Situation:
    Distributed setup, 1 master 2 slave nodes.
    Master node has 1.4.5-pre running with postgresql backend.
    Slaves nodes have 1.4.4 running with mysql backend.

    Problem:
    When one of the slave nodes crash / stop (havent had time to check for the erros yet, logfiles are already cycle'd away) the master node will keep running but not poll / receive data anymore. Trigger evaluation and action handling still works (looks like the symptoms related to the infinite loop in the trapper processes, but there is no load increase).
    Restart of the zabbix_server daemon will restore functionality

    Question:
    Is this specific issue known / fixed in 1.4.5? I dont really want to do the trial and error thing while the fix is unknown.
    I expect the slave and master node were exchanging data at the point of the crash (and thus leaving the master node in some wait state because the announced amount of data hasn't been received yet.

    Kind regards,
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #2
    Update

    All nodes are running 1.4.5

    The problem above still exists.
    Slave node which stops is running RHEL 4ES with mysql 4.1.20 (std mysql version for rhel4).
    Apperantly sometimes it *thinks* 'mysql has gone away' during a history update from slave (node 3) to master (node 1).
    This on itself is bad, i think, BUT the master also goes into a problematic state, it:
    - Doesnt receive data from active agents
    - Doesnt do agent checks
    - Doesnt do snmp traps
    - Does trigger evaluation
    - Does trigger action processing.

    -------------8<--------------- Slave node log directly after crash
    7848:20080328:105617 NODE 3: Sending new history_uint of node 3 to node 1 datalen 817
    7848:20080328:105626 NODE 3: Sending new history of node 3 to node 1 datalen 191
    7848:20080328:105627 NODE 3: Sending new history_uint of node 3 to node 1 datalen 197
    7848:20080328:105636 NODE 3: Sending new history of node 3 to node 1 datalen 47
    7848:20080328:105815 Error while receiving answer from Node [1]
    7848:20080328:105815 Query::select id,itemid,clock,value from history_uint_sync where nodeid=3 order by id limit 10000
    7848:20080328:105815 Query failed:MySQL server has gone away [2006]
    7826:20080328:105815 One child process died. Exiting ...
    7826:20080328:105817 ZABBIX Server stopped
    -------------8<--------------- Slave node log directly after crash


    ------------8<--------------- Master node log directly after crash
    105478 9450:20080328:155631 Active parameter [windows.ssm[Disk.SmartUsed,G,Percent]] is not supported by agent on host [nlyehvgdc1ms171]
    105479 9451:20080328:155631 NODE 1: Received history from node 2 for node 2 datalen 414
    105480 9452:20080328:155631 Active parameter [system.cpu.util[,system,avg1]] is not supported by agent on host [nlvocl6]
    105481 9452:20080328:155636 Active parameter [system.cpu.util[,idle,avg1]] is not supported by agent on host [nlvu009]
    105482 9451:20080328:155641 Active parameter [net.if.in[eth3,bytes]] is not supported by agent on host [nlvud138]
    105483 9456:20080328:155654 Expression [({100100000030421}#1)|({100100000030420}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
    105484 9456:20080328:155654 Expression [({100100000030459}#1)|({100100000030458}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
    105485 9456:20080328:155654 Expression [({100100000030425}#1)|({100100000030424}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030425]]
    105486 9456:20080328:155654 Expression [({100100000030309}#1)|({100100000030308}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030309]]
    105487 9456:20080328:155654 Expression [({100100000030313}#1)|({100100000030312}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030313]]
    105488 9456:20080328:155654 Expression [({100100000030315}#1)|({100100000030314}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030315]]
    105489 9456:20080328:155655 Expression [({100100000030641}#1)|({100100000030640}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030641]]
    105490 9456:20080328:155655 Expression [({100100000030415}#1)|({100100000030414}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030415]]
    105491 9456:20080328:155655 Expression [({100100000030463}#1)|({100100000030462}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030463]]
    105492 9457:20080328:155655 Timeout while answering request
    105493 9457:20080328:155656 Timeout while connecting to [nlvg153:161]
    105494 9457:20080328:155656 Host [nlvg153] will be checked after 60 seconds
    105495 9452:20080328:155656 Active parameter [system.cpu.util[,nice,avg1]] is not supported by agent on host [nlxcips36]
    105496 9456:20080328:155657 Expression [({100100000035985}#1)|({100100000035984}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000035985]]
    105497 9456:20080328:155728 Expression [({100100000030421}#1)|({100100000030420}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
    105498 9456:20080328:155728 Expression [({100100000030459}#1)|({100100000030458}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
    ------------8<--------------- Master node log directly after crash

    Appart from the bad function id relation messages, it does nothing.

    I *think* the following log line is the one where the connection is lost
    105492 9457:20080328:155655 Timeout while answering request
    The date/time lines differ because the node is located in a different timezone

    @developers
    The fact the slave node stops, i can accept (although i am 90% certain the mysql server does not 'go away'). But the fact that the master node also goes blank is a big problem.
    Any ideas?

    --- Update
    I suspect the connection between the master and slave node to have dropped / be unstable. I can imagine that if the connection is dropped during history update, mysql gets the blame (since its read from mysql, sent through a parser and off the master node.
    Last edited by xs-; 28-03-2008, 17:33.

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      Originally posted by xs-
      @developers
      The fact the slave node stops, i can accept (although i am 90% certain the mysql server does not 'go away'). But the fact that the master node also goes blank is a big problem.
      Slave node stopped due to MySQL unavailability. Please check MySQL logs to see what happened.

      I do not see any critical problems on the master node side. I am note sure what you mean saying "the master node goes blank".
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • xs-
        Senior Member
        Zabbix Certified Specialist
        • Dec 2007
        • 393

        #4
        As said, the master node
        - Doesnt receive data from active agents
        - Doesnt do agent checks
        - Doesnt do snmp polls
        - Does trigger evaluation
        - Does trigger action processing.

        Symptoms look similar to http://www.zabbix.com/forum/showthread.php?t=8769

        And this happens, as far as i can see, because of the slave node crashing / stopping
        Last edited by xs-; 31-03-2008, 09:19.

        Comment

        • xs-
          Senior Member
          Zabbix Certified Specialist
          • Dec 2007
          • 393

          #5
          After further testing, i get the feeling the it is an mysql issue which pops up only when the child is sending history data to the master (during the send). Probably a mysql4 thing, i need to enable binlogging for this.
          But as long as the master node keeps hanging when this happens, i am somewhat reluctant to turn the child node back on (i disabled it for now so the master node keeps working).

          I am not posting here because the child node dies (ill make another post when i find out why this happens).
          I am posting because when the child node dies during a history update to the master node, the master node partially freezes, displaying the same symptoms presented in the following threads (but the cause is different)

          Comment

          • xs-
            Senior Member
            Zabbix Certified Specialist
            • Dec 2007
            • 393

            #6
            Update:

            We upgraded the child node to mysql5, problem seems to be gone.

            But my main issue still stands, if a fatal query error occurs during a history / config update between nodes, the sending end will stop and because of that the receiving node will 'hang'.

            Comment

            Working...