Ad Widget

Collapse

Weird Zabbix Agent freezing...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gleepwurp
    Senior Member
    • Mar 2014
    • 119

    #1

    Weird Zabbix Agent freezing...

    (Sorry, originally posted this in the wrong forum...)

    Hi,

    I'm encountering a weird issue where the Zabbix Agent stops sending active data to the Zabbix Server. the weird thing is, when I do a "log_level_increase" to see what's going on, it unfreezes and starts sending the data...

    Here is a part of the Zabbix Agent log that's relevant:

    Code:
    13808:20150316:101539.102 log level has been increased to 4 (debug)
     13808:20150316:101539.102 collector [processing data]
     13808:20150316:101539.102 In update_cpustats()
     13808:20150316:101539.102 End of update_cpustats()
     13808:20150316:101539.102 collector [idle 1 sec]
     13810:20150316:101539.102 log level has been increased to 4 (debug)
     13810:20150316:101539.102 listener #1 [waiting for connection]
     13811:20150316:101539.102 log level has been increased to 4 (debug)
     13812:20150316:101539.103 log level has been increased to 4 (debug)
     13813:20150316:101539.103 log level has been increased to 4 (debug)
     13814:20150316:101539.103 log level has been increased to 4 (debug)
     13812:20150316:101539.103 listener #3 [waiting for connection]
    [B] 13813:20150316:101539.103 active check data upload to [142.101.252.201:10051] is working again[/B]
     13814:20150316:101539.103 In send_buffer() host:'142.101.253.106' port:10051 values:0/100
     13813:20150316:101539.104 End of send_buffer():SUCCEED
     13814:20150316:101539.104 End of send_buffer():SUCCEED
     13813:20150316:101539.104 buffer: new element 0
     13814:20150316:101539.104 active checks #2 [processing active checks]
     13813:20150316:101539.104 End of process_value():SUCCEED
     13814:20150316:101539.109 In process_active_checks() server:'142.101.253.106' port:10051)
     13814:20150316:101539.113 for key [proc.mem[rscd]] received value [314322944]
     13814:20150316:101539.113 In process_value() key:'whml33985:proc.mem[rscd]' value:'314322944'
     13814:20150316:101539.113 In send_buffer() host:'142.101.253.106' port:10051 values:0/100
     13814:20150316:101539.113 End of send_buffer():SUCCEED
     13814:20150316:101539.113 buffer: new element 0
     13814:20150316:101539.113 End of process_value():SUCCEED
     13814:20150316:101539.113 End of process_active_checks()
     13814:20150316:101539.113 In get_min_nextcheck()
     13814:20150316:101539.113 active checks #2 [idle 1 sec]
    Have any of you encountered a similar issue? It only occurs randomly on a couple of Redhat machines, and not always the same, all reporting to the same Zabbix Server...

    Thanks!

    Gleepwurp.
  • filipp.sudanov
    Senior Member
    Zabbix Certified Specialist
    • May 2014
    • 137

    #2
    Try something like
    Code:
    strace  -s 256 -p <PID of "active checks" agent's process> -tdt
    to understand what the agent is doing.

    Also get a tcpdump of agent's exchange with the server - it's possible that some TCP packets are getting lost on some firewall - in such case agent has quite a long timeout until it will try to reconnect again.

    Comment

    • gleepwurp
      Senior Member
      • Mar 2014
      • 119

      #3
      Hi Filipp,

      thanks for the suggestions... strace is something I never used before, and I didn't even think of checking with tcpdump...

      I'll try those the next time I get the problem and we'll see what's going on...

      Thanks again!

      Gleepwurp.

      Comment

      • gleepwurp
        Senior Member
        • Mar 2014
        • 119

        #4
        Ok, Happened again this morning and I ran the strace on the Zabbix_agentd: Active checks process.

        Indeed, the active checks process seems to have hung:

        Code:
        sudo strace -s 256 -p 28327 -tdt
        Password: 
        Process 28327 attached - interrupt to quit
         [wait(0x137f) = 28327]
        pid 28327 stopped, [SIGSTOP]
         [wait(0x57f) = 28327]
        pid 28327 stopped, [SIGTRAP]
        09:39:56.009157 read(5,
        as soon as I increase the log_levels (-R log_level_increase), the whole thing "unlocks" and resumes normal processing.

        Please note that this issue always seem to occur when I have a big period of High-Level queue waiting > 10 minutes (around 80,000 items) and the Zabbix Server seems to stop accepting network connections... When the Zabbix server comes back online, some agents are locked in this "Active check hung" state.

        G.

        Comment

        • filipp.sudanov
          Senior Member
          Zabbix Certified Specialist
          • May 2014
          • 137

          #5
          Would be cool to log such agent's behaviour starting from the moment _before_ the problem happened - but putting agent's log level 4 from the beginning (and strace + tcpdump).
          But as I understand it happens randomly? Is there any way to replicate this?

          Comment

          • gleepwurp
            Senior Member
            • Mar 2014
            • 119

            #6
            Indeed!

            No... this can work fine for a couple of days and then happen again...

            I've left the LogLevel = 4 for the Zabbix Active checks only (-R log-level-increase=<active pid>, so maybe I'll see what happens in the logs next time...

            Thanks for the feedback!

            G.

            Comment

            Working...