Ad Widget

Collapse

Very unstable 1.4.3 zabbix agent

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #1

    Very unstable 1.4.3 zabbix agent

    Simply dies with:

    4039:20071213:150218 One child process died. Exiting ...
    4041:20071213:150218 Got signal. Exiting ...
    4042:20071213:150218 Got signal. Exiting ...
    4043:20071213:150218 Got signal. Exiting ...
    4044:20071213:150218 Got signal. Exiting ...
    4039:20071213:150218 zbx_on_exit() called.

    After upgrade two hours ago to final 1.4.3 i observe this kind effect on ~10% computers.
    Before on this computers was installed zabbix agents 1.4.2 with added svn r5126 patches from 1.4 branch and all was rock solid so bug sits after this revision.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    What platform the agents are running on? May I ask you to provide pre-crash extract of agent log file with Debug set to 4. Thank you.
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Originally posted by Alexei
      What platform the agents are running on? May I ask you to provide pre-crash extract of agent log file with Debug set to 4. Thank you.
      $ uname -ps
      Linux i686

      BTW logfile: in last few days I found bug related to handle this file on server.
      Enable LogFileSize=0 in zabbix server configuration for disable rotate log file causes zabbix server freezes when this file gain 2GB size.
      Last edited by kloczek; 13-12-2007, 18:46.
      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        After change to DebugLevel=4 on all agent I don't see anything new in log output and all agents are still very unstable (in one group of 30 computers agent died in last 5h on ~40% computers). Example of end log with DebugLevel=4:

        13461:20071213:222235 Requested [vfs.fs.size[/var,pfree]]
        13461:20071213:222235 Sending back [22.960690]
        13458:20071213:222236 One child process died. Exiting ...
        13461:20071213:222236 Got signal. Exiting ...
        13459:20071213:222236 Got signal. Exiting ...
        13462:20071213:222236 Got signal. Exiting ...
        13463:20071213:222236 Got signal. Exiting ...
        13458:20071213:222236 zbx_on_exit() called.
        13458:20071213:222238 ZABBIX Agent stopped

        I must back to my last stable version.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • Alexei
          Founder, CEO
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2004
          • 5654

          #5
          Please post at least 50-100 lines of the log file.
          Alexei Vladishev
          Creator of Zabbix, Product manager
          New York | Tokyo | Riga
          My Twitter

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            27984:20071213:234650 Requested [system.cpu.util[,irq,]]
            27984:20071213:234650 Sending back [0.201804]
            27983:20071213:234650 Processing request.
            27983:20071213:234650 In check_security()
            27983:20071213:234650 Requested [vfs.dev.read[sda,operations]]
            27983:20071213:234650 Sending back [31847162]
            27982:20071213:234650 Processing request.
            27982:20071213:234650 In check_security()
            27982:20071213:234650 Requested [net.if.out[lo,bytes]]
            27982:20071213:234650 Sending back [96433]
            27984:20071213:234650 Processing request.
            27984:20071213:234650 In check_security()
            27984:20071213:234650 Requested [net.if.out[eth0,bytes]]
            27984:20071213:234650 Sending back [514254888]
            27983:20071213:234651 Processing request.
            27983:20071213:234651 In check_security()
            27983:20071213:234651 Requested [vfs.fs.size[/var/log,pfree]]
            27983:20071213:234651 Sending back [82.347821]
            27982:20071213:234652 Processing request.
            27982:20071213:234652 In check_security()
            27982:20071213:234652 Requested [system.cpu.load[,avg1]]
            27982:20071213:234652 Sending back [15.930000]
            27984:20071213:234652 Processing request.
            27984:20071213:234652 In check_security()
            27984:20071213:234652 Requested [net.if.out[eth1,bytes]]
            27984:20071213:234652 Sending back [192]
            27983:20071213:234652 Processing request.
            27983:20071213:234652 In check_security()
            27983:20071213:234652 Requested [system.cpu.util[,user,]]
            27983:20071213:234652 Sending back [1.554832]
            27982:20071213:234652 Processing request.
            27982:20071213:234652 In check_security()
            27982:20071213:234652 Requested [system.cpu.util[,wait,]]
            27982:20071213:234652 Sending back [80.141873]
            27984:20071213:234652 Processing request.
            27984:20071213:234652 In check_security()
            27984:20071213:234652 Requested [net.if.in[eth0,bytes]]
            27984:20071213:234652 Sending back [3188859023]
            27983:20071213:234653 Processing request.
            27983:20071213:234653 In check_security()
            27983:20071213:234653 Requested [vfs.dev.write[sda,sectors]]
            27983:20071213:234653 Sending back [2070953790]
            27982:20071213:234653 Processing request.
            27982:20071213:234653 In check_security()
            27982:20071213:234653 Requested [net.if.in[eth1,bytes]]
            27982:20071213:234653 Sending back [320]
            27984:20071213:234654 Processing request.
            27984:20071213:234654 In check_security()
            27984:20071213:234654 Requested [system.cpu.util[,idle,]]
            27984:20071213:234654 Sending back [14.572864]
            27983:20071213:234654 Processing request.
            27983:20071213:234654 In check_security()
            27983:20071213:234654 Requested [net.if.in[lo,bytes]]
            27983:20071213:234654 Sending back [96433]
            27982:20071213:234656 Processing request.
            27982:20071213:234656 In check_security()
            27982:20071213:234656 Requested [net.if.out[lo,bytes]]
            27982:20071213:234656 Sending back [96433]
            27984:20071213:234656 Processing request.
            27984:20071213:234656 In check_security()
            27984:20071213:234656 Requested [net.if.out[eth0,bytes]]
            27984:20071213:234656 Sending back [516267100]
            27983:20071213:234656 Processing request.
            27983:20071213:234656 In check_security()
            27983:20071213:234656 Requested [system.cpu.util[,system,]]
            27983:20071213:234656 Sending back [2.420319]
            27982:20071213:234656 Processing request.
            27982:20071213:234656 In check_security()
            27982:20071213:234656 Requested [vfs.dev.write[sda,operations]]
            27982:20071213:234656 Sending back [22340888]
            27984:20071213:234656 Processing request.
            27984:20071213:234656 In check_security()
            27984:20071213:234656 Requested [system.cpu.load[,avg1]]
            27984:20071213:234656 Sending back [15.450000]
            27983:20071213:234656 Processing request.
            27983:20071213:234656 In check_security()
            27983:20071213:234656 Requested [net.if.out[eth1,bytes]]
            27983:20071213:234656 Sending back [192]
            27982:20071213:234657 Processing request.
            27982:20071213:234657 In check_security()
            27982:20071213:234657 Requested [system.swap.in[all,pages]]
            27982:20071213:234657 Sending back [0]
            27984:20071213:234657 Processing request.
            27984:20071213:234657 In check_security()
            27984:20071213:234657 Requested [system.cpu.load[,avg5]]
            27984:20071213:234657 Sending back [18.340000]
            27983:20071213:234657 Processing request.
            27983:20071213:234657 In check_security()
            27983:20071213:234657 Requested [net.if.in[eth0,bytes]]
            27983:20071213:234657 Sending back [3194608139]
            27982:20071213:234658 Processing request.
            27982:20071213:234658 In check_security()
            27982:20071213:234658 Requested [system.cpu.util[,softirq,]]
            27982:20071213:234658 Sending back [0.827842]
            27984:20071213:234658 Processing request.
            27984:20071213:234658 In check_security()
            27984:20071213:234658 Requested [net.if.in[eth1,bytes]]
            27984:20071213:234658 Sending back [320]
            27985:20071213:234659 get_active_checks('192.168.1.106',10051)
            27983:20071213:234701 Processing request.
            27983:20071213:234701 In check_security()
            27983:20071213:234701 Requested [vfs.fs.inode[/,pfree]]
            27983:20071213:234701 Sending back [99.707275]
            27982:20071213:234702 Processing request.
            27982:20071213:234702 In check_security()
            27982:20071213:234702 Requested [system.swap.out[all,pages]]
            27982:20071213:234702 Sending back [0]
            27979:20071213:234702 One child process died. Exiting ...
            27982:20071213:234702 Got signal. Exiting ...
            27984:20071213:234702 Got signal. Exiting ...
            27983:20071213:234702 Got signal. Exiting ...
            27980:20071213:234702 Got signal. Exiting ...
            27979:20071213:234702 zbx_on_exit() called.
            27979:20071213:234704 ZABBIX Agent stopped
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • steria
              Junior Member
              • Jun 2007
              • 17

              #7
              Same here for Solaris 8 and HP-UX 11.11 pre-compiled binaries!

              Comment

              • Alexei
                Founder, CEO
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Sep 2004
                • 5654

                #8
                I confirm this problem. It happens only when ZABBIX agent timeouts while getting list of active checks for the first time. Setting correct Server or activating DisableActive should help.

                The problem is quite critical, so I think we will release out-of-order 1.4.4 next Monday. Pre-compiled binaries will be updated as well.
                Alexei Vladishev
                Creator of Zabbix, Product manager
                New York | Tokyo | Riga
                My Twitter

                Comment

                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #9
                  Originally posted by Alexei
                  I confirm this problem. It happens only when ZABBIX agent timeouts while getting list of active checks for the first time. Setting correct Server or activating DisableActive should help.

                  The problem is quite critical, so I think we will release out-of-order 1.4.4 next Monday. Pre-compiled binaries will be updated as well.
                  Woduln't it be better to ask why the timeout happens in first place?
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • Alexei
                    Founder, CEO
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Sep 2004
                    • 5654

                    #10
                    Originally posted by kloczek
                    Woduln't it be better to ask why the timeout happens in first place?
                    Because your agent is enable to connect to ZABBIX server within 3 seconds, so it timeouts. Check configuration of the agent.
                    Alexei Vladishev
                    Creator of Zabbix, Product manager
                    New York | Tokyo | Riga
                    My Twitter

                    Comment

                    • bbrendon
                      Senior Member
                      • Sep 2005
                      • 870

                      #11
                      Originally posted by Alexei
                      I confirm this problem. It happens only when ZABBIX agent timeouts while getting list of active checks for the first time. Setting correct Server or activating DisableActive should help.

                      The problem is quite critical, so I think we will release out-of-order 1.4.4 next Monday. Pre-compiled binaries will be updated as well.
                      Sounds good. I haven't seen any show stoppers with 1.4.3 on the server side listed in the forums, so I'll probably give it a shot this weekend.
                      Unofficial Zabbix Expert
                      Blog, Corporate Site

                      Comment

                      Working...