Ad Widget

Collapse

zabbix agent frequent crash

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sunfire999
    Junior Member
    • Jan 2014
    • 2

    #1

    zabbix agent frequent crash

    Hi expert
    I have a zabbix agent 2.0.10 on AIX 6.1,recently this agent is frequent crash.below is error log:

    20513712:20140117:151203.715 Got signal [signal:11(SIGSEGV),reason:51,refaddr:2f40000]. Crashing ...
    20513712:20140117:151203.715 ====== Fatal information: ======
    20513712:20140117:151203.715 program counter not available for this architecture
    20513712:20140117:151203.715 === Registers: ===
    20513712:20140117:151203.715 register dump not available for this architecture
    20513712:20140117:151203.715 === Backtrace: ===
    20513712:20140117:151203.715 backtrace not available for this platform
    20513712:20140117:151203.715 === Memory map: ===
    20513712:20140117:151203.715 memory map not available for this platform
    20513712:20140117:151203.715 ================================
    27263806:20140117:151203.717 One child process died (PID:20513712,exitcode/signal:255). Exiting ...
    27263806:20140117:151205.718 Zabbix Agent stopped. Zabbix 2.0.10 (revision 40809).

    I've also tried agent 2.2 but have the same issue.

    Thanks for your help!
    Jason
  • jan.garaj
    Senior Member
    Zabbix Certified Specialist
    • Jan 2010
    • 506

    #2
    Do you know which child process of agent has a problem (listener, collector, active checks)?
    Check it after agent start by PID, which child process was terminated with exitcode/signal:255.
    Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
    My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

    Comment

    • sunfire999
      Junior Member
      • Jan 2014
      • 2

      #3
      Thanks for your reply!

      20513712:20140117:142132.281 agent #1 started[listener]

      Originally posted by jan.garaj
      Do you know which child process of agent has a problem (listener, collector, active checks)?
      Check it after agent start by PID, which child process was terminated with exitcode/signal:255.

      Comment

      • germanomm
        Junior Member
        • Jun 2014
        • 4

        #4
        Problem solved?

        Mr. sunfire999,

        This problem was solved?

        I have the same problem with Zabbix Agent 2.2.1 in AIX 6.1.

        Can you help me?


        4718952:20140606:101809.347 Got signal [signal:11(SIGSEGV),reason:51,refaddr:2710000]. Crashing ...
        4718952:20140606:101809.347 ====== Fatal information: ======
        4718952:20140606:101809.347 program counter not available for this architecture
        4718952:20140606:101809.347 === Registers: ===
        4718952:20140606:101809.347 register dump not available for this architecture
        4718952:20140606:101809.347 === Backtrace: ===
        4718952:20140606:101809.347 backtrace not available for this platform
        4718952:20140606:101809.347 === Memory map: ===
        4718952:20140606:101809.347 memory map not available for this platform
        4718952:20140606:101809.347 ================================
        459200:20140606:101809.348 One child process died (PID:4718952,exitcode/signal:255). Exiting ...
        459200:20140606:101811.349 Zabbix Agent stopped. Zabbix 2.0.0 (revision 27675)

        The process with problem is the "listener"

        Comment

        • shadow
          Junior Member
          • Aug 2014
          • 3

          #5
          Same Issue here

          I have the same issue as well where the AIX zabbix_agentd crashes. It appears to be only on systems that have Oracle ASM disks and crashes when the agent does vfs checks.


          66715830:20140806:114839.754 listener #3 [processing request]
          66715830:20140806:114839.754 Requested [vfs.fs.discovery]
          48103484:20140806:114839.754 In send_buffer() host:'10.251.199.18' port:10051 values:0/100
          66715830:20140806:114839.754 Got signal [signal:11(SIGSEGV),reason:51,refaddr:1680000]. Crashing ...
          48103484:20140806:114839.754 End of send_buffer():SUCCEED
          66715830:20140806:114839.754 ====== Fatal information: ======
          48103484:20140806:114839.754 active checks #1 [idle 1 sec]
          66715830:20140806:114839.754 program counter not available for this architecture
          66715830:20140806:114839.755 === Registers: ===
          66715830:20140806:114839.755 register dump not available for this architecture
          66715830:20140806:114839.755 === Backtrace: ===
          66715830:20140806:114839.755 backtrace not available for this platform
          66715830:20140806:114839.755 === Memory map: ===
          66715830:20140806:114839.755 memory map not available for this platform
          66715830:20140806:114839.755 ================================
          41812044:20140806:114839.762 One child process died (PID:66715830,exitcode/signal:255). Exiting ...
          41812044:20140806:114839.762 zbx_on_exit() called
          48562258:20140806:114839.765 Got signal [signal:15(SIGTERM),sender_pid:41812044,sender_uid: 705,reason:0]. Exiting ...
          65142810:20140806:114839.765 Got signal [signal:15(SIGTERM),sender_pid:41812044,sender_uid: 705,reason:0]. Exiting ...
          48103484:20140806:114839.767 Got signal [signal:15(SIGTERM),sender_pid:41812044,sender_uid: 705,reason:0]. Exiting ...
          6095320:20140806:114839.772 Got signal [signal:15(SIGTERM),sender_pid:41812044,sender_uid: 705,reason:0]. Exiting ...
          41812044:20140806:114839.774 Zabbix Agent stopped. Zabbix 2.2.5 (revision 47411).
          41812044:20140806:114839.775 In unload_modules()


          Any thoughts as to what to check? I have tried numerous versions of the agent with the same issue.

          Comment

          • germanomm
            Junior Member
            • Jun 2014
            • 4

            #6
            Hello!

            I changed the StartAgents value.

            StartAgents=9

            It solved for me.
            You can try increase this value.


            bye

            Comment

            • shadow
              Junior Member
              • Aug 2014
              • 3

              #7
              No Luck

              Thanks for the response. Unfortunately that did not resolve the crashing issue.


              22151224:20140807:093320.311 Requested [vfs.fs.discovery]
              22151224:20140807:093320.312 Got signal [signal:11(SIGSEGV),reason:51,refaddr:1680000]. Crashing ...
              22151224:20140807:093320.312 ====== Fatal information: ======
              22151224:20140807:093320.312 program counter not available for this architecture
              22151224:20140807:093320.312 === Registers: ===
              22151224:20140807:093320.312 register dump not available for this architecture
              22151224:20140807:093320.312 === Backtrace: ===
              22151224:20140807:093320.312 backtrace not available for this platform
              22151224:20140807:093320.312 === Memory map: ===
              22151224:20140807:093320.312 memory map not available for this platform
              22151224:20140807:093320.312 ================================
              11338210:20140807:093320.313 One child process died (PID:22151224,exitcode/signal:255). Exiting ...
              11338210:20140807:093320.313 zbx_on_exit() called
              59441174:20140807:093320.313 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              13369672:20140807:093320.313 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              2359742:20140807:093320.314 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              20512822:20140807:093320.315 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              6488544:20140807:093320.316 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              4719006:20140807:093320.316 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              36503664:20140807:093320.317 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              38797522:20140807:093320.317 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              64553206:20140807:093320.318 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              66519086:20140807:093320.318 Got signal [signal:15(SIGTERM),sender_pid:11338210,sender_uid: 705,reason:0]. Exiting ...
              11338210:20140807:093320.322 Zabbix Agent stopped. Zabbix 2.2.5 (revision 47411).
              11338210:20140807:093320.322 In unload_modules()

              Comment

              • germanomm
                Junior Member
                • Jun 2014
                • 4

                #8
                Try it!

                Hello again!

                Sorry... my memory isn't good...

                Another thing that I did, was disable "Discovery Rules" on the AIX host from web interface.

                Try it!
                so.. if solve the problem, you can enable once a week for update the "Discovery" Items and triggers, and after that disable again.

                try it and answer if it worked! please!
                Last edited by germanomm; 25-08-2014, 15:30.

                Comment

                • shadow
                  Junior Member
                  • Aug 2014
                  • 3

                  #9
                  That worked.

                  So it keeps that agent running... so far. Thanks for the advise. I guess the real question is, is someone looking into why this discovery check causes the agent to crash on some AIX systems? This is just a work-around and should not be deemed as a long term solution. Should i be submitting this issue elsewhere?

                  Thanks again. I can at least move forward with my AIX deployments.

                  Comment

                  • germanomm
                    Junior Member
                    • Jun 2014
                    • 4

                    #10
                    Of course, it isn't a solution of the problem =/

                    I'm not a developer, I'm only a Zabbix user too.

                    Comment

                    • jad.baz
                      Junior Member
                      • Dec 2019
                      • 13

                      #11
                      6 years later, I'm having the same issue with Zabbix agent 4.4.7
                      I'm running 500 zabbix agent docker containers on 4 different machine (500 on each)
                      Each server crashed within a few hours of each other completely independently
                      On all these servers, all containers stopped at the same time
                      I get something like this:

                      Code:
                      [root@zabbix-load-client /]# for i in `seq 1 5`; do echo "*** $i ***"; docker logs -t --tail 3 load_test_agent_$i; echo; done
                      *** 1 ***
                      2020-04-29T07:48:49.806430000Z     71:20200429:074849.754 active check data upload to [104.239.234.21:10051] is working again
                      2020-04-29T09:47:44.707973000Z      8:20200429:094744.048 Got signal [signal:15(SIGTERM),sender_pid:1,sender_uid:999,reason:0]. Exiting ...
                      2020-04-29T09:47:44.708355000Z      8:20200429:094744.053 Zabbix Agent stopped. Zabbix 4.4.7 (revision 77fb8c7).
                      
                      *** 2 ***
                      2020-04-29T07:49:48.764472000Z     71:20200429:074948.751 active check data upload to [104.239.234.21:10051] is working again
                      2020-04-29T09:47:43.428541000Z      8:20200429:094743.366 Got signal [signal:15(SIGTERM),sender_pid:1,sender_uid:999,reason:0]. Exiting ...
                      2020-04-29T09:47:43.431914000Z      8:20200429:094743.372 Zabbix Agent stopped. Zabbix 4.4.7 (revision 77fb8c7).
                      
                      *** 3 ***
                      2020-04-29T07:48:49.003615000Z     71:20200429:074849.002 active check data upload to [104.239.234.21:10051] is working again
                      2020-04-29T09:47:42.381715000Z      8:20200429:094742.308 Got signal [signal:15(SIGTERM),sender_pid:1,sender_uid:999,reason:0]. Exiting ...
                      2020-04-29T09:47:42.382115000Z      8:20200429:094742.310 Zabbix Agent stopped. Zabbix 4.4.7 (revision 77fb8c7).
                      
                      *** 4 ***
                      2020-04-29T07:48:48.781917000Z     71:20200429:074848.780 active check data upload to [104.239.234.21:10051] is working again
                      2020-04-29T09:47:44.725097000Z      8:20200429:094744.190 Got signal [signal:15(SIGTERM),sender_pid:1,sender_uid:999,reason:0]. Exiting ...
                      2020-04-29T09:47:44.725819000Z      8:20200429:094744.193 Zabbix Agent stopped. Zabbix 4.4.7 (revision 77fb8c7).
                      
                      *** 5 ***
                      2020-04-29T07:48:49.767545000Z     71:20200429:074849.717 active check data upload to [104.239.234.21:10051] is working again
                      2020-04-29T09:47:43.429339000Z      8:20200429:094743.365 Got signal [signal:15(SIGTERM),sender_pid:1,sender_uid:999,reason:0]. Exiting ...
                      2020-04-29T09:47:43.431570000Z      8:20200429:094743.371 Zabbix Agent stopped. Zabbix 4.4.7 (revision 77fb8c7).
                      There's nothing peculiar in the server logs nor in the client logs. Here for example, it says is working again then 2 hours later, it crashes (completely unrelated)

                      I'm suspecting this is related to a reboot I did on the server. Because they had been running for 2 weeks with no issues. Suddenly, 4 machines have their containers stop within hours of each other
                      However, the agents crashed 1-3 hours after the reboot. So it doesn't seem to be directly related, but it must be related somehow
                      Attached Files
                      Last edited by jad.baz; 29-04-2020, 14:21.

                      Comment

                      Working...