Ad Widget

Collapse

agent death

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • alj
    Senior Member
    • Aug 2006
    • 188

    #1

    agent death

    Can zabbix agent made more bulletproof? Like for example one child process death would not kill whole monitoring agent and maybe just restart that child process? The same with server - excaution on mysql connections or simple mysql server restart kills zabbix-server instantly.


    Otherwise it makes monitoring to be least stable software on the server when it should be opposite.
    Last edited by alj; 22-12-2006, 00:49.
  • alj
    Senior Member
    • Aug 2006
    • 188

    #2
    More agent death-o-rama:

    Code:
    ------------------------
    24128 ?        SN     0:00 /usr/sbin/zabbix_agentd
    24129 ?        ZN     0:12 [zabbix_agentd] <defunct>
    24130 ?        ZN     0:13 [zabbix_agentd] <defunct>
    24132 ?        ZN     0:12 [zabbix_agentd] <defunct>
    24133 ?        ZN     0:13 [zabbix_agentd] <defunct>
    24134 ?        ZN     0:00 [zabbix_agentd] <defunct>
    
    
    # kill -9 24128 ; /etc/init.d/zabbix-agent restart
    Stopping Zabbix agent: zabbix_agentd
    No /usr/sbin/zabbix_agentd found running; none killed.
    Starting Zabbix agent: zabbix_agentd
    
    # tail /var/log/zabbix-agent/zabbix_agentd.log
    005042:20061225:062503 zabbix_agentd started. ZABBIX 1.1.4.
    005042:20061225:062503 Cannot bind to port 10050. Error [Address already in use] . Another zabbix_agentd already running ?
    009962:20061225:161643 zabbix_agentd started. ZABBIX 1.1.4.
    009963:20061225:161643 zabbix_agentd 9963 started
    009964:20061225:161643 zabbix_agentd 9964 started
    009965:20061225:161643 zabbix_agentd 9965 started
    009966:20061225:161643 zabbix_agentd 9966 started
    009967:20061225:161643 zabbix_agentd 9967 started
    
    ----------------------
    025038:20061226:062503 Got signal. Exiting ...
    030876:20061226:062503 zabbix_agentd started. ZABBIX 1.1.4.
    025042:20061226:062503 Got signal. Exiting ...
    025039:20061226:062503 Got signal. Exiting ...
    030876:20061226:062503 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd alrea
    dy running ?
    025040:20061226:062503 Got signal. Exiting ...
    025044:20061226:062503 Got signal. Exiting ...
    025041:20061226:062503 Got signal. Exiting ...
    ----------------------
    002503:20061221:144313 Got signal. Exiting ...
    002498:20061221:144313 One child process died. Exiting ...
    002500:20061221:144313 Got signal. Exiting ...
    002501:20061221:144313 Got signal. Exiting ...
    002502:20061221:144313 Got signal. Exiting ...
    002499:20061221:144313 Got signal. Exiting ...
    003276:20061221:150528 zabbix_agentd started. ZABBIX 1.1.4.
    003277:20061221:150528 zabbix_agentd 3277 started
    003278:20061221:150528 zabbix_agentd 3278 started
    003279:20061221:150528 zabbix_agentd 3279 started
    003280:20061221:150528 zabbix_agentd 3280 started
    003281:20061221:150528 zabbix_agentd 3281 started
    003301:20061221:150528 zabbix_agentd started. ZABBIX 1.1.4.
    003301:20061221:150528 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd alrea
    dy running ?
    003281:20061221:150528 Got signal. Exiting ...
    003280:20061221:150528 Got signal. Exiting ...
    003279:20061221:150528 Got signal. Exiting ...
    003278:20061221:150528 Got signal. Exiting ...
    003277:20061221:150528 Got signal. Exiting ...
    003276:20061221:150528 One child process died. Exiting ...
    ----------------------
    032426:20061221:151159 zabbix_agentd 32426 started
    032427:20061221:151159 zabbix_agentd 32427 started
    032447:20061221:151159 zabbix_agentd started. ZABBIX 1.1.4.
    032447:20061221:151159 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd already running ?
    032427:20061221:151159 Got signal. Exiting ...
    032426:20061221:151159 Got signal. Exiting ...
    032425:20061221:151159 Got signal. Exiting ...
    032424:20061221:151159 Got signal. Exiting ...
    032423:20061221:151159 Got signal. Exiting ...
    032422:20061221:151159 One child process died. Exiting ...
    eugenea@wfc-zit-ops-002:~/zabbix$ cat bugs.txt
    ------------------------
    24128 ?        SN     0:00 /usr/sbin/zabbix_agentd
    24129 ?        ZN     0:12 [zabbix_agentd] <defunct>
    24130 ?        ZN     0:13 [zabbix_agentd] <defunct>
    24132 ?        ZN     0:12 [zabbix_agentd] <defunct>
    24133 ?        ZN     0:13 [zabbix_agentd] <defunct>
    24134 ?        ZN     0:00 [zabbix_agentd] <defunct>
    
    
    # kill -9 24128 ; /etc/init.d/zabbix-agent restart
    Stopping Zabbix agent: zabbix_agentd
    No /usr/sbin/zabbix_agentd found running; none killed.
    Starting Zabbix agent: zabbix_agentd
    
    # tail /var/log/zabbix-agent/zabbix_agentd.log
    005042:20061225:062503 zabbix_agentd started. ZABBIX 1.1.4.
    005042:20061225:062503 Cannot bind to port 10050. Error [Address already in use] . Another zabbix_agentd already running ?
    009962:20061225:161643 zabbix_agentd started. ZABBIX 1.1.4.
    009963:20061225:161643 zabbix_agentd 9963 started
    009964:20061225:161643 zabbix_agentd 9964 started
    009965:20061225:161643 zabbix_agentd 9965 started
    009966:20061225:161643 zabbix_agentd 9966 started
    009967:20061225:161643 zabbix_agentd 9967 started
    
    ----------------------
    025038:20061226:062503 Got signal. Exiting ...
    030876:20061226:062503 zabbix_agentd started. ZABBIX 1.1.4.
    025042:20061226:062503 Got signal. Exiting ...
    025039:20061226:062503 Got signal. Exiting ...
    030876:20061226:062503 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd alrea
    dy running ?
    025040:20061226:062503 Got signal. Exiting ...
    025044:20061226:062503 Got signal. Exiting ...
    025041:20061226:062503 Got signal. Exiting ...
    ----------------------
    002503:20061221:144313 Got signal. Exiting ...
    002498:20061221:144313 One child process died. Exiting ...
    002500:20061221:144313 Got signal. Exiting ...
    002501:20061221:144313 Got signal. Exiting ...
    002502:20061221:144313 Got signal. Exiting ...
    002499:20061221:144313 Got signal. Exiting ...
    003276:20061221:150528 zabbix_agentd started. ZABBIX 1.1.4.
    003277:20061221:150528 zabbix_agentd 3277 started
    003278:20061221:150528 zabbix_agentd 3278 started
    003279:20061221:150528 zabbix_agentd 3279 started
    003280:20061221:150528 zabbix_agentd 3280 started
    003281:20061221:150528 zabbix_agentd 3281 started
    003301:20061221:150528 zabbix_agentd started. ZABBIX 1.1.4.
    003301:20061221:150528 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd alrea
    dy running ?
    003281:20061221:150528 Got signal. Exiting ...
    003280:20061221:150528 Got signal. Exiting ...
    003279:20061221:150528 Got signal. Exiting ...
    003278:20061221:150528 Got signal. Exiting ...
    003277:20061221:150528 Got signal. Exiting ...
    003276:20061221:150528 One child process died. Exiting ...
    ----------------------
    032426:20061221:151159 zabbix_agentd 32426 started
    032427:20061221:151159 zabbix_agentd 32427 started
    032447:20061221:151159 zabbix_agentd started. ZABBIX 1.1.4.
    032447:20061221:151159 Cannot bind to port 10050. Error [Address already in use]. Another zabbix_agentd already running ?
    032427:20061221:151159 Got signal. Exiting ...
    032426:20061221:151159 Got signal. Exiting ...
    032425:20061221:151159 Got signal. Exiting ...
    032424:20061221:151159 Got signal. Exiting ...
    032423:20061221:151159 Got signal. Exiting ...
    032422:20061221:151159 One child process died. Exiting ...
    Please make attempt to fix it because it makes whole monitoring system to be unusable. People redirect monitoring alias to /dev/null because of false-positives.


    This is latest 1.1.4 that comes with debian-etch. No custom compiling, simple config etc etc.

    Agent dies every night during log rotation.

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      Hmm, fix what?
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • crayons
        Junior Member
        • Oct 2006
        • 21

        #4
        monitoring alias? What are you doing? You got some crazy stuff restarting things in crontab or something? I do not think this is a zabbix problem. I do not have this issue. I have never had a problem with the zabbix agent dying or zabbix server.

        Comment

        • kassec
          Junior Member
          • Dec 2007
          • 13

          #5
          1.5 RC zabbix_server dies on mysql problems

          At least, it's sure zabbix_server (1.5 RC) dies on some mysql problems (like a too many connections error). I saw this behavior one hour ago.

          Comment

          • noxis
            Senior Member
            • Aug 2007
            • 145

            #6
            Rule of Thumb: Do not install Zabbix from the packaging systems. Especially those of Debian or Ubuntu which are maintained by utter 'tards.

            I suspect what is happening is that "logrotate" is rotating the zabbix logs and executing a /etc/init.d/zabbix-agentd restart. This will break your zabbix agent unless you have hacked a sleep in there as it attempts to start the agent instantly after stopping it (giving no time for the agent to shut itself down).

            You infact do not need to restart the zabbix agent on rotation of its logs as Alexi+Krew made it clever enough not to crap itself should its log disappear.

            Comment

            • Alexei
              Founder, CEO
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Sep 2004
              • 5654

              #7
              Originally posted by kassec
              At least, it's sure zabbix_server (1.5 RC) dies on some mysql problems (like a too many connections error). I saw this behavior one hour ago.
              Which is fine I guess? I just do not see how ZABBIX may recover nicely from this.
              Alexei Vladishev
              Creator of Zabbix, Product manager
              New York | Tokyo | Riga
              My Twitter

              Comment

              • noxis
                Senior Member
                • Aug 2007
                • 145

                #8
                Originally posted by Alexei
                Which is fine I guess? I just do not see how ZABBIX may recover nicely from this.
                Built in "action" of "sudo /etc/init.d/mysql-server restart"

                Comment

                • Alexei
                  Founder, CEO
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Sep 2004
                  • 5654

                  #9
                  Originally posted by noxis
                  Built in "action" of "sudo /etc/init.d/mysql-server restart"
                  ..and what if there is still no MySQL connections available?
                  Alexei Vladishev
                  Creator of Zabbix, Product manager
                  New York | Tokyo | Riga
                  My Twitter

                  Comment

                  • noxis
                    Senior Member
                    • Aug 2007
                    • 145

                    #10
                    Originally posted by Alexei
                    ..and what if there is still no MySQL connections available?
                    kill -9 damn it!

                    Seriously what I'd suggest is the ability for the Zabbix Server to be able to fall back to another database server. If Master<->Master replication was implemented and the zabbix daemon was aware of the other mysql server it would move Zabbix towards HA with very little work.
                    Last edited by noxis; 19-03-2008, 12:41.

                    Comment

                    • Alexei
                      Founder, CEO
                      Zabbix Certified Trainer
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • Sep 2004
                      • 5654

                      #11
                      Originally posted by noxis
                      kill -9 damn it!

                      Seriously what I'd suggest is the ability for the Zabbix Server to be able to fall back to another database server. If Master<->Master replication was implemented and the zabbix daemon was aware of the other mysql server it would move Zabbix towards HA with very little work.
                      I do really believe that ZABBIX Server should not worry about switching from one database to another. The HA must be implemented on MySQL level.
                      Alexei Vladishev
                      Creator of Zabbix, Product manager
                      New York | Tokyo | Riga
                      My Twitter

                      Comment

                      • noxis
                        Senior Member
                        • Aug 2007
                        • 145

                        #12
                        Originally posted by Alexei
                        I do really believe that ZABBIX Server should not worry about switching from one database to another. The HA must be implemented on MySQL level.
                        Well the official MySQL clustering is not very good and is definitely beyond the reach of most people (costs of hardware etc...). It can be achieved cheaply with MySQL Proxy and master master replication, or DRBD.

                        But the best optimisation would be achieved within Zabbix, even if its small things such as splitting out read and write transactions so that slaving can be used would be a massive head start and again pretty trivial to implement.

                        Comment

                        Working...