Ad Widget

Collapse

1.4.5 process dies

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bbrendon
    Senior Member
    • Sep 2005
    • 870

    #1

    1.4.5 process dies

    Died last night at about 1 AM. Disk IOwait didn't start spiking until about 2 AM.

    My cell phone wouldn't stop beeping so I just logged in using my cellphone and restarted which lost the logs but at least I was able to sleep. I'll have more info next time...

    Anyone else?
    Unofficial Zabbix Expert
    Blog, Corporate Site
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    I cannot believe that!

    May I ask you for one thing, please do

    cd src/zabbx_server/trapper
    grep sigaction trapper.c

    It should return you two lines.

    Just after release of 1.4.5, a wrong archive was uploaded and it was available for 10-15 minutes. There is a chance (yet, it is highly unlikely) that you use a wrong one.
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • bbrendon
      Senior Member
      • Sep 2005
      • 870

      #3
      Code:
      $ grep -i  sigaction trapper.c
      Returns 0 lines

      I'm investigating...
      Unofficial Zabbix Expert
      Blog, Corporate Site

      Comment

      • bbrendon
        Senior Member
        • Sep 2005
        • 870

        #4
        I just re-downloaded the source and re-ran the grep. I got two lines!

        I didn't even download 1.4.5 right away... I ran 1.4.4 with patches for about 12 hours first. Odd... Maybe because the files are replicated?

        You should have released it as 1.4.5.1
        Unofficial Zabbix Expert
        Blog, Corporate Site

        Comment

        • Alexei
          Founder, CEO
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2004
          • 5654

          #5
          For your information, md5sum of:

          correct zabbix-1.4.5.tar.gz: f87d73852fdab33f99beebfd16c21c63
          faulty one: 44de68151dc103f4eeda69764a072bdc

          I bet you were the first one who downloaded 1.4.5!
          Alexei Vladishev
          Creator of Zabbix, Product manager
          New York | Tokyo | Riga
          My Twitter

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            Originally posted by infinity005
            You should have released it as 1.4.5.1
            That's what I usually do in case of any pre-release mistakes. In this case I was quite confident no one downloaded the archive.
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • bbrendon
              Senior Member
              • Sep 2005
              • 870

              #7
              Originally posted by Alexei
              That's what I usually do in case of any pre-release mistakes. In this case I was quite confident no one downloaded the archive.
              Ummm... you know I'm on the ball right? I'm all over you like flies on shit. Lucky for you we're thousands of miles apart and it's only virtual stalking.

              Never underestimate your users/bugtrackers!
              Unofficial Zabbix Expert
              Blog, Corporate Site

              Comment

              • Alexei
                Founder, CEO
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Sep 2004
                • 5654

                #8
                Originally posted by infinity005
                Ummm... you know I'm on the ball right? I'm all over you like flies on shit. Lucky for you we're thousands of miles apart and it's only virtual stalking.
                I will send you a nice postcard next time ZABBIX crashes, if this ever happens
                Alexei Vladishev
                Creator of Zabbix, Product manager
                New York | Tokyo | Riga
                My Twitter

                Comment

                • bbrendon
                  Senior Member
                  • Sep 2005
                  • 870

                  #9
                  Malfunctioned again!

                  This time a zabbix process didn't die, the old problem is back. It seems it stopped receiving data from all the agents.

                  Interestingly, it didn't happen at night. It happened right before noon today. I don't see any high disk IO either.
                  Unofficial Zabbix Expert
                  Blog, Corporate Site

                  Comment

                  • Alexei
                    Founder, CEO
                    Zabbix Certified Trainer
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Sep 2004
                    • 5654

                    #10
                    Can you telnet to ZABBIX server's 10051/TCP port? If so, then ZABBIX is up and accepting connections from the agents.
                    Alexei Vladishev
                    Creator of Zabbix, Product manager
                    New York | Tokyo | Riga
                    My Twitter

                    Comment

                    • xs-
                      Senior Member
                      Zabbix Certified Specialist
                      • Dec 2007
                      • 393

                      #11
                      @infinity005
                      Are you running a distributed setup? And if so, did any of the slave nodes show problems?

                      Comment

                      • bbrendon
                        Senior Member
                        • Sep 2005
                        • 870

                        #12
                        No. I don't run distributed. I still have not caught a copy of the logs. In some respect v1.4.4 was better because when it stopped working, it was always at night during high disk IO. Now it breaks at random times.
                        Unofficial Zabbix Expert
                        Blog, Corporate Site

                        Comment

                        • Alexei
                          Founder, CEO
                          Zabbix Certified Trainer
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Sep 2004
                          • 5654

                          #13
                          Originally posted by infinity005
                          No. I don't run distributed. I still have not caught a copy of the logs. In some respect v1.4.4 was better because when it stopped working, it was always at night during high disk IO. Now it breaks at random times.
                          I need actual evidence of this! This is the only report of instability of 1.4.5 so far. Do you run vanilla 1.4.5?
                          Alexei Vladishev
                          Creator of Zabbix, Product manager
                          New York | Tokyo | Riga
                          My Twitter

                          Comment

                          • bbrendon
                            Senior Member
                            • Sep 2005
                            • 870

                            #14
                            I'm running vanilla 1.4.5. The last time it broke, I didn't telnet to 10051 like you asked. I only have logs before the break because they get over-written from the healthy zabbix processes. I'm currently working on writing scripts to detect and work around the problem.

                            Since zabbix runs actions in the broken state, I'm going to disable actions and restart zabbix automatically, then re-enable actions. I should only have about 20 minutes of downtime every few days. I can also script in something to properly save logfiles.
                            Unofficial Zabbix Expert
                            Blog, Corporate Site

                            Comment

                            • bbrendon
                              Senior Member
                              • Sep 2005
                              • 870

                              #15
                              Okay. I have an update! (Alexei, I'm also emailing you the logs)

                              When zabbix malfunctioned, nmap reported:
                              Code:
                              Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2008-04-02 05:15 PDT
                              Interesting ports on server:
                              PORT      STATE    SERVICE
                              10051/tcp filtered unknown
                              When it should have said:
                              Code:
                              PORT      STATE SERVICE
                              10051/tcp open  unknown
                              It was about 5 AM exactly when it malfunctioned

                              Sar info shows IO wait at 43 %:
                              Code:
                              12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
                              04:55:01 AM       all      1.89      2.20      4.44      4.65      0.00     86.82
                              05:05:01 AM       all      2.17      4.15      8.85     43.28      0.00     41.54
                              05:15:01 AM       all      1.39      4.98      4.57      4.91      0.00     84.15
                              05:25:01 AM       all      3.17      2.96      5.20      5.17      0.00     83.50
                              My auto-fix script didn't work. Apparently command actions don't work when it zabbix breaks so I'll need to have a cron job run queries against the database directly to detect it.

                              Agent log from the agent running on the zabbix server:
                              Code:
                               31944:20080402:045520 Active check [vfs.dev.write[sdc,operations]] is not supported. Disabled.
                               31944:20080402:050136 Timeout while answering request
                               31944:20080402:050139 Timeout while answering request
                               31944:20080402:050142 Timeout while answering request
                               31944:20080402:050145 Timeout while answering request
                               31944:20080402:050148 Timeout while answering request
                               31944:20080402:050151 Timeout while answering request
                               31944:20080402:050154 Timeout while answering request
                               31944:20080402:050157 Timeout while answering request
                               31944:20080402:050200 Timeout while answering request
                               31944:20080402:050203 Timeout while answering request
                               31944:20080402:050208 Timeout while answering request
                               31944:20080402:050211 Timeout while answering request
                               31944:20080402:050214 Timeout while answering request
                               31944:20080402:050217 Timeout while answering request
                               31944:20080402:050220 Timeout while answering request
                               31944:20080402:050223 Timeout while answering request
                               31944:20080402:050226 Timeout while answering request
                               31944:20080402:050226 Getting list of active checks failed. Will retry after 60 seconds
                               31944:20080402:050329 Timeout while answering request
                               31944:20080402:050447 Getting list of active checks failed. Will retry after 60 seconds
                               31944:20080402:050550 Timeout while answering request
                              I hope this helps! I'm going back to sleep!
                              Unofficial Zabbix Expert
                              Blog, Corporate Site

                              Comment

                              Working...