Ad Widget

Collapse

Zabbix 1.4.4 malfunctions/dies, nothing found in logs yet

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bbrendon
    Senior Member
    • Sep 2005
    • 870

    #1

    Zabbix 1.4.4 malfunctions/dies, nothing found in logs yet

    Recently my zabbix server 'freaks out' weekly. Suddenly, all of my nodata triggers go off, and I mean all of them. Everyone gets about 30 text messages before someone runs over to the zabbix_server and purges the postfix email queue. Once that is done, all the actions are disabled, and "/etc/init.d/zabbix-server restart" is run. Its fine again for about another week and the story repeats.

    I've looked through the logs and haven't found anything. Any ideas as to what I might look for to resolve this?
    Last edited by bbrendon; 06-02-2008, 20:25.
    Unofficial Zabbix Expert
    Blog, Corporate Site
  • bbrendon
    Senior Member
    • Sep 2005
    • 870

    #2
    This may be related.


    It use to happen about once a week. It has now happened for the past two nights. It always happens late at night, but not the same time.
    Unofficial Zabbix Expert
    Blog, Corporate Site

    Comment

    • xs-
      Senior Member
      Zabbix Certified Specialist
      • Dec 2007
      • 393

      #3
      I have the same problem. The zabbix server processes still run (some at 80%-100% cpu usage) all nodata triggers fire and it seems almost no data is received.
      I am using the postgresql backend. I have tried to search for problems in logfiles but nothing so far.

      Setup:
      Master node with 3 remote (distributed)
      DB server: 2x dual core xeon, 4G ram
      Zabbix server: 1x dualcore xeon, 3G ram (also runs webfrontend)
      Master server has about 550 monitored hosts, nodes send data for +-150 hosts
      Our main problem is the disks utilization on the database server, caused by a lack of disk spindles. We only have 2 disks in mirror mode.

      One thing we noticed is that the database server has a load of 8+
      We've been busy trying to increase performance on the database server, with some success, load is now 2.5-4 with much better response / mem usage.
      So far (fingers crossed) it seems to have helped, but i'll wait till monday for i'll start to cheer.

      For those who are interested in the postgresql tweaks
      postgresql.conf:
      - work_mem = 4MB # Dont know if this helped
      - sync = off # dont know if this helped
      - checkpoint_segments=6 # this helped!
      - enable_seqscan = off # i think this helped!

      also remount postgresql's database partition with the noatime option
      mount -o remount,noatime <mountpoint>

      Last we did was lower the amount of zabbix_server processes
      StartPollers=3
      StartPollersUnreachable=1
      StartTrappers=3
      StartPingers=1
      StartDiscoverers=0 # we dont use discovery)
      StartHTTPPollers=1
      This way you will have less concurrent connections and thus less concurrent queries, which will lessen the queue's


      Interesting to see is the following
      modify / uncomment the following line in the postgresql.conf
      log_min_duration_statement = 2000
      All queries with a execution time over 2secs will show up in the logfile. The above changes should lower the mount of queries that show up.


      Hope this helps

      Comment

      • bbrendon
        Senior Member
        • Sep 2005
        • 870

        #4
        I'm running mysql. If we're having the same problem, at least we know its not the database portion
        Unofficial Zabbix Expert
        Blog, Corporate Site

        Comment

        • xs-
          Senior Member
          Zabbix Certified Specialist
          • Dec 2007
          • 393

          #5
          Ok, so after the above changes, and reindexing the database, it did lower the db server load considerably but . . . . the problem still remains, although it happens less often.
          So its not db related indeed.

          To to sum up things:
          - Not database type / speed related (so far)
          - All nodata triggers fire during the problem, which tells us:
          --- Trapper doesnt receive data from active checks
          --- The trigger evaluation and alert scripts execution still work
          - Normal polls, i.e. snmp checks still work (need to confirm this tho)
          - between 1 and 4 processes (different each time) will show up in top using 50%-100% cpu usage. 0.5% is normal.

          So it seems the issue lies with the trapper portion of the zabbix_server.

          @developers
          Any known issues, does the above ring a bell?
          Last edited by xs-; 10-03-2008, 11:26.

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            Originally posted by xs-
            - between 1 and 4 processes (different each time) will show up in top using 50%-100% cpu usage. 0.5% is normal.
            It look very much like a problem we fixed in pre-1.4.5. Under some circumstances, on connection loss, ZABBIX trapper process may go into an endless loop doing accept() system call.

            You may run strace -p <process PID> to see what the 100% CPU process is actially doing.
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • xs-
              Senior Member
              Zabbix Certified Specialist
              • Dec 2007
              • 393

              #7
              Hi,

              Ok, so i've waited till the next occurrence of the problem (ofcourse it wouldt trigger for a loong time)

              -----------------------8<-----------------------------------
              accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
              read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
              accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
              read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
              accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
              read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
              accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
              read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
              accept(4, 0x7fff3b811770, [6129680576417890320]) = -1 EBADF (Bad file descriptor)
              read(4, 0x7fff3b8117d8, 5) = -1 EBADF (Bad file descriptor)
              -----------------------8<-----------------------------------
              You get the idea . . . .

              Alexei, is this the issue you are talking about?

              And the most important question, how production-ready is 1.4.5-pre (yes yes the -pre kinda answers it but still . . .)

              Thanks!

              Comment

              • bbrendon
                Senior Member
                • Sep 2005
                • 870

                #8
                You should have searched the forums



                I didn't do anything but lead you to water... I'll have beer instead. heh.
                Unofficial Zabbix Expert
                Blog, Corporate Site

                Comment

                • xs-
                  Senior Member
                  Zabbix Certified Specialist
                  • Dec 2007
                  • 393

                  #9
                  Yeah i tried ofcourse, but apparently didnt search for the correct words
                  Problem's fixed with 1.4.5-pre (used nightly release from website->dev)

                  Thanks!

                  Comment

                  • bbrendon
                    Senior Member
                    • Sep 2005
                    • 870

                    #10
                    Originally posted by xs-
                    Yeah i tried ofcourse, but apparently didnt search for the correct words
                    Problem's fixed with 1.4.5-pre (used nightly release from website->dev)

                    Thanks!
                    Use google to search. I find it works much better.
                    Unofficial Zabbix Expert
                    Blog, Corporate Site

                    Comment

                    • bbrendon
                      Senior Member
                      • Sep 2005
                      • 870

                      #11
                      This issue seems to occur when the database is very busy.

                      The times when data is no longer collected is close to the times of slow queries listed in Mysql's slow query log!
                      Unofficial Zabbix Expert
                      Blog, Corporate Site

                      Comment

                      • bbrendon
                        Senior Member
                        • Sep 2005
                        • 870

                        #12
                        The title of this thread isn't accurate, move discussiong to this thread:

                        Unofficial Zabbix Expert
                        Blog, Corporate Site

                        Comment

                        Working...