Ad Widget

Collapse

[1.4.4] zabbix_server doesn't crash, but no longer collects data

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bbrendon
    Senior Member
    • Sep 2005
    • 870

    #16
    I think I have successfully distributed the load on my server to make zabbix happy. It hasn't malfunctioned in a few days, but that doesn't mean it won't again.
    Unofficial Zabbix Expert
    Blog, Corporate Site

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #17
      I think that there could be a problem in processing of situations when MySQL server is unavailable. Possibly ZABBIX does not recover nicely under some unknown circumstances. This is just a guess, I cannot confirm it.
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • sdwilders
        Member
        • Feb 2008
        • 33

        #18
        I've had the same problem with 1.4.4 on ubuntu. I recently purchased a high spec Redhat Enterprise server and experienced the same problem. I have now upgraded to 1.5 and this problem still exists.

        I run an IT Support company so we've been trying to setup a monitoring system for some time that will receive data from client computers distributed nationally - for this reason we can only use active agents. I originally setup Nagios which ran fine but was a nightmare to configure. Zabbix is much better for our needs but as it stops collecting data from active agents after 2 - 3 hours we won't be able to continue using it unless this is fixed. I have only added about 100 hosts so far and will need to add alot more. The only way I'm able to get it working again is by doing a full reboot - restarting the Zabbix services doesn't seem to get it going again (but will confirm this after it next stops).

        I'm currently comparing things like running services before and after the problem to see if I can pinpoint what is causing it. One odd thing I noticed is that when it happens I can still telnet to 10051 on localhost but cannot from any other machine. Can anyone replicate this?

        Comment

        • Alexei
          Founder, CEO
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2004
          • 5654

          #19
          Originally posted by sdwilders
          One odd thing I noticed is that when it happens I can still telnet to 10051 on localhost but cannot from any other machine. Can anyone replicate this?
          That is strange. It doesn't look like a ZABBIX problem to me because of this. Is there a firewall or something in between? What OS ZABBIX server is running on?
          Alexei Vladishev
          Creator of Zabbix, Product manager
          New York | Tokyo | Riga
          My Twitter

          Comment

          • sdwilders
            Member
            • Feb 2008
            • 33

            #20
            I am running RedHat Enterprise Linux 5.

            I am still testing but I have found a few interesting things. Firstly, it appears I can telnet to port 10051 but it is really slow - sometimes timing out and other times connecting after a while. This explains why active checks don't get collected as the agents have a default timeout of 5 seconds.

            It has been suggested that the cause could be a busy MySQL server but I don't see this because the MySQL server is using about 10% CPU while data is being collected but once the problem starts, the CPU usage lowers to between 2% and 3%. The MySQL server is still running fine when data is being collected; even restarting it doesn't help.

            I have no firewall running - this was one thing I had to check because I wasn't sure if the problem was to do with too many connections within a period of time. I can now confirm this isn't the case because there is no firewall running and I can still telnet (as above) just very slowly.

            The problem seems to start after about 2 - 3 hours of the server running and can only be rectified by rebooting. I am trying to work out what procedures may be running at this frequency which is why I have currently disabled log rotation to see if this may be a cause.

            I will keep testing and post my results, if anyone has any suggestions in the meantime I would be grateful to hear them.
            Last edited by sdwilders; 23-03-2008, 01:46. Reason: clarification.

            Comment

            • sdwilders
              Member
              • Feb 2008
              • 33

              #21
              OK, still not worked this out. The data stopped again after 2 hours and 20 minutes (I can tell because on the queue screen all the ZABBIX agent (active) checks go to not having being heard from for 'More than 5 minutes').

              This time I got it going again by simply stopping all zabbix_server processes and then starting it again. So I guess its nothing to do with log rotation. I know housekeeping isn't the cause because this ran a few times during the time the system was running fine.

              Any other suggestions? Something is stopping it from processing ZABBIX agent (active) checks; all the other types continue to run. What is it thats different about the way these checks are processed over the other checks?

              The zabbix_server.log contained the following when the problem started:

              4197:20080323:000659 Timeout while answering request
              4193:20080323:000700 Timeout while answering request
              4196:20080323:000713 Timeout while answering request
              4211:20080323:000715 Error while sending list of active checks
              4211:20080323:000715 Error while sending list of active checks
              4211:20080323:000715 Error while sending list of active checks
              4211:20080323:000715 Error while sending list of active checks
              4193:20080323:000715 Timeout while answering request
              4194:20080323:000716 Timeout while answering request
              4194:20080323:000751 Timeout while answering request
              4196:20080323:000753 Timeout while answering request
              4197:20080323:000754 Timeout while answering request
              4193:20080323:000755 Timeout while answering request
              4197:20080323:000759 Timeout while answering request
              4193:20080323:000800 Timeout while answering request
              4196:20080323:000813 Timeout while answering request
              4193:20080323:000815 Timeout while answering request
              4194:20080323:000817 Timeout while answering request
              4194:20080323:000851 Timeout while answering request
              4196:20080323:000853 Timeout while answering request
              4197:20080323:000854 Timeout while answering request
              4193:20080323:000855 Timeout while answering request
              4197:20080323:000859 Timeout while answering request
              4193:20080323:000900 Timeout while answering request
              4196:20080323:000914 Timeout while answering request
              4193:20080323:000915 Timeout while answering request
              4194:20080323:000917 Timeout while answering request
              4194:20080323:000951 Timeout while answering request
              4196:20080323:000953 Timeout while answering request
              4197:20080323:000954 Timeout while answering request
              4193:20080323:000955 Timeout while answering request
              4197:20080323:000959 Timeout while answering request
              4193:20080323:001000 Timeout while answering request

              The only other thing I have in my log is about CPU checks. The following was being reported before the problem:

              4193:20080322:235135 Timeout while answering request
              4211:20080322:235136 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4211:20080322:235138 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
              4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
              4211:20080322:235141 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4211:20080322:235143 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
              4211:20080322:235145 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235146 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4217:20080322:235146 Executing housekeeper
              4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]
              4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
              4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4194:20080322:235152 Timeout while answering request
              4196:20080322:235153 Timeout while answering request
              4217:20080322:235154 Deleted 11207 records from history and trends
              4197:20080322:235154 Timeout while answering request
              4193:20080322:235155 Timeout while answering request
              4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
              4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235156 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4211:20080322:235157 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
              4197:20080322:235159 Timeout while answering request
              4193:20080322:235200 Timeout while answering request
              4211:20080322:235201 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
              4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4211:20080322:235204 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
              4211:20080322:235205 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235208 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
              4211:20080322:235210 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
              4211:20080322:235211 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]

              Comment

              • sdwilders
                Member
                • Feb 2008
                • 33

                #22
                Error message in /var/log/messages

                I have found this in /var/log/messages:

                Mar 23 04:02:06 zabbix syslogd 1.4.1: restart.
                Mar 23 04:02:06 zabbix logrotate: ALERT exited abnormally with [1]

                I have disabled log rotation (by setting LogFileSize=0 in /etc/zabbix/zabbix_server.conf) so if the above is the cause of the problem, why is log rotation still happening while disabled?

                Comment

                • Alexei
                  Founder, CEO
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Sep 2004
                  • 5654

                  #23
                  Originally posted by sdwilders
                  I have found this in /var/log/messages:

                  Mar 23 04:02:06 zabbix syslogd 1.4.1: restart.
                  Mar 23 04:02:06 zabbix logrotate: ALERT exited abnormally with [1]
                  This has nothing to do with ZABBIX settings. Check configuration of the logrotate.
                  Alexei Vladishev
                  Creator of Zabbix, Product manager
                  New York | Tokyo | Riga
                  My Twitter

                  Comment

                  • sdwilders
                    Member
                    • Feb 2008
                    • 33

                    #24
                    Yes, you're correct. Fixed the logrotate issue - just threw me because it mentioned zabbix.

                    Still investigating...

                    Comment

                    • sdwilders
                      Member
                      • Feb 2008
                      • 33

                      #25
                      I have installed webmin on the server to assist with troubleshooting and something interest has shown up...

                      The last time the problem occured, I had a look at the running processes and zabbix_server was still running however when I looked at the open files and connections I noticed the following:
                      3w Regular file 5 1966084 /var/tmp/zabbix_server.pid (deleted)

                      And actually checking in /var/tmp I could see that zabbix_server.pid was missing. 1) Could this cause the service to stop collecting active agent data but still process other types of checks? and 2) What would delete this file?

                      I have also disabled actions for discovery items. I don't use discovery but there were 2 enabled actions and after reading an earlier post about a change to actions fixing the issue I thought it wouldn't hurt to disable these. In fact both of these actions were reporting errors about hosts and templates that it referred to being missing.

                      I'm just waiting for the problem to occur again so I can see if the pid file disappears again or if the change to actions has made any difference. I'll report back on the outcome.

                      Comment

                      • Alexei
                        Founder, CEO
                        Zabbix Certified Trainer
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Sep 2004
                        • 5654

                        #26
                        I don't think the missing (removed by someone else) PID file can make any difference. Yet I would like to understand what's going on before release of 1.4.5.
                        Alexei Vladishev
                        Creator of Zabbix, Product manager
                        New York | Tokyo | Riga
                        My Twitter

                        Comment

                        • sdwilders
                          Member
                          • Feb 2008
                          • 33

                          #27
                          Been running for 4 hours now - not going to start shouting about it yet but this is the longest I have managed so far.

                          The only thing I have really changed is the configuration in Zabbix frontend of discovery actions: Configuration > Actions > Event Source: Discovery. When I went into this screen I received a warning that there was 1 missing host and 2 missing templates. I don't have the exact message to hand but maybe someone else experiencing this problem can check if they too have the same kind of message?

                          It wasn't a screen I had been into before because I don't use discovery. I know why the errors were appearing - because I removed all the default templates and created my own. After installing Zabbix I imported the Schema and Data SQL files into MySQL as a starting point. I then re-organised everything into templates and host groups that were more useful to me. My templates are:
                          • Antivirus - AVG
                          • Antivirus - Symantec
                          • External Service - FTP
                          • External Service - HTTP
                          • External Service - HTTPS
                          • External Service - IMAP
                          • External Service - POP
                          • External Service - RDP
                          • External Service - RPC
                          • External Service - SMTP
                          • External Service - Webadmin
                          • External Service - Webmin
                          • OS - Linux
                          • OS - Windows
                          • PING
                          • Server - Backup
                          • Server - Domain Controller
                          • Server - Exchange
                          • Server - Terminal


                          I find this is much easier for us to work with. It looks like having removed the default templates caused errors in the default discovery actions. I don't use discovery but since removing the actions Zabbix has been running.

                          Maybe I'm being premature here but I will see how long the server keeps running for and post back. It seems strange that another user mentioned actions as the cause in a previous post, is it possible that broken actions can cause the server to stop accepting active agent checks?

                          Comment

                          • sdwilders
                            Member
                            • Feb 2008
                            • 33

                            #28
                            GUTTED! It ran for just over 10 hours and has just died, the PID file is still where it should be though so this doesn't appear to be the cause.

                            Solving the invalid actions has certainly extended the time it runs for but I currently only have 109 monitored hosts and need to add several thousand but as it is unable to keep running while monitoring these few I doubt it will work with many more.

                            Looks like I'm either going back to Nagios or looking at the alternatives which is a shame as Zabbix is perfect apart from this problem.

                            Comment

                            • Alexei
                              Founder, CEO
                              Zabbix Certified Trainer
                              Zabbix Certified SpecialistZabbix Certified Professional
                              • Sep 2004
                              • 5654

                              #29
                              I would appreciate if you could set Debug=4, and send FULL after-crash log file to a l e x @ z a b b i x . c o m.
                              Alexei Vladishev
                              Creator of Zabbix, Product manager
                              New York | Tokyo | Riga
                              My Twitter

                              Comment

                              • bbrendon
                                Senior Member
                                • Sep 2005
                                • 870

                                #30
                                Interesting. Usually the iowait is high for a longer period of time when things break. Last night it died at about at 5:17

                                sar output from that time:
                                Code:
                                12:35:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
                                05:05:01 AM       all      1.86      6.60      4.96      4.40      0.00     82.18
                                05:15:02 AM       all      1.85      1.53      3.88      3.54      0.00     89.21
                                05:25:01 AM       all      1.69      1.64      6.36     24.24      0.00     66.07
                                05:35:01 AM       all      1.23      1.17      3.32      3.18      0.00     91.10
                                05:45:01 AM       all      1.25      2.57      3.56      3.56      0.00     89.06
                                Last edited by bbrendon; 23-03-2008, 22:46.
                                Unofficial Zabbix Expert
                                Blog, Corporate Site

                                Comment

                                Working...