I think I have successfully distributed the load on my server to make zabbix happy. It hasn't malfunctioned in a few days, but that doesn't mean it won't again.
Ad Widget
Collapse
[1.4.4] zabbix_server doesn't crash, but no longer collects data
Collapse
X
-
-
I think that there could be a problem in processing of situations when MySQL server is unavailable. Possibly ZABBIX does not recover nicely under some unknown circumstances. This is just a guess, I cannot confirm it.Comment
-
I've had the same problem with 1.4.4 on ubuntu. I recently purchased a high spec Redhat Enterprise server and experienced the same problem. I have now upgraded to 1.5 and this problem still exists.
I run an IT Support company so we've been trying to setup a monitoring system for some time that will receive data from client computers distributed nationally - for this reason we can only use active agents. I originally setup Nagios which ran fine but was a nightmare to configure. Zabbix is much better for our needs but as it stops collecting data from active agents after 2 - 3 hours we won't be able to continue using it unless this is fixed. I have only added about 100 hosts so far and will need to add alot more. The only way I'm able to get it working again is by doing a full reboot - restarting the Zabbix services doesn't seem to get it going again (but will confirm this after it next stops).
I'm currently comparing things like running services before and after the problem to see if I can pinpoint what is causing it. One odd thing I noticed is that when it happens I can still telnet to 10051 on localhost but cannot from any other machine. Can anyone replicate this?Comment
-
That is strange. It doesn't look like a ZABBIX problem to me because of this. Is there a firewall or something in between? What OS ZABBIX server is running on?Comment
-
I am running RedHat Enterprise Linux 5.
I am still testing but I have found a few interesting things. Firstly, it appears I can telnet to port 10051 but it is really slow - sometimes timing out and other times connecting after a while. This explains why active checks don't get collected as the agents have a default timeout of 5 seconds.
It has been suggested that the cause could be a busy MySQL server but I don't see this because the MySQL server is using about 10% CPU while data is being collected but once the problem starts, the CPU usage lowers to between 2% and 3%. The MySQL server is still running fine when data is being collected; even restarting it doesn't help.
I have no firewall running - this was one thing I had to check because I wasn't sure if the problem was to do with too many connections within a period of time. I can now confirm this isn't the case because there is no firewall running and I can still telnet (as above) just very slowly.
The problem seems to start after about 2 - 3 hours of the server running and can only be rectified by rebooting. I am trying to work out what procedures may be running at this frequency which is why I have currently disabled log rotation to see if this may be a cause.
I will keep testing and post my results, if anyone has any suggestions in the meantime I would be grateful to hear them.Comment
-
OK, still not worked this out. The data stopped again after 2 hours and 20 minutes (I can tell because on the queue screen all the ZABBIX agent (active) checks go to not having being heard from for 'More than 5 minutes').
This time I got it going again by simply stopping all zabbix_server processes and then starting it again. So I guess its nothing to do with log rotation. I know housekeeping isn't the cause because this ran a few times during the time the system was running fine.
Any other suggestions? Something is stopping it from processing ZABBIX agent (active) checks; all the other types continue to run. What is it thats different about the way these checks are processed over the other checks?
The zabbix_server.log contained the following when the problem started:
4197:20080323:000659 Timeout while answering request
4193:20080323:000700 Timeout while answering request
4196:20080323:000713 Timeout while answering request
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4193:20080323:000715 Timeout while answering request
4194:20080323:000716 Timeout while answering request
4194:20080323:000751 Timeout while answering request
4196:20080323:000753 Timeout while answering request
4197:20080323:000754 Timeout while answering request
4193:20080323:000755 Timeout while answering request
4197:20080323:000759 Timeout while answering request
4193:20080323:000800 Timeout while answering request
4196:20080323:000813 Timeout while answering request
4193:20080323:000815 Timeout while answering request
4194:20080323:000817 Timeout while answering request
4194:20080323:000851 Timeout while answering request
4196:20080323:000853 Timeout while answering request
4197:20080323:000854 Timeout while answering request
4193:20080323:000855 Timeout while answering request
4197:20080323:000859 Timeout while answering request
4193:20080323:000900 Timeout while answering request
4196:20080323:000914 Timeout while answering request
4193:20080323:000915 Timeout while answering request
4194:20080323:000917 Timeout while answering request
4194:20080323:000951 Timeout while answering request
4196:20080323:000953 Timeout while answering request
4197:20080323:000954 Timeout while answering request
4193:20080323:000955 Timeout while answering request
4197:20080323:000959 Timeout while answering request
4193:20080323:001000 Timeout while answering request
The only other thing I have in my log is about CPU checks. The following was being reported before the problem:
4193:20080322:235135 Timeout while answering request
4211:20080322:235136 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235138 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235141 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235143 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235145 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235146 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4217:20080322:235146 Executing housekeeper
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4194:20080322:235152 Timeout while answering request
4196:20080322:235153 Timeout while answering request
4217:20080322:235154 Deleted 11207 records from history and trends
4197:20080322:235154 Timeout while answering request
4193:20080322:235155 Timeout while answering request
4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235156 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235157 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
4197:20080322:235159 Timeout while answering request
4193:20080322:235200 Timeout while answering request
4211:20080322:235201 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235204 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235205 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235208 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235210 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235211 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]Comment
-
Error message in /var/log/messages
I have found this in /var/log/messages:
Mar 23 04:02:06 zabbix syslogd 1.4.1: restart.
Mar 23 04:02:06 zabbix logrotate: ALERT exited abnormally with [1]
I have disabled log rotation (by setting LogFileSize=0 in /etc/zabbix/zabbix_server.conf) so if the above is the cause of the problem, why is log rotation still happening while disabled?Comment
-
Comment
-
I have installed webmin on the server to assist with troubleshooting and something interest has shown up...
The last time the problem occured, I had a look at the running processes and zabbix_server was still running however when I looked at the open files and connections I noticed the following:
3w Regular file 5 1966084 /var/tmp/zabbix_server.pid (deleted)
And actually checking in /var/tmp I could see that zabbix_server.pid was missing. 1) Could this cause the service to stop collecting active agent data but still process other types of checks? and 2) What would delete this file?
I have also disabled actions for discovery items. I don't use discovery but there were 2 enabled actions and after reading an earlier post about a change to actions fixing the issue I thought it wouldn't hurt to disable these. In fact both of these actions were reporting errors about hosts and templates that it referred to being missing.
I'm just waiting for the problem to occur again so I can see if the pid file disappears again or if the change to actions has made any difference. I'll report back on the outcome.Comment
-
I don't think the missing (removed by someone else) PID file can make any difference. Yet I would like to understand what's going on before release of 1.4.5.Comment
-
Been running for 4 hours now - not going to start shouting about it yet but this is the longest I have managed so far.
The only thing I have really changed is the configuration in Zabbix frontend of discovery actions: Configuration > Actions > Event Source: Discovery. When I went into this screen I received a warning that there was 1 missing host and 2 missing templates. I don't have the exact message to hand but maybe someone else experiencing this problem can check if they too have the same kind of message?
It wasn't a screen I had been into before because I don't use discovery. I know why the errors were appearing - because I removed all the default templates and created my own. After installing Zabbix I imported the Schema and Data SQL files into MySQL as a starting point. I then re-organised everything into templates and host groups that were more useful to me. My templates are:- Antivirus - AVG
- Antivirus - Symantec
- External Service - FTP
- External Service - HTTP
- External Service - HTTPS
- External Service - IMAP
- External Service - POP
- External Service - RDP
- External Service - RPC
- External Service - SMTP
- External Service - Webadmin
- External Service - Webmin
- OS - Linux
- OS - Windows
- PING
- Server - Backup
- Server - Domain Controller
- Server - Exchange
- Server - Terminal
I find this is much easier for us to work with. It looks like having removed the default templates caused errors in the default discovery actions. I don't use discovery but since removing the actions Zabbix has been running.
Maybe I'm being premature here but I will see how long the server keeps running for and post back. It seems strange that another user mentioned actions as the cause in a previous post, is it possible that broken actions can cause the server to stop accepting active agent checks?Comment
-
GUTTED! It ran for just over 10 hours and has just died, the PID file is still where it should be though so this doesn't appear to be the cause.
Solving the invalid actions has certainly extended the time it runs for but I currently only have 109 monitored hosts and need to add several thousand but as it is unable to keep running while monitoring these few I doubt it will work with many more.
Looks like I'm either going back to Nagios or looking at the alternatives which is a shame as Zabbix is perfect apart from this problem.Comment
-
I would appreciate if you could set Debug=4, and send FULL after-crash log file to a l e x @ z a b b i x . c o m.Comment
-
Interesting. Usually the iowait is high for a longer period of time when things break. Last night it died at about at 5:17
sar output from that time:
Code:12:35:01 PM CPU %user %nice %system %iowait %steal %idle 05:05:01 AM all 1.86 6.60 4.96 4.40 0.00 82.18 05:15:02 AM all 1.85 1.53 3.88 3.54 0.00 89.21 05:25:01 AM all 1.69 1.64 6.36 24.24 0.00 66.07 05:35:01 AM all 1.23 1.17 3.32 3.18 0.00 91.10 05:45:01 AM all 1.25 2.57 3.56 3.56 0.00 89.06
Last edited by bbrendon; 23-03-2008, 22:46.Comment

Comment