PDA

View Full Version : [1.4.4] zabbix_server doesn't crash, but no longer collects data


pascalp
23-01-2008, 15:53
Hello,

I have searched the existing threads, but so far didn't find a match. So I opened a new thread. If I'm mistaken and there is a thread already handling this, please do not hit me :)

My problem: Sometimes when I start zabbix_server, after a few minutes, the server stops collecting data. When I look at the graph for e.g. the processor load of a client, it simply stops at a certain moment. But the zabbix_server process continous running. At the same time, when the server stops collecting data, the processor load of the machine on which zabbix_server is running, falls down to 0.40 +- which is lower than the value when zabbix_server is collecting data.
All my clients are running zabbix_agentd with active checks. There are about 15 clients, with each having +- 15 items. When zabbix_server is collecting data, the average load never goes higher than 2.
The server that runs zabbix_server is virtual Debian. 1500 MHz (Xeon) guaranteed, 800MB RAM. Virtualization software is Virtuozzo.

Does anyone have seen the same problem or is it probably a fault of mine?

Regards,
Pascal

N.B. I forgot to say that recently I fixed the bug described in http://www.zabbix.com/forum/showthread.php?t=8703 though I have no idea if this has something with the actual problem..

Petya
23-01-2008, 17:13
Do you mean "Zabbix Agent (active)" when saying
"zabbix_agentd with active checks"?

If yes then I'm another one who have similar problem,
(I don't have such problem when items are of type "Zabbix Agent").

Try changing item types (you can use "Mass update" button) --
this works well when you have not many hosts (and it's the default actually).

Also there's similar issue here:
http://www.zabbix.com/forum/showthread.php?t=8718

pascalp
24-01-2008, 15:11
Do you mean "Zabbix Agent (active)" when saying
"zabbix_agentd with active checks"?

If yes then I'm another one who have similar problem,
(I don't have such problem when items are of type "Zabbix Agent").

Try changing item types (you can use "Mass update" button) --
this works well when you have not many hosts (and it's the default actually).

Exactly, I can't use passive checks because all my servers are running behind routers for which I'm not responsible of the maintenance.

Also there's similar issue here:
http://www.zabbix.com/forum/showthread.php?t=8718

but in fact, my zabbix_server.log in /tmp is filled with the statement
"(...) Error while sending list of active checks"
And this message can be found in the place in the source code where the patch described in the thread http://www.zabbix.com/forum/showthread.php?t=8703 (this thread is mentionned in your link) is applied. Does the patch to fix the average load problem could cause this problem? I'm absolutely not sure because my server is throwing these messages already when it's still collecting data..

Regards,
Pascal

torti-
17-03-2008, 11:30
this is exactly the situation I have - did someone already solve this?

xs-
17-03-2008, 12:37
I believe this is fixed in 1.4.5-pre (1.4.4 nightly build, on website -> developer)

Alexei
17-03-2008, 20:30
It is fixed in pre 1.4.5.

torti-
18-03-2008, 08:54
Well if the mentioned archive is http://www.zabbix.com/downloads/nightly/pre-zabbix-1.4.tar.gz then it is not fixed :(

zabbix_server still stops responding (and collecting data) without an error. the only thin I can see in the logs is:

this is like I guess it should look when the server process is still ok:

2016:20080228:223009 In process_httptests()
2016:20080228:223009 Query [select httptestid,name,applicationid,nextcheck,status,del ay,macros,agent from httptest where status=0 and nextcheck<=1204234209 and mod(httptestid,5)=2 and httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999) ]
2016:20080228:223009 End process_httptests()
2016:20080228:223009 Spent 0 seconds while processing HTTP tests
2016:20080228:223009 Query [select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=2 and t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999) ]
2016:20080228:223009 Nextcheck:1204234259 Time:1204234209
2016:20080228:223009 Sleeping for 5 seconds


and this is what I get when the server process hangs:

2015:20080228:223009 In process_httptests()
2015:20080228:223009 Query [select httptestid,name,applicationid,nextcheck,status,del ay,macros,agent from httptest where status=0 and nextcheck<=1204234209 and mod(httptestid,5)=1 and httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999) ]
2015:20080228:223009 End process_httptests()
2015:20080228:223009 Spent 0 seconds while processing HTTP tests
2015:20080228:223009 Query [select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=1 and t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999) ]
2015:20080228:223009 No httptests to process in get_minnextcheck.
2015:20080228:223009 Nextcheck:-1 Time:1204234209
2015:20080228:223009 Sleeping for 5 seconds

xs-
18-03-2008, 11:14
Heh, well yesterday we had a similar thing again.
It very much looked like the problems we had before (trapper not receiving data) but this time the load was 0, no zabbix threads going haywire.

After not finding anything to blame, we restarted zabbix_server (master node in a distributed setup) and all was well again.
Shortly after that we saw one of the distributed nodes had its zabbix_server stopped (connection to db lost, local database, didnt stop). After inspection we saw it had stopped around the same time the master node stopped receiving data.

Maybe this is related, maybe not. worth looking into tho.
It might be possible the trapper part of zabbix can experience problems when another server node dies during a send or action (or vice versa).

-- Edit
We are running 1.4.5-pre on the main node, 1.4.4 on the child nodes

torti-
18-03-2008, 11:38
hm you might be right, that the problem is in the db-connection-part of zabbix.

I am currently not running a distributed setup of zabbix_server, so I don't think, that it is a problem related to multiple servers.

bbrendon
19-03-2008, 05:39
http://www.zabbix.com/forum/showthread.php?p=31732#post31732

Seems to be related to the mysql server being very busy, which seems to sometimes be caused by the web monitoring, which I don't use in production so I delete all web monitoring.

We'll see if things improve. My zabbix has been down for the past week because of this.

torti-
19-03-2008, 13:52
I have thought about that too and disabling web monitoring didn't help at all. I tried various 1.4.* versions including developer pre-1.4.5 from monday :(

actually the problem raised, when I started using active agents.

This is a major issue for me because at this point zabbix isn't useful at all if you need to use active agents and the zabbix_server process has stability issues :(

PLEASE fix this as soon as possible alexei

bbrendon
19-03-2008, 17:41
I have thought about that too and disabling web monitoring didn't help at all. I tried various 1.4.* versions including developer pre-1.4.5 from monday :(

actually the problem raised, when I started using active agents.

This is a major issue for me because at this point zabbix isn't useful at all if you need to use active agents and the zabbix_server process has stability issues :(

PLEASE fix this as soon as possible alexei
FYI:
- My zabbix seems to die between 3:50 AM and 4:10 AM (almost every night, but not quite)
- I only use active agents.
- I'm running 1.4.4 with the load patch
- I disabled web monitoring last night
- I looked at the mysql-slow logs and it seems that the problem is related to a busy mysql server
- Non-active agent related items appear to get data, while active agents don't. 90% of my system are active agent agent items though.
- I updated SNMP to 5.4.1 hoping it was SNMP lib related, recompiled, and no change
- server didn't stop recording data last night. We'll see how long it lasts...

Thats about it here.

bbrendon
19-03-2008, 19:01
Okay. I have a fix! ...You'll love it, I swear!

# tail -2 crontab
# disable zabbix actions before zabbix_server breaks at 4 AM
22 1 * * * root mysql --user=zabbix --password=mypass zabbix -e "update actions set status = 1"

torti-
20-03-2008, 10:13
well that is not my definition of a 'fix' :(
last night it broke at 22:05 or so. Restarting the server process works fine but this is no solution for serious use of a program.

I have attached the server logfile with debuglevel 4. maybe someone more familiar with zabbix might look over it?

I'm not really sure that the server breaks everytime at the same time...

thanks,
torti-

ps:
please increase the maximal size of the zip attachment - Your file of 262.5 KB bytes exceeds the forum's limit of 97.7 KB for this filetype.
I have renamed the archive for now to .c

bbrendon
20-03-2008, 17:18
I have narrowed it down to plain old busy server. It doesn't appear to have anything to do with mysql. Mysql just has long queries because the server gets very busy, causing zabbix to malfunction.

bbrendon
21-03-2008, 17:59
I think I have successfully distributed the load on my server to make zabbix happy. It hasn't malfunctioned in a few days, but that doesn't mean it won't again.

Alexei
22-03-2008, 11:22
I think that there could be a problem in processing of situations when MySQL server is unavailable. Possibly ZABBIX does not recover nicely under some unknown circumstances. This is just a guess, I cannot confirm it.

sdwilders
22-03-2008, 19:48
I've had the same problem with 1.4.4 on ubuntu. I recently purchased a high spec Redhat Enterprise server and experienced the same problem. I have now upgraded to 1.5 and this problem still exists.

I run an IT Support company so we've been trying to setup a monitoring system for some time that will receive data from client computers distributed nationally - for this reason we can only use active agents. I originally setup Nagios which ran fine but was a nightmare to configure. Zabbix is much better for our needs but as it stops collecting data from active agents after 2 - 3 hours we won't be able to continue using it unless this is fixed. I have only added about 100 hosts so far and will need to add alot more. The only way I'm able to get it working again is by doing a full reboot - restarting the Zabbix services doesn't seem to get it going again (but will confirm this after it next stops).

I'm currently comparing things like running services before and after the problem to see if I can pinpoint what is causing it. One odd thing I noticed is that when it happens I can still telnet to 10051 on localhost but cannot from any other machine. Can anyone replicate this?

Alexei
22-03-2008, 21:52
One odd thing I noticed is that when it happens I can still telnet to 10051 on localhost but cannot from any other machine. Can anyone replicate this?
That is strange. It doesn't look like a ZABBIX problem to me because of this. Is there a firewall or something in between? What OS ZABBIX server is running on?

sdwilders
22-03-2008, 23:54
I am running RedHat Enterprise Linux 5.

I am still testing but I have found a few interesting things. Firstly, it appears I can telnet to port 10051 but it is really slow - sometimes timing out and other times connecting after a while. This explains why active checks don't get collected as the agents have a default timeout of 5 seconds.

It has been suggested that the cause could be a busy MySQL server but I don't see this because the MySQL server is using about 10% CPU while data is being collected but once the problem starts, the CPU usage lowers to between 2% and 3%. The MySQL server is still running fine when data is being collected; even restarting it doesn't help.

I have no firewall running - this was one thing I had to check because I wasn't sure if the problem was to do with too many connections within a period of time. I can now confirm this isn't the case because there is no firewall running and I can still telnet (as above) just very slowly.

The problem seems to start after about 2 - 3 hours of the server running and can only be rectified by rebooting. I am trying to work out what procedures may be running at this frequency which is why I have currently disabled log rotation to see if this may be a cause.

I will keep testing and post my results, if anyone has any suggestions in the meantime I would be grateful to hear them.

sdwilders
23-03-2008, 01:24
OK, still not worked this out. The data stopped again after 2 hours and 20 minutes (I can tell because on the queue screen all the ZABBIX agent (active) checks go to not having being heard from for 'More than 5 minutes').

This time I got it going again by simply stopping all zabbix_server processes and then starting it again. So I guess its nothing to do with log rotation. I know housekeeping isn't the cause because this ran a few times during the time the system was running fine.

Any other suggestions? Something is stopping it from processing ZABBIX agent (active) checks; all the other types continue to run. What is it thats different about the way these checks are processed over the other checks?

The zabbix_server.log contained the following when the problem started:

4197:20080323:000659 Timeout while answering request
4193:20080323:000700 Timeout while answering request
4196:20080323:000713 Timeout while answering request
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4211:20080323:000715 Error while sending list of active checks
4193:20080323:000715 Timeout while answering request
4194:20080323:000716 Timeout while answering request
4194:20080323:000751 Timeout while answering request
4196:20080323:000753 Timeout while answering request
4197:20080323:000754 Timeout while answering request
4193:20080323:000755 Timeout while answering request
4197:20080323:000759 Timeout while answering request
4193:20080323:000800 Timeout while answering request
4196:20080323:000813 Timeout while answering request
4193:20080323:000815 Timeout while answering request
4194:20080323:000817 Timeout while answering request
4194:20080323:000851 Timeout while answering request
4196:20080323:000853 Timeout while answering request
4197:20080323:000854 Timeout while answering request
4193:20080323:000855 Timeout while answering request
4197:20080323:000859 Timeout while answering request
4193:20080323:000900 Timeout while answering request
4196:20080323:000914 Timeout while answering request
4193:20080323:000915 Timeout while answering request
4194:20080323:000917 Timeout while answering request
4194:20080323:000951 Timeout while answering request
4196:20080323:000953 Timeout while answering request
4197:20080323:000954 Timeout while answering request
4193:20080323:000955 Timeout while answering request
4197:20080323:000959 Timeout while answering request
4193:20080323:001000 Timeout while answering request

The only other thing I have in my log is about CPU checks. The following was being reported before the problem:

4193:20080322:235135 Timeout while answering request
4211:20080322:235136 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235138 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235140 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235141 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235143 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235145 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235146 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4217:20080322:235146 Executing housekeeper
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235151 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4194:20080322:235152 Timeout while answering request
4196:20080322:235153 Timeout while answering request
4217:20080322:235154 Deleted 11207 records from history and trends
4197:20080322:235154 Timeout while answering request
4193:20080322:235155 Timeout while answering request
4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235155 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235156 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235157 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server2]
4197:20080322:235159 Timeout while answering request
4193:20080322:235200 Timeout while answering request
4211:20080322:235201 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server2]
4211:20080322:235202 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235204 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg5]@server1]
4211:20080322:235205 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235208 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server1]
4211:20080322:235210 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg1]@server2]
4211:20080322:235211 Type of received value [Collector is not started!] is not suitable for [system.cpu.load[,avg15]@server1]

sdwilders
23-03-2008, 11:09
I have found this in /var/log/messages:

Mar 23 04:02:06 zabbix syslogd 1.4.1: restart.
Mar 23 04:02:06 zabbix logrotate: ALERT exited abnormally with [1]

I have disabled log rotation (by setting LogFileSize=0 in /etc/zabbix/zabbix_server.conf) so if the above is the cause of the problem, why is log rotation still happening while disabled?

Alexei
23-03-2008, 11:20
I have found this in /var/log/messages:

Mar 23 04:02:06 zabbix syslogd 1.4.1: restart.
Mar 23 04:02:06 zabbix logrotate: ALERT exited abnormally with [1]

This has nothing to do with ZABBIX settings. Check configuration of the logrotate.

sdwilders
23-03-2008, 11:51
Yes, you're correct. Fixed the logrotate issue - just threw me because it mentioned zabbix.

Still investigating...

sdwilders
23-03-2008, 13:02
I have installed webmin on the server to assist with troubleshooting and something interest has shown up...

The last time the problem occured, I had a look at the running processes and zabbix_server was still running however when I looked at the open files and connections I noticed the following:
3w Regular file 5 1966084 /var/tmp/zabbix_server.pid (deleted)

And actually checking in /var/tmp I could see that zabbix_server.pid was missing. 1) Could this cause the service to stop collecting active agent data but still process other types of checks? and 2) What would delete this file?

I have also disabled actions for discovery items. I don't use discovery but there were 2 enabled actions and after reading an earlier post about a change to actions fixing the issue I thought it wouldn't hurt to disable these. In fact both of these actions were reporting errors about hosts and templates that it referred to being missing.

I'm just waiting for the problem to occur again so I can see if the pid file disappears again or if the change to actions has made any difference. I'll report back on the outcome.

Alexei
23-03-2008, 13:48
I don't think the missing (removed by someone else) PID file can make any difference. Yet I would like to understand what's going on before release of 1.4.5.

sdwilders
23-03-2008, 15:29
Been running for 4 hours now - not going to start shouting about it yet but this is the longest I have managed so far.

The only thing I have really changed is the configuration in Zabbix frontend of discovery actions: Configuration > Actions > Event Source: Discovery. When I went into this screen I received a warning that there was 1 missing host and 2 missing templates. I don't have the exact message to hand but maybe someone else experiencing this problem can check if they too have the same kind of message?

It wasn't a screen I had been into before because I don't use discovery. I know why the errors were appearing - because I removed all the default templates and created my own. After installing Zabbix I imported the Schema and Data SQL files into MySQL as a starting point. I then re-organised everything into templates and host groups that were more useful to me. My templates are:

Antivirus - AVG
Antivirus - Symantec
External Service - FTP
External Service - HTTP
External Service - HTTPS
External Service - IMAP
External Service - POP
External Service - RDP
External Service - RPC
External Service - SMTP
External Service - Webadmin
External Service - Webmin
OS - Linux
OS - Windows
PING
Server - Backup
Server - Domain Controller
Server - Exchange
Server - Terminal


I find this is much easier for us to work with. It looks like having removed the default templates caused errors in the default discovery actions. I don't use discovery but since removing the actions Zabbix has been running.

Maybe I'm being premature here but I will see how long the server keeps running for and post back. It seems strange that another user mentioned actions as the cause in a previous post, is it possible that broken actions can cause the server to stop accepting active agent checks?

sdwilders
23-03-2008, 21:11
GUTTED! It ran for just over 10 hours and has just died, the PID file is still where it should be though so this doesn't appear to be the cause.

Solving the invalid actions has certainly extended the time it runs for but I currently only have 109 monitored hosts and need to add several thousand but as it is unable to keep running while monitoring these few I doubt it will work with many more.

Looks like I'm either going back to Nagios or looking at the alternatives which is a shame as Zabbix is perfect apart from this problem.

Alexei
23-03-2008, 21:30
I would appreciate if you could set Debug=4, and send FULL after-crash log file to a l e x @ z a b b i x . c o m.

bbrendon
23-03-2008, 21:40
Interesting. Usually the iowait is high for a longer period of time when things break. Last night it died at about at 5:17

sar output from that time:
12:35:01 PM CPU %user %nice %system %iowait %steal %idle
05:05:01 AM all 1.86 6.60 4.96 4.40 0.00 82.18
05:15:02 AM all 1.85 1.53 3.88 3.54 0.00 89.21
05:25:01 AM all 1.69 1.64 6.36 24.24 0.00 66.07
05:35:01 AM all 1.23 1.17 3.32 3.18 0.00 91.10
05:45:01 AM all 1.25 2.57 3.56 3.56 0.00 89.06

sdwilders
24-03-2008, 19:58
I currently have both the server and agent running with logging set to debug level 4. As soon as it dies I will email these over to you Alexei.

Thanks for your help with this.

bbrendon
25-03-2008, 06:55
Any updates on this? This is my favorite thread on the internet ...

sdwilders
25-03-2008, 08:32
Server stopped collecting sometime after midnight - by looking at the latest data it apears to be sometime between 00:10 and 00:20 when it died.

Took a little longer to get a log file than expected because the first time round it used up all the space in my /tmp partition! The raw log file is 2GB! I have tarred the file but obviously can't email it because its still 125MB, so I've emailed you a link to download it Alexei. Don't envy you looking through a log so large.

Hopefully we can now work out what is wrong as reading the forums there seems to be quite a few people with a similar problem.

sdwilders
25-03-2008, 12:47
Alexei repled:

Thanks for the log files.

Unfortunately I do not see nothing wrong in the log. It seems that ZABBIX was killed, not crashed.

It was stopped exactly at 07:20:00 am. It is very suspicious to me!
Please check your system, it seems that some periodic (?) process killed ZABBIX.

Problem is that the zabbix server is still running (I can see several zabbix processes by doing ps aux at the command line). Any other ideas?

sdwilders
25-03-2008, 12:48
Sorry, I just had a thought. 7:20 is when I restarted the service myself to get it going again. It stopped collecting data around midnight.

Alexei
25-03-2008, 14:56
Sorry, I just had a thought. 7:20 is when I restarted the service myself to get it going again. It stopped collecting data around midnight.
Argh, good to know... I think I know what's going on. A patch will be created soon and release of 1.4.5 and 1.5.1 is on the way.

sdwilders
25-03-2008, 15:04
I'm assuming this means you found something in the logs? If so, I for one and infinity005 will be very happy :)

Alexei
25-03-2008, 15:11
I'm assuming this means you found something in the logs? If so, I for one and infinity005 will be very happy :)
Yes, I found something! Actually this was an known problem, which suddenly came up. The problem affects all ZABBIX systems, especially those with heavy use of active checks for monitoring of unreliable networks and remote locations.

bbrendon
25-03-2008, 17:27
Yes, I found something! Actually this was an known problem, which suddenly came up. The problem affects all ZABBIX systems, especially those with heavy use of active checks for monitoring of unreliable networks and remote locations.Sounds like it applies to me 100%!! Please attach the patch to this thread, I need it ASAP, I don't even care if it hasn't been tested!! :)

Alexei
25-03-2008, 17:49
The patch is attached for your convenience. It has been tested!

bbrendon
25-03-2008, 17:50
Check out ZBX-343 in svn. I think its the patch.

Installed it on my 1.4.4 setup. FINGERS CROSSED!!

sdwilders
25-03-2008, 20:28
I was looking at ZBX-323.

Can you tell me how to install a patch?

sdwilders
25-03-2008, 20:59
I've managed to install the patch.

I'll let you know how it goes - but I too will sit with my fingers crossed. If it makes it through the night I'll be happy, probably won't be convinced until its run for a month though :D

Alexei
25-03-2008, 22:14
No need to wait a month. Feel free to keep me updated every new week of ZABBIX uptime :)

bbrendon
26-03-2008, 00:02
My zabbix server just crashed.

Its 1.4.4 + load patch and trapper patch. It was fine with the load patch, except that it would suddenly hang. The trapper patch causes it to crash.

I'm upgrading to 1.4.5 and bumping debugging back up to 4. I have the 1.4.4 + patch, debug=3 log if you're interested with summary below:

# tail zabbix_server.log.crash1.144_patched
] is not suitable for [ProcTotInt@kbl.server1]
32331:20080325:150003 Expression [{21451}>50] cannot be evaluated [Unable to get value for functionid [21451]]
32331:20080325:150003 Expression [{21456}>50] cannot be evaluated [Unable to get value for functionid [21456]]
32331:20080325:150003 Expression [{21457}>50] cannot be evaluated [Unable to get value for functionid [21457]]
32322:20080325:150020 Active parameter [system.run[mysqladmin --defaults-file=/etc/zabbix/agent.mycnf status|cut -f4 -d":"|cut -f1 -d"S"]] is not supported by agent on host [arts.web2]
32319:20080325:150021 One child process died. Exiting ...
32319:20080325:150023 ZABBIX Server stopped

# grep 32319 zabbix_server.log.crash1.144_patched
32319:20080325:150021 One child process died. Exiting ...
32319:20080325:150023 ZABBIX Server stopped
#

sdwilders
26-03-2008, 08:28
I'm still up and running! Made it through the night - a first.

I'm actually running 1.5 beta if that makes any difference to you infinity005? I know I shouldn't really run the beta but I love the new dashboard and a few other things so can't really take a step back now!

I'll be adding quite a lot of new hosts over the next few weeks so it'll certainly be tested thoroughly, will keep you updated.

sdwilders
26-03-2008, 09:12
Sorry, should have said - I'm not sure if the patch is certified for 1.5. I manually made the changes to the source code and recompiled.

Still running though! :D (though a phrase regarding chickens and eggs springs to mind...)

Alexei
26-03-2008, 09:23
Please ignore the original patch. It was incorrectly created. I fixed it in the original message, it is attached here as well.

Apologies for this.

bbrendon
26-03-2008, 16:55
Alexei- I used the patch from SVN. I didn't see you message until after I brought out the svn browser.

sdwilders- I'm assuming you're running 1.5 from svn and not the official beta?

Maybe if I was creating a new zabbix install I would start with 1.5, but I'm afraid I'll waste too much time running after bugs and monitoring may go down because of bugs. I already had enough downtime in the past month with the last issue.

Re- 1.4.5 trapper patch:
It seems to be working since I've switched to 1.4.5

sdwilders
26-03-2008, 22:11
I'm using the nightly build of 1.5 that was available when I downloaded it. At the time it was simply that I had nothing to lose in doing this because 1.4.4 was useless to me with the active check problem. As it is, 1.5 is very good - there are a few issues (such as not being able to create or modify templates, and a javascript error when hovering over a count on the dashboard) but nothing that stops me using it.

Just to keep you updated - ever since I manually made the changes detailed in the patch my system has been up and running. 26 hours of continuous operation looks good to me.

sdwilders
01-04-2008, 01:52
Just wanted to report back that since the patch my server has been running continuously without problem for the last 7 days. I am now monitoring 5169 items on 212 hosts - all being received from active agents (except zabbix agent on localhost).

Everything looks good but will let you know how it copes with the few thousand hosts we'll be adding over the next few weeks.

morabito
24-04-2008, 19:12
Just wanted to report back that since the patch my server has been running continuously without problem for the last 7 days. I am now monitoring 5169 items on 212 hosts - all being received from active agents (except zabbix agent on localhost).

Everything looks good but will let you know how it copes with the few thousand hosts we'll be adding over the next few weeks.

How Nice !!!
Im still on 1.4.4 Im a little scared to upgrade to 1.4.5, Im noob on zabbix.
How did you backup your existing zabbix database and the configuration files (PHP and Zabbix Binaries)
Any help is appreciated

This is my first time, since everything is running good (Im monitoring 15 machines so far).

Please let me know.

Thank You!

ulukay
11-07-2008, 11:04
hm, i'm running ZABBIX 1.4.5 and i got the same problem
after some time the zabbix server stops collecting some data and eats 100% cpu time.
the queue of items goes up (last time i had ~400 "ZABBIX agent" items outstanding for more than 5 minutes)
so i killed the zabbix process and restarted it. the outstanding 400 Zabbix agent items immediately began to decrease and now remain at 0.
had this issue multiple times, i think it had something to do with wrong DNS entries and hosts being unavailable for some time.

i'm using zabbix 1.4.5 on a fully up to date debian etch with a mysql database