View Full Version : Zabbix Queues
Alexi,
I screwed up entering the monitoring port for a number of new servers I recently added to Zabbix. During the time between entering all the new servers and the time I discovered my typeo the zabbix server tried to poll the servers on the port I had erroneously entered and now I have 298 tasks in the task queue scheduled to occur on 01/01/1970.
I have since corrected the problem and Zabbix is monitoring the servers on the correct port however I would like to remove the backlog from the queue.
How do I clear these?
Regards,
Ash
Hi Ash,
Are you sure zabbix_suckerd is running? Are all elements of the queue ICMP-ping related?
There is no such thing as backlog in ZABBIX. The queue is just list of items of monitored hosts that have to be updated immediately.
In normal case when no performance problems exist, the queue stays empty or nearly empty. The only exception could be ICMP-related (icmpping, icmppingsec) items. Such items may stay in the queue longer, up-to 30 seconds, if you ping your hosts every 30 seconds.
Alexi,
Yes, confirmed zabbix_suckerd is running
No, not all elements of the queue ICMP-ping, some are disk space checks, some CPU checks (ALL are ones that usually work on other hosts).
I know the problem you are referring to about icmp ping related as I suffered it earlier when fping wasnt SETUID. This has a similiar symptom (for example all items in the queue have the same date as the fping problem had 01/01/1970).
Normally on this server the queue is empty, its performing all the same tests on other servers that it is performing on these servers, its just like I said, when I created this new lot of hosts, I screwed up the port for zabbix to poll and that appears to have caused the queue entries. As to why it defaults to 01/01/1970 I have no idea.
All of my other servers are working as desired but these new ones arent, even after fixing the agent port number problem that I created earlier. For example, if I go into Latest values of any operational host (other than these newly created ones) I get the current values of any monitored parameter but on these newly created hosts, all fields are blank.
I've sheduled a bounce of the zabbix server this evening but failing that, is there anything else I could check?
Regards,
Ash
Hmm... I cannot understand how initial setting of incorrect port could screw up something.
Please do the following:
- select an item in question from the queue
- check status of corresponding host
Status of the host must be Monitored.
Also, I would suggest to check zabbix_suckerd's LogFile.
That's all I can help now. No more ideas so far.
Yep. for some reason the status had changed to 'not monitored' on those hosts (prob because I originally screwed up the port to check them on they switched to not monitored).
Yep. for some reason the status had changed to 'not monitored' on those hosts (prob because I originally screwed up the port to check them on they switched to not monitored).
Is there any way to configure the rules used to decide when to stop trying to monitor a host?
thanks
charles
Hi, Charles,
Is there any way to configure the rules used to decide when to stop trying to monitor a host? Currently, ZABBIX stops monitoring of a host for 60 seconds in case if three (3) network error occured. For example, host is unreachable, unable to resolve host name (of no IP used), etc etc.
Every 60 seconds, ZABBIX will try to restart monitoring of the unreachable host.
When host is not monitored, status of all triggers (except triggers related to status of the host) is UNKNOWN.
The logic is hardcoded.
What would you like to change in the logic? Why? What could be configurable?
What would you like to change in the logic? Why? What could be configurable?
Hi Alexei
Thanks for the explanation. In my case I have had monitoring stop for two reasons.
1. There was a problem with a slow user parameter.
2. The box (being monitored) crashed and because zabbix_agentd was killed suddenly and the pid file existed, it never started at boot.
So, I think there needs to be a nicer solution to #2, but in either case it would be nice if I could configure it to try once a day or something in case it can start monitoring again without me waiting a week to realize it has stopped :) Because there have been times where I realized zabbix_agentd was down and started it, but didn't put the host back into monitored state again - so never collected data for a while longer.
Am I missing something?
charles
p.s. I am starting to wonder if I am not remembering the missing parameter issue properly - it may just stop monitoring that one user parameter, but I am not sure anymore.
p.p.s. The site has been faster for me all week, but right now it is unbearbly slow - about 10 seconds to load this page.
2. The box (being monitored) crashed and because zabbix_agentd was killed suddenly and the pid file existed, it never started at boot.
....
Am I missing something?
I would suggest to monitor avilability of all hosts using trigger expression similar to:
#Server {HOSTNAME} is unreachable
{host:status.last(0)}=2
In this case if for some reason an agent will not be running, you'll get a message.
p.p.s. The site has been faster for me all week, but right now it is unbearbly slow - about 10 seconds to load this page. This is because of network speed. The www.zabbix.com (http://www.zabbix.com) itself is working perfectly, at least for me:(
I would suggest to monitor avilability of all hosts using trigger expression similar to:
#Server {HOSTNAME} is unreachable
{host:status.last(0)}=2
In this case if for some reason an agent will not be running, you'll get a message.
Yes, I think the problem is for our agent based template we must have a different check that is either not configured properly or not setup to notify. A lot of our machines are standalone and those are fine.
Looking forward to 1.1! I need to setup dependecies properly when it comes out :)
This is because of network speed. The www.zabbix.com (http://www.zabbix.com) itself is working perfectly, at least for me:(
I know :)
Hi Ash,
Are you sure zabbix_suckerd is running? Are all elements of the queue ICMP-ping related?
There is no such thing as backlog in ZABBIX. The queue is just list of items of monitored hosts that have to be updated immediately.
In normal case when no performance problems exist, the queue stays empty or nearly empty. The only exception could be ICMP-related (icmpping, icmppingsec) items. Such items may stay in the queue longer, up-to 30 seconds, if you ping your hosts every 30 seconds.
Hi Alexsey
I have opened this thread up since I have this exact problem and I can't resolve it (although in my case I have 4682 in my queue).
Suckerd is running and checking a few hosts who have entries in the queue with todays date.
The queue items are a mixture of different checks, not just icmp related
Some have the date 12.31.1969 19:00:00, some on other days, but the vast majority are from the 11th of this month.
I have spot checked hosts, and they are all in state "Monitored" and the checks are "Active".
How can I clear the queue and get zabbix monitoring again?
thanks
charles
p.s, this is v1.0beta14 and the status check took a very long time to run..
Is zabbix_suckerd running ? Yes
Is zabbix_trapperd running ? Yes
Number of values stored 114580174
Number of trends stored 24283583
Number of alarms 21469
Number of alerts 769
Number of triggers (enabled/disabled) 3502(3501/1)
Number of items (active/trapper/not active/not supported) 6434(5699/0/85/650)
Number of users 16
Number of hosts (monitored/not monitored) 297(270/2)
Hi Charles,
Some have the date 12.31.1969 19:00:00, some on other days, but the vast majority are from the 11th of this month.
I have spot checked hosts, and they are all in state "Monitored" and the checks are "Active".
How can I clear the queue and get zabbix monitoring again?
...
p.s, this is v1.0beta14 and the status check took a very long time to run..
ZABBIX selects all items from the queue having next check date in past. Please, could you check if the items with date 12.31.1969 (default date) have type 'ZABBIX agent'? Is there anything special about the items?
The status check took a very long time because of usage of InnoDB. In this case, MySQL does sequential scan of all data in a table to get result of "select count(*) from <table>".
Hi Charles,
ZABBIX selects all items from the queue having next check date in past. Please, could you check if the items with date 12.31.1969 (default date) have type 'ZABBIX agent'? Is there anything special about the items?
All the 12.31.1969 ones are ICMP. A sample...
12.31.1969 19:00:00 VZD2 ICMP Ping Seconds
12.31.1969 19:00:00 Power_D3-11 ICMP Ping
12.31.1969 19:00:00 VZ4 ICMP Ping Seconds
12.31.1969 19:00:00 ALER-MIKROTIK ICMP Ping Seconds
I then have some oddballs
05.10.2004 10:49:36 DTG162 ICMP Ping Seconds
05.10.2004 17:08:03 DTG162 ICMP Ping
09.03.2004 03:41:58 DTG102 ICMP Ping Seconds
09.07.2004 18:52:15 DTG97 ICMP Ping Seconds
09.07.2004 18:52:15 DTG100 ICMP Ping Seconds
09.07.2004 18:52:16 DTG99 ICMP Ping Seconds
09.07.2004 18:52:16 DTG98 ICMP Ping Seconds
11.04.2004 10:48:23 DTG133 ICMP Ping Seconds
11.04.2004 10:48:24 DTG31 ICMP Ping Seconds
11.04.2004 10:51:15 DTG197 ICMP Ping Seconds
12.06.2004 12:19:05 NLAY-EQCHI-1 ICMP Ping Seconds
12.30.2004 16:22:35 DTG185 ICMP Ping Seconds
01.20.2005 21:46:34 DTG195 ICMP Ping Seconds
02.03.2005 10:55:27 VZArray1 ICMP Ping Seconds
But, then I have the rest on 02.11.2005 and they are type zabbix agent it appears, but cover all types.
02.11.2005 00:29:02 kt4c SSH server is running
02.11.2005 00:29:02 kt4c Free number of inodes on /usr
02.11.2005 00:29:41 kt4c Incoming traffic on interface eth0 (1min)
02.11.2005 00:29:41 kt4c Outgoing traffic on interface eth1 (1min)
02.11.2005 00:29:41 kt4c Processor load
02.11.2005 00:29:41 kt4c Incoming traffic on interface eth1 (1min)
02.11.2005 00:29:41 kt4c Outgoing traffic on interface lo (1min)
02.11.2005 00:29:41 kt4c Incoming traffic on interface lo (1min)
02.11.2005 00:29:41 kt4c Outgoing traffic on interface eth0 (1min)
.....
Zabbix is not collecting agent data for most hosts it appears, and up/down alerts etc are nto working as well. This is why I want to get the queue cleared so I can see whats getting done. Only a very few hosts appear to be getting checked right now and they are all icmp it seems.
The status check took a very long time because of usage of InnoDB. In this case, MySQL does sequential scan of all data in a table to get result of "select count(*) from <table>".
mine are all MyISAM though :)
charles
Alexei, do you have any suggestions on how to recover from this? Zabbix is effectively down for me and I need to get it working again. Nothing is getting monitored and nothing is graphing.
Short of blowing away all my data and starting clean, what can I do to fix this?
thanks
charles
I really have no idea what hapenned. If it worked before and then suddenly stopped working, then something obviously has changed. I'd check:
1. Available disk space for ZABBIX database (who knows?)
2. LogFile of both zabbix_suckerd and zabbix_trapperd
3. If you see "insert ... failed" in a LogFile, it means your database is corrupted. Do 'repair table ...'.
Let me know if it helped.
I really have no idea what hapenned. If it worked before and then suddenly stopped working, then something obviously has changed. I'd check:
1. Available disk space for ZABBIX database (who knows?)
2. LogFile of both zabbix_suckerd and zabbix_trapperd
3. If you see "insert ... failed" in a LogFile, it means your database is corrupted. Do 'repair table ...'.
Let me know if it helped.
I checked all these already :(
Plenty of disk space and no errors in the logs, even at level 4 debug, I don't see anything failing other than expected things like timing out contacting a host.
I have had database corruption in the past, but this is why I am trying to find out where the queue info comes from so I can cleanup, or see what tables need repairing. It literaly took days to repair the history table last time and dogged the box (which we rely on heavily) so I don't want to take the shotgun approach and repair all of them. But I guess I'm out of options.
thanks anyway
charles