We've been using Zabbix for awhile to monitor all sorts of things and generally it works really well. Lately though, about once a day, it just starts falling way behind. The values in the Administration->Queue screen climb into the hundreds, and all of our hosts start showing up dead.
We are monitoring ~150 hosts, ~10k items, and the server performance value on the dashboard is 130. Most of the time everything is working great.
I do not believe it is a hardware issue as I have two reasonable boxes in the mix. The Zabbix server (v1.8.5) runs on one while the DB (PostgreSQL 9.0) runs on the other. Each is a bare metal 8 core, 12GB RAM, 4 disk 15k rpm raid 10 running Ubuntu 10.04 LTS.
Generally these boxes run a load average of only 1 - 2, are using a couple GB of RAM, and certainly are not taxing the disks (by either vmstat or iostat -dx output). I rarely see IOWait get above 3-4 and it is usually 0.
When these queue backups happen nothing looks different on the servers - no jump in CPU nor memory and the DB looks perfectly healthy in terms of no long running queries present. The backup does clear on its own eventually as well but we go 20 - 30 minutes with no monitoring or even worse lots of false alarms going off.
On the server running zabbix-server I have these settings in zabbix_server.conf:
CacheSize=1024M
HistoryCacheSize=1024M
HistoryTextCacheSize=1024M
TrendCacheSize=1024M
StartPollers=100
StartTrappers=100
Right now all of our items are of the passive variety and I have no proxies or zabbix_sender setups in place. Is this simply too many items for a single server to manage the collection of? The fact that the hardware itself isn't being taxed makes me hope this isn't the case. Is there some configuration I could look into adjusting that might help?
I've tried to keep our configs very clean and kept the common templates shared by many hosts fairly lean with generally larger updated intervals of 60s or more. The only messages I see in the zabbix server logs look harmless - occasional curl timeouts from web scripts, unsupported items that I haven't yet cleaned up, etc.
Any help is greatly appreciated.
We are monitoring ~150 hosts, ~10k items, and the server performance value on the dashboard is 130. Most of the time everything is working great.
I do not believe it is a hardware issue as I have two reasonable boxes in the mix. The Zabbix server (v1.8.5) runs on one while the DB (PostgreSQL 9.0) runs on the other. Each is a bare metal 8 core, 12GB RAM, 4 disk 15k rpm raid 10 running Ubuntu 10.04 LTS.
Generally these boxes run a load average of only 1 - 2, are using a couple GB of RAM, and certainly are not taxing the disks (by either vmstat or iostat -dx output). I rarely see IOWait get above 3-4 and it is usually 0.
When these queue backups happen nothing looks different on the servers - no jump in CPU nor memory and the DB looks perfectly healthy in terms of no long running queries present. The backup does clear on its own eventually as well but we go 20 - 30 minutes with no monitoring or even worse lots of false alarms going off.
On the server running zabbix-server I have these settings in zabbix_server.conf:
CacheSize=1024M
HistoryCacheSize=1024M
HistoryTextCacheSize=1024M
TrendCacheSize=1024M
StartPollers=100
StartTrappers=100
Right now all of our items are of the passive variety and I have no proxies or zabbix_sender setups in place. Is this simply too many items for a single server to manage the collection of? The fact that the hardware itself isn't being taxed makes me hope this isn't the case. Is there some configuration I could look into adjusting that might help?
I've tried to keep our configs very clean and kept the common templates shared by many hosts fairly lean with generally larger updated intervals of 60s or more. The only messages I see in the zabbix server logs look harmless - occasional curl timeouts from web scripts, unsupported items that I haven't yet cleaned up, etc.
Any help is greatly appreciated.

Comment