Hello,
I've a problem with our zabbix server. We first used Zabbix 1.6 on a machine which had a dying hd controller. Sow we decided to use an other machine and upgrade to 1.8.1.
The machine is a CentOS 5.3 machine with a P4 3Ghz and 2 Gig of RAM. Because it only has to monitor around 15 hosts I would say that this machine must be fast enough. Als the interval for the items is about 5 minutes.
But the machine is awfully slow! Loads vary from 1 to 10 caused by postgres. The web frontends dashboard doesn't even load! I've searched this forum and tried a few thinks but nothing helps.
I've tried to optimize some postgres parameters. (Also turned on autovacuum).
Is it possible to misses some indexes?
Or what else can it be?
Zabbix 1.8.1
Postgres 8.1
Software raid which capable of 80MB/sec
Top:
After check some queries it looks like the problem lays with the item table. The count for the items is relative to the others very slow.
Explain plan
After running analyse on the database the problem looks gone. Anyone any idea how that can happen?
I've a problem with our zabbix server. We first used Zabbix 1.6 on a machine which had a dying hd controller. Sow we decided to use an other machine and upgrade to 1.8.1.
The machine is a CentOS 5.3 machine with a P4 3Ghz and 2 Gig of RAM. Because it only has to monitor around 15 hosts I would say that this machine must be fast enough. Als the interval for the items is about 5 minutes.
But the machine is awfully slow! Loads vary from 1 to 10 caused by postgres. The web frontends dashboard doesn't even load! I've searched this forum and tried a few thinks but nothing helps.

I've tried to optimize some postgres parameters. (Also turned on autovacuum).
Is it possible to misses some indexes?
Or what else can it be?
Zabbix 1.8.1
Postgres 8.1
Software raid which capable of 80MB/sec
Top:
Code:
top - 14:24:15 up 7 days, 45 min, 2 users, load average: 7.94, 8.28, 6.32 Tasks: 156 total, 7 running, 149 sleeping, 0 stopped, 0 zombie Cpu(s): 91.4%us, 8.1%sy, 0.0%ni, 0.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2075560k total, 1928992k used, 146568k free, 193984k buffers Swap: 2096376k total, 144k used, 2096232k free, 1479164k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18373 postgres 18 0 22176 11m 9m R 74.4 0.6 1:23.34 postmaster 18397 postgres 17 0 22752 11m 10m R 72.4 0.6 2:14.27 postmaster 19305 postgres 18 0 23252 11m 9m R 49.8 0.6 0:32.31 postmaster 18376 postgres 16 0 22520 11m 10m S 1.0 0.6 0:09.67 postmaster 19358 root 15 0 2324 1048 796 R 0.7 0.1 0:00.05 top 404 root 10 -5 0 0 0 S 0.3 0.0 0:23.17 md1_raid1
Code:
zabbix=> select count(*) from items; 2044 Time: 155.076 ms zabbix=> select count(*) from history; 2310746 Time: 1632.498 ms zabbix=> select count(*) from history_uint; 5012448 Time: 3421.886 ms
Code:
zabbix=> explain analyze SELECT DISTINCT g.* FROM groups g,hosts_groups hg,hosts h WHERE ((g.groupid BETWEEN 000000000000000 AND 099999999999999)) AND hg.groupid=g.groupid AND h.hostid=hg.hostid AND h.status=0 AND EXISTS( SELECT t.triggerid FROM items i, functions f, triggers t WHERE i.hostid=hg.hostid AND i.status=0 AND i.itemid=f.itemid AND f.triggerid=t.triggerid AND t.status=0);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=2668.39..2668.41 rows=1 width=158) (actual time=19601.806..19601.852 rows=6 loops=1)
-> Sort (cost=2668.39..2668.40 rows=1 width=158) (actual time=19601.800..19601.820 rows=10 loops=1)
Sort Key: g.groupid, g.name, g.internal
-> Nested Loop (cost=0.00..2668.38 rows=1 width=158) (actual time=373.403..19601.703 rows=10 loops=1)
Join Filter: ("inner".hostid = "outer".hostid)
-> Nested Loop (cost=0.00..2665.88 rows=1 width=166) (actual time=373.320..19598.309 rows=48 loops=1)
-> Seq Scan on groups g (cost=0.00..1.18 rows=1 width=158) (actual time=0.013..0.053 rows=12 loops=1)
Filter: ((groupid >= 0) AND (groupid <= 99999999999999::bigint))
-> Index Scan using hosts_groups_2 on hosts_groups hg (cost=0.00..2664.69 rows=1 width=16) (actual time=317.854..1633.172 rows=4 loops=12)
Index Cond: (hg.groupid = "outer".groupid)
Filter: (subplan)
SubPlan
-> Nested Loop (cost=170.49..2660.01 rows=1 width=8) (actual time=369.762..369.762 rows=1 loops=53)
-> Hash Join (cost=170.49..2647.99 rows=2 width=8) (actual time=369.666..369.726 rows=1 loops=53)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on functions f (cost=0.00..1994.32 rows=96632 width=16) (actual time=0.005..1.039 rows=453 loops=52)
-> Hash (cost=170.44..170.44 rows=23 width=8) (actual time=368.193..368.193 rows=40 loops=53)
-> Bitmap Heap Scan on items i (cost=79.94..170.44 rows=23 width=8) (actual time=367.894..368.040 rows=40 loops=53)
Recheck Cond: ((status = 0) AND (hostid = $0))
-> BitmapAnd (cost=79.94..79.94 rows=23 width=0) (actual time=367.618..367.618 rows=0 loops=53)
-> Bitmap Index Scan on items_status_index (cost=0.00..29.84 rows=4527 width=0) (actual time=359.337..359.337 rows=796458 loops=53)
Index Cond: (status = 0)
-> Bitmap Index Scan on items_1 (cost=0.00..49.84 rows=4527 width=0) (actual time=1.715..1.715 rows=122 loops=53)
Index Cond: (hostid = $0)
-> Index Scan using triggers_pkey on triggers t (cost=0.00..6.00 rows=1 width=8) (actual time=0.018..0.018 rows=1 loops=53)
Index Cond: ("outer".triggerid = t.triggerid)
Filter: (status = 0)
-> Seq Scan on hosts h (cost=0.00..2.49 rows=1 width=8) (actual time=0.017..0.046 rows=11 loops=48)
Filter: (status = 0)
Total runtime: 19602.220 ms
(30 rows)
After running analyse on the database the problem looks gone. Anyone any idea how that can happen?