Zabbix performance issues with PostgreSQL 8.4.2

ktm

Junior Member

Joined: Jan 2010

Posts: 1
#1

Zabbix performance issues with PostgreSQL 8.4.2

21-01-2010, 23:38

Hi Zabbix developers,

I just wanted to report some additional performance problem with PostgreSQL 8.4.2 and Zabbix 1.8. Like others, we are experiencing the slow history clean-up query problem which is being discussed in another thread. But when we went to version 1.8 of Zabbix, some updates to some pages were taking much, much longer to perform. In particular, on user was experiencing >30s load times for their screens. We turned on slow query logging on the database backend and here was the problem query:

SELECT g.graphid FROM graphs g,graphs_items gi,items i,hosts_groups hg,rights r,users_groups ug WHERE (g.graphid/100000000000000) in (0) AND gi.graphid=g.graphid AND i.itemid=gi.itemid AND hg.hostid=i.hostid AND r.id=hg.groupid AND r.groupid=ug.usrgrpid AND ug.userid=20 AND r.permission>=2 AND NOT EXISTS( SELECT gii.graphid FROM graphs_items gii, items ii WHERE gii.graphid=g.graphid AND gii.itemid=ii.itemid AND EXISTS( SELECT hgg.groupid FROM hosts_groups hgg, rights rr, users_groups ugg WHERE ii.hostid=hgg.hostid AND rr.id=hgg.groupid AND rr.groupid=ugg.usrgrpid AND ugg.userid=20 AND rr.permission<2)) AND (g.graphid IN (2,3,4,5,386,387,969,389,971,972,973,446,447,448,4 49,450,451,471,456,470,473,477,
472,474,475,476,478,479,480,481,482,483,484,459,61 4,655,658,645,490,492,489,
493,496,495,498,497,499,501,500,502,974,558,559,56 2,566,563,564,565,567,568,
569,570,571,535,572,573,534,536,538,539,540,541,54 2,543,544,545,537,546,547,
548,552,553,554,555,556,549,550,551,557,577,578,57 9,580,574,576,581,835,587,
588,589,590,560,561,836,591,592,593,594,595,827,38 9,495,498,497,597,598,599,
975,978,999,1004,604,605,606,679,616,634,635,636,6 37,638,618,629,630,631,
632,633,671,682,669,670,678,679,680,674,672,676,67 3,675,677,681,682,683,
683,644,652,829,681,687,698,685,686,705,706,707,70 8,830,945,946,710,716,
712,714,713,709,718,721,720,719,723,724,747,749,75 0,730,731,732,733,734,
735,736,737,738,739,740,741,742,743,744,745,772,77 4,775,755,756,757,758,
759,760,761,762,763,764,765,766,767,768,769,770,77 7,776,977,824,823,826,
825,829,832,833,835,581,836,842,852,854,839,840,83 8,853,855,847,848,944,
846,859,850,899,901,902,903,864,865,866,867,976,97 9,939,941,942,943,906,
907,908,909,910,868,969,991,950,955,964,966,952,95 3,962,965,967,959,961,
968,1001,1002,1003,986,987,988,994,995,996,1008,10 06,1007,1009,1010));

I logged a report to the PostgreSQL Performance mailing list with the following EXPLAIN ANALYZE of the slow query:

QUERY PLAN
...deleted explain to meet posting requirement
...
Total runtime: 78166.581 ms
(50 rows)

Here are the comments from one of the main PostgreSQL developers:
Kenneth Marshall <[email protected]> writes:
> We have just upgraded our monitoring server software and
> now the following query for graphing the data performs
> abysmally with the default settings. Here is the results
> of the EXPLAIN ANALYZE run with nestloops enabled:

I poked at this a bit more and now think I see where the problem is.
The thing that would be easiest for you to do something about is
the misestimation here:

> -> Nested Loop Anti Join (cost=94.07..10294.64
+rows=1 width=8) (actual time=98.049..27907.702 rows=281 loops=1)
> Join Filter: (gii.graphid = g.graphid)
> -> Bitmap Heap Scan on graphs g
+(cost=94.07..233.17 rows=1 width=8) (actual time=0.529..1.772 rows=281 loops=1)
> Recheck Cond: (graphid = ANY
+('{2,3,4,5,386,387,969,389,971,972,973,446,447,44 8,449,450,451,471,456,470,473,
+477,472,474,475,476,478,479,480,481,482,483,484,4 59,614,655,658,645,490,492,489
+,493,496,495,498,497,499,501,500,502,974,558,559, 562,566,563,564,565,567,568,56
+9,570,571,535,572,573,534,536,538,539,540,541,542 ,543,544,545,537,546,547,548,5
+52,553,554,555,556,549,550,551,557,577,578,579,58 0,574,576,581,835,587,588,589,
+590,560,561,836,591,592,593,594,595,827,389,495,4 98,497,597,598,599,975,978,999
+,1004,604,605,606,679,616,634,635,636,637,638,618 ,629,630,631,632,633,671,682,6
+69,670,678,679,680,674,672,676,673,675,677,681,68 2,683,683,644,652,829,681,687,
+698,685,686,705,706,707,708,830,945,946,710,716,7 12,714,713,709,718,721,720,719
+,723,724,747,749,750,730,731,732,733,734,735,736, 737,738,739,740,741,742,743,74
+4,745,772,774,775,755,756,757,758,759,760,761,762 ,763,764,765,766,767,768,769,7
+70,777,776,977,824,823,826,825,829,832,833,835,58 1!

+,836,842,852,854,839,840,838,853,855,847,848,944, 846,859,850,899,901,902,903,86
+4,865,866,867,976,979,939,941,942,943,906,907,908 ,909,910,868,969,991,950,955,9
+64,966,952,953,962,965,967,959,961,968,1001,1002, 1003,986,987,988,994,995,996,1
+008,1006,1007,1009,1010}'::bigint[]))
> Filter: ((graphid /
+100000000000000::bigint) = 0)
> -> Bitmap Index Scan on graphs_pkey
+(cost=0.00..94.07 rows=246 width=0) (actual time=0.507..0.507 rows=294 loops=1)
> Index Cond: (graphid = ANY
+('{2,3,4,5,386,387,969,389,971,972,973,446,447,44 8,449,450,451,471,456,470,473,
+477,472,474,475,476,478,479,480,481,482,483,484,4 59,614,655,658,645,490,492,489
+,493,496,495,498,497,499,501,500,502,974,558,559, 562,566,563,564,565,567,568,56
+9,570,571,535,572,573,534,536,538,539,540,541,542 ,543,544,545,537,546,547,548,5
+52,553,554,555,556,549,550,551,557,577,578,579,58 0,574,576,581,835,587,588,589,
+590,560,561,836,591,592,593,594,595,827,389,495,4 98,497,597,598,599,975,978,999
+,1004,604,605,606,679,616,634,635,636,637,638,618 ,629,630,631,632,633,671,682,6
+69,670,678,679,680,674,672,676,673,675,677,681,68 2,683,683,644,652,829,681,687,
+698,685,686,705,706,707,708,830,945,946,710,716,7 12,714,713,709,718,721,720,719
+,723,724,747,749,750,730,731,732,733,734,735,736, 737,738,739,740,741,742,743,74
+4,745,772,774,775,755,756,757,758,759,760,761,762 ,763,764,765,766,767,768,769,7
+70,777,776,977,824,823,826,825,829,832,833,835!

+,581,836,842,852,854,839,840,838,853,855,847,848, 944,846,859,850,899,901,902,90
+3,864,865,866,867,976,979,939,941,942,943,906,907 ,908,909,910,868,969,991,950,9
+55,964,966,952,953,962,965,967,959,961,968,1001,1 002,1003,986,987,988,994,995,9
+96,1008,1006,1007,1009,1010}'::bigint[]))

The estimate of the ANY condition is not too bad (246 vs 294 actual). But it
hasn't got any ability to deal with the "(graphid / 100000000000000::bigint) =
+0"
filter condition, and is falling back to a default selectivity estimate for
that, which IIRC is just 0.005 --- but actually, that condition doesn't
eliminate any rows at all. Do you need that condition in the first
place? Can you persuade your client-side software to eliminate it when
it's impossible based on the ANY list? Or at least recast it to
something more easily estimatable, like "graphid < 100000000000000"?

If you really have to have the condition just like that, I'd advise
creating an index on "(graphid / 100000000000000::bigint)". That would
cause ANALYZE to accumulate statistics on that expression, which'd
result in a far better estimate.

The reason that this misestimate hammers it so hard is that the
inside of the nestloop looks like

> -> Nested Loop (cost=0.00..17449.43
+rows=1954 width=8) (actual time=99.304..99.304 rows=0 loops=281)
> -> Index Scan using graphs_items_2 on
+graphs_items gii (cost=0.00..69.83 rows=1954 width=16) (actual
+time=0.013..3.399 rows=1954 loops=281)
> -> Index Scan using items_pkey on
+items ii (cost=0.00..8.88 rows=1 width=8) (actual time=0.046..0.046 rows=0
+loops=549074)
> Index Cond: (ii.itemid =
+gii.itemid)
> Filter: (alternatives: SubPlan 1
+or hashed SubPlan 2)
> SubPlan 1
> -> Nested Loop
+(cost=0.00..7.83 rows=1 width=0) (actual time=0.040..0.040 rows=0 loops=549074)
> Join Filter: (rr.groupid
+= ugg.usrgrpid)
> -> Nested Loop
+(cost=0.00..6.67 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=549074)
> Join Filter:
+(hgg.groupid = rr.id)
> -> Index Scan
+using hosts_groups_1 on hosts_groups hgg (cost=0.00..4.27 rows=1 width=8)
+(actual time=0.003..0.005 rows=1 loops=549074)
> Index Cond:
+($0 = hostid)
> -> Seq Scan on
+rights rr (cost=0.00..2.39 rows=1 width=16) (actual time=0.027..0.027 rows=0
+loops=532214)
> Filter:
+(rr.permission < 2)
> -> Seq Scan on
+users_groups ugg (cost=0.00..1.15 rows=1 width=8) (never executed)
> Filter: (ugg.userid
+= 20)
> SubPlan 2
> -> Nested Loop
+(cost=2.34..8.13 rows=1 width=8) (never executed)
> -> Nested Loop
+(cost=0.00..3.55 rows=1 width=8) (never executed)
> Join Filter:
+(rr.groupid = ugg.usrgrpid)
> -> Seq Scan on
+rights rr (cost=0.00..2.39 rows=1 width=16) (never executed)
> Filter:
+(permission < 2)
> -> Seq Scan on
+users_groups ugg (cost=0.00..1.15 rows=1 width=8) (never executed)
> Filter:
+(ugg.userid = 20)
> -> Bitmap Heap Scan on
+hosts_groups hgg (cost=2.34..4.45 rows=11 width=16) (never executed)
> Recheck Cond:
+(hgg.groupid = rr.id)
> -> Bitmap Index
+Scan on hosts_groups_2 (cost=0.00..2.33 rows=11 width=0) (never executed)
> Index Cond:
+(hgg.groupid = rr.id)

The alternative subplans are variant implementations of the inner EXISTS
test. Unfortunately, it's choosing to go with the "retail" lookup, not
knowing that that's going to wind up being executed 549074 times.
If it had gone to the "wholesale" hashtable implementation, this would
probably have run a whole lot faster. (We have it in mind to allow the
executor to switch to the other implementation on-the-fly when it
becomes clear that the planner misjudged the rowcount, but it's not done
yet.)

However, I'm curious about your statement that this used to perform
better. Used to perform better on what, and what was the plan back
then? I don't believe that pre-8.4 PG would have done better than
8.4 on either of these points. It certainly wouldn't have won on the
EXISTS subplan, because before 8.4 that would always have been done in
the "retail" style.

regards, tom lane

As a short term work-around, I created the index that was recommended on (graphid/100000000000000::bigint) with the following command:

CREATE INDEX fix_bad_stat_estimate ON graphs ((graphid / 100000000000000::bigint));

Afterwards, the query times dropped to 217ms from 78s, x350 times faster. The same problem occurred with the following trigger related query:

SELECT t.* FROM triggers t,functions f,items i,hosts_groups hg,rights r,users_groups ug,hosts h WHERE (t.triggerid/100000000000000) in (0) AND f.triggerid=t.triggerid AND f.itemid=i.itemid AND hg.hostid=i.hostid AND r.id=hg.groupid AND r.groupid=ug.usrgrpid AND ug.userid=2 AND r.permission>=2 AND NOT EXISTS( SELECT ff.triggerid FROM functions ff, items ii WHERE ff.triggerid=t.triggerid AND ff.itemid=ii.itemid AND EXISTS( SELECT hgg.groupid FROM hosts_groups hgg, rights rr, users_groups gg WHERE hgg.hostid=ii.hostid AND rr.id=hgg.groupid AND rr.groupid=gg.usrgrpid AND gg.userid=2 AND rr.permission<2)) AND i.hostid=h.hostid AND i.status=0 AND t.status=0 AND h.status=0 AND t.value=1 ORDER BY t.lastchange DESC LIMIT 20 OFFSET 0;

And again, creating the following work-around index sped up the query:

CREATE INDEX fix_bad_stat_estimate2 ON triggers ((triggerid / 100000000000000::bigint));

The final error message that we see that has not been reported so far is:

2010-01-18 21:04:03 CSTERROR: operator does not exist: character varying = integer at character 104
2010-01-18 21:04:03 CSTHINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
2010-01-18 21:04:03 CSTSTATEMENT: SELECT DISTINCT * FROM hosts WHERE (hostid/100000000000000) in (0) AND status IN (0,1,3) AND host IN (0)

Thank you for considering fixes to these issues in your next release. Please let me know if you need any help testing fixes.

Regards,
Ken
Tags: None
Aly

ZABBIX developer

Joined: May 2007

Posts: 1126
#2

22-01-2010, 16:31

We already implemented some improvements in this case. You can try our nightly builds on DB without those indexes, we would be interested in results.

Zabbix | ex GUI developer
Comment

Previous template Next

Ad Widget

Zabbix performance issues with PostgreSQL 8.4.2

Zabbix performance issues with PostgreSQL 8.4.2

Comment