OS: Rocky linux 8
Zabbix: 7.0.4
Postgresql: 15.3-2
Patroni: 3.0.2
Timescaledb: 2.16.1
Python: 3.6.8
Hello everyone.
Last September I updated my Zabbix system from version 6 to 7.0.3, it went smoothly and everything worked just fine until last Friday when IOPS, IOWAIT and history syncers usage skyrocketed (almost 4900 IOPS, 39% IOWAIT and 88% of syncer usage).
I've had minor troubles with IOPS before, but this case in particular left the system unusable, when system usage went back to normal everything worked Ok again as usual.
I thougth this was a bug due to mismatch of versions between zabbix and timescaledb at the time, so on Monday I updated zabbix to version 7.0.4, assuming this was the cause of the incidence.
When it happened again yesterday, I traced the problem to a query that was executing for almost 3 hours for no reason, the specific query is this one:
SELECT h.itemid,h.value,h.clock,s.tick FROM history_uint h JOIN (SELECT h2.itemid,h2.clock,MAX(h2.ns) AS ns,s2.tick FROM history_uint h2 JOIN (SELECT itemid,
CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800) AS tick, MAX(clock) AS clock FROM history_uint WHERE itemid=533393 AND clock>='1728532602' AND clock<='1728554202' GROUP B
Y itemid, CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800)) s2 ON h2.itemid=s2.itemid AND h2.clock=s2.clock GROUP BY h2.itemid,h2.clock,s2.tick) s ON h.itemid=s.itemid AND
h.clock=s.clock AND h.ns=s.ns
My first thought was that there was a problem with timescaledb chunk compression, since I executed a query explain and saw an abnormal number of decompressing chunk phases on it, for example:
(The query explain is almost 540 rows of length, should anybody need it, I'll post it separately)
Hash Join (cost=1255985308.38..1510110786.09 rows=42887811365 width=24)
Hash Cond: ((s.itemid = h_1.itemid) AND (s.clock = h_1.clock) AND (s.ns = h_1.ns))
-> Subquery Scan on s (cost=9492229.80..12165187.50 rows=8000000 width=24)
-> HashAggregate (cost=9492229.80..12085187.50 rows=8000000 width=24)
Group Key: h2_1.itemid, h2_1.clock, (((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint)))
Planned Partitions: 32
-> Nested Loop (cost=59.29..294804.62 rows=214439057 width=24)
-> HashAggregate (cost=56.72..60.72 rows=200 width=20)
Group Key: _hyper_2_5191_chunk.itemid, ((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint))
-> Result (cost=0.57..55.06 rows=221 width=20)
-> Index Only Scan using "5191_2640_history_uint_pkey" on _hyper_2_5191_chunk (cost=0.57..50.64 rows=221 width=12)
Index Cond: ((itemid = 533393) AND (clock >= 1728532602) AND (clock <= 1728554202))
-> Append (cost=2.57..643.63 rows=83008 width=16)
-> Custom Scan (DecompressChunk) on _hyper_2_4132_chunk h2_1 (cost=2.57..2.57 rows=1000 width=16)
Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
-> Index Scan using compress_hyper_9_4219_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4219_chunk compress_hyper_9_4219_chunk_
1 (cost=0.42..2.57 rows=1 width=96)
Index Cond: (itemid = _hyper_2_5191_chunk.itemid)
-> Custom Scan (DecompressChunk) on _hyper_2_4142_chunk h2_2 (cost=2.57..2.57 rows=1000 width=16)
Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
-> Index Scan using compress_hyper_9_4229_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4229_chunk compress_hyper_9_4229_chunk_
But it makes no sense to me since the range of time the query is trying to filter matches the chunk that was active at that time for the history_uint table, hence, it was not compressed. After terminating it, system went back to normal.
Does anybody have any idea why does this query behaves in this way?.
Thanks in advance.
Zabbix: 7.0.4
Postgresql: 15.3-2
Patroni: 3.0.2
Timescaledb: 2.16.1
Python: 3.6.8
Hello everyone.
Last September I updated my Zabbix system from version 6 to 7.0.3, it went smoothly and everything worked just fine until last Friday when IOPS, IOWAIT and history syncers usage skyrocketed (almost 4900 IOPS, 39% IOWAIT and 88% of syncer usage).
I've had minor troubles with IOPS before, but this case in particular left the system unusable, when system usage went back to normal everything worked Ok again as usual.
I thougth this was a bug due to mismatch of versions between zabbix and timescaledb at the time, so on Monday I updated zabbix to version 7.0.4, assuming this was the cause of the incidence.
When it happened again yesterday, I traced the problem to a query that was executing for almost 3 hours for no reason, the specific query is this one:
SELECT h.itemid,h.value,h.clock,s.tick FROM history_uint h JOIN (SELECT h2.itemid,h2.clock,MAX(h2.ns) AS ns,s2.tick FROM history_uint h2 JOIN (SELECT itemid,
CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800) AS tick, MAX(clock) AS clock FROM history_uint WHERE itemid=533393 AND clock>='1728532602' AND clock<='1728554202' GROUP B
Y itemid, CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800)) s2 ON h2.itemid=s2.itemid AND h2.clock=s2.clock GROUP BY h2.itemid,h2.clock,s2.tick) s ON h.itemid=s.itemid AND
h.clock=s.clock AND h.ns=s.ns
My first thought was that there was a problem with timescaledb chunk compression, since I executed a query explain and saw an abnormal number of decompressing chunk phases on it, for example:
(The query explain is almost 540 rows of length, should anybody need it, I'll post it separately)
Hash Join (cost=1255985308.38..1510110786.09 rows=42887811365 width=24)
Hash Cond: ((s.itemid = h_1.itemid) AND (s.clock = h_1.clock) AND (s.ns = h_1.ns))
-> Subquery Scan on s (cost=9492229.80..12165187.50 rows=8000000 width=24)
-> HashAggregate (cost=9492229.80..12085187.50 rows=8000000 width=24)
Group Key: h2_1.itemid, h2_1.clock, (((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint)))
Planned Partitions: 32
-> Nested Loop (cost=59.29..294804.62 rows=214439057 width=24)
-> HashAggregate (cost=56.72..60.72 rows=200 width=20)
Group Key: _hyper_2_5191_chunk.itemid, ((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint))
-> Result (cost=0.57..55.06 rows=221 width=20)
-> Index Only Scan using "5191_2640_history_uint_pkey" on _hyper_2_5191_chunk (cost=0.57..50.64 rows=221 width=12)
Index Cond: ((itemid = 533393) AND (clock >= 1728532602) AND (clock <= 1728554202))
-> Append (cost=2.57..643.63 rows=83008 width=16)
-> Custom Scan (DecompressChunk) on _hyper_2_4132_chunk h2_1 (cost=2.57..2.57 rows=1000 width=16)
Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
-> Index Scan using compress_hyper_9_4219_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4219_chunk compress_hyper_9_4219_chunk_
1 (cost=0.42..2.57 rows=1 width=96)
Index Cond: (itemid = _hyper_2_5191_chunk.itemid)
-> Custom Scan (DecompressChunk) on _hyper_2_4142_chunk h2_2 (cost=2.57..2.57 rows=1000 width=16)
Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
-> Index Scan using compress_hyper_9_4229_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4229_chunk compress_hyper_9_4229_chunk_
But it makes no sense to me since the range of time the query is trying to filter matches the chunk that was active at that time for the history_uint table, hence, it was not compressed. After terminating it, system went back to normal.
Does anybody have any idea why does this query behaves in this way?.
Thanks in advance.
Comment