Ad Widget

Collapse

DB Query makes Zabbix system unusable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Emilio Gil
    Junior Member
    • Oct 2024
    • 1

    #1

    DB Query makes Zabbix system unusable

    OS: Rocky linux 8
    Zabbix: 7.0.4
    Postgresql: 15.3-2
    Patroni: 3.0.2
    Timescaledb: 2.16.1
    Python: 3.6.8

    Hello everyone.

    Last September I updated my Zabbix system from version 6 to 7.0.3, it went smoothly and everything worked just fine until last Friday when IOPS, IOWAIT and history syncers usage skyrocketed (almost 4900 IOPS, 39% IOWAIT and 88% of syncer usage).
    I've had minor troubles with IOPS before, but this case in particular left the system unusable, when system usage went back to normal everything worked Ok again as usual.

    I thougth this was a bug due to mismatch of versions between zabbix and timescaledb at the time, so on Monday I updated zabbix to version 7.0.4, assuming this was the cause of the incidence.

    When it happened again yesterday, I traced the problem to a query that was executing for almost 3 hours for no reason, the specific query is this one:

    SELECT h.itemid,h.value,h.clock,s.tick FROM history_uint h JOIN (SELECT h2.itemid,h2.clock,MAX(h2.ns) AS ns,s2.tick FROM history_uint h2 JOIN (SELECT itemid,
    CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800) AS tick, MAX(clock) AS clock FROM history_uint WHERE itemid=533393 AND clock>='1728532602' AND clock<='1728554202' GROUP B
    Y itemid, CAST(clock AS BIGINT)- MOD(CAST(clock AS BIGINT),1800)) s2 ON h2.itemid=s2.itemid AND h2.clock=s2.clock GROUP BY h2.itemid,h2.clock,s2.tick) s ON h.itemid=s.itemid AND
    h.clock=s.clock AND h.ns=s.ns

    My first thought was that there was a problem with timescaledb chunk compression, since I executed a query explain and saw an abnormal number of decompressing chunk phases on it, for example:

    (The query explain is almost 540 rows of length, should anybody need it, I'll post it separately)

    Hash Join (cost=1255985308.38..1510110786.09 rows=42887811365 width=24)
    Hash Cond: ((s.itemid = h_1.itemid) AND (s.clock = h_1.clock) AND (s.ns = h_1.ns))
    -> Subquery Scan on s (cost=9492229.80..12165187.50 rows=8000000 width=24)
    -> HashAggregate (cost=9492229.80..12085187.50 rows=8000000 width=24)
    Group Key: h2_1.itemid, h2_1.clock, (((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint)))
    Planned Partitions: 32
    -> Nested Loop (cost=59.29..294804.62 rows=214439057 width=24)
    -> HashAggregate (cost=56.72..60.72 rows=200 width=20)
    Group Key: _hyper_2_5191_chunk.itemid, ((_hyper_2_5191_chunk.clock)::bigint - mod((_hyper_2_5191_chunk.clock)::bigint, '1800'::bigint))
    -> Result (cost=0.57..55.06 rows=221 width=20)
    -> Index Only Scan using "5191_2640_history_uint_pkey" on _hyper_2_5191_chunk (cost=0.57..50.64 rows=221 width=12)
    Index Cond: ((itemid = 533393) AND (clock >= 1728532602) AND (clock <= 1728554202))
    -> Append (cost=2.57..643.63 rows=83008 width=16)
    -> Custom Scan (DecompressChunk) on _hyper_2_4132_chunk h2_1 (cost=2.57..2.57 rows=1000 width=16)
    Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
    -> Index Scan using compress_hyper_9_4219_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4219_chunk compress_hyper_9_4219_chunk_
    1 (cost=0.42..2.57 rows=1 width=96)
    Index Cond: (itemid = _hyper_2_5191_chunk.itemid)
    -> Custom Scan (DecompressChunk) on _hyper_2_4142_chunk h2_2 (cost=2.57..2.57 rows=1000 width=16)
    Filter: ((max(_hyper_2_5191_chunk.clock)) = clock)
    -> Index Scan using compress_hyper_9_4229_chunk__compressed_hypertable _9_itemid__ts on compress_hyper_9_4229_chunk compress_hyper_9_4229_chunk_


    But it makes no sense to me since the range of time the query is trying to filter matches the chunk that was active at that time for the history_uint table, hence, it was not compressed. After terminating it, system went back to normal.

    Does anybody have any idea why does this query behaves in this way?.

    Thanks in advance.
  • szydlowski
    Junior Member
    • Dec 2024
    • 3

    #2
    I had the same issue (Zabbix 7.0.6, TimescaleDB 2.17.2).

    First of all, to prevent such queries from immediately overloading the database, it’s worth tuning the temp_file_limit parameter in PostgreSQL. The problematic query came from the graph widget (svggraph) and resulted from using the Aggregation function (last or first) in combination with an overlay.

    In my opinion, adding a constraint in the final join, such as h.clock >= 'start' AND h.clock <= 'end', would help the optimizer avoid scanning all records. This would narrow the time range and increase the chances of the query executing without overloading the database.

    Comment

    • szydlowski
      Junior Member
      • Dec 2024
      • 3

      #3

      That killed my database

      Comment

      • szydlowski
        Junior Member
        • Dec 2024
        • 3

        #4
        I checked how the query generated for the widget looked like in version 6.4. There is a difference in this, because there is no join after ns. As a result, there is no scanning after all partitions.

        Comment

        Working...