Ad Widget

Collapse

Slow query on history_uint after primary key update

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • brice.lopez
    Junior Member
    • Mar 2023
    • 4

    #1

    Slow query on history_uint after primary key update

    Hello,

    We're managing a medium zabbix instance, monitoring ~3000 servers, storing it's data in postgresql. As usual, the history and history_uint are the big tables, with history_uint reaching a few tens of gigabytes with 10days of history on most metrics.

    We recently upgraded the history* tables to add a primary key, using the zabbix documentation procedure. Since then, we're observing a very long query on history_uint, so CPU intensive we have to regularly delete the oldest rows in order to keep the monitoring working.

    I first tried to optimize the postgres server configuration, among other tuning the work mem to avoid using a temp file on disk to process this huge request. However I wasn't able to lower the load, and each time the history_uint tables fills back, we start again to see the same query exhausting the ressources again.

    The query, which asks for the MAX timestamp of each web scenario configured (full query here) :

    Code:
    SELECT h.itemid, MAX(h.clock) AS clock FROM history_uint h WHERE (h.itemid IN (132523,132811,...) OR h.itemid IN (...) ...)  AND h.clock > 1678921992 GROUP BY h.itemid
    If I split this huge query in five different smaller queries, one by "OR" occurence, it runs fine. Am I missing something obvious? Is it normal for zabbix server to do such a big query, requiring so much memory and CPU processing? Is there a way to manage it, maybe by partitioning the tables with timescaledb? I was also thinking about an index on 'itemid' and 'clock order by desc' to get the biggest timestamp without killing the CPU.

    The schema for the problematic table :

    Code:
    zabbix=# \d history_uint
                     Table "zabbix.history_uint"
     Column |     Type      | Collation | Nullable |   Default    
    --------+---------------+-----------+----------+--------------
     itemid | bigint        |           | not null |
     clock  | integer       |           | not null | 0
     value  | numeric(20,0) |           | not null | '0'::numeric
     ns     | integer       |           | not null | 0
    Indexes:
        "history_uint_pkey" PRIMARY KEY, btree (itemid, clock, ns)
  • brice.lopez
    Junior Member
    • Mar 2023
    • 4

    #2
    Update : the index doesn't help, I had to manually remove the oldest data again. Does anybody knows if this single request asking for MAX(clock) of more than 4400 itemid at the same time is expected? I don't get why this kind of huge request is happening only on web scenario. It seems like a bug, but I'd like to ensure it's not a configuration mistake before opening a bug report.

    Comment

    • tim.mooney
      Senior Member
      • Dec 2012
      • 1427

      #3
      I'm not trying to be a jerk, but you've told us a lot about what problem you're observing but absolutely nothing about your environment. That may be relevant.

      What version of Zabbix are you running on the server?

      You've said how many clients you have, but it would be perhaps more useful to know what your NVPS is?

      What OS (or distro) and version?

      What version of PostgresSQL?

      Dedicated database server, or is it combined zabbix server + DB on one?

      What PostgreSQL tuning has been done, if any?​

      Comment

      • brice.lopez
        Junior Member
        • Mar 2023
        • 4

        #4
        Hi Tim,

        We're running the official zabbix docker image 6.0.11 on a Debian 11.3 server, handling on average 993NVPS. The database is a hosted on a dedicated Debian 11.3 VM with 64GB of memory, SSD storage, and 8 Xeon 4208 CPU cores, running postgresql 13.5.

        Here is the full postgresql.conf : https://paste.chapril.org/?3c58d5cd5...EYFDfM2PZaQfuz

        Comment

        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4807

          #5

          For that version I would go as high on DB version as possible (15.X) .. you are barely over lower limit... But as DB is on dedicated box, that should still be able to server your workload... Config looks ok...

          housekeeping is enabled? It should delete old data for you...

          I am no DBA, but do you see vacuuming happening in db logs? All this DB tuning is tricky business..

          Comment

          • tim.mooney
            Senior Member
            • Dec 2012
            • 1427

            #6
            Thanks for providing that additional info about your environment.

            Having a dedicated database server with great resources like you've outlined should (I think) be able to handle your config. I do think it's weird that the problem only began after you added the indexes. It sort of implies that PostgreSQL is somehow mis-optimizing the query, but that's just a wild guess.

            I'm far more experienced with MySQL/MariaDB as a Zabbix backend. We don't have much postgres in my environment, and what we do have serves small applications, so I've only done the smallest bit of optimization. I tried a couple of different PostgreSQL "tuners" sized for your system, and they output config very similar to what you have, including for shared_buffers and work_mem, so if it's a DB tuning issue, I'm not experienced enough to spot it.

            Since you're experienced enough to capture the query and try alternate versions that perform better, have you tried using the query analyzer on the original query, to see if you can spot where things are going wrong?

            As you know, indexes on those tables are new at 6.0.x. It's possible that there's some inefficiency with that specific query that other sites haven't run into yet, either because they don't have as many web scenario items, they're not using the PostgreSQL backend, or something similar. You could try reporting the issue via the free bugtracker (support.zabbix.com). Without a support contract you may not get a response. Still, it seems like you've done your homework here and this isn't some simple misconfiguration. If other people start experiencing the problem, they might find your reported issue.

            Wish I had a better suggestion as to what the problem is. I do think the EXPLAIN option with the query analyzer might be required.

            As you make progress with the issue, please follow up here and let us know where the debugging takes you. It's an interesting (and complicated) problem.​

            Comment

            • brice.lopez
              Junior Member
              • Mar 2023
              • 4

              #7
              Hello,

              Thank you both for you suggestions. A quick followup.

              First I did a few optimization on the DB to maximize the performance. I increased the shared_buffers size to 20GB to keep most of the hot data into memory. I then realized we had temporary file on disk for big queries, so I increased the workmem to be just below the "out of memory" logs, which solved that issue.

              We were still seeing performances issues with postgres. Digging the logs and the graphes, I identified a few more bottle necks :
              - quite often, the backend writer was not fast enough to manage all the dirty buffers, and some backend process had to manage directly the writing
              - the max_wal_size value was not big enough, triggering checkpoints more often than the default 5min interval
              - the checkpoint time was high during the peaks of loads

              We increased the performances by :
              - Moving the WAL files to SSD. They were on a dedicated HDD partition, which was obviously a mistake.
              - Increasing the max_wal_size to 4GB, to effectively do a checkpoint once every 5min
              - Increasing bgwriter_lru_maxpages to 150 and bgwriter_lru_multiplier to 3.0 to let the background writer take care of the writing process

              It helped, but we still had performances issues. I then noticed two things :
              - The load spike which triggered a performance collapse was happening at 9am
              - The number of requests was decreasing when the DB performance increased

              It looked like a script running on an employee computer was doing high frequency requests. Less requests when performance increased, it could be a script which didn't manage well timeout or errors? I started to log the queries to zabbix web, and the POST content. We identified a few script doing heavy or high frequency queries. But fixing that didn't decrease the load on the DB.

              Once a failing script was ruled out, most of the remaining requests were related to the dashboards. I realized each custom dashboard has a "web monitoring" widget, with a refresh interval of 1min. For most of them, made by the exploitation teams, it was filtered to display only their customers. But the main dashboard (dashboardid = 1) was displaying the global metrics, without any filter, which triggered a huge request for thousands of web monitoring checks each time the page was displayed or refreshed (every 1min).

              Removing this widget and expiring the sessions to force everybody to refresh the page decreased massively the load, we are now planning to partition the tables using timescaledb.​

              Comment


              • tim.mooney
                tim.mooney commented
                Editing a comment
                Really great post! Thanks for following up and sharing what you found. I'm glad you were able to identify the root cause.
            Working...