Ad Widget

**brice.lopez** · 24-03-2023, 13:37

Update : the index doesn't help, I had to manually remove the oldest data again. Does anybody knows if this single request asking for MAX(clock) of more than 4400 itemid at the same time is expected? I don't get why this kind of huge request is happening only on web scenario. It seems like a bug, but I'd like to ensure it's not a configuration mistake before opening a bug report.

**tim.mooney** · 27-03-2023, 20:16

I'm not trying to be a jerk, but you've told us a lot about what problem you're observing but absolutely nothing about your environment. That may be relevant.

What version of Zabbix are you running on the server?

You've said how many clients you have, but it would be perhaps more useful to know what your NVPS is?

What OS (or distro) and version?

What version of PostgresSQL?

Dedicated database server, or is it combined zabbix server + DB on one?

What PostgreSQL tuning has been done, if any?

**brice.lopez** · 29-03-2023, 15:06

Hi Tim,

We're running the official zabbix docker image 6.0.11 on a Debian 11.3 server, handling on average 993NVPS. The database is a hosted on a dedicated Debian 11.3 VM with 64GB of memory, SSD storage, and 8 Xeon 4208 CPU cores, running postgresql 13.5.

Here is the full postgresql.conf : https://paste.chapril.org/?3c58d5cd5...EYFDfM2PZaQfuz

**cyber** · 29-03-2023, 15:48

For that version I would go as high on DB version as possible (15.X) .. you are barely over lower limit... But as DB is on dedicated box, that should still be able to server your workload... Config looks ok...

housekeeping is enabled? It should delete old data for you...

I am no DBA, but do you see vacuuming happening in db logs? All this DB tuning is tricky business..

**tim.mooney** · 29-03-2023, 22:50

Thanks for providing that additional info about your environment.

Having a dedicated database server with great resources like you've outlined should (I think) be able to handle your config. I do think it's weird that the problem only began after you added the indexes. It sort of implies that PostgreSQL is somehow mis-optimizing the query, but that's just a wild guess.

I'm far more experienced with MySQL/MariaDB as a Zabbix backend. We don't have much postgres in my environment, and what we do have serves small applications, so I've only done the smallest bit of optimization. I tried a couple of different PostgreSQL "tuners" sized for your system, and they output config very similar to what you have, including for shared_buffers and work_mem, so if it's a DB tuning issue, I'm not experienced enough to spot it.

Since you're experienced enough to capture the query and try alternate versions that perform better, have you tried using the query analyzer on the original query, to see if you can spot where things are going wrong?

As you know, indexes on those tables are new at 6.0.x. It's possible that there's some inefficiency with that specific query that other sites haven't run into yet, either because they don't have as many web scenario items, they're not using the PostgreSQL backend, or something similar. You could try reporting the issue via the free bugtracker (support.zabbix.com). Without a support contract you may not get a response. Still, it seems like you've done your homework here and this isn't some simple misconfiguration. If other people start experiencing the problem, they might find your reported issue.

Wish I had a better suggestion as to what the problem is. I do think the EXPLAIN option with the query analyzer might be required.

As you make progress with the issue, please follow up here and let us know where the debugging takes you. It's an interesting (and complicated) problem.

**brice.lopez** · 11-04-2023, 12:00

Hello,

Thank you both for you suggestions. A quick followup.

First I did a few optimization on the DB to maximize the performance. I increased the shared_buffers size to 20GB to keep most of the hot data into memory. I then realized we had temporary file on disk for big queries, so I increased the workmem to be just below the "out of memory" logs, which solved that issue.

We were still seeing performances issues with postgres. Digging the logs and the graphes, I identified a few more bottle necks :
- quite often, the backend writer was not fast enough to manage all the dirty buffers, and some backend process had to manage directly the writing
- the max_wal_size value was not big enough, triggering checkpoints more often than the default 5min interval
- the checkpoint time was high during the peaks of loads

We increased the performances by :
- Moving the WAL files to SSD. They were on a dedicated HDD partition, which was obviously a mistake.
- Increasing the max_wal_size to 4GB, to effectively do a checkpoint once every 5min
- Increasing bgwriter_lru_maxpages to 150 and bgwriter_lru_multiplier to 3.0 to let the background writer take care of the writing process

It helped, but we still had performances issues. I then noticed two things :
- The load spike which triggered a performance collapse was happening at 9am
- The number of requests was decreasing when the DB performance increased

It looked like a script running on an employee computer was doing high frequency requests. Less requests when performance increased, it could be a script which didn't manage well timeout or errors? I started to log the queries to zabbix web, and the POST content. We identified a few script doing heavy or high frequency queries. But fixing that didn't decrease the load on the DB.

Once a failing script was ruled out, most of the remaining requests were related to the dashboards. I realized each custom dashboard has a "web monitoring" widget, with a refresh interval of 1min. For most of them, made by the exploitation teams, it was filtered to display only their customers. But the main dashboard (dashboardid = 1) was displaying the global metrics, without any filter, which triggered a huge request for thousands of web monitoring checks each time the page was displayed or refreshed (every 1min).

Removing this widget and expiring the sessions to force everybody to refresh the page decreased massively the load, we are now planning to partition the tables using timescaledb.

Ad Widget

Slow query on history_uint after primary key update

Slow query on history_uint after primary key update

Comment

Comment

Comment

Comment

Comment

Comment