Ad Widget

**Markku** · 21-08-2019, 21:17

Hi, I don’t have an exact answer to your all questions, but I would do the one-shot cleanup job in whatever way is the best (as this is just a one-time problem), and then use the usual housekeeper to do the cleaning regularly (apart from partitioned tables). In my some three years of running the same Zabbix servers I haven’t seen a need to arrange special handling for events and problems etc., while using partitioning for history and trend tables at the same time.

Hopefully someone else will chime in and comment the best ways to do this one time cleanup.

Markku

**kloczek** · 22-08-2019, 13:07

Originally posted by RomanT

Situation is the following:
- Partitioning is configured for trends and history tables. No issues here.
- Internal housekeeper is disabled by setting HousekeepingFrequency=0 in config.
- At the moment we have a huge amount of data in events, event_recovery and problem tables (700.000.000+ records and 100GB+ data for these 3 tables in total).

It is possible to partition events table as well and drop oldest events by dropping oldest partition.
Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

**RomanT** · 22-08-2019, 13:23

Originally posted by kloczek

It is possible to partition events table as well and drop oldest events by dropping oldest partition.
Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

Hi kloczek,
Thanks for your reply!
It's clear for us why we have so many events in the DB and as I mentioned they are all internal events (source=3) and caused by misconfigured triggers.

Most part of events are like this:

Cannot evaluate expression: "Cannot evaluate function "someswitch:net.if.in.errors[ifInErrors.1000].avg(5m)": not enough data.".

We already fixed this and amount of new events per hour is now 100 times less than we had before. All we need to do now is to safely cleanup all of them.

I'm not sure if partitioning is a good idea as events table has several foreign keys, so dropping such partition might cause data inconsistency if it's even possible to partition a table with foreign keys. It might also require a lot of time to partition existing table with so much data. It's fine for us to delete all these events using the query from my post in several steps as it runs fast enough, but I'd like to ensure that this will not impact anything as deleting any data manually from the DB always has risks.

Thanks.

**kloczek** · 23-08-2019, 17:23

If you have single database engine without slave you are in trouble.
Normally such thing like changing database schema or table layout to introduce partitions usually the best way to introduce is by do this on the slave than promote that slave as new master.

If you don't have slave .. start thinking about rearchitecting DB backend.
If you have no snapshotable filesystem beneath you database you've reached trouble lvl. 2. because without that it would be way harder to create such slave.

Without slave and/or snapshotable FS underneath still you can start heading in that direction by move monitoring of everything off the server to only proxies.
With that you would be able to schedule zbx server and db backend downtime and still your monitoring will be working (collecting monitoring data) but without alarming layer.
With all monitoring done over proxies (only server using zabbix server template only) should be monitored without proxy.
Then with still collected monitoring data you can copy database to another box to create slave database. With attached slave to the master you can do even one big delete query or partition events table as long as it will be necessary.
When delete/partitioning will be finished on slave after commit all changes accumulated in mean time on slave in few seconds you can promote slave as new master.

**RomanT** · 26-08-2019, 08:29

Originally posted by kloczek

If you have single database engine without slave you are in trouble.
Normally such thing like changing database schema or table layout to introduce partitions usually the best way to introduce is by do this on the slave than promote that slave as new master.

If you don't have slave .. start thinking about rearchitecting DB backend.
If you have no snapshotable filesystem beneath you database you've reached trouble lvl. 2. because without that it would be way harder to create such slave.

Without slave and/or snapshotable FS underneath still you can start heading in that direction by move monitoring of everything off the server to only proxies.
With that you would be able to schedule zbx server and db backend downtime and still your monitoring will be working (collecting monitoring data) but without alarming layer.
With all monitoring done over proxies (only server using zabbix server template only) should be monitored without proxy.
Then with still collected monitoring data you can copy database to another box to create slave database. With attached slave to the master you can do even one big delete query or partition events table as long as it will be necessary.
When delete/partitioning will be finished on slave after commit all changes accumulated in mean time on slave in few seconds you can promote slave as new master.

Hi kloczek,

Thanks for all your recommendations!

Actually we have 2 DB servers with configured replication but the performance impact on the DB by running delete queries is the latest thing I'm worried about if you are talking about cleaning up the DB by running custom queries. In this case I'd like to understand if I'm on a right way and can use the query provided in my initial post without any impact on database consistency (not performance and availability).
Under consistency I mean that I'll not get the DB broken like having issues with writing new events into DB, errors in web interface, problems stuck in some state, issues in some zabbix processes related to event processing etc.

In case of using internal zabbix housekeeper - as I understand it's not possible to use it with a slave DB anyway because it will run across the database that is configured for zabbix server.

Thanks.

**kloczek** · 27-08-2019, 01:57

In housekeeper table are data about what needs to be deleted from history and trends tables. If some itemid items have been deleted events related to those items needs to be deleted as well.
In other words you can use content of the housekeeper table to clean a bit events table however with already working master and slave I would go for partition events table on slave -> drop oldest partitions -> promote slave as new master -> recreate slave and than clean that table other methods.

**rankinc** · 09-09-2019, 15:34

Originally posted by kloczek

It is possible to partition events table as well and drop oldest events by dropping oldest partition.
Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

Do you know if there is any guidance anywhere on how events partitioning would be done? I do plan on redefining triggers but I've never seen an answer on how to deal with events, event_recovery, and problems table growing faster than housekeeper can keep up. With a large enough installation it does seem this would naturally be the fix.

**kloczek** · 09-09-2019, 16:11

Partitioning is completly transparent for all SQL queries.
Using partitioned tables is only about change method of dealing with delete oldest data which by default needs to be deleted by delete query.
Instead delete some table rows is used whole dropping partition (effectively delte whole with only few IOs).

**rankinc** · 09-09-2019, 16:14

Originally posted by kloczek

Partitioning is completly transparent for all SQL queries.
Using partitioned tables is only about change method of dealing with delete oldest data which by default needs to be deleted by delete query.
Instead delete some table rows is used whole dropping partition (effectively delte whole with only few IOs).

I may be misunderstanding but I was looking for any information on how to partition events, event_recovery, and problems tables so I can add them to my partition dropping script but I haven't seen anyone describe how they partitioned those tables just that it was possible.

**kloczek** · 09-09-2019, 19:54

Just use the same SQL procedure which yoiu are using to partition trends* and history* tables.
Only specify events as table which needs to bi partitioned.

**tikken** · 03-10-2019, 00:44

Hello,
I may have misunderstood - sorry if so.

Situation is the following:
- Partitioning is configured for trends and history tables. No issues here.
- Internal housekeeper is disabled by setting HousekeepingFrequency=0 in config.

The partitioning you have is fine, but keep the internal housekeeper turned on -- except for trends and history. You can override the trend and history housekeeping, individually.

Our one-shot deletes took hours, so not much help for you there. I think we ran it over the weekend.

**jameskirsop** · 10-10-2019, 05:45

RomanT did you end up making progress with this?

I'm in a similar position where our events table is much larger than our history tables:

$ ls -Slh zabbix_prod/
total 625G
256G events.ibd
180G history_uint.ibd
104G event_recovery.ibd
42G history.ibd
34G trends_uint.ibd
3.7G auditlog.ibd
2.2G sessions.ibd
2.0G trends.ibd

There are many rows in the events table with a timestamp in the clock column that is earlier than the date we should be seeing things deleted via housekeeping - so I'm imagining that housekeeping isn't keeping up in flushing out all these events as you and others seem to have described above.

I'm a little concerned that simply using SQL to delete them might leave remnants in others tables that then never get cleaned up because they're not handled by either the usual housekeeping tasks, or some other internal method.

Furthermore, I'm interested to know what is meant by this comment:

Originally posted by kloczek

Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

If the events are internal, then what recourse do we have other than trying to make sure that checks are being performed correctly? The lack of UI to see a summary of these internal events in the Zabbix front end makes it difficult to discover issues, and setting up alerts for new events doesn't solve the problem we're discussing here, because we're looking at historical issues - rather than present ones.

**RomanT** · 10-10-2019, 06:48

Originally posted by jameskirsop

RomanT did you end up making progress with this?
...

Hi James,

Yes, we successfully performed the cleanup.
The total time of running delete queries took about 40+ hours. We divided all the internal events by timestamp into batches that are being deleted for about 1 minute (150000 events per batch) and executed them by the script that executes delete query and then waits for 1 minute to finish the replication and process all the rest queries that are stuck in the DB queue. We were thinking about LIMIT option in delete queries but it could cause an issue with the replication if such queries are used without ORDER BY clause, so we decided to avoid using it.

Initially we started running longer queries but faced with a following issue - some related table (event_suppress or event_recovery, I don't remember exactly) that is connected to events table using the foreign key was locked while delete query was being executed across the events table. This caused a storm of nodata trigger based problems like Zabbix agents unavailability. That's why we made the batches smaller.

After that we executed "OPTIMIZE TABLE" queries for events, problem and event_recovery tables to free up some disk space.

The query we used I've provided in my initial post.
The list of timestamps for these delete queries I received using the following query:

select * from (SELECT @rownum := @rownum + 1 AS rank, e.clock FROM events e, (SELECT @rownum := 0) r where e.source=3 and e.object=0 order by e.clock asc) as t where t.rank % 150000 = 0;

This query took around 50 minutes to execute.

Regarding the remnants you are concerned about - I don't think that anything will be kept in the database because there's a lot of tables are connected with a foreign key with the events table with "ON DELETE CASCADE" option. That means that all these records from the other tables related to deleted event will be also deleted. Actually Zabbix built-in housekeeper does almost the same - it executes delete queries for events and problem tables only but it also performs some additional checks for each event that is going to be deleted.

You can check the housekeeper code here:

File not found · zabbix/zabbix

https://github.com/zabbix/zabbix/blob/release/4.0/src/zabbix_server/housekeeper/housekeeper.c

Real-time monitoring of IT components and services, such as networks, servers, VMs, applications and the cloud. - File not found · zabbix/zabbix

You can check the list of events related tables here:

Zabbix Community Area

https://zabbix.org/wiki/Docs/DB_schema/4.0/events

Join the friendly and open Zabbix community on our forums and social media platforms.

Hope this helps.

Thanks.

**rankinc** · 10-10-2019, 15:47

Originally posted by kloczek

Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

jameskirsop

This gets brought up a lot and while it is probably most users' problem it is not the solution. The thought here is that we have more alarming and event generation than we need so we should disable or increase the threshold on these alarms so housekeeper can keep up. This is not sustainable because as monitoring systems grow and more is added official templates are often used which have those over-defined alarming layers. Even without templates a large monitoring system will hit an alarming threshold where housekeeper cannot keep up.

In my opinion an official solution is needed on this or a default Zabbix alert for when the housekeeper is not keeping up.

If I can find a good automatic solution I might see if someone else has requested this as a feature and I can add onto that request.

As far as how I have been dealing with it manually I've been creating EVENTS_NEW like EVENTS, INSERTing data > clock, renaming EVENTS to EVENTS_OLD, renaming EVENTS_NEW to EVENTS, double check, drop EVENTS_OLD.

Ad Widget

Zabbix events housekeeping with huge backlog

Zabbix events housekeeping with huge backlog

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment