I'm having problems understanding why deleting events is so slow.
Zabbix: 4.2 (never mind the name of the db - it is 4.2)
Postgresql: 10.8
nvps: ~400
hosts: ~600
items: ~45000
event table: ~25 million rows
event_recovery table: ~12 million rows
alerts table: ~600000 rows
We've already partitioned all history* and trends* tables - which has worked fine and without problems for a year. I don't think I've seen a definitive guide on partitioning of events, and the since the db schema is a bit complex I haven't tried to figure out how to transition with partitioning of events. I would be happy for any pointers on this.
Now we've seen that housekeeping of events is very slow - a single (hourly) run can take more than one day to finish, so events keep stacking up. A typical slow delete statement, from the postgres log:
postgresql-10-20190717-031404.log:2019-07-17 03:37:43 CEST [80965]: [4-1] user=zabbix,db=zabbix_34,app=[unknown],client=[local]: LOG: duration: 27298798.930 ms statement: delete from events where (eventid between 5580621 and 5580681 or eventid between 5580689 and 5580762 or eventid between 5580769 and 5580844 or eventid between 5580851 and 5580867 or eventid between 5580869 and 5580926 or eventid between 5580933 and 5580949 or eventid between 5580963 and 5581024
--- 8< --- a lot of similar eventids snipped away -----
or eventid between 5586799 and 5586839 or eventid in (5581385,5581389,5581561,5581563,5581564,5581580,5 581582,5581584,5581585,5581635))
I've analyzed the deletion of a single row in events:
zabbix_34=> begin;
BEGIN
zabbix_34=> explain (analyze,buffers,timing)
delete from events where eventid = 5600001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on events (cost=0.44..8.46 rows=1 width=6) (actual time=0.260..0.260 rows=0 loops=1)
Buffers: shared hit=6 dirtied=1
-> Index Scan using idx_29337_primary on events (cost=0.44..8.46 rows=1 width=6) (actual time=0.043..0.044 rows=1 loops=1)
Index Cond: (eventid = '5600001'::numeric)
Buffers: shared hit=4 dirtied=1
Planning time: 0.060 ms
Trigger for constraint c_alerts_5: time=330.938 calls=1
Trigger for constraint c_event_suppress_1: time=0.102 calls=1
Trigger for constraint c_acknowledges_2: time=0.081 calls=1
Trigger for constraint c_alerts_2: time=0.080 calls=1
Trigger for constraint c_event_recovery_1: time=1821.648 calls=1
Trigger for constraint c_event_recovery_2: time=1866.104 calls=1
Trigger for constraint c_event_recovery_3: time=762.721 calls=1
Trigger for constraint c_event_tag_1: time=3.526 calls=1
Trigger for constraint c_problem_1: time=2.016 calls=1
Trigger for constraint c_problem_2: time=1.162 calls=1
Execution time: 4788.672 ms
(17 rows)
zabbix_34=> rollback;
5 seconds to delete a single row, wow! This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time. But why? In this case there is no row with eventid=5600001 in either event_recovery or alerts. There are indexes or primary keys (as per the schema for Zabbix 4.2) on all the referencing columns. The database is analyzed and vacuumed nightly. Querying event_recovery for this eventid is fast
zabbix_34=> explain (analyze,timing) select * from event_recovery where eventid = 5600001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using event_recovery_rb_test_100 on event_recovery (cost=0.43..8.45 rows=1 width=40) (actual time=0.055..0.055 rows=0 loops=1)
Index Cond: (eventid = 5600001)
Planning time: 0.157 ms
Execution time: 0.068 ms
(4 rows)
zabbix_34=> select * from event_recovery where eventid = 5600001;
eventid | r_eventid | c_eventid | correlationid | userid
---------+-----------+-----------+---------------+--------
(0 rows)
Any ideas?
Zabbix: 4.2 (never mind the name of the db - it is 4.2)
Postgresql: 10.8
nvps: ~400
hosts: ~600
items: ~45000
event table: ~25 million rows
event_recovery table: ~12 million rows
alerts table: ~600000 rows
We've already partitioned all history* and trends* tables - which has worked fine and without problems for a year. I don't think I've seen a definitive guide on partitioning of events, and the since the db schema is a bit complex I haven't tried to figure out how to transition with partitioning of events. I would be happy for any pointers on this.
Now we've seen that housekeeping of events is very slow - a single (hourly) run can take more than one day to finish, so events keep stacking up. A typical slow delete statement, from the postgres log:
postgresql-10-20190717-031404.log:2019-07-17 03:37:43 CEST [80965]: [4-1] user=zabbix,db=zabbix_34,app=[unknown],client=[local]: LOG: duration: 27298798.930 ms statement: delete from events where (eventid between 5580621 and 5580681 or eventid between 5580689 and 5580762 or eventid between 5580769 and 5580844 or eventid between 5580851 and 5580867 or eventid between 5580869 and 5580926 or eventid between 5580933 and 5580949 or eventid between 5580963 and 5581024
--- 8< --- a lot of similar eventids snipped away -----
or eventid between 5586799 and 5586839 or eventid in (5581385,5581389,5581561,5581563,5581564,5581580,5 581582,5581584,5581585,5581635))
I've analyzed the deletion of a single row in events:
zabbix_34=> begin;
BEGIN
zabbix_34=> explain (analyze,buffers,timing)
delete from events where eventid = 5600001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on events (cost=0.44..8.46 rows=1 width=6) (actual time=0.260..0.260 rows=0 loops=1)
Buffers: shared hit=6 dirtied=1
-> Index Scan using idx_29337_primary on events (cost=0.44..8.46 rows=1 width=6) (actual time=0.043..0.044 rows=1 loops=1)
Index Cond: (eventid = '5600001'::numeric)
Buffers: shared hit=4 dirtied=1
Planning time: 0.060 ms
Trigger for constraint c_alerts_5: time=330.938 calls=1
Trigger for constraint c_event_suppress_1: time=0.102 calls=1
Trigger for constraint c_acknowledges_2: time=0.081 calls=1
Trigger for constraint c_alerts_2: time=0.080 calls=1
Trigger for constraint c_event_recovery_1: time=1821.648 calls=1
Trigger for constraint c_event_recovery_2: time=1866.104 calls=1
Trigger for constraint c_event_recovery_3: time=762.721 calls=1
Trigger for constraint c_event_tag_1: time=3.526 calls=1
Trigger for constraint c_problem_1: time=2.016 calls=1
Trigger for constraint c_problem_2: time=1.162 calls=1
Execution time: 4788.672 ms
(17 rows)
zabbix_34=> rollback;
5 seconds to delete a single row, wow! This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time. But why? In this case there is no row with eventid=5600001 in either event_recovery or alerts. There are indexes or primary keys (as per the schema for Zabbix 4.2) on all the referencing columns. The database is analyzed and vacuumed nightly. Querying event_recovery for this eventid is fast
zabbix_34=> explain (analyze,timing) select * from event_recovery where eventid = 5600001;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using event_recovery_rb_test_100 on event_recovery (cost=0.43..8.45 rows=1 width=40) (actual time=0.055..0.055 rows=0 loops=1)
Index Cond: (eventid = 5600001)
Planning time: 0.157 ms
Execution time: 0.068 ms
(4 rows)
zabbix_34=> select * from event_recovery where eventid = 5600001;
eventid | r_eventid | c_eventid | correlationid | userid
---------+-----------+-----------+---------------+--------
(0 rows)
Any ideas?
Comment