Hi!
We have an issue with audit logs on Zabbix 6.0.30 with PostgreSQL + timescaleDB, where the housekeeping process takes over 10 minutes to run. This, in turn, causes 100% disk utilisation which slows down the system.
This is the size of the table (+other tables):
Code:
zabbix=# select table_name, pg_size_pretty(pg_total_relation_size(quote_ident( table_name))), pg_total_relation_size(quote_ident(table_name)) from information_schema.tables where table_schema = 'public' order by 3 desc; table_name | pg_size_pretty | pg_total_relation_size ----------------------------+----------------+------------------------ auditlog | 130 GB | 139443576832 events | 70 GB | 75190296576 event_recovery | 10 GB | 11040464896 items | 2640 MB | 2768052224 item_discovery | 1540 MB | 1614446592 host_inventory | 733 MB | 768819200 alerts | 492 MB | 515424256 item_tag | 315 MB | 330399744 item_preproc | 287 MB | 301400064 trigger_discovery | 238 MB | 249200640 event_tag | 234 MB | 245088256 interface | 233 MB | 244572160 triggers | 220 MB | 230727680 item_rtdata | 218 MB | 228704256 hosts | 214 MB | 223920128 functions | 204 MB | 214056960 .....
Code:
2615:20240903:015340.567 housekeeper [deleted 0 hist/trends, 0 items/triggers, 781 events, 42 problems, 2 sessions, 0 alarms, 55842 audit, 0 autoreg_host, 0 records in 785.880511 sec, idle for 1 hour(s)] 2615:20240903:025340.681 executing housekeeper 2615:20240903:030653.002 housekeeper [deleted 0 hist/trends, 0 items/triggers, 947 events, 22 problems, 2 sessions, 0 alarms, 54113 audit, 0 autoreg_host, 0 records in 792.310625 sec, idle for 1 hour(s)] 2615:20240903:040653.113 executing housekeeper 2615:20240903:042015.917 housekeeper [deleted 0 hist/trends, 0 items/triggers, 732 events, 37 problems, 1 sessions, 0 alarms, 57534 audit, 0 autoreg_host, 0 records in 802.791813 sec, idle for 1 hour(s)] 2615:20240903:052016.041 executing housekeeper 2615:20240903:053344.770 housekeeper [deleted 0 hist/trends, 0 items/triggers, 629 events, 20 problems, 0 sessions, 0 alarms, 55350 audit, 0 autoreg_host, 0 records in 808.718917 sec, idle for 1 hour(s)] 2615:20240903:063344.868 executing housekeeper 2615:20240903:064719.545 housekeeper [deleted 0 hist/trends, 0 items/triggers, 490 events, 18 problems, 5 sessions, 0 alarms, 59333 audit, 0 autoreg_host, 0 records in 814.665098 sec, idle for 1 hour(s)] 2615:20240903:074719.657 executing housekeeper 2615:20240903:080102.838 housekeeper [deleted 0 hist/trends, 0 items/triggers, 440 events, 44 problems, 73 sessions, 0 alarms, 62293 audit, 0 autoreg_host, 0 records in 823.170509 sec, idle for 1 hour(s)] 2615:20240903:090102.957 executing housekeeper 2615:20240903:091451.138 housekeeper [deleted 0 hist/trends, 0 items/triggers, 355 events, 44 problems, 24 sessions, 0 alarms, 61843 audit, 0 autoreg_host, 0 records in 828.170993 sec, idle for 1 hour(s)] 2615:20240903:101451.257 executing housekeeper 2615:20240903:102842.305 housekeeper [deleted 0 hist/trends, 0 items/triggers, 499 events, 106 problems, 74 sessions, 0 alarms, 61406 audit, 0 autoreg_host, 0 records in 831.036257 sec, idle for 1 hour(s)] 2615:20240903:112842.421 executing housekeeper 2615:20240903:114238.144 housekeeper [deleted 0 hist/trends, 0 items/triggers, 450 events, 83 problems, 221 sessions, 0 alarms, 63427 audit, 0 autoreg_host, 0 records in 835.709272 sec, idle for 1 hour(s)] 2615:20240903:124238.261 executing housekeeper 2615:20240903:125959.730 housekeeper [deleted 0 hist/trends, 0 items/triggers, 1254 events, 83 problems, 140 sessions, 0 alarms, 65545 audit, 0 autoreg_host, 0 records in 1041.457073 sec, idle for 1 hour(s)] 2615:20240903:135959.845 executing housekeeper 2615:20240903:141321.513 housekeeper [deleted 0 hist/trends, 0 items/triggers, 833 events, 103 problems, 366 sessions, 0 alarms, 60789 audit, 0 autoreg_host, 0 records in 801.655640 sec, idle for 1 hour(s)]
What is odd is that sometimes the housekeeper completes in a few seconds, but in a day or so it does back to 5/10/15 minutes. The time taken doesn't seem to be linked with the amount of auditlog deleted records: this is a record from one of the quick runs:
Code:
2615:20240805:104551.263 housekeeper [deleted 0 hist/trends, 0 items/triggers, 624 events, 30 problems, 4 sessions, 0 alarms, 55496 audit, 0 autoreg_host, 0 records in 1.898408 sec, idle for 1 hour(s)]
To try and alleviate the issue we lowered auditlog retention from 365 days to 335 days. This took > 1 week to complete (each housekeeping run only deletes a certain number of records) but despite the number of records reducing, and the oldest records confirming the new retention, we've not had any change in behaviour, nor a reduction in table size (we were expecting a ~8-10% reduction).
Is this expected? Is there any way to understand why the housekeeper takes so long to delete auditlog records?
Lastly, since v6 Zabbix keeps its own actions in the auditlogs, which we are not interested in. I understand this cannot be currently changed, but is it possible to delete such records to reduce the table size? I was thinking of a quick
Code:
delete from auditlog where username is System;
Thanks for reading!
Stefano