Ad Widget

Collapse

How to cleanup database disk space from oversized auditlog database table

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jjeff123
    Member
    • May 2022
    • 33

    #1

    How to cleanup database disk space from oversized auditlog database table


    This is based on my experience with:
    Zabbix 6.04
    Postgres SQL
    Debian Linux
    Zabbix system has 3 proxies, 75,000 items and 900 hosts, single server running both the web and database, as a VM under vmware, backed by SSD disks.

    Everything about this process uses a lot of disk IO, if you're running your box on spinning disk, I can't even guess how long any of this will take.

    Many commands will be similar if you're running some other arrangement, but not identical. Some of this might be helpful for other database size/disk space utilization things as well. Commands and important responses are in bold. Some of the data might look at little goofy, since I'm writing this up after the fact, so I've had to mock up some of the command responses. Also, I'm not an expert, there are better ways to do some of these things, and some of them might be ill advised, I don't think I broke anything but YMMV.

    Starting in version 6, Zabbix kept much more audit logs. But a bug not fixed until version 6.06 meant your audit log might grow incredibly and never get cleaned up. I ran 6.04 for over a year, so definitely had that problem (pro tip, upgrading is EASY). The housekeeper service, which runs every other hour for up to 1 hour, also was useless, pegging the CPU at 100% and never cleaning up the auditlog by any significant amount. Every time you discover a device, the auditlog gets an entry for every single monitored item, so a single 24 port switch might generate a couple hundred (or more) auditlog entries, those I don't need.

    I shrunk my database from 350GB down to 38GB, without losing any monitoring data, just (for me) useless auditlog data. My housekeeper now runs in a few seconds. This is how...

    Note, when my database first began to grow, I created a 2nd disk, and moved by postgres database to it. Your system might only have one disk.

    First, let's gather some data, use the command df -h to look at your disks and space taken up, on my system, that 2nd disk, /dev/sdb1, was 80% full.

    j@server:~$ df -h
    Filesystem Size Used Avail Use% Mounted on
    udev 9.8G 0 9.8G 0% /dev
    tmpfs 2.0G 1.2M 2.0G 1% /run
    /dev/sda1 62G 3.5G 56G 6% /
    tmpfs 9.8G 412K 9.8G 1% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    /dev/sdb1 393G 333G 60G 80% /data
    tmpfs 2.0G 0 2.0G 0% /run/user/1000

    Now let's connect to the zabbix database and see what tables are taking all this space.

    j@server:~$ sudo -u postgres psql zabbix
    [sudo] password for j:
    psql (13.11 (Debian 13.11-0+deb11u1))
    Type "help" for help.

    zabbix=# select table_name, pg_size_pretty(pg_total_relation_size(quote_ident( table_name))) from information_scma = 'public'order by 2 desc limit 10;

    table_name | pg_relation_size | pg_size_pretty
    ​----------------+------------------+----------------
    auditlog | 226735716352 | 227 GB
    items | 128974848 | 158 MB
    event_tag | 126861312 | 184 MB
    events | 38002688 | 116 MB
    item_tag | 24608768 | 41 MB
    triggers | 24199168 | 26 MB
    item_discovery | 20602880 | 31 MB
    item_preproc | 7151616 | 12 MB
    functions | 6709248 | 17 MB
    event_recovery | 6651904 | 14 MB
    (10 rows)

    227GB is a lot. Let's see how many auditlog entries that is, and how far back they go...
    Note, this next command is going to take a long time, 10 minutes maybe more, and your zabbix might stop responding and/or start spitting out a lot of slow database query warnings in the log.

    zabbix=# select count(*) from auditlog;
    count
    ----------
    371158200
    (1 row)

    zabbix=# select to_timestamp(min(clock)) as date from auditlog;
    date
    ------------------------
    2022-03-19 08:00:02-04
    (1 row)​

    So we've got 371 million entries in the auditlog and they go back 14 months. So I've got plenty of data to clean up. Type quit to exit the database interface and go back to shell.


    If your system is not at least 6.06, you need to upgrade, or the problem will just come back. So go do that first. Maybe I'll write a cookbook article on it....

    Anyway, yup, audit log is huge. Housekeeper should clean up old logs, look in your gui at Administration -> Housekeeping and see how long the audit log was set to be kept for. Mine was a year (so why did I have 14 months of data?? right, there was a bug), I changed that to 21 days. Housekeeper runs by default every other hour, so 12 times a day. Let's see how much the housekeeper is helping me out here...

    j@server:~$ sudo more /var/log/zabbix/zabbix_server.log | grep house
    924:20230528:001427.668 housekeeper [deleted 1532947 hist/trends, 0 items/triggers, 243 events, 26 problems, 0 sessions, 0 alarms, 37555 audit, 0 records in 1048.823645 sec, idle for 1 hour(s)]
    924:20230528:011427.783 executing housekeeper

    OK, we're cleaning up audit logs, but at 37K entries every hour, this will take till the heat death of the universe, or maybe 835 days, to clean up. Let's improve that.

    j@server:~$ sudo nano /etc/zabbix/zabbix_server.conf
    [sudo] password for j:

    use cursor keys, or be fancy and use CTRL+W to search, and find this entry in the config file:

    ### Option: MaxHousekeeperDelete
    # The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
    # [housekeeperid], [tablename], [field], [value].
    # No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
    # will be deleted per one task in one housekeeping cycle.
    # If set to 0 then no limit is used at all. In this case you must know what you are doing!
    #
    # Mandatory: no
    # Range: 0-1000000
    # Default:
    MaxHousekeeperDelete=100000

    So housekeeper should clean up to 100K entries per run, but even if I increase that value, and I tried up to 700K, it still was cleaning far less, most I ever saw on my system was 170K, and that took an hour. Hit CTRL+X to exit, answering yes to saving the file if you actually changed something.

    An odd note here maybe for developers, when I was watching queries, what I saw was the housekeeper doing something like this:
    select auditID from auditlog where clock < 1649483014 limit 100000;
    delete from auditlog where auditID in ('fdhsfhdafhafhdasfh', 'fdhsa8f9hads8f9ha8f9', ....... ); this repeated for 100,000 auditIDs.
    wouldn't something like this be faster?
    delete from auditlog where clock < 1649483014 limit 100000;


    Anyway, back to our cleanup.
    First, make sure you have good backups and know how to recover from them. My zabbix server is a VM, so I just took a snapshot.

    Next, we're going to delete old audit logs. This is going take a long time, long enough for my zabbix web interface to stop responding. Deleting smaller batches of logs seemed to eliminate that problem, if you can afford a chunk of downtime, I'd stop zabbix and do it all at once. But if you're watching TV with a laptop next to you typing commands during commercials.... Now, we're going to clean things up, but the database doesn't understand dates, not exactly. It only knows timestamps. So, using https://www.unixtimestamp.com/ pick a date maybe 1 week later than your oldest data, and convert it to a timestamp. I'll delete anything older than May 15, 2022, which is 1652631297 . Get back into the database just like above, then do

    zabbix=# delete from auditlog where clock < 1652631297 ;

    Now wait, and after a while it'll tell you how many rows it deleted. Do this a bunch of times, choosing newer and newer timestamps. I did about 1 month of logs at a time, until you've deleted all the logs up to however old you want to keep. Or, if you can afford the downtime, stop zabbix, and issue that same command with a timestamp from just 21 days ago or however long you want to keep your auditlog data, like this:

    j@server:~$ sudo /etc/init.d/zabbix-server stop
    zabbix=# delete from auditlog where clock < 1685598214 ;
    j@server:~$ sudo /etc/init.d/zabbix-server start

    After doing all this, the database will still take up a lot of disk space, and my housekeeper was running much quicker, but still had alot of entries to clean up. I think just deleting the auditlog doesn't quite do the same as the housekeeping.

    j@server:~$ sudo more /var/log/zabbix/zabbix_server.log | grep house
    924:20230528:001427.668 housekeeper [deleted 1532947 hist/trends, 0 items/triggers, 243 events, 26 problems, 0 sessions, 0 alarms, 167188 audit, 0 records in 9.206266 sec, idle for 1 hour(s)]
    924:20230528:011427.783 executing housekeeper

    I'm sure there are better ways to do this, but first I edited the zabbix-server.conf file using nano and set that maxhousekeeperdelete value to 500000, then restarted zabbix server, then opened 2 ssh sessions to my server, one to continuously monitor the zabbix log...
    j@server:~$ sudo tail -f /var/log/zabbix/zabbix_server.log
    863934:20230608:195722.021 executing housekeeper
    863934:20230608:195723.539 housekeeper [deleted 0 hist/trends, 0 items/triggers, 247 events, 30 problems, 0 sessions, 0 alarms, 131365 audit, 0 records in 7.491153 sec, idle for 1 hour(s)]
    863934:20230608:205723.637 executing housekeeper
    863934:20230608:205725.437 housekeeper [deleted 0 hist/trends, 0 items/triggers, 251 events, 36 problems, 0 sessions, 0 alarms, 132238 audit, 0 records in 21.774204 sec, idle for 1 hour(s)]
    863934:20230608:215725.536 executing housekeeper
    863934:20230608:215727.202 housekeeper [deleted 0 hist/trends, 0 items/triggers, 184 events, 32 problems, 0 sessions, 0 alarms, 132536 audit, 0 records in 5.637364 sec, idle for 1 hour(s)]

    and the other I was manually running the housekeeper:
    J@server:~$ sudo /usr/sbin/zabbix_server -R housekeeper_execute

    Wait for the 1st ssh session showing the log to tell me the housekeeper ran and is now idle for another hour, then up cursor key enter to re-run housekeeper.

    and eventually, you'll get one of these:
    644069:20230607:203208.737 forced execution of the housekeeper
    644069:20230607:203208.737 executing housekeeper
    644069:20230607:203208.830 housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 0 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 0.065684 sec, idle for 1 hour(s)]

    OK, now edit the zabbix config file again, put maxhousekeeperdelete back to the original setting.
    We've cleaned up the database table, but postgres doesn't shrink the files on your disk the moment you delete some rows. First, lets check how empty our database is.

    zabbix=# select schemaname,relname,n_live_tup, n_dead_tup from pg_stat_all_tables where relname = 'auditlog';
    schemaname | relname | n_live_tup | n_dead_tup
    ------------+----------+------------+------------
    public | auditlog | 10966205 | 43242773986
    (1 row)

    Notice my n_dead_tup is enormous, those are all rows which are deleted and available to be overwritten, but still taking up a huge amount of space on my disk. So let's reclaim that space. We're going to copy all the live data out of that table into a whole new table, then delete the original bloated table. postgres will do it all for us, but 2 important notes. First, you MUST stop zabbix while doing this next step, the cleanup has to lock that table exclusively. Second, since it's a copy, you'll end up temporarily taking up however much space that live data needs, PLUS the existing bloated table size. For me, remember that 3rd command we ran to show disk space used by each table, this one?
    table_name | pg_relation_size | pg_size_pretty
    ​----------------+------------------+----------------
    auditlog | 226735716352 | 227 GB


    I guessed that if 14 months of my data took 227GB, then 21 days should take about 11GB. It was actually a bit less than that, but be aware this command will cause a world of problems if you don't have enough space. databases really don't like filling a disk to 100%, especially if you have everything on one disk. Take a snapshot before you start.

    So, take your snapshot, backup, whatever, just make sure you'll be able to recover. Then stop zabbix like we did above. Get back into your postgres shell, and use the vacuum command...

    j@server:~$ sudo /etc/init.d/zabbix-server stop
    j@server:~$ sudo -u postgres psql zabbix
    [sudo] password for j:
    psql (13.11 (Debian 13.11-0+deb11u1))
    Type "help" for help.

    zabbix=# vacuum full auditlog;

    Now we wait, it's going to do a lot of reading and not too much writing. My system took about an hour. Once finished, quit the pgsql shell and check out your disk space. If you're a nervous nelly like me and want to watch it work, you can get iotop and run that on your 2nd ssh session

    j@server:~$ sudo apt-get install iotop
    ​j@server:~$ sudo /usr/sbin/iotop

    At the top of the screen it'll show the total read bytes/second and write bytes/second. hit q to quit once the reading and writing is done and the pgsql prompt returns. Then check your disk space:

    j@server:~$ df -h

    Filesystem Size Used Avail Use% Mounted on
    udev 9.8G 0 9.8G 0% /dev
    tmpfs 2.0G 1.2M 2.0G 1% /run
    /dev/sda1 62G 3.5G 56G 6% /
    tmpfs 9.8G 412K 9.8G 1% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    /dev/sdb1 393G 38G 335G 11% /data
    tmpfs 2.0G 0 2.0G 0% /run/user/1000

    Success!! Happy hunting.​
    Last edited by jjeff123; 09-06-2023, 14:21.
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4806

    #2
    Originally posted by jjeff123
    zabbix=# select min(clock) from auditlog;
    min
    ------------
    1649483014
    (1 row)​

    If I was better at SQL I'd write a statement to convert that timestamp into an actual date
    select to_timestamp(min(clock)) as date from auditlog;

    Comment

    • jjeff123
      Member
      • May 2022
      • 33

      #3
      Originally posted by cyber
      select to_timestamp(min(clock)) as date from auditlog;
      I kept trying select to_char(min(clock)) and it gave me assorted errors. thanks.

      Comment

      • frank108
        Junior Member
        • Mar 2022
        • 15

        #4
        I had problem with large timsecaledb database, zabbix has been unable to clean large chunks of history and trend tables.
        This was visible in zabbix server log:

        Code:
        grep housekeeper /var/log/zabbix/zabbix_server.log
        3245914:20241211:160120.725 executing housekeeper
        3245914:20241211:160205.960 housekeeper [deleted 0 hist/trends, 0 items/triggers, 7597 events, 28 problems, 0 sessions, 0 alarms, 1414 audit, 0 autoreg_host, 0 records in 44.969769 sec, idle
        for 1 hour(s)]
        Timsecaledb has hypertable management with retention feature that can take care about such a cleanup.
        Check the tables in the postgresql console if they are already used as hypertables:
        Code:
        SELECT * FROM timescaledb_information.dimensions;
        hypertable_schema | hypertable_name | dimension_number | column_name | column_type | dimension_type | time_interval | integer_interval | integer_now_func | num_partitions
        -------------------+-----------------+------------------+-------------+-------------+----------------+---------------+------------------+------------------+----------------
         public            | history         |                1 | clock       | integer     | Time           |               |            86400 | unix_now         |
         public            | history_uint    |                1 | clock       | integer     | Time           |               |            86400 | unix_now         |
         public            | history_log     |                1 | clock       | integer     | Time           |               |            86400 | unix_now         |
         public            | history_text    |                1 | clock       | integer     | Time           |               |            86400 | unix_now         |
         public            | history_str     |                1 | clock       | integer     | Time           |               |            86400 | unix_now         |​
        Check if there's an existing retention policy for history:
        Code:
        SELECT * FROM timescaledb_information.jobs  WHERE hypertable_name = 'history';
        job_id |     application_name      | schedule_interval | max_runtime | max_retries | retry_period |      proc_schema      |     proc_name      |  owner   | scheduled |                     c
        onfig                     |          next_start           | hypertable_schema | hypertable_name
        --------+---------------------------+-------------------+-------------+-------------+--------------+-----------------------+--------------------+----------+-----------+----------------------
        --------------------------+-------------------------------+-------------------+-----------------
           1003 | Compression Policy [1003] | 1 day             | 00:00:00    |          -1 | 01:00:00     | _timescaledb_internal | policy_compression | postgres | t         | {"hypertable_id": 2,
        "compress_after": 604800} | 2024-12-12 12:19:32.996748+00 | public            | history_uint
        I had only a compression policy (which is also doing a great job).
        Adding retention policy to clean older values then 40 days for history:
        Code:
        SELECT add_retention_policy('history', 40 * 24 * 60 * 60);
        The same for other hypertables for needed time.

        Eventually, postgresql restart might be needed if `next_start` in the schedule isn't setup. Restart will add the next schedule.

        Cleaning was fast, load wasn't bad for the server it released space of 3/4 of my database. (2Tb)

        Comment

        Working...