Ad Widget

Collapse

Zabbix events housekeeping with huge backlog

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Zabbix events housekeeping with huge backlog

    Hi everyone!

    I know that housekeeper-related questions have been discussed a lot of times but unfortunately I couldn't find any solutions for our case so any help will be much appreciated.

    Situation is the following:
    - Partitioning is configured for trends and history tables. No issues here.
    - Internal housekeeper is disabled by setting HousekeepingFrequency=0 in config.
    - At the moment we have a huge amount of data in events, event_recovery and problem tables (700.000.000+ records and 100GB+ data for these 3 tables in total).

    Some investigations have been performed and we found out that most part of these records (more than 99%) are related to some internal trigger-related events because of some initial triggers misconfiguration.
    This has been fixed and amount of daily added events and problems is now significantly decreased, but there's still a lot of old events left in the database.

    We tested internal housekeeper on test server by running it manually from command line with events category selected in housekeeper GUI section and faced with the following issues:
    - on the first run it cleaned up all records from problem table that have been resolved older than 1 day ago ignoring the threshold set in GUI for events. First execution was running longer than 10 hours. We'd like to have more control over it and split this process into several bunches somehow when we will run it in production to minimize the impact.
    - only several hundreds/thousands records are being cleaned up from events table during each housekeeper execution and it's running quite long. Changing MaxHousekeeperDelete parameter from zabbix server config doesn't affect this anyhow because this parameter works only for history data of items that have been removed as I know.

    Our goals are:
    1. Perform the cleanup of all the internal events backlog from Zabbix DB.
    2. Configure regular housekeeping of events.

    I did some testing with the following query:
    delete from events where source=3 and object=0 and clock <[some_timestamp]
    In this case events are cleared up much faster than by running internal housekeeper and this can be controlled by running multiple times with different timestamp intervals.
    Records from event_recovery and problem tables are also removed in this using by foreign keys.
    This also shouldn't affect events related to real problems raised by triggers that we need to keep for 1 year.

    So, the main questions are:
    - What is the best way to perform such cleanup for both cases (one-time cleanup of the backlog and regular housekeeping)? Using built-in housekeeper or some scheduled script with custom SQL queries?
    In case of using built-in housekeeper:
    1. How to make it cleaning up more events on each execution and run faster?
    2. How to configure it to have more granularity on cleaning up problems table to avoid running it for 10+ hours?
    In case of running custom queries:
    1. Is my query above is fine to perform the events cleanup and will not cause any problems in future?
    2. If no - then which queries need to be executed for events cleanup and are safe to run to be sure that we will not get any data inconsistency?
    3. Does it make sense to keep internal events in the database for longer than 1 day?

    Thanks everyone in advance!

    #2
    Hi, I don’t have an exact answer to your all questions, but I would do the one-shot cleanup job in whatever way is the best (as this is just a one-time problem), and then use the usual housekeeper to do the cleaning regularly (apart from partitioned tables). In my some three years of running the same Zabbix servers I haven’t seen a need to arrange special handling for events and problems etc., while using partitioning for history and trend tables at the same time.

    Hopefully someone else will chime in and comment the best ways to do this one time cleanup.

    Markku

    Comment


      #3
      Originally posted by RomanT View Post
      Situation is the following:
      - Partitioning is configured for trends and history tables. No issues here.
      - Internal housekeeper is disabled by setting HousekeepingFrequency=0 in config.
      - At the moment we have a huge amount of data in events, event_recovery and problem tables (700.000.000+ records and 100GB+ data for these 3 tables in total).
      It is possible to partition events table as well and drop oldest events by dropping oldest partition.
      Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.

      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment


        #4
        Originally posted by kloczek View Post
        It is possible to partition events table as well and drop oldest events by dropping oldest partition.
        Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.
        Hi kloczek,
        Thanks for your reply!
        It's clear for us why we have so many events in the DB and as I mentioned they are all internal events (source=3) and caused by misconfigured triggers.

        Most part of events are like this:
        Cannot evaluate expression: "Cannot evaluate function "someswitch:net.if.in.errors[ifInErrors.1000].avg(5m)": not enough data.".
        We already fixed this and amount of new events per hour is now 100 times less than we had before. All we need to do now is to safely cleanup all of them.

        I'm not sure if partitioning is a good idea as events table has several foreign keys, so dropping such partition might cause data inconsistency if it's even possible to partition a table with foreign keys. It might also require a lot of time to partition existing table with so much data. It's fine for us to delete all these events using the query from my post in several steps as it runs fast enough, but I'd like to ensure that this will not impact anything as deleting any data manually from the DB always has risks.

        Thanks.

        Comment


          #5
          If you have single database engine without slave you are in trouble.
          Normally such thing like changing database schema or table layout to introduce partitions usually the best way to introduce is by do this on the slave than promote that slave as new master.

          If you don't have slave .. start thinking about rearchitecting DB backend.
          If you have no snapshotable filesystem beneath you database you've reached trouble lvl. 2. because without that it would be way harder to create such slave.

          Without slave and/or snapshotable FS underneath still you can start heading in that direction by move monitoring of everything off the server to only proxies.
          With that you would be able to schedule zbx server and db backend downtime and still your monitoring will be working (collecting monitoring data) but without alarming layer.
          With all monitoring done over proxies (only server using zabbix server template only) should be monitored without proxy.
          Then with still collected monitoring data you can copy database to another box to create slave database. With attached slave to the master you can do even one big delete query or partition events table as long as it will be necessary.
          When delete/partitioning will be finished on slave after commit all changes accumulated in mean time on slave in few seconds you can promote slave as new master.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment


            #6
            Originally posted by kloczek View Post
            If you have single database engine without slave you are in trouble.
            Normally such thing like changing database schema or table layout to introduce partitions usually the best way to introduce is by do this on the slave than promote that slave as new master.

            If you don't have slave .. start thinking about rearchitecting DB backend.
            If you have no snapshotable filesystem beneath you database you've reached trouble lvl. 2. because without that it would be way harder to create such slave.

            Without slave and/or snapshotable FS underneath still you can start heading in that direction by move monitoring of everything off the server to only proxies.
            With that you would be able to schedule zbx server and db backend downtime and still your monitoring will be working (collecting monitoring data) but without alarming layer.
            With all monitoring done over proxies (only server using zabbix server template only) should be monitored without proxy.
            Then with still collected monitoring data you can copy database to another box to create slave database. With attached slave to the master you can do even one big delete query or partition events table as long as it will be necessary.
            When delete/partitioning will be finished on slave after commit all changes accumulated in mean time on slave in few seconds you can promote slave as new master.
            Hi kloczek,

            Thanks for all your recommendations!

            Actually we have 2 DB servers with configured replication but the performance impact on the DB by running delete queries is the latest thing I'm worried about if you are talking about cleaning up the DB by running custom queries. In this case I'd like to understand if I'm on a right way and can use the query provided in my initial post without any impact on database consistency (not performance and availability).
            Under consistency I mean that I'll not get the DB broken like having issues with writing new events into DB, errors in web interface, problems stuck in some state, issues in some zabbix processes related to event processing etc.

            In case of using internal zabbix housekeeper - as I understand it's not possible to use it with a slave DB anyway because it will run across the database that is configured for zabbix server.

            Thanks.

            Comment


              #7
              In housekeeper table are data about what needs to be deleted from history and trends tables. If some itemid items have been deleted events related to those items needs to be deleted as well.
              In other words you can use content of the housekeeper table to clean a bit events table however with already working master and slave I would go for partition events table on slave -> drop oldest partitions -> promote slave as new master -> recreate slave and than clean that table other methods.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment


                #8
                Originally posted by kloczek View Post
                It is possible to partition events table as well and drop oldest events by dropping oldest partition.
                Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.
                Do you know if there is any guidance anywhere on how events partitioning would be done? I do plan on redefining triggers but I've never seen an answer on how to deal with events, event_recovery, and problems table growing faster than housekeeper can keep up. With a large enough installation it does seem this would naturally be the fix.

                Comment


                  #9
                  Partitioning is completly transparent for all SQL queries.
                  Using partitioned tables is only about change method of dealing with delete oldest data which by default needs to be deleted by delete query.
                  Instead delete some table rows is used whole dropping partition (effectively delte whole with only few IOs).
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment


                    #10
                    Originally posted by kloczek View Post
                    Partitioning is completly transparent for all SQL queries.
                    Using partitioned tables is only about change method of dealing with delete oldest data which by default needs to be deleted by delete query.
                    Instead delete some table rows is used whole dropping partition (effectively delte whole with only few IOs).
                    I may be misunderstanding but I was looking for any information on how to partition events, event_recovery, and problems tables so I can add them to my partition dropping script but I haven't seen anyone describe how they partitioned those tables just that it was possible.

                    Comment


                      #11
                      Just use the same SQL procedure which yoiu are using to partition trends* and history* tables.
                      Only specify events as table which needs to bi partitioned.
                      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                      https://kloczek.wordpress.com/
                      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                      My zabbix templates https://github.com/kloczek/zabbix-templates

                      Comment


                        #12
                        Hello,
                        I may have misunderstood - sorry if so.

                        Situation is the following:
                        - Partitioning is configured for trends and history tables. No issues here.
                        - Internal housekeeper is disabled by setting HousekeepingFrequency=0 in config.
                        The partitioning you have is fine, but keep the internal housekeeper turned on -- except for trends and history. You can override the trend and history housekeeping, individually.

                        Our one-shot deletes took hours, so not much help for you there. I think we ran it over the weekend.

                        Comment


                          #13
                          RomanT did you end up making progress with this?

                          I'm in a similar position where our events table is much larger than our history tables:

                          $ ls -Slh zabbix_prod/
                          total 625G
                          256G events.ibd
                          180G history_uint.ibd
                          104G event_recovery.ibd
                          42G history.ibd
                          34G trends_uint.ibd
                          3.7G auditlog.ibd
                          2.2G sessions.ibd
                          2.0G trends.ibd


                          There are many rows in the events table with a timestamp in the clock column that is earlier than the date we should be seeing things deleted via housekeeping - so I'm imagining that housekeeping isn't keeping up in flushing out all these events as you and others seem to have described above.

                          I'm a little concerned that simply using SQL to delete them might leave remnants in others tables that then never get cleaned up because they're not handled by either the usual housekeeping tasks, or some other internal method.

                          Furthermore, I'm interested to know what is meant by this comment:
                          Originally posted by kloczek View Post
                          Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.
                          If the events are internal, then what recourse do we have other than trying to make sure that checks are being performed correctly? The lack of UI to see a summary of these internal events in the Zabbix front end makes it difficult to discover issues, and setting up alerts for new events doesn't solve the problem we're discussing here, because we're looking at historical issues - rather than present ones.

                          Comment


                            #14
                            Originally posted by jameskirsop View Post
                            RomanT did you end up making progress with this?
                            ...
                            Hi James,

                            Yes, we successfully performed the cleanup.
                            The total time of running delete queries took about 40+ hours. We divided all the internal events by timestamp into batches that are being deleted for about 1 minute (150000 events per batch) and executed them by the script that executes delete query and then waits for 1 minute to finish the replication and process all the rest queries that are stuck in the DB queue. We were thinking about LIMIT option in delete queries but it could cause an issue with the replication if such queries are used without ORDER BY clause, so we decided to avoid using it.

                            Initially we started running longer queries but faced with a following issue - some related table (event_suppress or event_recovery, I don't remember exactly) that is connected to events table using the foreign key was locked while delete query was being executed across the events table. This caused a storm of nodata trigger based problems like Zabbix agents unavailability. That's why we made the batches smaller.

                            After that we executed "OPTIMIZE TABLE" queries for events, problem and event_recovery tables to free up some disk space.

                            The query we used I've provided in my initial post.
                            The list of timestamps for these delete queries I received using the following query:
                            select * from (SELECT @rownum := @rownum + 1 AS rank, e.clock FROM events e, (SELECT @rownum := 0) r where e.source=3 and e.object=0 order by e.clock asc) as t where t.rank % 150000 = 0;
                            This query took around 50 minutes to execute.

                            Regarding the remnants you are concerned about - I don't think that anything will be kept in the database because there's a lot of tables are connected with a foreign key with the events table with "ON DELETE CASCADE" option. That means that all these records from the other tables related to deleted event will be also deleted. Actually Zabbix built-in housekeeper does almost the same - it executes delete queries for events and problem tables only but it also performs some additional checks for each event that is going to be deleted.

                            You can check the housekeeper code here:
                            https://github.com/zabbix/zabbix/blo.../housekeeper.c

                            You can check the list of events related tables here:
                            https://zabbix.org/wiki/Docs/DB_schema/4.0/events

                            Hope this helps.

                            Thanks.


                            Comment


                              #15
                              Originally posted by kloczek View Post
                              Nevertheless huge number of events usually is caused by not-so-well-defined alarming layer.
                              jameskirsop

                              This gets brought up a lot and while it is probably most users' problem it is not the solution. The thought here is that we have more alarming and event generation than we need so we should disable or increase the threshold on these alarms so housekeeper can keep up. This is not sustainable because as monitoring systems grow and more is added official templates are often used which have those over-defined alarming layers. Even without templates a large monitoring system will hit an alarming threshold where housekeeper cannot keep up.

                              In my opinion an official solution is needed on this or a default Zabbix alert for when the housekeeper is not keeping up.

                              If I can find a good automatic solution I might see if someone else has requested this as a feature and I can add onto that request.


                              As far as how I have been dealing with it manually I've been creating EVENTS_NEW like EVENTS, INSERTing data > clock, renaming EVENTS to EVENTS_OLD, renaming EVENTS_NEW to EVENTS, double check, drop EVENTS_OLD.
                              Last edited by rankinc; 10-10-2019, 16:05.

                              Comment

                              Announcement

                              Collapse
                              No announcement yet.
                              Working...
                              X