Ad Widget

Collapse

Housekeeper Rows ever increasing

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • CrashOverride
    Junior Member
    • Jul 2019
    • 12

    #1

    Housekeeper Rows ever increasing

    Hi,

    A while back we had problems with Housekeeping being too heavy. We were able to resolve the issue by moving the database to faster storage.
    However, it seems that since that move we see an ever increasing number of "Housekeeper Rows". I am not quite sure what this is, but I guess it's rows in the Housekeeper-table that tells the Housekeeping process what should be remove when it runs?
    We are currently at around 5 million rows. Please see attached screenshot. You can see that busy housekeeper processes decreased when we moved to faster storage (that's my conclusion atleast).

    The high number and the ever-increasing nature worries me. What could be the problem here?


    Click image for larger version  Name:	2020-01-09 13_39_20-Window.png Views:	0 Size:	43.6 KB ID:	393035

    Log rows from zabbix_server.log related to Housekeeper look like this:
    100517:20200109:084554.375 executing housekeeper
    100517:20200109:084828.966 housekeeper [deleted 4780797 hist/trends, 0 items/triggers, 211 events, 172 problems, 1615 sessions, 0 alarms, 1674 audit, 0 records in 154.590080 sec, idle for 1 hour(s)]
    100517:20200109:094829.267 executing housekeeper
    100517:20200109:095119.514 housekeeper [deleted 4925729 hist/trends, 0 items/triggers, 200 events, 148 problems, 1759 sessions, 0 alarms, 1819 audit, 0 records in 170.246319 sec, idle for 1 hour(s)]
    100517:20200109:105119.838 executing housekeeper
    100517:20200109:105400.186 housekeeper [deleted 4947061 hist/trends, 0 items/triggers, 251 events, 274 problems, 1525 sessions, 0 alarms, 1587 audit, 0 records in 160.345824 sec, idle for 1 hour(s)]
    100517:20200109:115400.451 executing housekeeper
    100517:20200109:115648.750 housekeeper [deleted 5009664 hist/trends, 0 items/triggers, 309 events, 2214 problems, 1139 sessions, 0 alarms, 1199 audit, 0 records in 168.298244 sec, idle for 1 hour(s)]
    Attached Files
    Last edited by CrashOverride; 09-01-2020, 14:51.
  • gofree
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2017
    • 400

    #2
    its kinda zabbix evergreen - try to search for zabbix database partitioning in the forum, web, etc

    bit on the topic from zabbix guys

    Comment

    • CrashOverride
      Junior Member
      • Jul 2019
      • 12

      #3
      Originally posted by gofree
      its kinda zabbix evergreen - try to search for zabbix database partitioning in the forum, web, etc

      bit on the topic from zabbix guys

      https://www.youtube.com/watch?v=Y3qtLQyT8DM
      Hey gofree thanks for replying. As I understand it, database partitioning is an alternative to Housekeeping altogether. We have discussed going down that route but would like to avoid it at the moment.
      Is it really the case that this is the intended behavior unless database partitioning is used? That doesn't seem right. Especially since this was not the case for us before june. Unless it changed in a newer version (sorry, I don't have the history of our version upgrades so I can't say if this coincides with one).

      Comment

      • gofree
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2017
        • 400

        #4
        Problem with housekeepers are not intended they emerge from the nature of it - espoecially in big environments with a lot of checks and short interval its inevitabel to use partitioning guess you hit the size and time

        As you have fast storage try to incrase the housekeeper interval and MaxHousekeeperDelete setting - this way tou can speed up the removal of unneeded data ...



        ### Option: HousekeepingFrequency
        # How often Zabbix will perform housekeeping procedure (in hours).
        # Housekeeping is removing unnecessary information from history, alert, and alarms tables.
        #
        # Mandatory: no
        # Range: 1-24
        # Default:
        # HousekeepingFrequency=1
        HousekeepingFrequency=4

        ### Option: MaxHousekeeperDelete
        # The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
        # [housekeeperid], [tablename], [field], [value].
        # No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
        # will be deleted per one task in one housekeeping cycle.
        # SQLite3 does not use this parameter, deletes all corresponding rows without a limit.
        # If set to 0 then no limit is used at all. In this case you must know what you are doing!
        #
        # Mandatory: no
        # Range: 0-1000000
        # Default:
        # MaxHousekeeperDelete=500
        MaxHousekeeperDelete=10000

        ### Option: DisableHousekeeping
        # If set to 1, disables housekeeping.
        #
        # Mandatory: no
        # Range: 0-1
        # Default:
        # DisableHousekeeping=0

        Comment

        • CrashOverride
          Junior Member
          • Jul 2019
          • 12

          #5
          We run housekeeping every hour and MaxHousekeeperDelete is set to 0 (unlimited).
          Housekeeping seems to run without issue, at least no errors are logged. Takes a few minutes each time (see initial post).
          I don't really think that this is unfixable without partitioning. If that really is the case it would be nice with some official word on it.
          As far as I understand partitioning is mainly a solution to the problem with housekeeping not being able to keep up. I currently don't see that problem in our setup.

          Comment

          • gofree
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2017
            • 400

            #6
            seems that everytime it deletes approx 5M history/trends - thats some kind of limit reached for now

            cant find in the docu the "Housekeeper Rows" metric - try to find out is this ( something custom ? ) - it just might that the thing is deleting 5M rows every hour and youre good ....

            Comment

            • CrashOverride
              Junior Member
              • Jul 2019
              • 12

              #7
              I can trigger the housekeer manually by running zabbix_server -c /path/to/zabbix_server.conf -R housekeeper_execute
              If I do that we can see that much fewer hist/trends are deleted, so your conclusion that 5M per run is some kind of limit is incorrect. It simply seems to be the approximate number of hist/trend that accumulates over 1h which is our housekeeper interval.

              Here is one normal scheduled run followed by two forced runs:

              100517:20200110:074345.487 executing housekeeper
              100517:20200110:074636.186 housekeeper [deleted 4701134 hist/trends, 1916330 items/triggers, 275 events, 7874 problems, 1501 sessions, 0 alarms, 1560 audit, 0 records in 170.697747 sec, idle for 1 hour(s)]
              100517:20200110:075622.036 forced execution of the housekeeper
              100517:20200110:075622.036 executing housekeeper
              100517:20200110:075733.496 housekeeper [deleted 977257 hist/trends, 0 items/triggers, 42 events, 22 problems, 442 sessions, 0 alarms, 454 audit, 0 records in 71.458398 sec, idle for 1 hour(s)]
              100517:20200110:075741.191 forced execution of the housekeeper
              100517:20200110:075741.191 executing housekeeper
              100517:20200110:075817.657 housekeeper [deleted 101709 hist/trends, 0 items/triggers, 9 events, 2 problems, 46 sessions, 0 alarms, 47 audit, 0 records in 36.465147 sec, idle for 1 hour(s)]

              Housekeeper Rows is a built-in item in Zabbix's monitoring of itself. The graph from my initial post is also built-in. It's called "Housekeeping" and can be found on the Zabbix host.

              Comment

              • gofree
                Senior Member
                Zabbix Certified SpecialistZabbix Certified Professional
                • Dec 2017
                • 400

                #8
                what I ment by limit its that every time ( every hour ) is deleted about 5M of hist/trend - if you run it manually earlier of course it will remove lesser number - so I think it might be it just works fine and your number of rows represent no of values deleted

                when it comes to the item Housekeeper Rows I cant find it it my newly installed 4.4 environemnt - can you elaborate in which template is it, what version of zabbix youre using, your nvps value and share the item configuration ( maybe im just blind and cant see it ) ) - maybe your db is performimg well and the item is simply counting how many lines have been deleted in the passage of time ( and the value of ~ 5M host/trends is not related ) - curious about the item - where and how it get the values from

                cant find it in internal zabbix server items

                https://www.zabbix.com/documentation...types/internal

                Last edited by gofree; 10-01-2020, 10:06.

                Comment

                • CrashOverride
                  Junior Member
                  • Jul 2019
                  • 12

                  #9
                  Interesting. Maybe it's something custom that was added by us in the past. Or something that was present in an older version that we have upgraded from.
                  Will check.

                  Edit: I checked the item. It's in our "Template_Zabbix_Server" which I belive is a default template. However, this specific item (Housekeeper rows) looks like it could have been added by us at some point. It's basically a "system.run[]"-item that does a mysql query for number of rows in the housekeeper-table.

                  That would explain the lack of documentation regarding the item.

                  My worries remain though. It's generally not a good thing to see something increasing forever like this. Especially like in this case when in the past it was seemingly cleared out on every housekeeper run.
                  Last edited by CrashOverride; 10-01-2020, 10:18.

                  Comment

                  • gofree
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Dec 2017
                    • 400

                    #10
                    it can be that when you had those perf issues it simply didnt grow that much because the housekeeper struglled or you didnt have that much data incomming..now it would be interesting to see the query itself

                    Edit: https://www.zabbix.com/forum/zabbix-...e-in-zabbix-db bit info on the topic...the question would be how big ( space ) is the table and if you need to worry

                    Edit2: i dont have access no to db in bigger zabbix installation db ( customers ) but my test installs have the housekeeper table empty ( they just small test instances not running all the time ) , even in the timescale partitioning the table is not partitioned so I guess you need to find out whats written in to the table ( or the purpose of the table ) and if there were any custom modifications
                    Last edited by gofree; 10-01-2020, 10:56.

                    Comment

                    • CrashOverride
                      Junior Member
                      • Jul 2019
                      • 12

                      #11
                      Here is a zoomed in screenshot of the graph. Before June 7th, the housekeeper rows were removed when the housekeeper ran, but that has stopped happening.
                      If the housekeeper struggles, to me that would be a reason for an increasing number of rows in the housekeeper table. But it doesn't struggle, so I'm still looking for an explanation for the never-ending increase.
                      Back when the housekeeper did struggle (before June 7th), it was regularly emptied when the housekeeper ran, even if it ran poorly/slowly.

                      This is what I think the housekeeper-table is: A list of stuff to clean up that grows over time until housekeeping occurs and then the cleaned-up things are removed from the list.
                      Maybe that's not what it is?

                      Click image for larger version

Name:	2020-01-10 09_28_52-Window.png
Views:	6411
Size:	43.8 KB
ID:	393102
                      The query is a simple "select count(*) from housekeeper". I have verified the value by manually querying the mysql-server directly.

                      Comment

                      • gofree
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Dec 2017
                        • 400

                        #12
                        my understandin is that housekeeper remove data from history and trend tables - not sure what for is the housekeeper table it self ( database schema nad functions are kinda internal magic, maybe its being used in some way to delete the hist data ) can you check how the rows in the table look like and share you housekeeper settings from gui Administration/General/Housekeeping

                        in general if something like this happens something was changed maybe for some items you have long history and theyre not being removed by housekeeper - which can be solved with overide option in housekeeper settings

                        not much how it works https://zabbix.org/wiki/Docs/DB_schema/4.0/housekeeper


                        Edit: see here kloczek post



                        In housekeeper table are data about what needs to be deleted from history and trends tables. If some itemid items have been deleted events related to those items needs to be deleted as well.
                        In other words you can use content of the housekeeper table to clean a bit events table however with already working master and slave I would go for partition events table on slave -> drop oldest partitions -> promote slave as new master -> recreate slave and than clean that table other methods.
                        Last edited by gofree; 10-01-2020, 11:32.

                        Comment

                        • CrashOverride
                          Junior Member
                          • Jul 2019
                          • 12

                          #13
                          Originally posted by gofree
                          my understandin is that housekeeper remove data from history and trend tables
                          Yeah this is correct as far as I know. My assumption is that the housekeeper table is the todo-list for that.

                          The table looks like this:
                          Code:
                          mysql> select * from housekeeper limit 100;
                          +---------------+--------------+--------+---------+
                          | housekeeperid | tablename    | field  | value   |
                          +---------------+--------------+--------+---------+
                          |      46874176 | history      | itemid | 4489808 |
                          |      46874177 | history_str  | itemid | 4489808 |
                          |      46874178 | history_uint | itemid | 4489808 |
                          |      46874179 | history_log  | itemid | 4489808 |
                          |      46874180 | history_text | itemid | 4489808 |
                          |      46874183 | history      | itemid | 4489809 |
                          |      46874184 | history_str  | itemid | 4489809 |
                          Housekeeping-settings:

                          Click image for larger version

Name:	2020-01-10 10_24_49-Window.png
Views:	6313
Size:	31.5 KB
ID:	393120

                          Comment

                          • gofree
                            Senior Member
                            Zabbix Certified SpecialistZabbix Certified Professional
                            • Dec 2017
                            • 400

                            #14
                            according the info from forum ( not officialy documented ) the housekeeper table might be this todo for housekeeper

                            2 things could be interesting how many rows are marked for deletion from history* tables ( 7d ) and how many are from ( trend* tables ) - maybe you have long trends that just keep adding till there is time to delete them ( in far future ) - but who knows...this is more like for zabbix support case - or deeper investigation for someboidy who had simillar experience

                            on the other side if teh table size is not big and not growing exponentially I wouldnt be worried ( of course if there are no other effect on the zabbix performance )

                            the thing you could try and might loose the trends is to setup overide period for trends for example 1year , 6 months, 3 months and execute housekeeper manualy if thats the case you should see decrease in the number of rows - of course better try in test environment ( or in prod )

                            Comment

                            • Markku
                              Senior Member
                              Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                              • Sep 2018
                              • 1781

                              #15
                              Here is some discussion about housekeeper table: https://www.zabbix.com/forum/zabbix-...e-in-zabbix-db

                              There is a pointer to https://zabbix.org/wiki/Docs/howto/mysql_partitioning which says:

                              1. Even with the housekeeping for items disabled, Zabbix server and web interface will keep writing housekeeping information for future use into the housekeeper table. To avoid this, you can set ENGINE = Blackhole for this table:

                              ALTER TABLE housekeeper ENGINE = BLACKHOLE;
                              In one of my Zabbix installations, currently 4.0, originally 3.2 and 3.4, running for three years now, partitioned some year ago:

                              Code:
                              MariaDB [zabbix]> select count(*) from housekeeper;
                              +----------+
                              | count(*) |
                              +----------+
                              |    74466 |
                              +----------+
                              The entries are all history and trend, as the housekeeper is disabled for those due to partitioning.

                              What is interesting is that in zabbix_server.log there is like this:

                              Code:
                               10110:20200113:142330.026 housekeeper [deleted 0 hist/trends, 214 items/triggers, 3484 events, 1978 problems, 0 sessions, 0 alarms, 0 audit items in 0.620801 sec, idle for 1 hour(s)]
                               10110:20200113:152330.691 housekeeper [deleted 0 hist/trends, 267 items/triggers, 3516 events, 2171 problems, 0 sessions, 0 alarms, 0 audit items in 0.350498 sec, idle for 1 hour(s)]
                               10110:20200113:162331.224 housekeeper [deleted 0 hist/trends, 36 items/triggers, 3466 events, 2213 problems, 0 sessions, 0 alarms, 0 audit items in 0.230449sec, idle for 1 hour(s)]
                              The amount of events and problems that is deleted every hour is very surprising. If I run housekeeper manually (zabbix_server -R housekeeper_execute) it is like this:

                              Code:
                               10110:20200113:164211.962 housekeeper [deleted 0 hist/trends, 0 items/triggers, 952 events, 690 problems, 0 sessions, 0 alarms, 0 audit items in 0.117959 sec, idle for 1 hour(s)]
                               10110:20200113:164325.963 housekeeper [deleted 0 hist/trends, 0 items/triggers, 6 events, 2 problems, 0 sessions, 0 alarms, 0 audit items in 0.035829 sec, idle for 1 hour(s)]
                              So, it looks like we've had 3000/2000 events/problems happening every hour one year ago, constantly, but that is not the case we experienced in monitoring. So I don't know what the numbers really mean. They seem to be time-related anyway, as the number is smaller if housekeeper is run more frequently.

                              Ok, I now run this:

                              Code:
                              MariaDB [zabbix]> select * from events order by clock limit 10;
                              +---------+--------+--------+----------+------------+-------+--------------+-----------+--------------------------------------+----------+
                              | eventid | source | object | objectid | clock      | value | acknowledged | ns        | name                                 | severity |
                              +---------+--------+--------+----------+------------+-------+--------------+-----------+--------------------------------------+----------+
                              |       4 |      3 |      4 |    23262 | 1481666562 |     1 |            0 | 265573348 | Cannot obtain item value.            |        0 |
                              |       5 |      3 |      0 |    13477 | 1481666562 |     1 |            0 | 265573348 | Cannot calculate trigger expression. |        0 |
                              |       6 |      3 |      4 |    23267 | 1481666567 |     1 |            0 | 266232164 | Cannot obtain item value.            |        0 |
                              |       7 |      3 |      0 |    13482 | 1481666567 |     1 |            0 | 266232164 | Cannot calculate trigger expression. |        0 |
                              |    1549 |      3 |      0 |    14810 | 1483631442 |     1 |            0 | 371446409 | Cannot calculate trigger expression. |        0 |
                              |    1550 |      3 |      0 |    14811 | 1483631442 |     1 |            0 | 357187617 | Cannot calculate trigger expression. |        0 |
                              |    1551 |      3 |      0 |    14812 | 1483631442 |     1 |            0 | 357187617 | Cannot calculate trigger expression. |        0 |
                              The timestamps lead to Dec 2016 - Jan 2017. There is also huge number of "Cannot obtain item value." messages in the events table. So, I guess, they kind of explain the number of deleted entries each hour, and for some reason those three years old entries are still not deleted. Maybe I will inspect this later again, or maybe not... I don't have any actual problem, just observing.

                              Markku

                              Comment

                              Working...