Ad Widget

Collapse

Zabbix housekeeper processes more than 75% busy

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sph919
    Member
    • Jan 2019
    • 38

    #1

    Zabbix housekeeper processes more than 75% busy

    Hi all,

    I keep getting the above warning. I've gone through the forum, but I don't understand what the process is used for and why it matters that it's busy.
    I've looked at server.conf and have read the notes with and taking warning of "you must know what you are doing!" at the end of it.

    So I just added HousekeepingFrequency=1 and left the MaxHousekeeperDelete alone.

    Does anyone have any recommended setting for this?

    Current set up is

    Zabbix 4.0.1 - 10.1.34-MariaDB

    260 host
    4534 items
    2392 triggers

    Thanks


  • Andrzej PAwlik
    Junior Member
    • Feb 2019
    • 27

    #2
    Housekeeper is process which makes sure that the database does not grow too much.

    Deleted old Item, Values and trend.

    HousekeepingFrequency=1 star this process every hour.

    MaxHousekeeperDelete how many line deleted in datebase per one Housekeeper Cycle.

    In my environment Housekeeper work for 20 minutes busy 100%.

    I have:

    HousekeepingFrequency=1

    MaxHousekeeperDelete=50000

    Comment

    • sph919
      Member
      • Jan 2019
      • 38

      #3
      Originally posted by Andrzej PAwlik
      Housekeeper is process which makes sure that the database does not grow too much.

      Deleted old Item, Values and trend.

      HousekeepingFrequency=1 star this process every hour.

      MaxHousekeeperDelete how many line deleted in datebase per one Housekeeper Cycle.

      In my environment Housekeeper work for 20 minutes busy 100%.

      I have:

      HousekeepingFrequency=1

      MaxHousekeeperDelete=50000
      Thanks Andrzej, that make more sense now. I will change the MaxHousekeeperDelete. As far as an alert from Zabbix, is it a problem or is it's just saying that the process is working?

      Comment

      • Andrzej PAwlik
        Junior Member
        • Feb 2019
        • 27

        #4
        You can disable this alarm and you can use it to monitor Housekeeper

        In Zabbix Server template is Item

        zabbix[process,housekeeper,avg,busy] Zabbix internal i dalete this alert Zabbix housekeeper processes more than 75% busy

        I create for my use item
        log[yourdir/zabbix_server.log,housekeeper]
        Zabbix Item (active) type LOG time 10 minutes refresh

        In zabbix_server.conf

        DebugLevel=3

        Comment

        • sph919
          Member
          • Jan 2019
          • 38

          #5
          Brill, Ill will give it a go now. Thanks for your help Andrzej, very much appreciated

          Comment

          • LMYpz7rD
            Junior Member
            • Feb 2019
            • 1

            #6
            I was getting the same message every hour for the past couple days after deleting some hosts.

            Is there a better way to delete hosts that won't cause this issue?

            Should I be worried about my sizing or is there anything I should be looking at in regards to performance?

            My Zabbix Server is 2cpu, 8gb ram
            My Postgres DB is 2cpu, 16gb ram

            My host count is currently only 9 hosts with a total item count of only about 9000.
            My items are polled only every minute and history is set to 90 days.

            I was hoping to scale this up to many, many more hosts.

            This was seemingly running smooth until I deleted some hosts (and presumably their 90 days of history).





            Comment

            • acejohns2
              Junior Member
              • May 2019
              • 1

              #7
              Originally posted by LMYpz7rD
              I was getting the same message every hour for the past couple days after deleting some hosts.

              Is there a better way to delete hosts that won't cause this issue?

              Should I be worried about my sizing or is there anything I should be looking at in regards to performance?

              My Zabbix Server is 2cpu, 8gb ram
              My Postgres DB is 2cpu, 16gb ram

              My host count is currently only 9 hosts with a total item count of only about 9000.
              My items are polled only every minute and history is set to 90 days.

              I was hoping to scale this up to many, many more hosts.

              This was seemingly running smooth until I deleted some hosts (and presumably their 90 days of history).




              I am also wondering this. I recently disabled the windows service discovery process in the Zabbix Agent template. But before I did I added a regular expression to it's pre-processing that would cause all the data related to services to time out and be deleted. This seems to have created a lot of work for my Housekeeper process. It has been running for hours.

              I don't know if I should just let it keep running until it finishes, or if there is a better way to clear out this data.

              Comment

              • scuba
                Junior Member
                • Jun 2018
                • 16

                #8
                When your Vps is 500 or above you should consider DB partitioning and disable housekeeping for History an Trend tables.

                Otherwise you keep getting these busy messages then it becomes a really huge work and locks down the server for a while.

                Comment


                • janderson
                  janderson commented
                  Editing a comment
                  I liked your comment and your idea, but if you disable housekeep how will the cleaning in the separate bank be done? would it be done manually?

                  Sorry my weak english.

                • scuba
                  scuba commented
                  Editing a comment
                  After you partition your DB ,then have to manually clean the history and trend tables , everyday/everymonth with a script on crontab.
              • Linwood
                Senior Member
                • Dec 2013
                • 398

                #9
                The part that has never made sense to me is that it appears to be "a" process, as in one. So it is either 100% busy or idle, except in fractional periods (i.e. poll intervals where it ran part of the period). Each poll is 1 minute, the test is more than 75% busy over 30 minutes. Ignoring these fractional minutes that means it gives an alert if housekeeping runs more than 22 minutes.

                So you hit some limit of time, or you clear some items, and in a very NORMAL course of business housekeeping can run more than 22 minutes. Right?

                To me this trigger never made any sense.

                To me housekeeping is "normal" so long as it finishes, mostly of the time, in the interval you run. Now others may have different ideas, especially if you run it on different schedules.

                This thread reminded me I wanted to adjust. I'm going to give a try like this:

                Change the item to poll slowly (because really -- 1 minute?), like 1h.

                Change the test to see what it has done for a couple days, which is likely to mean it ran amok:

                {Template App Zabbix Server:zabbix[process,housekeeper,avg,busy].min(48h)}>=99

                (And change to not need a recovery expression). Basically this gives a warning only on completely steady running for 48 hours (the 99 is just in case there's a rounding issue somewhere).

                Good idea? Bad idea? Not sure, but I think the answer for most of us is to adjust so what you consider normal doesn't trigger an alert, otherwise you will just always ignore it, like a car alarm that goes off all the time.

                Comment

                • doctorbal82
                  Member
                  • Oct 2016
                  • 39

                  #10
                  As scuba mentioned once the VPS hits over 500 highly consider housekeeping.

                  There are various methods of performing partitioning in PostgreSQL and some can be found on the Zabbix Forums, Zabbix Share and in the Zabbix Wiki (https://zabbix.org/wiki/Main_Page).

                  I use pg_partman (https://github.com/pgpartman/pg_partman) with great performance improvements and low database resource utilization on PostgreSQL v11 with 1000+ NVPS on Zabbix 4.0.

                  I have written about implementing pg_partman in the following guide here (with a bonus Ansible role to use) - https://github.com/Doctorbal/zabbix-...s-partitioning.

                  Comment

                  • rnalrd
                    Junior Member
                    • Jun 2015
                    • 9

                    #11
                    Partitioning is the definitive answer when you have a large amount of hosts. When I was hitting about 500 VPS I was able to avoid partitioning on PgSQL running this script every hour:

                    Code:
                    #!/bin/sh
                    ( flock -n 9 || exit 1
                    psql -U postgres zabbix -c 'DELETE FROM trends_uint t WHERE ctid IN ( SELECT t.ctid FROM trends_uint t LEFT JOIN items i ON i.itemid = t.itemid WHERE to_timestamp(t.clock) < (current_date - ((i.trends)::interval)) LIMIT 10000);'
                    psql -U postgres zabbix -c 'DELETE FROM history_str h WHERE ctid IN ( SELECT h.ctid FROM history_str h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 10000);'
                    psql -U postgres zabbix -c 'DELETE FROM history_text h WHERE ctid IN ( SELECT h.ctid FROM history_text h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 100000);'
                    psql -U postgres zabbix -c 'DELETE FROM history h WHERE ctid IN ( SELECT h.ctid FROM history h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 300000);'
                    psql -U postgres zabbix -c 'DELETE FROM history_uint h WHERE ctid IN ( SELECT h.ctid FROM history_uint h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 500000);'
                    ) 9>/var/lock/00-zabbix-history-purge.lock
                    It basically purge data from the largest tables which is expired according to the item history settings.

                    Comment


                    • SofCave
                      SofCave commented
                      Editing a comment
                      Hi rnalrd ,

                      Is this script still relevant for Zabbix 5?
                      Last edited by SofCave; 08-04-2022, 14:47.

                    • jhboricua
                      jhboricua commented
                      Editing a comment
                      rnalrd, I assume you then have the housekeeper tasks disabled in Zabbix? My database is in a AWS Aurora PostgreSQL instance so I don't know if partitioning with pg_partman is going to be an option.

                    • rnalrd
                      rnalrd commented
                      Editing a comment
                      jhboricua yes housekeeper is disabled. I'm going to migrate soon to TimescaleDB, so I'm going to drop this.
                  • rnalrd
                    Junior Member
                    • Jun 2015
                    • 9

                    #12
                    Yes SofCave tables haven't changed even for version 6.

                    Comment

                    • guille.rodriguez
                      Senior Member
                      • Jun 2022
                      • 114

                      #13
                      Originally posted by rnalrd
                      Partitioning is the definitive answer when you have a large amount of hosts. When I was hitting about 500 VPS I was able to avoid partitioning on PgSQL running this script every hour:

                      Code:
                      #!/bin/sh
                      ( flock -n 9 || exit 1
                      psql -U postgres zabbix -c 'DELETE FROM trends_uint t WHERE ctid IN ( SELECT t.ctid FROM trends_uint t LEFT JOIN items i ON i.itemid = t.itemid WHERE to_timestamp(t.clock) < (current_date - ((i.trends)::interval)) LIMIT 10000);'
                      psql -U postgres zabbix -c 'DELETE FROM history_str h WHERE ctid IN ( SELECT h.ctid FROM history_str h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 10000);'
                      psql -U postgres zabbix -c 'DELETE FROM history_text h WHERE ctid IN ( SELECT h.ctid FROM history_text h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 100000);'
                      psql -U postgres zabbix -c 'DELETE FROM history h WHERE ctid IN ( SELECT h.ctid FROM history h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 300000);'
                      psql -U postgres zabbix -c 'DELETE FROM history_uint h WHERE ctid IN ( SELECT h.ctid FROM history_uint h LEFT JOIN items i ON i.itemid = h.itemid WHERE to_timestamp(h.clock) < (current_date - ((i.history)::interval)) LIMIT 500000);'
                      ) 9>/var/lock/00-zabbix-history-purge.lock
                      It basically purge data from the largest tables which is expired according to the item history settings.
                      Do you have same for MariaDB/MySQL ? Theres no field ctid (zabbix 6.2). Maybe ctid == itemid?



                      Edit:


                      I made this MariaDB function to translate unit time interval to seconds

                      Code:
                      DROP FUNCTION IF EXISTS zbx_time_format_to_sec;
                      
                      DELIMITER $$
                      CREATE FUNCTION zbx_time_format_to_sec( zbxt VARCHAR(255) )
                      RETURNS varchar(255)
                      BEGIN
                          DECLARE result VARCHAR(255);    
                      
                          
                          IF zbxt REGEXP ('^[0-9]+$') THEN
                              SET result = zbxt;
                      
                          ELSE
                          # https://www.zabbix.com/documentation/current/en/manual/appendix/suffixes
                      
                              # SECONDS
                              IF zbxt REGEXP ('^[0-9]+s$') THEN
                                  SELECT REGEXP_REPLACE(zbxt,'^([0-9]+)s$','\\1') into result;
                              END IF;
                      
                              # MINUTES
                              IF zbxt REGEXP ('^[0-9]+m$') THEN
                                  SELECT CAST(( CAST( REGEXP_REPLACE(zbxt,'^([0-9]+)m$','\\1') AS INT) * 60) as VARCHAR(255)) into result;
                              END IF;
                      
                              # HOURS
                              IF zbxt REGEXP ('^[0-9]+h$') THEN
                                  SELECT CAST(( CAST( REGEXP_REPLACE(zbxt,'^([0-9]+)h$','\\1') AS INT) * 60 * 60) as VARCHAR(255)) into result;
                              END IF;
                      
                              # DAYS
                              IF zbxt REGEXP ('^[0-9]+d$') THEN
                                  SELECT CAST(( CAST( REGEXP_REPLACE(zbxt,'^([0-9]+)d$','\\1') AS INT) * 60 * 60 * 24) as VARCHAR(255)) into result;
                              END IF;
                      
                              # WEEKS
                              IF zbxt REGEXP ('^[0-9]+w$') THEN
                                  SELECT CAST(( CAST( REGEXP_REPLACE(zbxt,'^([0-9]+)w$','\\1') AS INT) * 60 * 60 * 24 * 7) as VARCHAR(255)) into result;
                              END IF;
                      
                          END IF;
                      RETURN result;    
                      END$$
                      DELIMITER ;
                      
                      
                      # Test translate
                      select itemid, delay, zbx_time_format_to_sec(delay) from items;​
                      Result

                      Code:
                      MariaDB [zabbix]> select unique  delay, zbx_time_format_to_sec(delay) from items where delay != 0;
                      +-------+-------------------------------+
                      | delay | zbx_time_format_to_sec(delay) |
                      +-------+-------------------------------+
                      | 1m    | 60                            |
                      | 1h    | 3600                          |
                      | 10m   | 600                           |
                      | 10s   | 10                            |
                      | 3m    | 180                           |
                      | 5m    | 300                           |
                      | 15m   | 900                           |
                      | 30s   | 30                            |
                      | 1d    | 86400                         |
                      | 30m   | 1800                          |
                      | 12h   | 43200                         |
                      | 3600  | 3600                          |
                      | 300   | 300                           |
                      | 8h    | 28800                         |
                      | 4h    | 14400                         |
                      | 600   | 600                           |
                      | 20m   | 1200                          |
                      +-------+-------------------------------+
                      17 rows in set, 6968 warnings (0.220 sec)
                      
                      ​
                      Now its time to translate PostGres queries to MariaDB queries. Could this be the translation?

                      Code:
                      DELETE FROM trends_uint WHERE itemid IN (SELECT t.itemid FROM trends_uint t LEFT JOIN items i ON i.itemid = t.itemid WHERE t.clock < (  unix_timestamp(now()) - zbx_time_format_to_sec(i.trends))) LIMIT 10000;
                      DELETE FROM history_str h WHERE itemid IN ( SELECT h.itemid FROM history_str h LEFT JOIN items i ON i.itemid = h.itemid WHERE h.clock < ( unix_timestamp(now()) - zbx_time_format_to_sec(i.history))) LIMIT 10000;
                      DELETE FROM history_text h WHERE itemid IN ( SELECT h.itemid FROM history_text h LEFT JOIN items i ON i.itemid = h.itemid WHERE h.clock < ( unix_timestamp(now()) - zbx_time_format_to_sec(i.history))) LIMIT 100000;
                      DELETE FROM history h WHERE itemid IN ( SELECT h.itemid FROM history h LEFT JOIN items i ON i.itemid = h.itemid WHERE h.clock < ( unix_timestamp(now()) - zbx_time_format_to_sec(i.history))) LIMIT 300000;
                      DELETE FROM history_uint h WHERE itemid IN ( SELECT h.itemid FROM history_uint h LEFT JOIN items i ON i.itemid = h.itemid WHERE h.clock < ( unix_timestamp(now()) - zbx_time_format_to_sec(i.history))) LIMIT 500000;


                      Last edited by guille.rodriguez; 15-10-2022, 21:03.

                      Comment

                    • guille.rodriguez
                      Senior Member
                      • Jun 2022
                      • 114

                      #14
                      jhboricua another question. Do you normally make partition in Zabbix Proxies too?

                      Comment

                      • jhboricua
                        Senior Member
                        • Dec 2021
                        • 113

                        #15
                        guille.rodriguez We don't have proxies in our deployment... yet.

                        Comment

                        Working...