Ad Widget

Collapse

odd line in housekeeper.c?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Tom Hutter
    Junior Member
    • Feb 2012
    • 7

    #1

    odd line in housekeeper.c?

    Hi everybody,

    this is my first post, so please be kind with me :-) I promise I have searched the forum for similar topics and haven't found one.

    I am also experiencing heavy load on my server due to the zabbix housekeeper. I read a lot about: "disable housekeeper at all", "partition your tables" ...., but I first wanted to know if there is any way to improve the algorithm of the housekeeper. I am using Debian squeeze and therefore picked the code of the Debian installation. During inspection of the code I stumbled over the following lines in housekeeper.c:

    Code:
     
    274         result = DBselect("select min(clock) from %s where itemid=" ZBX_FS_UI64, table, itemid);
     275 
     276         if (NULL == (row = DBfetch(result)) || SUCCEED == DBis_null(row[0]))
     277         {
     278                 DBfree_result(result);
     279                 return 0;
     280         }
     281 
     282         min_clock = atoi(row[0]);
    and

    Code:
     283         min_clock = MIN(now - keep_history * SEC_PER_DAY, min_clock + 4 * CONFIG_HOUSEKEEPING_FREQUENCY * SEC_PER_HOUR);
     284         DBfree_result(result);
     285 
     286         deleted = DBexecute("delete from %s where itemid=" ZBX_FS_UI64 " and clock<%d", table, itemid, min_clock);
    The first lines are clear. Grab me le oldest available clock value from itemid. But what for is the statement in line 283? My only guess is, every item should keep it's values at least for four housekeeper cycles, even keep_history is set to 0.

    But what does the statement do?

    Case 1:

    We create a item with default values (history 90 days). The housekeeper runs every hour. The item values start do drop in (let's say every minute).

    Let's say the first housekeeper run is after half an hour. We will get min_clock = "1/2 hour ago" ( at that time the first value arrived). So our statement in line 283 evaluates to:

    min_clock = MIN(now - 90 days , "1/2 hour ago" + 4 hours);

    => "now - 90 days" wins: Nothing will be deleted, as we have no values < now - 90 days.

    Case 2:

    We have monitored our item now quite a while and reach the magic day 90. Let's assume our first value is 90 day and half an hour old when the housekeeper runs. What does the statement in line 283 evaluate to?

    min_clock = MIN(now - 90 days , "-90 days - 1/2 hour + 4 hours);

    => "now - 90 days" wins: All values older than 90 days will be deleted.

    So far so good!

    Case 3:

    Let's say: I don't need history of the last 90 days, 7 days is enough. I set history in configuration to 7 days. What happens? Our oldest value is 90 days old but we only want to keep history for the last 7 days from now on. What does our statement evaluate to:

    min_clock = MIN(now - 7 days , "-90 days + 4 hours);

    => "-90 days + 4 hours " wins: Only the values older than 90 days + 4 hours will be deleted. Or am I completely wrong?

    Case 4:

    I want to get rid of history completely and set history to 0. It doesn't matter. It's the same behaviour as in Case 3.

    Case 5:

    I create an item and set it's history to 0 from the beginning. What does our statement? Once again, we assume the housekeeper runs half an hour after we got our first value and housekeeping cycle is one hour.

    min_clock = MIN(now - 0 days , - 1/2 hour + 4 hours);

    "now - 0 " is smaller than 3 1/2 hours in the future. now wins: All values older than now will be deleted, which means ALL values will be deleted. Am I wrong again?

    Patch:

    What about: I don't care how old the values in the database are. All item values older than history should be deleted. No matter, what history is set to, I want to keep at least the values for 4 housekeeper cycles to have my triggers something to evaluate. Replace line 274 - 283 by:

    Code:
    if( keep_history * SEC_PER_DAY <  4 * CONFIG_HOUSEKEEPING_FREQUENCY * SEC_PER_HOUR ) {
        min_clock = now - 4 * CONFIG_HOUSEKEEPING_FREQUENCY * SEC_PER_HOUR;
    } else {
        min_clock = now - keep_history * SEC_PER_DAY;
    }
    Please enlighten me :-)
    Last edited by Tom Hutter; 06-02-2012, 10:40.
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    The housekeeper does everything in a very correct way.There is no way to remove even one month of history in one go from the database without affecting performance of Zabbix. Your algorithm would run few days and historical tables would be locked most of the time.
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • Tom Hutter
      Junior Member
      • Feb 2012
      • 7

      #3
      Hi Alexei,

      puh, sou you're saying the above cases I mentioned are all wrong. Line 283 does exactly what expected to do and my patch is totally nonsense. And this all within three sentences, after I needed almost one Page to explain myself. I am impressed :-)

      See I don't want to blame anybody. Let's say I try to find a possible error. Or at least I want to understand why things happen as they do. I thought this forum is exactly for issues like this: asking, explaining, understanding, improving. If you are telling me I am wrong, I would at least appreciate, if you tell me why I am wrong.

      Would you be so kind and tell me, if my assumptions are correct. Have I understood the statement in line 283 correct? You want to keep values for every item at least 4 housekeeping cycles long to have some values for the triggers to evaluate? Or what else does it do?

      I still claim:

      Housekeeper doesn't delete historical data, if you reduce the amount of historical data in configuration.

      You wrote: There is no way to remove even one month of history in one go from the database without affecting performance of Zabbix. Does this mean even if I reduce an items history in configuration, the housekeeper will not drop all obsolete historical data of this item, because of performance reasons?

      If this is the final statement - okay. Then we can stop at this point. Then I really have misunderstood what the housekeeper is for.

      If you agree, that the housekeeper should drop all historical data which is older than the configured keep_history, let's try this:

      I would like to ask you, to reproduce the following steps on one of your systems. Do you have a zabbix server where you have several items which have their original history value reduced to a smaller amount? If not, you may can reduce the history of one or two items to a smaller amount (e.g. 7 days) and wait for one housekeeping cycle. After the cycle there shouldn't be any historical data older than 7 days for this items. Am I right? Or am I already on the wrong path?

      I have a lot of items for which I reduced the historical data from 90 (default) to 7 days some month ago. If I now run a statement on my database like:

      Code:
      SELECT EXTRACT(EPOCH FROM TIMESTAMP WITH TIME ZONE 'now' ) - 8 * 86400;
      It will return me the epoc time eight days ago (currently something about 1327943462). Do you agree that the statement:

      Code:
      select count(*) from history where itemid in ( select itemid from items where history = 7 ) and clock < 1327943462;
      should return 0 lines? There should be no 8 day old history values for any item which has a history_keep of 7.

      Tell me why before the patch I get something like:

      Code:
      zabbix=# select count(*) from history where itemid in ( select itemid from items where history = 7 ) and clock < 1327857072;
        count   
      ----------
       26474496
      (1 row)
      If you agree there should be a 0 instead of any number, you may want to have another look at line 283 and the cases I mentioned in my posting :-)

      And yes, you are right:
      Actually I ran my patch on a test system. The system experienced a very heavy first impact. The whole zabbix database shrank to half it's size. To be honest that's what I expected, because some month ago I had reduced the history of many items from it's default (90 days to 7 days) to save disk space. But from my point of view, the houskeeper didn't remove the obsolete data at that time, as there is an error in Line 283 in housekeeper.c. With my patch the history table alone shrank from about 3.8 GByte to about 2.2 GByte.

      Afterwards I got this:
      Code:
      zabbix=# select count(*) from history where itemid in ( select itemid from items where history = 7 ) and clock < 1327857072;
       count 
      -------
           0
      (1 row)


      I would really appreciate, if you find the time to reproduce this on some of your environments before answering me in three sentences ;-)

      cheers

      Tom

      Comment

      • dimir
        Zabbix developer
        • Apr 2011
        • 1080

        #4
        I wonder why can't we just use MaxHousekeeperDelete in here?

        Comment

        • Alexei
          Founder, CEO
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2004
          • 5654

          #5
          I didn't say you are wrong, not at all. All your cases are absolutely correct.

          What I am saying is that attempt to delete all outdated information immediately would fail because of huge impact on performance. Try to remove 10M of records out of 1000M table to see what I mean. Your "delete from ..." statement would run for hours blocking whole table.

          That's exactly why Zabbix does it in a more intelligent way, i.e. it removes the older data gradually, so instead of removing data in one go it expands removal process to days, weeks or months. Anyway the algorithm guarantees that the data will be eventually removed.

          What you are suggesting reminds me very first implementation of Zabbix housekeeper, which is ok for smaller setups but does not work well for larger data sets.
          Alexei Vladishev
          Creator of Zabbix, Product manager
          New York | Tokyo | Riga
          My Twitter

          Comment

          • Tom Hutter
            Junior Member
            • Feb 2012
            • 7

            #6
            Hey Alexei,

            thanks for your reply. NOW, as you explained yourself I have a feeling, what the statement in line 283 is meant to be:

            You are looking for the oldest clock value of an item. If the oldest clock is older than keep_history, you remove max 4 housekeeping cycles of data. As you remove 4 cycles every cycle, the data should decrease in time.

            I agree. I have a small installation and my patch kept the server busy for almost 3 hours. Still wondering why i had 26 mio lines of old data before I drastically reduced all data older than keep_history.

            Comment

            • richlv
              Senior Member
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Oct 2005
              • 3112

              #7
              but how does that relate to MaxHousekeeperDelete ? don't they do the same thing, functionally ?
              Zabbix 3.0 Network Monitoring book

              Comment

              • Tom Hutter
                Junior Member
                • Feb 2012
                • 7

                #8
                Good question. I haven't seen any part of code in housekeeper.c (verision 1.8) where a constant/variable is evaluated to stop with deleting, when reached.

                The only two variables which are evaluated are:

                keep_history

                and

                CONFIG_HOUSEKEEPING_FREQUENCY

                Comment

                • richlv
                  Senior Member
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Oct 2005
                  • 3112

                  #9
                  Originally posted by Tom Hutter
                  Good question. I haven't seen any part of code in housekeeper.c (verision 1.8) where a constant/variable is evaluated to stop with deleting, when reached.

                  The only two variables which are evaluated are:

                  keep_history

                  and

                  CONFIG_HOUSEKEEPING_FREQUENCY
                  see the part which checks CONFIG_MAX_HOUSEKEEPER_DELETE and if it's non-zero, does database-specific magic (starts at line 70 in 1.8 branch svn head)
                  Zabbix 3.0 Network Monitoring book

                  Comment

                  • Tom Hutter
                    Junior Member
                    • Feb 2012
                    • 7

                    #10
                    Okay, so CONFIG_MAX_HOUSEKEEPER_DELETE is only related to the deletes in the housekeeper table in function housekeeping_cleanup(). history and trend tables are not affected by CONFIG_MAX_HOUSEKEEPER_DELETE. They are deleted in function housekeeping_history_and_trends() which runs before housekeeping_cleanup().

                    Comment

                    • richlv
                      Senior Member
                      Zabbix Certified Trainer
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • Oct 2005
                      • 3112

                      #11
                      Originally posted by Tom Hutter
                      Okay, so CONFIG_MAX_HOUSEKEEPER_DELETE is only related to the deletes in the housekeeper table in function housekeeping_cleanup(). history and trend tables are not affected by CONFIG_MAX_HOUSEKEEPER_DELETE. They are deleted in function housekeeping_history_and_trends() which runs before housekeeping_cleanup().
                      argh. good catch. i guess unifying all that might be a good idea...
                      Zabbix 3.0 Network Monitoring book

                      Comment

                      • richlv
                        Senior Member
                        Zabbix Certified Trainer
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Oct 2005
                        • 3112

                        #12
                        issue about this at https://support.zabbix.com/browse/ZBX-5887
                        Zabbix 3.0 Network Monitoring book

                        Comment

                        • dougbee
                          Member
                          • Apr 2011
                          • 68

                          #13
                          I seem to be getting bursts of Zabbix Agent Availability triggers (currently set to agent.ping.nodata(300)}=1) that are timed with the housekeeper executions:

                          1437:20121206:083732.100 executing housekeeper
                          1437:20121206:084716.012 housekeeper deleted: 648552 records from history and trends, 141000 records of deleted items, 0 events, 0 alerts, 0 sessions
                          1437:20121206:094716.322 executing housekeeper
                          1437:20121206:095637.446 housekeeper deleted: 613913 records from history and trends, 141000 records of deleted items, 0 events, 0 alerts, 0 sessions
                          And then the agent queue > 10m spikes. Is the housekeeper the likely cause, and would the patch resolve the issue? Interestingly the 141000 deleted items is repeated consistently.

                          MaxHouseKeeperDelete is set to 500, DB is MySQL.

                          Comment

                          • tchjts1
                            Senior Member
                            • May 2008
                            • 1605

                            #14
                            Originally posted by dougbee
                            And then the agent queue > 10m spikes. Is the housekeeper the likely cause, and would the patch resolve the issue? Interestingly the 141000 deleted items is repeated consistently.

                            MaxHouseKeeperDelete is set to 500, DB is MySQL.
                            I have similar symptoms in 2.0.3 and MySql. Skipped doing DB partition because of foreign keys not supported by MySql, but am using innodb and file per table.

                            I have housekeeper at the default of running every hour. Each run lasts around 10 - 15 minutes, during which I get the trigger of IO overload on the DB server. DB resides on the SAN.

                            If you are trying to navigate screens/graphs during this time, it is generally a 30 - 60 second wait for the graph to redraw, which is horrible.

                            DB is on a VM with 4vCPU and 8GB memory.

                            Comment

                            Working...