Ad Widget

Collapse

Looking for suggestions on making triggers a little smarter.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jhboricua
    Senior Member
    • Dec 2021
    • 113

    #1

    Looking for suggestions on making triggers a little smarter.

    Trying to understand how to tackle different scenarios and reduce trigger noise. Take this example

    ​​​

    This is a database server that gets some heavy usage on its disk at 10 AM every day. Let's say this is a backup activity or some batch job doing lots of reads. This triggers the associated disk response time trigger daily during this period. Eventually people will simply ignore it. Modifying the triggers with time exclusions or their evaluation periods doesn't scale too well. On this server the offending activity is at 10 AM and takes 30-40 minutes. On another server, it could happen at a different time and take more or less time.

    But I do have the collected history, so how can I leverage it so that Zabbix says, 'hey, I'm seeing the disk is really busy but that's normal for this time period so I'm not going to alert'?
  • ISiroshtan
    Senior Member
    • Nov 2019
    • 324

    #2
    I personally would go with:
    - slap some unique tag on triggers you fighting with
    - creating maintenance(s) (one per host you fighting with or in groups based on when you expect jobs to be running)
    - with data collection
    - active for 1-2 year
    - recurring period (daily-weekly-monthly, based on need)
    - time matching knowin jobs + 10min on top of that
    - tag match to choose only specific trigger to be under maintenance

    Should cover your need I think

    Comment

    • markfree
      Senior Member
      • Apr 2019
      • 868

      #3
      You may be looking for a less sensitive trigger function.
      Trend and baseline functions can help you with that.

      Example
      Last edited by markfree; 02-01-2025, 03:39.

      Comment

      • jhboricua
        Senior Member
        • Dec 2021
        • 113

        #4
        Originally posted by ISiroshtan
        I personally would go with:
        - slap some unique tag on triggers you fighting with
        - creating maintenance(s) (one per host you fighting with or in groups based on when you expect jobs to be running)
        - with data collection
        - active for 1-2 year
        - recurring period (daily-weekly-monthly, based on need)
        - time matching knowin jobs + 10min on top of that
        - tag match to choose only specific trigger to be under maintenance


        Should cover your need I think
        This becomes unmanageable very quickly and doesn't scale when there are hundreds of servers.


        markfree I've been looking at the trends and baseline functions but have yet to see how they can be used effectively for this. Take the baseline function for example, from the documentation.

        baselinedev(/host/key,data period:time shift,season unit,num seasons):
        baselinedev(/host/key,1h:now/h,"d",10) #calculating the number of standard deviations (population) between the previous hour and the same hours over the period of ten days before yesterday​

        baselinewma(/host/key,data period:time shift,season unit,num seasons):
        baselinewma(/host/key,1h:now/h,"d",3) #calculating the baseline based on the last full hour within a 3-day period that ended yesterday. If "now" is Monday 13:30, the data for 12:00-12:59 on Friday, Saturday, and Sunday will be analyzed​

        What I gather by this is that baseline and trends are always looking at previous data to perform their calculations. Which makes sense because trends data is written hourly and trends for the current hour of activity are unavailable for the functions to use. Hence, If my activity spike always happen between 10 and 11 PM, the function above is going to look at the period between 9 to 10 PM. And that's a totally different activity profile. So I'm not sure how I can utilize these functions for what I'm trying to achieve, which is for Zabbix to trigger if the activity spike is normal for that time period given the data for the same time period during the last week(s) or month. It doesn't seem these functions are aimed to do that. Even watching some of the videos from Zabbix on the subject you could tell they were struggling a little bit to explain them, lol.

        Or maybe I'm just misunderstanding the use of the baseline/trend functions, which is why I'm asking here for smarter people to point me in the right direction.​

        Comment

        • Brambo
          Senior Member
          • Jul 2023
          • 245

          #5
          I think the trendavg function combined with a duration in 1 trigger should cover most of your needs.
          Create the trigger with macro's form trend time and duration so that default from template the match the expected scenario's but then you can make it a host macro when a specific host needs it's own "special setting"
          in other words: (example)
          trendavg(/host/key,1h:now/h) > trendavg(/host/key,6h:now/h-6h) + {$macrovalue for your expected increase}
          and
          last(/host/key) > {$macro a certain minimum value to pass}

          Comment

          • jhboricua
            Senior Member
            • Dec 2021
            • 113

            #6
            Looking at baselinewma again and it just hit me, is the solution as simple as using a time shift for 'current hour period' instead of last hour? Unlike baselinedev, it seems that baselinewma is not using that last hour value in the baseline calculation/analysis. If I'm reading it correctly, it is simply using the time shift as a reference of which values to use in the calculation based on the season unit and num season parameters. So If I setup the baselinewma function as:

            "baselinewma(/host/key,1h:now/h+1h,"d",7)" or "calculate the baseline based on the current hour within a 7-day period that ended yesterday. If "now" is Monday 13:30, the data for 13:00-13:59 for the previous 7 days will be analyzed".​​

            I could then use that in a trigger to compare the current value of say, cpu utilization, against the calculated baseline for that same 1h time period on the last X amount of days and define my threshold in which to trigger at. For example:

            Code:
            min(/host/system.cpu.util,15m)>90 and baselinewma(/host/key,1h:now/h+1h,"d",7)*2 < avg(/host/system.cpu.util,15m)
            Trigger if the min value of CPU utilization exceeds 90% over a 15-minute period and the 15m average CPU utilization is 2 times higher than the calculated baseline for the current 1h period of the last 7 days.

            Thoughts?​

            Comment

            • markfree
              Senior Member
              • Apr 2019
              • 868

              #7
              You know... I see very few people actually interested in improving their metrics. Cheer mate.

              Your scenario is actually very common, but can lead to some misinterpretations.
              Keep in mind that when you use "baselinewma" like that you are actually evaluating the current hour, not the last 60m. So, the beginning of each hour will have less data to compare with previous seasons and this can lead to false triggers.

              When combining functions, it is best to approximate the data/time
              scales in them. Something like "avg(1h) > baselinewma(1h) * 2".

              You could try other ways to infer abnormal behavior for your data. There are other more statistical ways to measure the data behavior, but baseline seems more straightforward to me.

              Still, you might want to try something a little simpler.​ How about this?
              Code:
              avg(//key,1h) / avg(//key,1h:now-1d) > 1,5
              Last edited by markfree; 18-01-2025, 04:10.

              Comment

              • jhboricua
                Senior Member
                • Dec 2021
                • 113

                #8
                Originally posted by markfree
                Keep in mind that when you use "baselinewma" like that you are actually evaluating the current hour, not the last 60m. So, the beginning of each hour will have less data to compare with previous seasons and this can lead to false triggers.
                Code:
                avg(//key,1h) / avg(//key,1h:now-1d) > 1,5[SIZE=16px][/SIZE]
                Is it actually evaluating the current hour values (or last hour if I were to use 1h:now/h) as part of the baselinewma calculation though? The language and examples in the documentation on baselinewma state:

                Code:
                baselinewma(/host/key,1h:now/h,"d",3) #calculating the baseline based on the last full hour within a 3-day period that ended yesterday. If "now" is Monday 13:30, the data for 12:00-12:59 on Friday, Saturday, and Sunday will be analyzed​
                It says calculating the baseline based on the last full hour within a 3 day period that ended yesterday. It doesn't seem to me that it is actually using the last hour values in that calculation, it is simply using the timeshift to know which past periods/seasons to evaluate.

                This is different from the language for baselinedev:

                Code:
                baselinedev(/host/key,1h:now/h,"d",10) #calculating the number of standard deviations (population) between the previous hour and the same hours over the period of ten days before yesterday​
                where it clearly states that is is calculating the number of std deviations between the previous hour and the same hours.....

                Comment


                • markfree
                  markfree commented
                  Editing a comment
                  Sorry about that. I meant a time shift like "1h:now/h+1h", which is the current hour.
              Working...