Ad Widget

Collapse

Web scenario trigger - 2 checks before alert

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • stevenh1901
    Junior Member
    • Jul 2019
    • 4

    #1

    Web scenario trigger - 2 checks before alert

    I have a Web Scenario setup to check the response code of a site (200 is required), that works perfectly, but the problem is that I can't figure out how to get the trigger to alert only if both of the last 2 checks failed.

    Trigger
    Expression: {Template Web Status:web.test.fail[HTTP code response].count(#2,200)}>1

    Web Scenario
    Update interval: 1m
    Attempts: 2
    Step 1 of Scenario
    URL: {HOST.DNS}
    Follow redirects: true
    Retrieve mode: body
    Timeout: 15s
    Required status codes: 200

    Server Info
    OS: Ubuntu 18.04
    Zabbix: 5.0.2
  • visiandy
    Junior Member
    • Jul 2020
    • 8

    #2
    Hey!
    In my opinion you're using a wrong expression and/or checking a wrong thing. As i can see you're checking Failed step of scenario "HTTP code response" ( i.e. "web.test.fail") which is actually, yes, a number of a failed step. You didn't show your full screnario so i assume it contains only one step. I can't say i quite understand how that would work. If your check will fail at step 1 then item "web.test.fail" will return "1". If it's working as intended then you'll be getting "0" for that item (check "Latest data").
    So i think you'll need to use other item, probably "web.test.rspcode". You'll be checking if return code is not "200" for a specific number of returned values within specific number of returned values
    It should (probably) look like that:
    Code:
    {Template Web Status:web.test.rspcode[HTTP code response,NAME_OF_YOUR_STEP_HERE].count(#2,200,ne)}>1
    I'll try to explain what's going on here (i may be wrong, duh) - we're checking that within last 2 returned values which contain value not equal (ne) to "200" (which we should consider an error condition) there's more than 1 of such values. I.e. if your web check will return something different from "200" more than once for the last two checks then trigger should fire up.
    At least that's what i'm thinking, but i may be wrong. Check manual here (especially function "count()"): https://www.zabbix.com/documentation...gers/functions

    Comment

    • stevenh1901
      Junior Member
      • Jul 2019
      • 4

      #3
      Hi Visiandy!

      Thank you! You are correct, there is only one step in my web scenario. HTTP Response Code is the name of the step.

      What you said makes sense! I will give it a try and see if it works. Unfortunately it's hard to test, typically what happens is that there's a small blip somewhere between the web server and the monitor server, so I get an alert saying the site is down.

      Comment

      • visiandy
        Junior Member
        • Jul 2020
        • 8

        #4
        No need to wait for that - you could make up some nonexistent page on the site you're checking. For example {HOST.DNS}/aintnostuffhere. That way it would return HTTP 502 (i think), which is not 200 you need. So after some time (2m) you should get your trigger fired.
        Btw i don't know your environment, but i do have some web checks (checking if API in some web based conteinerized app is responding) too. And i have alerts configured to report to our MS Teams group (team). Before that had them configured for Rocket.Chat/Telegram. Its much easier now.
        So what i'm saying you probably don't need such a high check frequency (1m). Also take into account that if that check would have some intermittent trouble reaching destination resource (unstable/bad network or equipment, etc) you will get a lot of alerts (i'd call it flapping). You could try to raise a number of attempts but it would not help if for example zabbix server would have recurring problem reaching that page (for seconds). So what i did i used some escalation if i may say to postpone sending alerts. Meaning that a problem would still appear on problem view but if it's some "glitch" i won't be notified about that and only if it persists for 10m i'll get a notification in Teams. I think i already mentioned that but web checks are initiated from zabbix server meaning that you could get some false positive if server could not reach a page but page itself is online. For example lets suppose you're checking your site on Internet and zabbix gets an error. But your site is online and you've got it because your connection to ISP stopped working. That's where you could introduce another step to your web check, to check some well-know resource if it's online (google/amazon/whatever) - that way you could be sure at least that the checks zabbix sending are reaching internet. Won't cover all cases of course.
        Heh, and here i just wanted to say about made up check to speed up your testing. I should become a novelist

        Comment

        • visiandy
          Junior Member
          • Jul 2020
          • 8

          #5
          I just took a bit more time to read your first sentence again. You said "HTTP Responce Code" is the name of your first (and only) step. It's almost the same as your web check name (HTTP code response) or did i get it wrong?
          Because that expression i gave you will need first the name of your web check and then the name of your step in that check (because each step could return different codes).

          Comment

          • stevenh1901
            Junior Member
            • Jul 2019
            • 4

            #6
            Fantastic, That worked perfectly! It took about 2 minutes before it triggered a failure state!! I don't have it setup yet, but it will be going to a text message. The reason I wanted it to check every minute is because we're required to in one of our hosting contracts, but I was getting a lot of flapping so I wanted the 2 minutes before it triggered. That way I can technically say it checks every minute, but allow for potential flapping.

            Yeah the naming isn't great for my steps / scenarios since they all only have 1 step to them.

            Thanks for your help Visiandy!!

            Comment

            • Sorcermon
              Junior Member
              • Sep 2023
              • 4

              #7
              I was checking this, and with version 6.4 it doesn't worked for me.

              I managed to create a experssion that i think it would fit well.

              count(/Zabbix server/web.test.fail[(name of web scenario)],#5,"eq",1)>=3

              This is that if each 5 counters it gets more than 3 errors it activate the trigger.

              Also this works for me aswell

              count(/Zabbix server/web.test.rspcode[(name of web scenario),(name of step)],5m,"eq","200")<3

              This one is looking for more than 3 200 in the last 5 min. (i have the web scenario to test the web 1 per minute)

              Hope this helps!

              Comment

              • wessel21
                Junior Member
                • Nov 2022
                • 4

                #8
                I think my post relates well to this thread as this is specifically about web sxcenarios.
                I got false positives infrequently. This or that web site is not available. Even it was loading immediately in any browser for a manual check. That false positive kept over Days. We are using the version 6.4.2 on Ubuntu 20.04, no proxy involved. And we are very happy to have this solution.
                In our case I have a bunch of JSP web pages (about ten) presented by an Apache Tomcat.
                This JSP server gets rebooted before Midnight. The first person or mechanism requesting any of the pages 1st time will trigger a re-compilation before the result can be handed out.
                That causes the first greed for resources.
                If someone (we did, indeed) takes the standard time of 5 minutes between the web scenario checks - all checks will happen the same time.
                Which is the second topic what may multiply the demand.
                For one of the web pages I organised a log to have a view from inside the page itself what happen. This log will be fed by a script what collects all necessary data and the JSP logic itself presenting these details.
                That JSP adds to the log what user agent is calling this page at which time. So I got a confirmation that this happen all 5 Minutes by Zabbix. And how - that made me flabbergasted:
                Here I can see that for the 13 details presented by this site (so the web scenario for this one is containing 13 steps) Zabbix is demanding the same page 22 times (!). wow.
                which is the third point to know how Zabbix may drown itself in information.
                here an excerpt of that log, showing easily that it takes some time for that Tomcat/Zabbix duo to dig through what is demanded:

                ...
                Fri 01 Sep 2023 03:10:05 PM UTC
                by XY check script (started every 5 min by cron):
                result.txt size: 8350 error.txt size: 0
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:10:25
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                XYcheck.jsp called by Zabbix at 15:12:26
                Fri 01 Sep 2023 03:15:04 PM UTC
                by XY check script (started every 5 min by cron):
                result.txt size: 8345 error.txt size: 0
                XYcheck.jsp called by Mozilla/Macintosh (user) at 15:13:03
                ...


                Well, we have web pages providing information about 50 details, needing 50 steps in a scenario obviously.
                A simple IF construct may check if the next step is requesting the same web page to prevent a repeated request of the same information.

                To alleviate the issue I set the delay between the checks to different primes for each scenario: 3,5,7,9,11,13... Minutes.
                Even with a re-compilation the pages don't get into a timeout and even over time those checks will interfere rarely in resource demands.
                Without that log accidentally uncovering this behaviour - I would never had an idea.
                Maybe this helps other admins, maybe a hint how-to improve Zabbix itself.​

                Comment

                Working...