Ad Widget

Collapse

Recovery expression not working as expected

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • comboloid
    Junior Member
    • Dec 2022
    • 15

    #1

    Recovery expression not working as expected


    Hello!!

    I am trying to monitor our machines and their downtimes in a more effective way, so i created the following trigger which checks if the agent is not responding or load is over 300. The problem with this, is that recovery expression i am using very often takes 1-2 hours to be recognized from zabbix as resolved.
    Click image for larger version

Name:	Screenshot 2024-03-31 at 17.05.29.png
Views:	848
Size:	124.0 KB
ID:	481643

    For example check the following machine. It has a CPU spike for some minutes and the load dropped down within 10 minutes

    Click image for larger version

Name:	Screenshot 2024-03-31 at 17.09.54.png
Views:	788
Size:	102.5 KB
ID:	481645

    But the trigger actually recovered on it's own after 2 hours, which is totally misleading for our stats.

    Click image for larger version

Name:	Screenshot 2024-03-31 at 17.10.57.png
Views:	801
Size:	18.7 KB
ID:	481646

    Do you have any idea that might be causing this or how we can debug this further?
    Attached Files
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    ok .. lets debug your expressions ... How long is your {$AGENT.TIMEOUT}`?
    "agent is not available for X amount of time" and " second to last value of cpu load is over or equal to 300"
    recovery ... "3rd value from "now" should be less or equal to 200"

    First .. you probably missed the "#N" usage of last() is different from other functions... it is not N values but "the Nth most recent value"
    Second, I think you host availability just keeps it open... even if your load values are already back to normal and even 2 or 3 values back should show ok... then maybe availability is still not OK? Look for both items values when trigger is active....

    Comment

    • comboloid
      Junior Member
      • Dec 2022
      • 15

      #3
      hey cyber, {$AGENT.TIMEOUT} is set to 1m.

      The weird part of this is that this setup when used to check only load it works as expected and the recovery expression works as expected.
      Click image for larger version

Name:	Screenshot 2024-04-01 at 18.19.00.png
Views:	746
Size:	59.5 KB
ID:	481687


      then maybe availability is still not OK? Look for both items values when trigger is active....

      Zabbix agent is available in all these cases. I am only thinking if Zabbix gets somehow confused or buggy when combining two triggers and using recovery expression only for one of them?

      Comment

      • markfree
        Senior Member
        • Apr 2019
        • 868

        #4
        To me, it doesn't make much sense to mix these two items in the same trigger.
        As Cyber already mentioned, when you use the "#num" parameter, you are actually evaluating the "#Nth" most recent value, not the last one.
        This means that your trigger will always evaluate values from previous interval cycles, not the most recent item cycle.

        Anyway, how long is your item interval?
        Last edited by markfree; 02-04-2024, 03:28.

        Comment

        • comboloid
          Junior Member
          • Dec 2022
          • 15

          #5
          hey Mark!!

          The reasoning behind these two items in a single trigger, is i want to calculate the real downtime of our machines, saying when the agent is not available and/or load is >300 seems quite accurate. So the idea is to have a single trigger about it, which will autoresolve when load is less than 200 and monitor the results of the trigger. If you have any other better idea on how to achieve this i am still searching for alternatives

          The interval for system.cpu.load[all,avg1] is 1min. The reason for using the "#num" parameter is to let 1-2 minutes pass before checking that there real is a problem and avoid false positives.

          thanks!

          Comment

          • cyber
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2006
            • 4807

            #6
            Originally posted by comboloid

            Zabbix agent is available in all these cases. I am only thinking if Zabbix gets somehow confused or buggy when combining two triggers and using recovery expression only for one of them?
            Recovery expression is additional expression that has to be TRUE after your initial trigger expression is already turned FALSE. I guess if your expression works OK without availability item, then that one is the culprit here, which does not allow it to turn false in time..

            Comment

            • comboloid
              Junior Member
              • Dec 2022
              • 15

              #7
              hey cyber thanks for the help. I managed to sort this out with your help.

              The problem was that despite in the recovery expression only load was ok to turn true, Zabbix was anyway looking for data regarding agent availability to proceed. And i have been using the "discard unchanged with heartbeat 3 hours" option for the agent, meaning that when unchanged, Zabbix had to wait at max 3 hours to get a new update about it to proceed with the recovery or not. I totally removed the option of heartbeat and since then it started working as expected.

              Comment

              Working...