Ad Widget

Collapse

Trigger firing when I don't think it should?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cdecarlo
    Junior Member
    • Nov 2019
    • 23

    #1

    Trigger firing when I don't think it should?

    I have an HTTP Agent item monitoring a URL with pre-processing configured. The item is returning the header and body as character data. The pre-processing is using a regular expression to evaluate the response down to OK with a custom on fail value set to ERROR. This is working as expected. When I look at the history for this item it contains mostly OK values with an occasional ERROR value. There is some flakiness in the back-end systems which cause an occasional error. Since I know that there will be a single random failure, I have a trigger that I've configured to count the number of error responses over the last three collections and evaluate to true if there have been 2 or more values equal to ERROR over those last three collections.

    This is what my trigger expression looks like : {zabbix.sys.courtnet.org:vm-esig-appp2_HSM_Pool_Health.count(#3,"OK",ne)}>1
    I have also tried : {zabbix.sys.courtnet.org:vm-esig-appp2_HSM_Pool_Health.count(#3,"ERROR",eq)}>1

    What happens is that this trigger seems to fire every time the item gets an ERROR value.

    Here is the data values collected from this item (from the text view of the latest data for this item).

    2020-04-21 11:45:18 1587483918 "OK"
    2020-04-21 11:44:18 1587483858 "OK"
    2020-04-21 11:43:18 1587483798 "ERROR"
    2020-04-21 11:42:18 1587483738 "OK"
    2020-04-21 11:41:18 1587483678 "OK"
    2020-04-21 11:40:18 1587483618 "OK"

    I received the alert emails at 11:44 within a couple seconds of each other....
    Problem started at 11:44:14 on 2020.04.21 Problem has been resolved at 11:44:16 on 2020.04.21 If I'm reading the documentation correct, then I would have assumed that trigger would not have fired because there was only every one ERROR status within any three collections of the item data.

    What am I missing here?

    Thanks,

    Cliff
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    Originally posted by cdecarlo
    This is what my trigger expression looks like : {zabbix.sys.courtnet.org:vm-esig-appp2_HSM_Pool_Health.count(#3,"OK",ne)}>1
    I have also tried : {zabbix.sys.courtnet.org:vm-esig-appp2_HSM_Pool_Health.count(#3,"ERROR",eq)}>1

    What happens is that this trigger seems to fire every time the item gets an ERROR value.

    Here is the data values collected from this item (from the text view of the latest data for this item).

    2020-04-21 11:45:18 1587483918 "OK"
    2020-04-21 11:44:18 1587483858 "OK"
    2020-04-21 11:43:18 1587483798 "ERROR"
    2020-04-21 11:42:18 1587483738 "OK"
    2020-04-21 11:41:18 1587483678 "OK"
    2020-04-21 11:40:18 1587483618 "OK"
    I'm not using count() in any of my triggers, but based on my reading of the trigger functions documentation what you have looks correct, especially the first one.

    I will say that I tend to avoid using text item types, even for OK/NOT_OK items. I prefer to write my custom items so that they return an integer, and then I create a value mapping so that it's obvious for any item that e.g. 0=OK, 1=ERROR, or whatever numbers make sense for your item.

    My reasons for doing this are
    1. graphing works by default when the item returns integers
    2. more trigger functions support integer types than text types, so I have a wider range of functions I can use.
    Still, what you're doing looks like it should work, at least according to the docs.

    What happens if you don't use #3, and instead try using something like 3m (since your checks appear to be every minute). It shouldn't make any difference based on the docs, but I'm wondering if it behaves differently.

    If you're comfortable reading function call traces from debug logs, you could enable debug logging on your server and then just wait for the problem to happen and then review the debug logs, to see if that gives any clues.

    Comment

    • cdecarlo
      Junior Member
      • Nov 2019
      • 23

      #3
      Tim,

      Thanks for the suggestions. I could certainly change the return type of the pre-processing rules to be an integer value...I'm not sure if that would make any difference to my triggering problem, but I understand where you are coming from with the graphing and function support.

      I forgot the mention, that the first type of trigger I tried was using the evaluation period (of 3 minutes) and it was behaving exactly the same, that's when I switched the the number of collections syntax, which actually made the trigger definition a little more clear in what it was trying to do.

      I will try cranking up the logging level on the server and see if I see anything interesting when the problem happens next...it usually fires a few times a day, so I should be able to catch it.

      Cliff

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        Originally posted by cdecarlo
        I forgot the mention, that the first type of trigger I tried was using the evaluation period (of 3 minutes) and it was behaving exactly the same, that's when I switched the the number of collections syntax, which actually made the trigger definition a little more clear in what it was trying to do.
        Honestly using #3 should be better than using 3m in this situation, I just wanted to see if it behaved any differently. Since it doesn't "fix" the issue, sticking with #3 is what I would do too.

        Good luck and please follow-up with whatever you find. It's not clear to me what the problem is, so I'll be interested to learn what you find. If you need help interpreting the debug logs you can try posting them here, but be sure you alter or obfuscate any sensitive stuff that may appear in them. Using the runtime control feature (see zabbix_server(8) man page) it might be possible to figure out which process(es) is/are doing the trigger calculation and just increase log level for them. That would greatly reduce the amount of unrelated logs you need to wade through. Not sure how easy that's going to be to do, though.

        Comment

        • cdecarlo
          Junior Member
          • Nov 2019
          • 23

          #5
          Thanks. I've cranked up the logging...now just waiting until the trigger fires again.

          I will update with anything I find.

          Comment

          • cdecarlo
            Junior Member
            • Nov 2019
            • 23

            #6
            O.K. I had another trigger alert fire today at 14:54:11 The itemid for this resource is 32775

            2020-04-21 14:54:15 1587495255 "OK"
            2020-04-21 14:53:15 1587495195 "ERROR"
            2020-04-21 14:52:15 1587495135 "OK"
            2020-04-21 14:51:15 1587495075 "OK"

            I'm not exactly sure what I'm looking at in the logs....here is what I found that might be relevant...these were the entries for this resource from 14:53 and 14:54

            54628:20200421:145315.980 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
            54628:20200421:145315.980 In evaluate_COUNT()
            54628:20200421:145315.980 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
            54628:20200421:145315.980 In substitute_simple_macros() data:'#3'
            54628:20200421:145315.980 get_function_parameter_int() type:num value:3
            54628:20200421:145315.980 End of get_function_parameter_int():SUCCEED
            54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
            54628:20200421:145315.980 In substitute_simple_macros() data:'OK'
            54628:20200421:145315.980 get_function_parameter_str() value:'OK'
            54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
            54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
            54628:20200421:145315.980 In substitute_simple_macros() data:'ne'
            54628:20200421:145315.980 get_function_parameter_str() value:'ne'
            54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
            54628:20200421:145315.980 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495195 ns:959473792
            54628:20200421:145315.980 End of zbx_vc_get_values():SUCCEED count:3 cached:1
            54628:20200421:145315.980 End of evaluate_COUNT():SUCCEED
            54628:20200421:145315.980 End of evaluate_function():SUCCEED value:'1'

            54626:20200421:145415.772 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
            54626:20200421:145415.772 In evaluate_COUNT()
            54626:20200421:145415.772 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
            54626:20200421:145415.772 In substitute_simple_macros() data:'#3'
            54626:20200421:145415.772 get_function_parameter_int() type:num value:3
            54626:20200421:145415.772 End of get_function_parameter_int():SUCCEED
            54626:20200421:145415.772 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
            54626:20200421:145415.772 In substitute_simple_macros() data:'OK'
            54626:20200421:145415.772 get_function_parameter_str() value:'OK'
            54626:20200421:145415.772 End of get_function_parameter_str():SUCCEED
            54626:20200421:145415.772 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
            54626:20200421:145415.772 In substitute_simple_macros() data:'ne'
            54626:20200421:145415.772 get_function_parameter_str() value:'ne'
            54626:20200421:145415.772 End of get_function_parameter_str():SUCCEED
            54626:20200421:145415.773 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495255 ns:425014894
            54626:20200421:145415.773 End of zbx_vc_get_values():SUCCEED count:3 cached:1
            54626:20200421:145415.773 End of evaluate_COUNT():SUCCEED
            54626:20200421:145415.773 End of evaluate_function():SUCCEED value:'1'

            At 14:53 when the data was collected it was in error status (this is the raw data coming back from the request...it has a HTTP status of 500 which caused the pre-processor to use the value of ERROR):

            54662:20200421:145315.961 worker_preprocess_value(): HTTP/1.1 500
            Cache-Control: private
            Expires: Thu, 01 Jan 1970 00:00:00 GMT
            Content-Type: text/html
            Content-Length: 1100
            Date: Tue, 21 Apr 2020 18:53:15 GMT
            Connection: close
            Server: http-server/1.0
            <HTML><head></head><body>
            <h1>One or More Clusters Unhealthy</h1>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>1 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 46sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: UNHEALTHY / 46sec ago<br> Active: 0<br> Idle: 0<br></dd>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>2 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 42sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 42sec ago<br> Active: 0<br> Idle: 1<br></dd>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>2 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 28sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 27sec ago<br> Active: 0<br> Idle: 1<br></dd>
            </body></HTML>

            And this is the response at 14:54 when everything was O.K. again

            54662:20200421:145415.426 worker_preprocess_value(): HTTP/1.1 200
            Cache-Control: private
            Expires: Thu, 01 Jan 1970 00:00:00 GMT
            Content-Type: text/html
            Content-Length: 1088
            Date: Tue, 21 Apr 2020 18:54:15 GMT
            Server: http-server/1.0
            <HTML><head></head><body>
            <h1>All Clusters Healthy</h1>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>2 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 45sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 45sec ago<br> Active: 0<br> Idle: 1<br></dd>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>2 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 41sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 41sec ago<br> Active: 0<br> Idle: 1<br></dd>
            <dt>ucs.common.esignature.server.ejb.internals.h sm .EsigSafeNetHsmHaExecutor</dt>
            <dd>2 of 2 are healthy.<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 27sec ago<br> Active: 0<br> Idle: 1<br> SafeNetHSM::[email protected]::EsigHsmOutOfProcWork er: Healthy / 26sec ago<br> Active: 0<br> Idle: 1<br></dd>
            </body></HTML>

            Here are some more entries that might be relevant (the id for the trigger that is firing is 19613)

            54617:20200421:145309.723 triggerid:19613 description:'vm-esig-appp1 HSM Pool Health - DOWN' type:0 status:0 priority:3
            54617:20200421:145309.723 expression:'{23121}>1' recovery_expression:''
            54617:20200421:145309.723 value:0 state:0 error:'' lastchange:1587155595
            54617:20200421:145309.723 correlation_tag:'' recovery_mode:'0' correlation_mode:'0'
            54617:20200421:145309.723 topoindex:1 functional:0 locked:0
            54617:20200421:145309.723 opdata:''
            54617:20200421:145309.723 tags:
            54617:20200421:145309.723 tagid:8386 tag:'zabbix-admins' value:''
            54617:20200421:145309.723 tagid:8387 tag:'Linux_Zabbix_Notification' value:''
            54617:20200421:145309.723 tagid:8388 tag:'NOC Staff' value:''
            54617:20200421:145309.723 tagid:8389 tag:'Esign_Zabbix_Notification' value:''

            54617:20200421:145410.760 triggerid:19613 description:'vm-esig-appp1 HSM Pool Health - DOWN' type:0 status:0 priority:3
            54617:20200421:145410.760 expression:'{23121}>1' recovery_expression:''
            54617:20200421:145410.760 value:0 state:0 error:'' lastchange:1587155595
            54617:20200421:145410.760 correlation_tag:'' recovery_mode:'0' correlation_mode:'0'
            54617:20200421:145410.760 topoindex:1 functional:0 locked:0
            54617:20200421:145410.760 opdata:''
            54617:20200421:145410.760 tags:
            54617:20200421:145410.760 tagid:8386 tag:'zabbix-admins' value:''
            54617:20200421:145410.760 tagid:8387 tag:'Linux_Zabbix_Notification' value:''
            54617:20200421:145410.760 tagid:8388 tag:'NOC Staff' value:''
            54617:20200421:145410.760 tagid:8389 tag:'Esign_Zabbix_Notification' value:''
            Last edited by cdecarlo; 21-04-2020, 22:05.

            Comment

            • tim.mooney
              Senior Member
              • Dec 2012
              • 1427

              #7
              Originally posted by cdecarlo
              54628:20200421:145315.980 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
              54628:20200421:145315.980 In evaluate_COUNT()
              54628:20200421:145315.980 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
              54628:20200421:145315.980 In substitute_simple_macros() data:'#3'
              54628:20200421:145315.980 get_function_parameter_int() type:num value:3
              54628:20200421:145315.980 End of get_function_parameter_int():SUCCEED
              54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
              54628:20200421:145315.980 In substitute_simple_macros() data:'OK'
              54628:20200421:145315.980 get_function_parameter_str() value:'OK'
              54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
              54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
              54628:20200421:145315.980 In substitute_simple_macros() data:'ne'
              54628:20200421:145315.980 get_function_parameter_str() value:'ne'
              54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
              54628:20200421:145315.980 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495195 ns:959473792
              54628:20200421:145315.980 End of zbx_vc_get_values():SUCCEED count:3 cached:1
              54628:20200421:145315.980 End of evaluate_COUNT():SUCCEED
              54628:20200421:145315.980 End of evaluate_function():SUCCEED value:'1'
              Most of that is Zabbix examining each of the 3 parameters to count() in turn and doing macro substitution on each of the parameters (via its substitute_simple_macros() function) to replace any macros that may be used with their actual values. You're not using macros in any of the 3 arguments, so the substitute_simple_macros() is basically a no-op, returning the original value.

              The zbx_vc_get_values() is where Zabbix is looking up data from its value cache (_vc_ -> value cache).

              Not much useful here, unfortunately.

              However, after this section from 14:53:15 there should be a section in the logs where Zabbix is probably calling something like "zbx_substitute_functions_results()" a bunch, and then eventually it's probably doing something like this:

              Code:
              query [txnlev:1] [insert into problem (eventid,source,object,objectid,clock,ns,name,severity) values (675241,0,0,17359,1586982965,626859764,'somehost.whatever: your trigger here',2);]
              Can you obfuscate anything you need to in that region of the logs, and post it too?



              Comment

              • cdecarlo
                Junior Member
                • Nov 2019
                • 23

                #8
                Tim,

                There are no entries in the log like "insert into problem....". The only "insert into" entries in the log file are insert into history type of insert statements like the following (this one is for the item in question item id 32775)

                27962,169: 54626:20200421:145216.042 query [txnlev:1] [insert into history_str (itemid,clock,ns,value) values (32655,1587495135,142616386,'OK'),(32715,158749513 5,143339205,'OK'),(32775,1587495135,225753371,'OK' );
                113562,101: 54628:20200421:145315.973 query [txnlev:1] [insert into history_str (itemid,clock,ns,value) values (32775,1587495195,959473792,'ERROR'),(32655,158749 5195,961626590,'OK');
                197016,169: 54626:20200421:145415.760 query [txnlev:1] [insert into history_str (itemid,clock,ns,value) values (32715,1587495255,320250950,'OK'),(32655,158749525 5,367090539,'OK'),(32775,1587495255,425014894,'OK' );
                279600,305: 12,1587495316,714947348,'OK'),(32713,1587495316,74 9301042,'OK'),(32653,1587495316,838554711,'OK'),(3 2714,1587495316,928755508,'OK'),(32774,1587495316, 980916331,'OK'),(32654,1587495316,983370269,'OK'), (32775,1587495317,26278456,'OK'),(32655,1587495317 ,30779337,'OK'),(32715,1587495317,66131147,'OK'),( 32716,1587495317,70071265,'OK'),(32777,1587495317, 111085689,'OK'),(32776,1587495317,115349152,'OK'), (32717,1587495317,142201823,'OK');
                365906,169: 54628:20200421:145616.569 query [txnlev:1] [insert into history_str (itemid,clock,ns,value) values (32715,1587495375,741732966,'OK'),(32655,158749537 5,742908883,'OK'),(32775,1587495375,802677464,'OK' );
                450288,135: 54628:20200421:145716.363 query [txnlev:1] [insert into history_str (itemid,clock,ns,value) values (32655,1587495435,729513775,'OK'),(32775,158749543 5,790846142,'OK'),(32715,1587495435,803357997,'OK' );

                Which are all of the insert into entries for that item id in the log file.

                These are all of the zbx_vc_get_values entries for that item id:

                28095,31: 54626:20200421:145216.075 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495135 ns:225753371
                28197,31: 54626:20200421:145216.096 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495136 ns:96380857
                36629,31: 54628:20200421:145224.246 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495144 ns:246282682
                62364,31: 54625:20200421:145254.644 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495174 ns:644977469
                113628,31: 54628:20200421:145315.980 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495195 ns:959473792
                113638,31: 54628:20200421:145315.981 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495195 ns:981275058
                122058,31: 54627:20200421:145324.101 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495204 ns:101548997
                147761,31: 54626:20200421:145354.461 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495234 ns:461676900
                197194,31: 54626:20200421:145415.771 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495255 ns:771957334
                197223,31: 54626:20200421:145415.773 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495255 ns:425014894
                206271,31: 54628:20200421:145424.974 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495264 ns:974166430
                229763,31: 54625:20200421:145454.356 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495294 ns:356055821
                280030,31: 54626:20200421:145517.655 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495317 ns:655170921
                280068,31: 54626:20200421:145517.655 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495317 ns:26278456
                288173,31: 54627:20200421:145524.772 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495324 ns:772249747
                314937,31: 54626:20200421:145554.236 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495354 ns:236898860
                366267,31: 54628:20200421:145616.579 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495376 ns:579861965
                366367,31: 54628:20200421:145616.582 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495375 ns:802677464
                374824,31: 54625:20200421:145624.703 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495384 ns:703504033
                399601,31: 54625:20200421:145654.079 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495414 ns:79572750
                450539,31: 54628:20200421:145716.389 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495435 ns:790846142
                450658,31: 54628:20200421:145716.397 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495436 ns:397759989
                459017,31: 54627:20200421:145724.541 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495444 ns:541143664
                483897,31: 54626:20200421:145754.847 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:120 count:1 sec:1587495474 ns:847689742

                As for the zbx_substitute_functions_results values...I'm having trouble determining which ones are for the item/trigger in question....there is a LOT of entries in the log (the total log file size is 37 Meg). Might this be the section you are looking for?

                54637:20200421:145315.885 In substitute_key_macros() data:'vm-esig-appp1_HSM_Pool_Health'
                54637:20200421:145315.885 End of substitute_key_macros():SUCCEED data:'vm-esig-appp1_HSM_Pool_Health'
                54628:20200421:145315.885 End of zbx_evaluate_item_functions()
                54628:20200421:145315.886 In zbx_substitute_functions_results() ifuncs_num:22 tr_num:10
                54628:20200421:145315.886 zbx_substitute_functions_results() expression[0]:'{23036}=0' => '1=0'
                54628:20200421:145315.886 zbx_substitute_functions_results() expression[1]:'{22980}=1' => '0=1'
                54628:20200421:145315.886 zbx_substitute_functions_results() expression[2]:'{23053}>90' => '0.237837>90'
                54628:20200421:145315.886 zbx_substitute_functions_results() expression[3]:'{23054}>90' => '24.109569>90'
                54628:20200421:145315.886 zbx_substitute_functions_results() expression[4]:'{23069}/{23070}>1.5
                and {23071}>0
                and {23072}>0' => '0/4>1.5
                and 0.03>0
                and 0>0'
                54637:20200421:145315.886 In substitute_simple_macros() data:'10s'
                54637:20200421:145315.886 In substitute_simple_macros() data:'https://10.1.20.190:8443/common/hsm_status?require_healthy=true'
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:EMPTY
                54637:20200421:145315.886 In substitute_simple_macros() data:'XXXXXXX'
                54637:20200421:145315.886 In substitute_simple_macros() data:'XXXXXXXXXX'
                54637:20200421:145315.886 In get_value() key:'vm-esig-appp1_HSM_Pool_Health'
                54637:20200421:145315.886 In get_value_http() request method 'GET' URL 'https://10.1.20.190:8443/common/hsm_status?require_healthy=true' headers '' message body ''
                54628:20200421:145315.887 zbx_substitute_functions_results() expression[5]:'{23099}>90 and
                (({23100}-{23101})<5G or {23102}<1d)' => '25.470582>90 and
                ((21072244736-5123526656)<5G or 4293727.121444<1d)'
                54628:20200421:145315.887 zbx_substitute_functions_results() expression[6]:'{23103}>80 and
                (({23104}-{23105})<10G or {23106}<1d)' => '25.470582>80 and
                ((21072244736-5123526656)<10G or 4293727.121444<1d)'
                54628:20200421:145315.887 zbx_substitute_functions_results() expression[7]:'{23107}<10' => '88.216889<10'
                54627:20200421:145315.887 zbx_setproctitle() title:'history syncer #3 [processed 0 values, 0 triggers in 0.009349 sec, idle 1 sec]'
                54637:20200421:145315.887 setting HTTPAUTH [1]
                54628:20200421:145315.887 zbx_substitute_functions_results() expression[8]:'{23108}<20' => '88.216889<20'
                54628:20200421:145315.887 zbx_substitute_functions_results() expression[9]:'{23109}>2
                or {23110}>2' => '0>2
                or 0>2'
                54628:20200421:145315.887 zbx_substitute_functions_results() recovery_expression[9]:'{23111}<2*0.8
                and {23112}<2*0.8' => '0<2*0.8
                and 0<2*0.8'
                54628:20200421:145315.887 End of zbx_substitute_functions_results()


                Last edited by cdecarlo; 21-04-2020, 22:32.

                Comment

                • tim.mooney
                  Senior Member
                  • Dec 2012
                  • 1427

                  #9
                  Ok, if you haven't already done so, you can return your log DebugLevel to normal. I thought maybe something would stand out from the debug logs that would be a "smoking gun", but I'm not seeing it.

                  Also, if I had looked more closely at the times you provided in post #6, I would have realized that because the problem didn't alert until 14:54:11, the log results I was looking for would be much closer to that time.

                  BTW, did you say what version of Zabbix you're using?

                  Comment

                  • cdecarlo
                    Junior Member
                    • Nov 2019
                    • 23

                    #10
                    We are currently running version 5.0.0 alpha 2 (we ran into the length limitation on the external check scripts and had to upgrade to this version). I turned to logging back down to a sane level right after the alerts fired! :>

                    It is interesting that the ERROR condition occurs at 14:53 but the trigger didn't send the alert until 14:54 when it went back to OK status. That being said, I didn't really notice anything to indicate why the trigger fired in the log (at either the 14:53 or 14:54 times). From the log entries below the function only ever returned 1 or 0 (neither of which should have caused the trigger to fire since the condition is > 1)

                    I suppose it's also possible that this is a bug in the 5.0 alpha version?

                    If I'm reading this sequence of log events correctly, it looks like the count from the trigger function returned 1 at the 14:53, 14:54, 14:55 collections and then returned 0 at the 14:56 collection (it also returned 0 at the 14:52 collection)? That makes sense to me since I'm looking for the last 3 collection intervals so 14:53,14:54 and 14:55 would and should have returned 1 from the function. So why then did the trigger fire at 14:54??

                    54628:20200421:145315.980 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
                    54628:20200421:145315.980 In evaluate_COUNT()
                    54628:20200421:145315.980 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
                    54628:20200421:145315.980 In substitute_simple_macros() data:'#3'
                    54628:20200421:145315.980 get_function_parameter_int() type:num value:3
                    54628:20200421:145315.980 End of get_function_parameter_int():SUCCEED
                    54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
                    54628:20200421:145315.980 In substitute_simple_macros() data:'OK'
                    54628:20200421:145315.980 get_function_parameter_str() value:'OK'
                    54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
                    54628:20200421:145315.980 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
                    54628:20200421:145315.980 In substitute_simple_macros() data:'ne'
                    54628:20200421:145315.980 get_function_parameter_str() value:'ne'
                    54628:20200421:145315.980 End of get_function_parameter_str():SUCCEED
                    54628:20200421:145315.980 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495195 ns:959473792
                    54628:20200421:145315.980 End of zbx_vc_get_values():SUCCEED count:3 cached:1
                    54628:20200421:145315.980 End of evaluate_COUNT():SUCCEED
                    54628:20200421:145315.980 End of evaluate_function():SUCCEED value:'1'

                    54626:20200421:145415.772 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
                    54626:20200421:145415.772 In evaluate_COUNT()
                    54626:20200421:145415.772 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
                    54626:20200421:145415.772 In substitute_simple_macros() data:'#3'
                    54626:20200421:145415.772 get_function_parameter_int() type:num value:3
                    54626:20200421:145415.772 End of get_function_parameter_int():SUCCEED
                    54626:20200421:145415.772 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
                    54626:20200421:145415.772 In substitute_simple_macros() data:'OK'
                    54626:20200421:145415.772 get_function_parameter_str() value:'OK'
                    54626:20200421:145415.772 End of get_function_parameter_str():SUCCEED
                    54626:20200421:145415.772 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
                    54626:20200421:145415.772 In substitute_simple_macros() data:'ne'
                    54626:20200421:145415.772 get_function_parameter_str() value:'ne'
                    54626:20200421:145415.772 End of get_function_parameter_str():SUCCEED
                    54626:20200421:145415.773 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495255 ns:425014894
                    54626:20200421:145415.773 End of zbx_vc_get_values():SUCCEED count:3 cached:1
                    54626:20200421:145415.773 End of evaluate_COUNT():SUCCEED
                    54626:20200421:145415.773 End of evaluate_function():SUCCEED value:'1'

                    54626:20200421:145517.655 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
                    54626:20200421:145517.655 In evaluate_COUNT()
                    54626:20200421:145517.655 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
                    54626:20200421:145517.655 In substitute_simple_macros() data:'#3'

                    Also, i just noticed that 5.0 Beta1 has been released...we will be upgrading to that version today.
                    54626:20200421:145517.655 get_function_parameter_int() type:num value:3
                    54626:20200421:145517.655 End of get_function_parameter_int():SUCCEED
                    54626:20200421:145517.655 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
                    54626:20200421:145517.655 In substitute_simple_macros() data:'OK'
                    54626:20200421:145517.655 get_function_parameter_str() value:'OK'
                    54626:20200421:145517.655 End of get_function_parameter_str():SUCCEED
                    54626:20200421:145517.655 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
                    54626:20200421:145517.655 In substitute_simple_macros() data:'ne'
                    54626:20200421:145517.655 get_function_parameter_str() value:'ne'
                    54626:20200421:145517.655 End of get_function_parameter_str():SUCCEED
                    54626:20200421:145517.655 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495317 ns:26278456
                    54626:20200421:145517.655 End of zbx_vc_get_values():SUCCEED count:3 cached:1
                    54626:20200421:145517.655 End of evaluate_COUNT():SUCCEED
                    54626:20200421:145517.655 End of evaluate_function():SUCCEED value:'1'

                    54628:20200421:145616.582 In evaluate_function() function:'zabbix.sys.courtnet.org:vm-esig-appp1_HSM_Pool_Health.count(#3, "OK", ne)'
                    54628:20200421:145616.582 In evaluate_COUNT()
                    54628:20200421:145616.582 In get_function_parameter_int() parameters:'#3, "OK", ne' Nparam:1
                    54628:20200421:145616.582 In substitute_simple_macros() data:'#3'
                    54628:20200421:145616.582 get_function_parameter_int() type:num value:3
                    54628:20200421:145616.582 End of get_function_parameter_int():SUCCEED
                    54628:20200421:145616.582 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:2
                    54628:20200421:145616.582 In substitute_simple_macros() data:'OK'
                    54628:20200421:145616.582 get_function_parameter_str() value:'OK'
                    54628:20200421:145616.582 End of get_function_parameter_str():SUCCEED
                    54625:20200421:145616.582 End of recalculate_triggers()
                    54625:20200421:145616.582 In zbx_process_events() events_num:0
                    54628:20200421:145616.582 In get_function_parameter_str() parameters:'#3, "OK", ne' Nparam:3
                    54628:20200421:145616.582 In substitute_simple_macros() data:'ne'
                    54628:20200421:145616.582 get_function_parameter_str() value:'ne'
                    54628:20200421:145616.582 End of get_function_parameter_str():SUCCEED
                    54628:20200421:145616.582 In zbx_vc_get_values() itemid:32775 value_type:1 seconds:0 count:3 sec:1587495375 ns:802677464
                    54625:20200421:145616.582 End of zbx_process_events() processed:0
                    54625:20200421:145616.582 query [txnlev:1] [commit;]
                    54628:20200421:145616.583 End of zbx_vc_get_values():SUCCEED count:3 cached:1
                    54628:20200421:145616.583 End of evaluate_COUNT():SUCCEED
                    54628:20200421:145616.583 End of evaluate_function():SUCCEED value:'0'

                    I also just noticed that 5.0 Beta1 has been released. I will get that version running today and see if that makes any difference.
                    Last edited by cdecarlo; 22-04-2020, 16:22.

                    Comment

                    • cdecarlo
                      Junior Member
                      • Nov 2019
                      • 23

                      #11
                      This is still happening with 5.0 beta 1.

                      Comment

                      • tim.mooney
                        Senior Member
                        • Dec 2012
                        • 1427

                        #12
                        I'm not sure what to tell you. I guess I'm a little suspicious of the "count:" returned in a few of the zbx_vc_get_values() calls you show in comment #8, but I also haven't spent any time looking at the source for that function to have a good idea of what the log messages are telling us.

                        If I were you, I would probably report it as a bug in the Zabbix bug tracker and see if you get any follow-up. Many things that get reported as "bugs" there are actually not bugs but configuration issues on the zabbix server (e.g. wrong logic in a trigger, etc.), but in your case if it's a configuration issue, it's something at least a little subtle. I'm certainly not spotting it.

                        Alternately, I would probably try what I suggested in comment #2: create a new item that returns an integer and then use max or min in your trigger function (depending on which value you make OK and which value you make ERROR) to determine if you've had 3 successive bad readings. You could make this an additional item and trigger, along side the one with count() that is not working, and have them run at the same frequency to see if they agree. You could change the severity of this trigger so that your actions don't generate alerts about it, to avoid double alerts just while testing them in parallel.

                        I wish I had a better answer for you.

                        If you do make headway on this, please follow up and let us know what you find.

                        Comment

                        • cdecarlo
                          Junior Member
                          • Nov 2019
                          • 23

                          #13
                          Thanks for at least attempting to help! I will try your suggestion of making a new item that returns an integer value and see if that behaves the same or differently. If that one behaves differently, then I will post a bug submission to Zabbix as I will have some more definitive information for the ticket.

                          Interestingly enough, debug logging level 5 in 5.0 beta 1 does not seem to log the calls to evaluate_function (unless those are still only done on the server)? Since my post yesterday, I've re-created this item on the correct proxy that should be monitoring this item and didn't see the evaluate_function calls in the proxy logs with logging turned up to 5, but I did not turn up the logging level on the server at the same time.

                          I'll update this post when I see some results from the new items returning integers...

                          Comment

                          • tim.mooney
                            Senior Member
                            • Dec 2012
                            • 1427

                            #14
                            I believe triggers would only be evaluated on the server.

                            The proxy collects the data (items) and sends it on to the server, but the server is responsible for all trigger evaluation and actions.

                            Comment

                            • cdecarlo
                              Junior Member
                              • Nov 2019
                              • 23

                              #15
                              Ah, O.K. that makes sense...I kind of had a feeling that is why I wasn't seeing those evaluation messages. I've created the two new items returning integer values...now I just need to wait for one of the events to happen....shouldn't be too long...there are usually several a day!

                              Comment

                              Working...