Ad Widget

Collapse

Trigger dependencies work only on Zabbix dashboard, then I get email storm!

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • diskmandata4
    Junior Member
    • Nov 2018
    • 5

    #1

    Trigger dependencies work only on Zabbix dashboard, then I get email storm!

    Hello All!

    Shortly:
    Let say I have XXX number of hosts outside of my office that I monitor.

    I cut internet cable and wait for 15 mins.

    On Zabbix dashboard I see only 1 problem (as it should be) for my "Internet check" that has failed (this is due to dependencies I've already set), BUT
    after I restore internet connection I get a lot of emails with "ZABBIX: PROBLEM: Zabbix agent on XXXX was unavailable for 5 mins"
    and immediately another email with "ZABBIX: OK: Zabbix agent on XXXX is unreachable for 5 minutes".

    I've spent a month trying to set up different configurations, tests, etc., but with no success in stopping that post email notification storm!

    I've used Zabbix 3.2, then following official instructions, upgraded to 4.0.1 - it's the same issue.
    I use passive checks and I obey recommendations for pyramid interval checks for chain dependencies - e.g:

    "LEVEL0 Item - Internet check -- 30s" -> "LEVEL1 Item -- 1m" -> "LEVEL2 Items -- 2m" etc ....

    Is this kind of a Zabbix design or I can stop these emails that should not be sent at all ... somehow?

    I also have some Icinga and Nagios exp. but cannot remember to had such an issue with post email storm.

    Partial cut of Zabbix server log right after the cable cut:

    " 938:20181113:084312.326 executing housekeeper
    938:20181113:084437.084 housekeeper [deleted 54545 hist/trends, 0 items/triggers, 33 events, 0 problems, 0 sessions, 0 alarms, 0 audit items in 84.757251 sec, idle for 1 hour(s)]
    958:20181113:091519.432 Zabbix agent item "proc.num[,,run]" on host "XXXX" failed: first network error, wait for 15 seconds
    960:20181113:091520.033 Zabbix agent item "system.cpu.intr" on host "XXXX" failed: first network error, wait for 15 seconds
    961:20181113:091520.066 Zabbix agent item "service.info[Power,state]" on host "XXXX" failed: first network error, wait for 15 seconds
    957:20181113:091524.078 Zabbix agent item "agent.ping" on host "XXXX" failed: first network error, wait for 15 seconds
    962:20181113:091619.351 temporarily disabling Zabbix agent checks on host "XXXX1": host unavailable
    962:20181113:091623.402 temporarily disabling Zabbix agent checks on host "XXXX2": host unavailable
    962:20181113:091627.461 temporarily disabling Zabbix agent checks on host "XXXX3": host unavailable
    962:20181113:091631.524 temporarily disabling Zabbix agent checks on host "XXXX4": host unavailable
    935:20181113:091645.083 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
    936:20181113:091725.134 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
    937:20181113:091805.184 failed to send email: Timeout was reached: Connection timed out after 40001 milliseconds
    935:20181113:091845.225 failed to send email: Timeout was reached: Connection timed out after 40000 milliseconds
    962:20181113:091928.024 enabling Zabbix agent checks on host "XXXX1": host became available
    962:20181113:091931.118 enabling Zabbix agent checks on host "XXXX2": host became available
    962:20181113:091935.175 enabling Zabbix agent checks on host "XXXX3": host became available
    962:20181113:091939.280 enabling Zabbix agent checks on host "XXXX4": host became available"

    In my opinion I should have in logs something like - INET is down ... wait for xx secs and that's it.
    And after connection is restored - just log it that "Connection is OK now" ... something like that.

    Help, ideas, suggestions are welcome!
    Thanks in advance!

    --
    Valentin
  • zux
    Member
    • Sep 2018
    • 93

    #2
    this should be set up in actions, include a condition "Problem is not suppressed"

    Comment

    • diskmandata4
      Junior Member
      • Nov 2018
      • 5

      #3
      Hi!
      Thanks for sharing your advise!
      I spent some time digging into Actions, but still cannot figure out WHY I had to set up Dependencies at all if they ... just don't do any reasonable work?!

      Why I think this way is because as I already said - right after my LEVEL0 (Internet check) item is restored with trigger status OK,
      then I got immediately XX count of problems generated on the Dashboard ... and after some short period of time they got RESOLVED - and from that effect I got all this unnecessary email notifications!

      Could somebody understanding Zabbix design try to explain that and clear my confusion?

      This is right from Zabbix user manual:

      "Sometimes the availability of one host depends on another. A server that is behind some router will become unreachable if the router goes down. With triggers configured for both, you might get notifications about two hosts down - while only the router was the guilty party.
      This is where some dependency between hosts might be useful. With dependency set notifications of the dependants could be withheld and only the notification for the root problem sent."


      So, according to that statement above, something is really not working correctly? Am I wrong, or it's just a lack of knowledge?!

      Thank you in advance!
      --
      Valentin

      Comment

      • zux
        Member
        • Sep 2018
        • 93

        #4
        This depends on which problem happens first. If you have trigger B, that depends on trigger A, if trigger A is active, you should not receive notifications about trigger B. But if trigger B strikes first, you will still get it. Could this be the explanation of your problem?
        This should be considered, when creating items and triggers. If both hosts go down at the same moment (probably, only A goes down, but B can't be reached because of that) and the item update interval and trigger expressions are equal, it is possible, that Zabbix first notices that B is down and only then, that A is down. So a good practice would be to update items for A more frequently and make the triggers a bit more sensitive.
        Hope this helps

        Comment

        • diskmandata4
          Junior Member
          • Nov 2018
          • 5

          #5
          zux, please reread my 1st message. There I thoroughly explain that I've spent a lot of time trying different tests, read forums, posts, suggestions etc...
          and yes my router is LEVEL0 - Item check is 30s, next is LEVEL1 - Item check 1m, next 2m etc ... As I said already, dependencies WORK, BUT ONLY ON DASHBOARD
          In your example - if trigger B was striked 1st then I'd see it also on the Dashboard, but I don't. I just cut the INET cable and watch what is going on.
          Before setting up any dependencies I've seen so many problems on the Dashboard, but after setting correct dependencies I can see only INTERNET CHECK Item - Failed and no other problems.

          Digging and putting additional logic in Actions devalue Dependencies!

          Is there anybody else who also experience this?!

          Comment

          • zux
            Member
            • Sep 2018
            • 93

            #6
            Oh sorry, misunderstood you there, but yes, the logic in actions is exactly that - if you put that statement there ("Problem is not suppressed"), then you won't receive the notifications. If you don't put it there, you will. In dashboard problems, the default is to hide the suppressed problems, but you can enable showing them. This gives you flexibility, to define actions for special problems, even if they are suppressed.
            Consider a situation, that you must notify a specific manager, that a service is down. He only cares about that service, but doesn't care why it's down. It is very important for the technical people to understand as fast as possible, that the problem is maybe the router in front of the service. But the manager might not care about routers, you only need to notify him, that the service is down. So you create a specific action, that only works on that specific trigger, and doesn't care if the problem is suppressed or not.

            Comment

            • diskmandata4
              Junior Member
              • Nov 2018
              • 5

              #7
              Ok, let forget about dependencies and talk about that you suggest.
              zux, take a look at attached file. Is that what you recommend to suppress this email storm?
              If it's that, then it does't work.

              Also, I'm not able to fully understand your suggestion and what's written in the manual about this condition:

              "Problem is suppressed:
              yes - Specify if the problem is suppressed (not shown) because of host maintenance.
              no - problem is not suppressed.
              yes - problem is suppressed."


              So my hosts aren't in maintenance ... I've just get lost in the logic...

              Any comments?
              Attached Files

              Comment

              • zux
                Member
                • Sep 2018
                • 93

                #8
                hmm, sorry, I had really missed the concept here. Dependent triggers really should even get triggered. Sorry for leading you the wrong way.
                But anyway, I just tested the setup and it works as expected for me. If trigger A is in problem, trigger B does not get into problem state, and no email is sent. even if the action is without the mentioned condition (Problem is not suppressed)

                Comment

                • diskmandata4
                  Junior Member
                  • Nov 2018
                  • 5

                  #9
                  Strange enough, but Zabbix doesn't trigger email notification for any of my custom TCP port XX checks. All the post problem and email notifications (that must not be generated) are related to "NoData received in the last 5 mins" checks. I.e related to Zabbix agents.
                  According to me ... it's a kind of a Zabbix design and I really cannot figure out how to prevent that without break the correct logic.

                  Of course I can try diff logic, e.g.:
                  - change NoData item to 30 mins, instead of 5 mins. Well, this will help only if you have short INET interruptions < 30 mins.
                  - Disable NoData item and replace it with TCP port 10050 check, but if the Zabbix Agent service get stuck somehow, you will not detect it.

                  As I said ... I've spent a lot of time and tests to figure it out but with no luck.

                  Is it possible to pay to Zabbix support per specific problem?! I don't need subscriptions at the moment...

                  Thanks!
                  --
                  Valentin

                  Comment

                  • zux
                    Member
                    • Sep 2018
                    • 93

                    #10
                    I think that consultations by the hour are also available:
                    Rely on expert advice and get all your technical questions answered with Zabbix Consulting services.

                    Comment

                    Working...