Ad Widget

Collapse

Prevent nodata() trigger flood after zabbix_server maintenance break

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hama
    Junior Member
    • Aug 2008
    • 4

    #1

    Prevent nodata() trigger flood after zabbix_server maintenance break

    Hello.

    This post is pretty much the same as Sjeik's "Prevent Zabbix from hitting triggers after a restart of zabbix_server daemon" six months ago.

    It would be good if there was "grace period" after Zabbix server restart. During grace period Zabbix would ignore all actions. The grace period could be defined for example in the zabbix_server.conf as "GracePeriod=300" to set it to five minutes.

    With a grace period items with nodata() triggers would have time to receive data instead of Zabbix going mad about not having any data of the items. Yesterday evening our Zabbix installation went crazy and sent over 400 sms messages after maintenance period. That's because we have nodata() triggers linked to agent.ping items. When the Zabbix server (not any agent) is down for over three minutes, all the triggers automatically change to true when Zabbix server is started again. Short grace period after Zabbix server start would give Zabbix time to receive data for the items linked to nodata() triggers.

    Alternative solution would be to add condition about Zabbix server's state to the nodata() triggers. That would mean adding another level of complexity to the triggers and items. It would require separate item monitoring Zabbix's state. It would also probably affect the performance of the whole trigger and action mechanism. This kind of solution doesn't seem good.

    For now we'll probably just work-around this "maintenance madness" problem by disabling the sms send script while bringing Zabbix back to action after maintenance. That way Zabbix will still send hundreds of false email alerts, but at least we don't have to set the on duty phone silent
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    I believe this already been fixed, so that nodata(N) does not report missing data after server restart. The logic is very simple:

    - if no data for N seconds and ZABBIX uptime is more than N seconds, then report NO DATA
    - report that there is some data otherwise
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • hama
      Junior Member
      • Aug 2008
      • 4

      #3
      Great. Apparently it's fixed in 1.4.6, because we have 1.4.5 and this happened.

      Code:
      zabbix_server --version
      ZABBIX Server (daemon) v1.4.5 (25 March 2008)
      Compilation time:  May  2 2008 16:59:47
      Most likely we won't update yet, because we're quite anxiously waiting for 1.6 release too. Updating all the agents is somewhat lengthy process and there's always chance of messing up something - unless we update only the server, which might create incompatibility issues(?). We'll probably update directly to 1.6 when it's released. For now it's good to know that we don't have to worry about this after the update.

      Thanks.

      Comment

      • richlv
        Senior Member
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Oct 2005
        • 3112

        #4
        Originally posted by hama
        Great. Apparently it's fixed in 1.4.6, because we have 1.4.5 and this happened.
        well, actually it was supposed to be fixed in 1.4.3...
        [ZBX-1] fixed wrong status of function "nodata" after server restart (Sasha)

        the thing is, i also have seen the problem after that version, but - not always. sometimes triggers fire after server restart, sometimes they don't.
        Zabbix 3.0 Network Monitoring book

        Comment

        • NOB
          Senior Member
          Zabbix Certified Specialist
          • Mar 2007
          • 469

          #5
          Originally posted by hama
          Great. Apparently it's fixed in 1.4.6, because we have 1.4.5 and this happened.

          Code:
          zabbix_server --version
          ZABBIX Server (daemon) v1.4.5 (25 March 2008)
          Compilation time:  May  2 2008 16:59:47
          Most likely we won't update yet, because we're quite anxiously waiting for 1.6 release too. Updating all the agents is somewhat lengthy process and there's always chance of messing up something - unless we update only the server, which might create incompatibility issues(?). We'll probably update directly to 1.6 when it's released. For now it's good to know that we don't have to worry about this after the update.

          Thanks.
          There should be no problem updating just the server to 1.4.6.
          The DB is the same (just did a diff on both schema versions), so you just have to
          replace the "zabbix_server" executable, I guess.
          The agents are not involved, at all.

          I didn't try that, yet and I am not sure whether the issue is really
          fixed in 1.4.6 (didn't check the source code, yet).
          Looking at the source, there is no difference in the implementation
          of the function evaluate_NODATA between 1.4.5 and 1.4.6.
          So the post which says "fixed in 1.4.3" might be right.

          But the source code might still be wrong ?
          Code:
          if((CONFIG_SERVER_STARTUP_TIME + parameter > now) || (item->lastclock +
          parameter > now))
                  {
                          strcpy(value,"0");
                  }
                  else
                  {
                          strcpy(value,"1");
                  }
          I think it might be better to use:
          Code:
          if((CONFIG_SERVER_STARTUP_TIME + parameter > now) && (item->lastclock +
          parameter > now))
                  {
                          strcpy(value,"1");
                  }
                  else
                  {
                          strcpy(value,"0");
                  }
          if the result is exactly the value you use in the trigger.
          Like nodata(600) = 1
          So this will be set, whenever the server was running more than
          600 seconds and there is no data for the same amount of time (parameter),
          only.
          Otherwise, the time of nodata has not yet expired.
          If I find time, I'll try it.

          OK, This is my interpretation of the source code. I could be completely wrong,
          but that doesn't happen that often

          Regards

          Norbert.
          Last edited by NOB; 29-08-2008, 11:31. Reason: Admitted that my interpretation of the source might be completely wrong.

          Comment

          • Emir Imamagic
            Member
            • Mar 2008
            • 67

            #6
            Originally posted by Alexei
            I believe this already been fixed, so that nodata(N) does not report missing data after server restart. The logic is very simple:

            - if no data for N seconds and ZABBIX uptime is more than N seconds, then report NO DATA
            - report that there is some data otherwise
            I see that this behavior is kept in version 1.6. With this approach half of items with nodata (e.g. Zabbix server down) suddenly turn green and then after N seconds switch back to red again no matter how long was the server maintenance.

            Isn't it more logical to keep the old trigger status, wait for N seconds and then start calculating new value of nodata? If something was broken on monitored servers before zabbix maintenance it sounds bit too optimistic to assume that it was fixed in the meanwhile.

            Alternatively, setting nodata to unknown also sounds more logical since zabbix doesn't really know if there was some data coming in the meanwhile.

            Cheers,
            emir

            Comment

            • teferi
              Member
              • Jul 2008
              • 93

              #7
              And yet again I'll remind about server's nodata() problem with proxies - when there's no connectivity for some period of time.

              hope I'm not too annoying, just it seems to me that the problem is serious and it's a good oppotunity to remind about it =)

              Comment

              • halik
                Junior Member
                • Jun 2007
                • 7

                #8
                Originally posted by Alexei
                I believe this already been fixed, so that nodata(N) does not report missing data after server restart. The logic is very simple:

                - if no data for N seconds and ZABBIX uptime is more than N seconds, then report NO DATA
                - report that there is some data otherwise
                imho the following code works like you described:

                Code:
                        if((CONFIG_SERVER_STARTUP_TIME + parameter > now)||(item->lastclock + parameter < now))
                        {
                                strcpy(value,"1");
                        }
                        else
                        {
                                strcpy(value,"0");
                        }
                but for me this one works better:
                Code:
                        if(item->lastclock + parameter < now)
                        {
                                strcpy(value,"1");
                        }
                        else
                        {
                                strcpy(value,"0");
                        }
                michal koslinski

                Comment

                • Justin Freeman
                  Junior Member
                  • Jan 2009
                  • 18

                  #9
                  Another alternative solution

                  Another alternative solution is posted here, http://www.zabbix.com/forum/showthread.php?p=42674

                  Comment

                  Working...