Ad Widget

Collapse

How to avoid generating transient errors that go away quickly

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zabb3r
    Junior Member
    • Aug 2016
    • 3

    #1

    How to avoid generating transient errors that go away quickly

    I have a system where it sputters several times a day. It's an LDAP like system (freeIPA to be exact), which loses connection to a few servers that it replicates with.

    99.9999% of the time, a given server will lose connection to its replication servers, and within a few seconds it regains connectivity. The system is just flaky, and it causes too many false alert page outs.

    Basically, Zabbix watches the log file for this error:

    [XX/XXX/2016:17:31:28 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired))

    When it reconnects, it looks like this:

    [XX/XXX/2016:17:31:36 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth resumed

    Is there a way to define a trigger to not go off if it finds a "Replication bind with GSSAPI auth resumed" for the server that had an issue, i.e. replicationserver1.example.com? I should mention, that when the server loses connectivity to the replication servers, there are multiple entries, and I need to match X1.example.com /failed|failure/ to X1.example.com /auth resume/ for each server.

    The check should wait at least 2-3 minutes because it'll recover during that time before triggering.

    Please help. I need sleep and not be awoken up for nothing.

    Thanks!
  • guzzijason
    Senior Member
    • Dec 2015
    • 106

    #2
    Can you post the trigger expression that you are using currently? I'm thinking it can probably be altered to make it less sensitive, but it would help to see how it looks now.

    __Jason

    Comment

    • zabb3r
      Junior Member
      • Aug 2016
      • 3

      #3
      Sure thing, there is the

      Trigger:

      (({Template IPA Services:log[/var/log/dirsrv/slapd-EXAMPLE-COM/errors].regexp(NSMMReplicationPlugin - agmt=.* Replication bind with GSSAPI auth failed)})<>0)

      Multiple PROBLEM events generation: ON

      Let me know if you need more information.

      Comment

      • guzzijason
        Senior Member
        • Dec 2015
        • 106

        #4
        Ah, yes... that does seem like a tricky problem.

        I think you're best course may be to set up a new action specific to this trigger, and then use the escalation feature to "delay" the action. Check out "Example 2" on this page:



        In that example, the operation is doing nothing on the first step, and waiting for step #2 before executing the action. If the trigger resets before step #2 happens, then no action is performed.

        You could also make this the standard behavior of all your actions, rather than setting up a specific action for this trigger. That way, you would quash all transient alerts.

        __Jason

        Comment

        • zabb3r
          Junior Member
          • Aug 2016
          • 3

          #5
          Originally posted by guzzijason
          Ah, yes... that does seem like a tricky problem.
          You could also make this the standard behavior of all your actions, rather than setting up a specific action for this trigger. That way, you would quash all transient alerts.

          __Jason
          __Jason, working on this right now. Keep you posted how I make out with it.

          Thanks for the help.

          Comment

          Working...