I have a system where it sputters several times a day. It's an LDAP like system (freeIPA to be exact), which loses connection to a few servers that it replicates with.
99.9999% of the time, a given server will lose connection to its replication servers, and within a few seconds it regains connectivity. The system is just flaky, and it causes too many false alert page outs.
Basically, Zabbix watches the log file for this error:
[XX/XXX/2016:17:31:28 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired))
When it reconnects, it looks like this:
[XX/XXX/2016:17:31:36 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth resumed
Is there a way to define a trigger to not go off if it finds a "Replication bind with GSSAPI auth resumed" for the server that had an issue, i.e. replicationserver1.example.com? I should mention, that when the server loses connectivity to the replication servers, there are multiple entries, and I need to match X1.example.com /failed|failure/ to X1.example.com /auth resume/ for each server.
The check should wait at least 2-3 minutes because it'll recover during that time before triggering.
Please help. I need sleep and not be awoken up for nothing.
Thanks!
99.9999% of the time, a given server will lose connection to its replication servers, and within a few seconds it regains connectivity. The system is just flaky, and it causes too many false alert page outs.
Basically, Zabbix watches the log file for this error:
[XX/XXX/2016:17:31:28 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired))
When it reconnects, it looks like this:
[XX/XXX/2016:17:31:36 -0X00] NSMMReplicationPlugin - agmt="cn=meToreplicationserver1.example.com" (replicationserver1:389): Replication bind with GSSAPI auth resumed
Is there a way to define a trigger to not go off if it finds a "Replication bind with GSSAPI auth resumed" for the server that had an issue, i.e. replicationserver1.example.com? I should mention, that when the server loses connectivity to the replication servers, there are multiple entries, and I need to match X1.example.com /failed|failure/ to X1.example.com /auth resume/ for each server.
The check should wait at least 2-3 minutes because it'll recover during that time before triggering.
Please help. I need sleep and not be awoken up for nothing.

Thanks!
Comment