Ad Widget

Collapse

Proper Zabbix trigger function(s) to check multiple boolean(ish) values ?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • BNC
    Junior Member
    • Dec 2019
    • 3

    #1

    Proper Zabbix trigger function(s) to check multiple boolean(ish) values ?

    Hello,

    I've posted a question on Stack Overflow last week, and since it didn't get much love, I'm trying my luck where the knowledge is at

    I'm reposting it here:

    I have two production networks (let's call them ZoneA and ZoneB) that are linked by a strongSwan tunnel through dedicated servers (ipsecA1, ipsecA2 and ipsecB1, ipsecB2), with keepalived managing the floating public and private IPs for each zone.
    All of this is monitored by Zabbix servers (as you probably have guessed already: ZabbixA and ZabbixB).
    Code:
        ZabbixA                  ZabbixB
        ╔═════════════╗         ╔═════════════╗
        ║  ┌───────┐  ║         ║  ┌───────┐  ║
        ║┌─┤ipsecA1│──╫─────────╫──┤ipsecB1├─┐║
        ║│ └───┬───┘  ║         ║  └───┬───┘ │║
        ║│ keepalived ║         ║ keepalived │║
        ║│ ┌───┴───┐  ║         ║  ┌───┴───┐ │║
        ║└─┤ipsecA2│  ║         ║  │ipsecB2├─┘║
        ║  └───────┘  ║         ║  └───────┘  ║
        ╚═════════════╝         ╚═════════════╝
    Also, ipsecA1 is the "master" (for lack of a better description): it's the one initiating the reauthentications.


    Goal

    We want to have an alarm popping up (on each zone), when the following points are not met:
    • For the primary servers (ipsecx1):
      • The IPsec tunnel is up AND Private IP is present AND Public IP is present
    • For the secondary servers (ipsecx2):
      • The IPsec tunnel is down AND Private IP is missing AND Public IP is missing

    To sum it up: if all 3 checks on each machine don't have the same value, then there's something wrong.


    Configuration (servers-side)

    The IPsec tunnel is checked by a script that returns 1 if the tunnel is up (grep for ESTABLISHED on ipsec status), 0 if it's not.
    Public and private floating IPs are checked in the same fashion.
    Scripts and .conf files are configured properly, and the relevant template/application/items have been created on both Zabbix servers: items do show the proper statuses.


    Configuration (Zabbix-side)

    The triggers are configured as such on both zones
    Code:
    [SIZE=12px]({ipsec_server:ipsec.status.last(3m)}<>{ipsec_serv er:keepalived.vip.private.last(3m)})
    or ({ipsec_server:ipsec.status.last(3m)}<>{ipsec_serv er:keepalived.vip.public.last(3m)})
    or ({ipsec_server:keepalived.vip.private.last(3m)}<>{ ipsec_server:keepalived.vip.public.last(3m)})[/SIZE]
    On ipsecA1, there's also a trigger solely for checking the tunnel status ipsec.status.last(3m), but it's there only until the main issue is solved.

    PS: I'm not sure it's relevant here, but ZabbixA is v2.4.7, and ZabbixB is v3.4.11.


    Main issue

    Every now and then, on ipsecA1, both triggers will fire alarms, with the recovery being issued within seconds. Nothing is triggered on ZabbixB.
    Most of the time, there's nothing in the logs to distinguish reauthentications that triggered an alarm from the ones that didn't.

    The loglevels have been changed to hopefully find out what's going on, to no avail:
    Code:
    [SIZE=12px]   /var/log/charon.log {
              time_format = %b %e %T
              append = yes
              default = 1
        }
        stderr {
              ike = 2
              knl = 3
              net = 2
              dmn = 2
              mgr = 2
              job = 2
              ike_name = yes
        }[/SIZE]
    The reason each trigger function is .last(3m) is because I thought/hoped that Zabbix would check all three items' statuses (ipsec.status, keepalived.vip.public and keepalived.vip.private), and if any of those would deviate within the last 3 minutes, the trigger would go off.
    Turns out, it was not the best idea...

    There's already a lot of questions around about .last(x) being misused, and then replaced by .avg(), .min() or some such, but all of the examples I've found were about treating analogic numbers.
    I couldn't find much about binary/boolean results, and thus I'm not sure the answers from analogs numbers apply here...


    Possible improvement

    Even if said main issue is solved, my trigger is not the best, and it can most likely be improved.
    I'm thinking about adding a ping to the other side of the tunnel as an additional condition to the trigger, to make sure that even if strongSwan says there's something wrong, we try to reach the other side of the tunnel to make sure.
    Once again, I'm not sure how that can be achieved.

    I'm open to any smart idea.
    Last edited by BNC; 09-12-2019, 17:20. Reason: Added tags
  • BNC
    Junior Member
    • Dec 2019
    • 3

    #2
    Thanks for you (valuable) input.

    You raise a lot of good points, but please allow me to review them to see if I understand them correctly...
    Also, for what it's worth, the update interval is the default: 30 seconds.
    • Adding the results is so simple I'm flabbergasted I didn't think about it (let's call that the learning curve !), yet it might not apply to what I need (see below)
    • For your proposed trigger to work, it looks like I'd have to use the servers' names with the keys, instead of the template's (else on each server the trigger will compare the addition of all results to both 3 and 0, which will always return 0). That would give us:
      • Code:
        [SIZE=12px](({ipsecA1:ipsec.status.last()} + {ipsecA1:keepalived.private.last()}
        		+ {ipsecA1:keepalived.public.last()}) <> 3) and (({ipsecA2:ipsec.status.last()}
        		+ {ipsecA2:keepalived.private.last()} + {ipsecA2:keepalived.public.last()}) <> 0)[/SIZE]
        But that means I'll have to create items and triggers on each machine (as I can't call a server item directly from a template)...
        Doesn't that defeat the idea of having templates deploy items, triggers and such on servers ?
        Or am I missing something ?
    • Last but not least, I'm not sure I understand the part about nodata:
      • On one line you propose to use nodata(3m) on triggers, and the next line (and the example after that), you use last()...
        I'm a bit lost here
      • Also, if I understand nodata correctly (and I quote its wiki definition):
        • Code:
          [SIZE=12px]Returns:
          			1 - if no data received during the defined period of time
          			0 - otherwise[/SIZE]
        • Thing is, the item is getting data every 30 seconds, be it 0 or 1, so it will always return 1 (unless the server is down, I suppose)
          According to the wiki, it checks for any data, not new data
      • What I first understood from your sentence "So we add triggers with nodata(3m) on that items to get that info" was that it was possible to add trigger functions (such as nodata(3m), min(3m), max(3m), avg(3m)...) to items, and then call last() on the triggers to get something akin to what I expected last(3m) to do...
        Is that actually possible, or even desirable ?
        So far, just about all of my items are merely referencing the keys (like ipsec.status), and that's it...
    Since the time I posted on SO, I've been trying a different, more targeted approach, for my trigger.
    At first I used min(3m) instead of the deceptive last(3m).
    I tested it on the master/primary, where items return 1 when OK, and it seemed fine... but obviously it wouldn't work on the secondary.
    I've then tried with delta(3m), but I don't know if it's the best function to use, and I didn't get much action on the servers to have a real-life test of my trigger.

    I also added ping checks to both the IPs managed by keepalived, but I'm still wondering if it wouldn't be better to have them issued by the Zabbix server, rather than the clients themselves.

    Lastly, since after some time the errors that resulted in an alarm were pushed away, the results were normalized to the errors (effectively making the alarm disappear), I added the TRIGGER.VALUE combo (my IPsec master is on the Zabbix Server 2.4.7) to keep it relevant.

    It was of course before you showed me the addition possibility, and it gave something like this (I exploded the expression to make it readable):
    Code:
    [SIZE=12px](
        {TRIGGER.VALUE}=0 and (
            (
                {Template IPsec:ipsec.status.delta(3m)}<>{Template IPsec:keepalived.private.delta(3m)}
            ) or (
                {Template IPsec:ipsec.status.delta(3m)}<>{Template IPsec:keepalived.public.delta(3m)}
            ) or (
                {Template IPsec:keepalived.private.delta(3m)}<>{Template IPsec:keepalived.public.delta(3m)}
            )
        ) and (
            (
                {Template IPsec:icmpping[pr.iv.ate.IP].min(3m)}<>1
            ) or (
                {Template IPsec:icmpping[pu.bl.ic.IP].min(3m)}<>1
            )
        )
    ) or (
        {TRIGGER.VALUE}=1 and (
            (
                {Template IPsec:icmpping[pr.iv.ate.IP].min(3m)}<>1
            ) or (
                {Template IPsec:icmpping[pu.bl.ic.IP].min(3m)}<>1
            )
        )
    )[/SIZE]
    The issue with this trigger is that it won't go off problems with the tunnel or IPs (sometimes keepalived will detect something that it doesn't like, and enable the IPs on the secondary server, even if they are still present on the primary) unless the IPs are down.
    I might have to make another trigger to specifically check if keepalived hasn't gone crazy.
    And probably even another to make sure the tunnel is OK on both sides (even if one side should be enough).
    I was hoping to have all of this in just one trigger, but is it even possible ?

    NB: the 3 minutes I use in my trigger functions is a side-effect of strongSwan sometimes going AWOL for the whole duration of the retransmission timeout, which is 165 secs.
    Since 98% of the time it seems to have no effect on the tunnel (I'll see with the team at strongSwan to clear that up), I set the 3 minutes to avoid unnecessary alarms.
    There is also the occasional blip that doesn't have an impact on the tunnel, but will trigger an alarm nonetheless, and the idea is to ignore them.
    Last edited by BNC; 10-12-2019, 13:31. Reason: More stuff

    Comment

    Working...