Hello,
I've posted a question on Stack Overflow last week, and since it didn't get much love, I'm trying my luck where the knowledge is at
I'm reposting it here:
I have two production networks (let's call them ZoneA and ZoneB) that are linked by a strongSwan tunnel through dedicated servers (ipsecA1, ipsecA2 and ipsecB1, ipsecB2), with keepalived managing the floating public and private IPs for each zone.
All of this is monitored by Zabbix servers (as you probably have guessed already: ZabbixA and ZabbixB).
Also, ipsecA1 is the "master" (for lack of a better description): it's the one initiating the reauthentications.
Goal
We want to have an alarm popping up (on each zone), when the following points are not met:
To sum it up: if all 3 checks on each machine don't have the same value, then there's something wrong.
Configuration (servers-side)
The IPsec tunnel is checked by a script that returns 1 if the tunnel is up (grep for ESTABLISHED on ipsec status), 0 if it's not.
Public and private floating IPs are checked in the same fashion.
Scripts and .conf files are configured properly, and the relevant template/application/items have been created on both Zabbix servers: items do show the proper statuses.
Configuration (Zabbix-side)
The triggers are configured as such on both zones
On ipsecA1, there's also a trigger solely for checking the tunnel status ipsec.status.last(3m), but it's there only until the main issue is solved.
PS: I'm not sure it's relevant here, but ZabbixA is v2.4.7, and ZabbixB is v3.4.11.
Main issue
Every now and then, on ipsecA1, both triggers will fire alarms, with the recovery being issued within seconds. Nothing is triggered on ZabbixB.
Most of the time, there's nothing in the logs to distinguish reauthentications that triggered an alarm from the ones that didn't.
The loglevels have been changed to hopefully find out what's going on, to no avail:
The reason each trigger function is .last(3m) is because I thought/hoped that Zabbix would check all three items' statuses (ipsec.status, keepalived.vip.public and keepalived.vip.private), and if any of those would deviate within the last 3 minutes, the trigger would go off.
Turns out, it was not the best idea...
There's already a lot of questions around about .last(x) being misused, and then replaced by .avg(), .min() or some such, but all of the examples I've found were about treating analogic numbers.
I couldn't find much about binary/boolean results, and thus I'm not sure the answers from analogs numbers apply here...
Possible improvement
Even if said main issue is solved, my trigger is not the best, and it can most likely be improved.
I'm thinking about adding a ping to the other side of the tunnel as an additional condition to the trigger, to make sure that even if strongSwan says there's something wrong, we try to reach the other side of the tunnel to make sure.
Once again, I'm not sure how that can be achieved.
I'm open to any smart idea.
I've posted a question on Stack Overflow last week, and since it didn't get much love, I'm trying my luck where the knowledge is at

I'm reposting it here:
I have two production networks (let's call them ZoneA and ZoneB) that are linked by a strongSwan tunnel through dedicated servers (ipsecA1, ipsecA2 and ipsecB1, ipsecB2), with keepalived managing the floating public and private IPs for each zone.
All of this is monitored by Zabbix servers (as you probably have guessed already: ZabbixA and ZabbixB).
Code:
ZabbixA ZabbixB
╔═════════════╗ ╔═════════════╗
║ ┌───────┐ ║ ║ ┌───────┐ ║
║┌─┤ipsecA1│──╫─────────╫──┤ipsecB1├─┐║
║│ └───┬───┘ ║ ║ └───┬───┘ │║
║│ keepalived ║ ║ keepalived │║
║│ ┌───┴───┐ ║ ║ ┌───┴───┐ │║
║└─┤ipsecA2│ ║ ║ │ipsecB2├─┘║
║ └───────┘ ║ ║ └───────┘ ║
╚═════════════╝ ╚═════════════╝
Goal
We want to have an alarm popping up (on each zone), when the following points are not met:
- For the primary servers (ipsecx1):
- The IPsec tunnel is up AND Private IP is present AND Public IP is present
- For the secondary servers (ipsecx2):
- The IPsec tunnel is down AND Private IP is missing AND Public IP is missing
To sum it up: if all 3 checks on each machine don't have the same value, then there's something wrong.
Configuration (servers-side)
The IPsec tunnel is checked by a script that returns 1 if the tunnel is up (grep for ESTABLISHED on ipsec status), 0 if it's not.
Public and private floating IPs are checked in the same fashion.
Scripts and .conf files are configured properly, and the relevant template/application/items have been created on both Zabbix servers: items do show the proper statuses.
Configuration (Zabbix-side)
The triggers are configured as such on both zones
Code:
[SIZE=12px]({ipsec_server:ipsec.status.last(3m)}<>{ipsec_serv er:keepalived.vip.private.last(3m)})
or ({ipsec_server:ipsec.status.last(3m)}<>{ipsec_serv er:keepalived.vip.public.last(3m)})
or ({ipsec_server:keepalived.vip.private.last(3m)}<>{ ipsec_server:keepalived.vip.public.last(3m)})[/SIZE]
PS: I'm not sure it's relevant here, but ZabbixA is v2.4.7, and ZabbixB is v3.4.11.
Main issue
Every now and then, on ipsecA1, both triggers will fire alarms, with the recovery being issued within seconds. Nothing is triggered on ZabbixB.
Most of the time, there's nothing in the logs to distinguish reauthentications that triggered an alarm from the ones that didn't.
The loglevels have been changed to hopefully find out what's going on, to no avail:
Code:
[SIZE=12px] /var/log/charon.log {
time_format = %b %e %T
append = yes
default = 1
}
stderr {
ike = 2
knl = 3
net = 2
dmn = 2
mgr = 2
job = 2
ike_name = yes
}[/SIZE]
Turns out, it was not the best idea...
There's already a lot of questions around about .last(x) being misused, and then replaced by .avg(), .min() or some such, but all of the examples I've found were about treating analogic numbers.
I couldn't find much about binary/boolean results, and thus I'm not sure the answers from analogs numbers apply here...
Possible improvement
Even if said main issue is solved, my trigger is not the best, and it can most likely be improved.
I'm thinking about adding a ping to the other side of the tunnel as an additional condition to the trigger, to make sure that even if strongSwan says there's something wrong, we try to reach the other side of the tunnel to make sure.
Once again, I'm not sure how that can be achieved.
I'm open to any smart idea.
Comment