This is a rather lengthy post but I'm trying to provide as much info as possible.
We use systemd mounts on several of our linux aws instances, targeting either EFS or CIFS network shares. We have run into some instances where a server fails to mount the network share due to various reasons. The problem is that we currently don't have a way to get alerts when these network shares are down on the clients so we are often reacting after a consumer brings it up to our attention. We currently use the 'Systemd by Zabbix agent 2' template on our linux hosts to monitor systemd services. So I set to out to see how to leverage that template for systemd mounts. The template, in its current incarnation, discovers Systemd services and socket units. So first order was to discover mount units. I was able to figure out that one fairly quick and modeled the discovery after the service unit discovery, so now I have a discovery created for mount units. The first item prototype I setup in this discovery is targeting the following key in order to get the zabbix raw items:
Zabbix discovers the mount units and gets the following raw data (formatted for easy viewing):
From there I created dependent item prototype to grab the value of the mountpoint from 'Where' in the JSON output above and that correctly gets discovered and populated in the Zabbix client, in the case above as /mnt/efs.
The thing I'm struggling right now is the trigger to alert on. Unlike the Systemd service units, the raw output returned by the Systemd mount units doesn't have values for the active state of the unit, that could tell me if the unit is running or not. I was hoping I could do something like running a stat command such as:
and trigger an alert if the value returned is anything but cifs or nfs. However I'm having a hard time understanding how to accomplish this in Zabbix even after extensive searches, as this needs to happen on the monitored instance. Hence my post.
We use systemd mounts on several of our linux aws instances, targeting either EFS or CIFS network shares. We have run into some instances where a server fails to mount the network share due to various reasons. The problem is that we currently don't have a way to get alerts when these network shares are down on the clients so we are often reacting after a consumer brings it up to our attention. We currently use the 'Systemd by Zabbix agent 2' template on our linux hosts to monitor systemd services. So I set to out to see how to leverage that template for systemd mounts. The template, in its current incarnation, discovers Systemd services and socket units. So first order was to discover mount units. I was able to figure out that one fairly quick and modeled the discovery after the service unit discovery, so now I have a discovery created for mount units. The first item prototype I setup in this discovery is targeting the following key in order to get the zabbix raw items:
Code:
systemd.unit.get["{#UNIT.NAME}",Mount]
Code:
{
"AmbientCapabilities": 0,
"AppArmorProfile": [false, ""],
"BlockIOAccounting": false,
"BlockIODeviceWeight": [],
"BlockIOReadBandwidth": [],
"BlockIOWeight": 18446744073709551615,
"BlockIOWriteBandwidth": [],
"CPUAccounting": false,
"CPUAffinity": "",
"CPUQuotaPerSecUSec": 18446744073709551615,
"CPUSchedulingPolicy": 0,
"CPUSchedulingPriority": 0,
"CPUSchedulingResetOnFork": false,
"CPUShares": 18446744073709551615,
"Capabilities": "",
"CapabilityBoundingSet": 18446744073709551615,
"ControlGroup": "/system.slice/mnt-efs.mount",
"ControlPID": 0,
"Delegate": false,
"DeviceAllow": [],
"DevicePolicy": "auto",
"DirectoryMode": 493,
"Environment": [],
"EnvironmentFiles": [],
"ExecMount": [
[
"/bin/mount",
[
"/bin/mount",
"<filesystemID>.efs.us-east-1.amazonaws.com:/",
"/mnt/efs",
"-t",
"efs",
"-o",
"rw,user"
],
false,
1657913026954499,
72937867165,
1657913028376711,
72939289376,
29696,
1,
0
]
],
"ExecRemount": [],
"ExecUnmount": [],
"Group": "",
"IOScheduling": 0,
"IgnoreSIGPIPE": true,
"InaccessibleDirectories": [],
"KillMode": "control-group",
"KillSignal": 15,
"LazyUnmount": false,
"LimitAS": 18446744073709551615,
"LimitCORE": 18446744073709551615,
"LimitCPU": 18446744073709551615,
"LimitDATA": 18446744073709551615,
"LimitFSIZE": 18446744073709551615,
"LimitLOCKS": 18446744073709551615,
"LimitMEMLOCK": 65536,
"LimitMSGQUEUE": 819200,
"LimitNICE": 0,
"LimitNOFILE": 4096,
"LimitNPROC": 31448,
"LimitRSS": 18446744073709551615,
"LimitRTPRIO": 0,
"LimitRTTIME": 18446744073709551615,
"LimitSIGPENDING": 31448,
"LimitSTACK": 18446744073709551615,
"MemoryAccounting": false,
"MemoryCurrent": 18446744073709551615,
"MemoryLimit": 18446744073709551615,
"MountFlags": 0,
"Nice": 0,
"NoNewPrivileges": false,
"NonBlocking": false,
"OOMScoreAdjust": 0,
"Options": "rw,nosuid,nodev,noexec,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.134.37.172,local_lock=none,addr=10.134.37.105,user",
"PAMName": "",
"PassEnvironment": [],
"Personality": "",
"PrivateDevices": false,
"PrivateNetwork": false,
"PrivateTmp": false,
"ProtectHome": "no",
"ProtectSystem": "no",
"ReadOnlyDirectories": [],
"ReadWriteDirectories": [],
"RestrictAddressFamilies": [false, []],
"Result": "success",
"RootDirectory": "",
"RuntimeDirectory": [],
"RuntimeDirectoryMode": 493,
"SELinuxContext": [false, ""],
"SameProcessGroup": true,
"SecureBits": 0,
"SendSIGHUP": false,
"SendSIGKILL": true,
"Slice": "system.slice",
"SloppyOptions": false,
"SmackProcessLabel": [false, ""],
"StandardError": "inherit",
"StandardInput": "null",
"StandardOutput": "journal",
"StartupBlockIOWeight": 18446744073709551615,
"StartupCPUShares": 18446744073709551615,
"SupplementaryGroups": [],
"SyslogIdentifier": "",
"SyslogLevelPrefix": true,
"SyslogPriority": 30,
"SystemCallArchitectures": [],
"SystemCallErrorNumber": 0,
"SystemCallFilter": [false, []],
"TTYPath": "",
"TTYReset": false,
"TTYVHangup": false,
"TTYVTDisallocate": false,
"TasksAccounting": false,
"TasksCurrent": 18446744073709551615,
"TasksMax": 18446744073709551615,
"TimeoutUSec": 90000000,
"TimerSlackNSec": 50000,
"Type": "nfs4",
"UMask": 18,
"User": "",
"UtmpIdentifier": "",
"What": "<filesystemID>.efs.us-east-1.amazonaws.com:/",
"Where": "/mnt/efs",
"WorkingDirectory": ""
}
The thing I'm struggling right now is the trigger to alert on. Unlike the Systemd service units, the raw output returned by the Systemd mount units doesn't have values for the active state of the unit, that could tell me if the unit is running or not. I was hoping I could do something like running a stat command such as:
Code:
stat -f --format="T%" <path from mountpoint value in discovered dependent item>
Comment