Ad Widget

Collapse

systemd units monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • StCyr
    Junior Member
    • Jan 2023
    • 6

    #1

    systemd units monitoring

    Hello,

    I'm not the only one dubious about current systemd integration: The current integration generates a lot of false positives (see eg: https://www.zabbix.com/forum/zabbix-...false-positive).

    I've the impression that these false positives come from the fact that the trigger is based on the "active" attribute of a systemd unit (ie: when a systemd unit is "inactive" zabbix will trigger) while it's perfectly normal to have inactive systemd units (IINM, a lots of systemd units just run at boot time once and then become inactive).

    So, my question and suggestion is: Why not making the integration trigger on the "failed" attribute of systemd units? Is there a particular reason why you don't use this information or can it be used to improve this monitoring?

    Best regards,

    Cyrille

    Additionnal info:
    ==========

    On the system I'm testing the template, I have 1 failed systemd units, and 114 inactive systemd units:

    root@deus:/etc# systemctl list-units --state=failed | tail -n 1
    1 loaded units listed.
    root@deus:/etc# systemctl list-units --state=inactive | tail -n 2
    114 loaded units listed.
    To show all installed unit files use 'systemctl list-unit-files'.

    ​Strangely enough, Zabbix triggers "only" on 19 inactive systemd units
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    If you look on discovery filter, it has to be "active" and "enabled". https://www.zabbix.com/integrations/systemd#systemd
    I am not using it myself, but if you try to match those conditions, would those results be more real?

    Comment

    • StCyr
      Junior Member
      • Jan 2023
      • 6

      #3
      I've studied systemd a little bit and I'm quite sure the issue is that services can be enabled and inactive (services of Type=oneshot for example).

      From what I can understand in https://github.com/zabbix/zabbix/blo...emd/systemd.go this attribute is not taken into account when devising if a unit should be active or not (at line https://github.com/zabbix/zabbix/blo...ystemd.go#L250 IINM).

      Regarding my original question ("Why not making the integration trigger on the "failed" attribute of systemd units?"). This information is not good enough as it only covers issues during the service 'startup, not issues occuring during the life of the service (eg: if you kill the service's process the unit will become inactive, not failed).

      I'd like to provide a patch for the plugin but I've no idea how zabbix plugin development/debugging occurs...

      Best regards,

      Cyrille

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #4
        Well, maybe I should have been more precise with wording... "has to be enabled and active at the time of discovery".. So if it is enabled but inactive at the time of discovery, it should not even be discovered.
        But I think you don't need to change the plugin, it just reports the statuses of discovered units. Triggers you can change from GUI according to your own needs. Out-of-the-box ones are anyway "best practice" and "these are examples" type of things, they don't have to match everyone's needs..
        You can tweak discovery conditions also as you need without changing plugin. It does return everything anyway, its the filtering in template, that either passes unit or throws it away...

        Comment

        • StCyr
          Junior Member
          • Jan 2023
          • 6

          #5
          >Well, maybe I should have been more precise with wording... "has to be enabled and active at the time of discovery".. So if it is enabled but inactive at the time of discovery, it should not even be discovered

          It doesn't look like it's working this way: For all my servers, I've about 20 services that are discovered and create problems because they are not running:

          Click image for larger version

Name:	systemd.png
Views:	4737
Size:	139.7 KB
ID:	457501

          If I look at these unit's type, most of them are of type "oneshot":

          ​oot@deus:/usr/lib/systemd/system# grep Type systemd-pstore.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type e2scrub_reap.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type lxd-agent.service
          Type=simple
          root@deus:/usr/lib/systemd/system# grep Type lxd-agent-9p.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type dmesg.service
          Type=idle
          root@deus:/usr/lib/systemd/system# grep Type ondemand.service
          Type=idle
          root@deus:/usr/lib/systemd/system# grep Type snapd.aa-prompt-listener.service
          Type=simple
          root@deus:/usr/lib/systemd/system# grep Type thermald.service
          Type=dbus
          root@deus:/usr/lib/systemd/system# grep Type ubuntu-advantage.service
          Type=notify
          root@deus:/usr/lib/systemd/system# grep Type grub-common.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type ua-reboot-cmds.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type rsync.service
          root@deus:/usr/lib/systemd/system# grep Type snapd.recovery-chooser-trigger.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system# grep Type grub-initrd-fallback.service
          Type=oneshot
          root@deus:/usr/lib/systemd/system#

          ​I understand that I can blacklist these services, and it's certainly what I'll do for the time being, but I'm fairly convinced there's room for improvement in the plugin itself.

          Comment

          • cyber
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2006
            • 4807

            #6
            yea.. that looks weird... I guess I need to play a bit with that agent2...

            Comment

            • galoxucro
              Junior Member
              • May 2018
              • 5

              #7
              Zabbix systemd.unit.discovery​ should support a LLD MACRO {#UNIT.TYPE}​, then it'd be possible to filter out units of types 'idle' , 'oneshot' and 'dbus'.

              Please, vote here ZBXNEXT-8571
              Last edited by galoxucro; 14-07-2023, 19:11.

              Comment

              • Flow
                Junior Member
                • Nov 2021
                • 3

                #8
                I was wondering the same as OP: Why does zabbix only report not running services, but not services that are in failed state?

                There seem to be very valid reasons why a service is not running. Hence the current behavior causes a lot of false negatives. Instead, i would expect a monitoring solution to report as soon as a systemd unit enters the failed state.

                Originally posted by StCyr
                I've studied systemd a little bit and I'm quite sure the issue is that services can be enabled and inactive (services of Type=oneshot for example).
                Regarding my original question ("Why not making the integration trigger on the "failed" attribute of systemd units?"). This information is not good enough as it only covers issues during the service 'startup, not issues occuring during the life of the service (eg: if you kill the service's process the unit will become inactive, not failed).
                Cyrille
                You are right that a systemd service does not enter the failed state if you kill it's processes. However, that is sensible: It only happens if a term/kill signal is send to a process by an entity which is authorized to do so.

                I do not see how the "systemd unit failed" "information is not good enough". It is exactly what you want to see reported.

                So why do we have a systemd trigger prototype for non-running units, but no trigger prototype for failed units? I have created https://support.zabbix.com/browse/ZBXNEXT-8767 for this.
                Last edited by Flow; 18-10-2023, 21:45.

                Comment

                Working...