Ad Widget

Collapse

Future of Zabbix - service-oriented items

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • otheus
    Member
    • Mar 2009
    • 53

    #1

    Future of Zabbix - service-oriented items

    In the 3.0 Release notes, I saw an interesting item (paraphrasing):

    zabbix_agent support will be dropped because apparently no one was using inetd.
    I find this ironic given that zabbix_agent was the only solution to a peculiar problem that has evolved within the last two years, and is coming down the collective pipes of sysadmins everywhere: The docker phenomenon.

    To be more succinct, we are seeing an increase in the demand to host and monitor services (or service sets) which are completely unattached to their physical -- or even virtual -- servers. This presents a serious challenge to the Zabbix monitoring design -- and to most other monitoring service designs.

    I ran into this a few years ago while managing a RedHat 5 cluster engine cluster. The services in particular were NFS mounts which would dynamically migrate from one host to another. Each service was tied to a service IP, and multiple services could be on the same host. There was no sensibly monitor these services as tied to the physical hosts, so I had to create Zabbix "hosts" which were at the service-IP and contained only the items relevant for the service being monitored. But this led to another problem: the zabbix agent was not too keen on being used to monitor IP addresses it didn't know about (maybe that was an old problem or my memory is faulty here). Certainly, active-checks on such a service could not work.

    With Docker, the situation is much worse: miniature "containers" hold the service that is running, and said service is typically behind a NAT'd dynamically-created IP address. The typical Docker setup is a single process or process-tree and does not include services such as sshd, init, or inetd. However, the typical Docker setup is not very usable in productive, monitored environments. In such environments, the container needs to hold these other things. But you're still left with a dynamic IP, limiting the utility of running a zabbix agent daemon. By contrast, inetd would make more sense here. (It's also the mechanism of choice for the check-mk monitoring system).

    This monitoring complexity is exacerbated by things like Kubernettes, Swarm, and other container-clustering technologies. Services will not only have random IPs, but random ports! A special discovery agent figures out where these services are running and is queried for redirecting traffic accordingly. (The right thing to do would be to finally properly extend DNS to announce services, like RPC did aeons ago. But I digress.) Any decent monitoring system will need to adopt to this scenario, especially w.r.t automatic service discovery.

    I hope Zabbix can remain at the forefront and adapt to this new tech gracefully and intelligently.
  • otheus
    Member
    • Mar 2009
    • 53

    #2
    Modest proposal

    OK, so from the above, what I think Zabbix needs is a new architecture. This new architecture introduces the concept of Services that are akin to Hosts.

    * An item may belong to either a Service or a Host.

    * A Service is not permanently associated with a Host. It may be dynamically associated with 0 or more hosts.

    * If associated with 0 hosts, the Service is said to be down, offline, or unavailable.

    * Each Service, similar to hosts, may have one or more items. Such items would logically be related to the service, such as (for an HTTP service) the number of worker threads, the response time of a URL fetch operation, and log entries. The items should not know much about the host its running on, but may include things such as IP address, MAC address, hostname; but not things like OS load or CPU utilization -- these latter are the domain of the Host items.

    * If a Service migrates from one Host to another, its metadata will be updated accordingly.

    * For Service-autodiscovery, Zabbix will need to rely on external entities such as container registries.

    * A special zabbix-agent may be needed to deal with such service-oriented monitoring. It might, for instance, detect and notify the Zbx server when a service has moved to its hosts.

    Comment

    • jan.garaj
      Senior Member
      Zabbix Certified Specialist
      • Jan 2010
      • 506

      #3
      Hard-coded service orientation can reach the limits in the future. I prefer metric/item tagging. For example:

      Code:
      - response.time [host=elasticsearch]
      - response.time [service=elasticsearch]
      - response.time [container=elasticsearch]
      - response.time [host=elasticsearch,service=elasticsearch,container=elasticsearch]
      I don't understand how inetd zabbix agent version can be used for docker monitoring. Can you explain it please?
      Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
      My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

      Comment

      • otheus
        Member
        • Mar 2009
        • 53

        #4
        Hard-coded service orientation can reach the limits in the future. I prefer metric/item tagging. For example:
        I don't understand at all the example you provide. Response.time? What does that mean? What are these tags? Is this an example of how Zabbix might be configured in the future?

        I don't understand how inetd zabbix agent version can be used for docker monitoring. Can you explain it please?
        It's an ugly hack. The idea isn't specific to Docker.

        Context: Let's say that you have a cluster of (possibly virtual) hosts (as opposed to containers), each having their own fixed IP and each with a zabbix daemon running. Additionally, you'd have one or more service IPs that float with the system depending on where the clustering service places that service. Now, we want to monitor that service, but it's pointless to monitor it via the (for example) 4 fixed-IP hosts, but on the one host that currently holds the floating IP address. So we configure Zabbix to have a "host" identified by its service IP and which holds all items related to that particular service. (URL monitoring is an example of such an item.)

        We can do this with the standard daemon-agent, provided we configure it to listen to 0.0.0.0. But what if, for some reason (security, isolation, etc), we don't want it to listen to 0.0.0.0 but to the fixed IP address of the server? It's not a problem in general to do this, but it is a problem for the floating, service IP -- now we need to configure a daemon to listen to an address that does not exist (most of the time, because it's on another server). The work-around is to use inetd/xinetd and to configure its filtering rules to hand off to the zabbix-agent for a request to the given service IP.

        With Docker, this may become an obvious solution: Instead of building up a container that runs init and which includes a dedicated zabbix monitoring service (because, you know, Docker claims this is not the Docker way of doing things, and because, you know, systemd doesn't like to run inside a container, and because, you know, it's increasingly difficult to install services outside of systemd), the monitoring service can be handled nicely by xinetd. Every time the Zbx server makes a request to the agent via the service IP, xinetd steps in, launches the zabbix agent inside the container.

        Caveat: I haven't actually tried this. I'm not even sure how the various Docker-clustering-wares really handle service IPs. This particular coalescence of problem-solution came about as I was playing around with Docker, trying to figure out its suitability for running certain applications within our datacenter. One of the models was to use LVM and keepalive to assign a service IP to one of the several servers, each running a docker container of the service, with the additional question of: how to run the monitoring service within the container as well. The problem is that the containers themselves cannot assign (or even know about) the Service IP: that must be handled by the containing OS.

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by otheus
          OK, so from the above, what I think Zabbix needs is a new architecture. This new architecture introduces the concept of Services that are akin to Hosts.

          * An item may belong to either a Service or a Host.

          * A Service is not permanently associated with a Host. It may be dynamically associated with 0 or more hosts.

          * If associated with 0 hosts, the Service is said to be down, offline, or unavailable.

          * Each Service, similar to hosts, may have one or more items. Such items would logically be related to the service, such as (for an HTTP service) the number of worker threads, the response time of a URL fetch operation, and log entries. The items should not know much about the host its running on, but may include things such as IP address, MAC address, hostname; but not things like OS load or CPU utilization -- these latter are the domain of the Host items.

          * If a Service migrates from one Host to another, its metadata will be updated accordingly.

          * For Service-autodiscovery, Zabbix will need to rely on external entities such as container registries.

          * A special zabbix-agent may be needed to deal with such service-oriented monitoring. It might, for instance, detect and notify the Zbx server when a service has moved to its hosts.
          Nichil novi sub Sole (Latin: Nothing new under the Sun)

          Your scenario it is typical scenario of monitoring multi node cluster with N>2.
          What you should do is just organize dummy host and put on metrics of such host monitoring of you service.

          I'm really surprised in how many cases people are thinking that things like containerisation, job processing or async processing it is something discovered only in last few years
          No .. most of those things are around more years than some admins are on this Planet ..
          Instead rediscovering the wheel more people should try to ask older SAs/SEs asking them how to deal with such dilemmas
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            I think that notion of 'host' (computational unit, container, whatever) will always present. Zabbix architecture is flexible enough to adopt to different use cases, I wouldn't expect any major paradigm shift in this space.

            Zabbix 3.2 is introducing problem (event) tags and event correlation module that will bring eventually top level view on problems and services along with much flexible way of managing actions and top-level dependencies.

            I believe we still miss good and well-understood way of defining applications (services), it will come in the future.
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              Alex ass author of the zabbix you know that everything is hooked on definition of some kid of new keys/monitoring

              Perfect example here is web check (which may be similar to service). Each Web check adds more than one item with metrics definition. To be honest at the moment I don't see how to define such thing like high level service in a similar way as it is with web checks.
              Only think which I see that it would be possible to define is generic system.service[<service>] key which depends on OS may deliver status of the <service>.
              AFAIK such key is at the moment provided on Windows and I see some possibilities of extending definition of this key on other OSes or distros.
              However even across different flavours of the Unices it is possible to add some more sophisticated variations like on Solaris is possible to add sampling start/stop service timeouts or number of restarts made automatically. However in case SMF (Service Management Facility) on Solaris it is possible to do tis over trapper items hooked in some core SMF infrastructure.
              Theoretically similar systemd on Linux is very immature from point of view even status of the services or instances of the services compare to SMF.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • Alexei
                Founder, CEO
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Sep 2004
                • 5654

                #8
                I don't think services should be hard-linked to underlying resources as implemented for WEB checks.

                The way I see it is to have low-level resources somehow loosely connected to business level services. It can be achieved by engaging tagging (coming in 3.2) that would allow more flexible relationship between IT Services and events. Well, let see where it goes.
                Alexei Vladishev
                Creator of Zabbix, Product manager
                New York | Tokyo | Riga
                My Twitter

                Comment

                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #9
                  Originally posted by Alexei
                  I don't think services should be hard-linked to underlying resources as implemented for WEB checks.
                  I've been only mentioning that as same as web checks some other classes of resources may have in future some special support to monitor/present state and IMO one of candidates may be high level services monitoring to provide kind of abstraction.

                  Example: On Linux lets say we have service like zabbix-agent and it would be good to know is this service is in state guarantee that after reboot it will be automatically started.

                  On RHEL6 you can check this using "/sbin/chkconfig --list zabbix-agent|cut -f5" checking do you have "3n". On RHEL7 you can check did someone enabled this service using systemd commands. On other types of distributions such check can be done using other method. IMO key like system.service[] could be used on hiding such details making templates more portable on time scale and/or on moving between distributions.

                  In other words services monitoring it is not only something which presents current state of the running processes but as well history like "did service A been automatically restarted in lash 1h by systemd on Linux or SMF on Solaris because it crashed?" or state in future answering on questions like "service A will be automatically started after reboot or not?"
                  Other examples of the service related metrics:
                  - how long took systemd/SMF start Oracle listener before it has been reporting that it is in fully initialized state?
                  - messages send on stderr/stdout.
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • otheus
                    Member
                    • Mar 2009
                    • 53

                    #10
                    Originally posted by Alexei
                    Zabbix architecture is flexible enough to adopt to different use cases, I wouldn't expect any major paradigm shift in this space.
                    @Alexi, More precisely, Zabbix's configuration can be hacked heroically to adopt to this use case. But what I'm strongly suggesting here is that the hacks are so ugly and inelegant, it begs the question : why Zabbix. One of Z's key benefits is its configurability in the Web GUI and its powerful templating system. Secondly, this is a problem that I think will envelop all monitoring systems -- and system admins -- and I'd personally like to see Zabbix ahead of the curve.

                    Originally posted by kloczek
                    Your scenario it is typical scenario of monitoring multi node cluster with N>2. What you should do is just organize dummy host and put on metrics of such host monitoring of you service.
                    @Klozcek, This is a hack. It's also one that doesn't scale very well. It also does not take into account some invalid assumptions.

                    I did discuss the multi node cluster problem. Your suggested solution requires that services be connectable on the routable network. It also requires that zabbix agent be listenable to that IP address. It also assumes that services are managed statically. In a docker world, these assumptions are not necessarily valid.

                    As I mentioned before: the docker community resists running containers within init. What they think is ideal is this scenario for one host running in a kubernettes environment:

                    Code:
                     |-- pid 1822 docker container 1 running mysql "alpha" port 3306
                     |-- pid 1823 docker container 2 running httpd "Aaron" port 80,443
                     |-- pid 2010 docker container 3 running mysql "beta" port 3306
                     |-- pid 2011 docker container 4 running httpd "Bob" port 80,443
                     |-- pid 3430 docker container 5 running mysql "delta" port 3306 
                     |-- pid 3431 docker container 6 running httpd "David" port 80,443
                    Each service is running in its own container. Each service can have the same port number because internally, they listen on different network interfaces. Somehow (undefined, because multiple possibilities exist) the services are multiplexed on the public network IP address (usually via different port numbers and a discovery service of some sort). Each service may suddenly moved between host A and host B. When it moves to host B, it may have (1) a different internal IP, (2) a different external IP/port, (2) a different container id. The service IP address is one managed by Kubernettes, swarm, or whatever. The standard way is to set up a reverse NAT path for the service IP to reach the inside container; docker itself does this. I see great difficulty in ensuring that docker redirects incoming connections to the zabbix agent.

                    To monitor HTTP, I want the following items:
                    • Connection time to service ip/port
                    • Number of processes running of that service
                    • Number of lines in log file for that service
                    • Custom item which extracts /extended-status (for Apache) from localhost
                    • memory usage of all HTTP services


                    To monitor MySQL, I might have a 50 or so items that correlate to the various values from mysqladmin variables and related commands. One way to do this while avoiding firewall difficulties is with zabbix-triggers and zabbix-sender. Obviously, these are wrapped up in a script which will be run within the related container. However, there are also standard items:
                    • Connection time to service ip/port
                    • Number of kernel threads for mysql process
                    • Memory usage of mysql process
                    • Number of lines in slowqueries log


                    If the MySQL service suddenly dies, the related HTTP service(s) will also trigger alerts. Finally, service pairs might be created dynamically in response to load (dynamic resource allocation in clouds is part of the point of all this docker stuff).

                    Now, how do we configure zabbix (or anything, really) to monitor this?

                    Option 1

                    As @Kloczeck suggested: each service gets an IP and a host entry within zabbix config. To do this, the service must have a routable IP. Further: the zabbix agent must be listening on that IP and in the same container as the service. Without a sysinit, the admin must make sure (somehow) that when docker 1 is launched for mysql "alpha", it also launches a zabbix agent within that container. So our pseudo-process table looks like:


                    Code:
                     |-- pid 1822 docker container 1 running mysql "alpha" port 3306
                     |-- pid 1823 docker container 2 running httpd "Aaron" port 80,443
                     |-- pid 1824 docker container 1 running zabbix-agentd "Aragorn" port 10050
                     |-- pid 1825 docker container 2 running zabbix-agentd "Aardvark" port 10050
                     |-- pid 2010 docker container 3 running mysql "beta" port 3306
                     |-- pid 2011 docker container 4 running httpd "Bob" port 80,443
                     |-- pid 2012 docker container 3 running zabbix-agentd "Balin" port 10050
                     |-- pid 2013 docker container 4 running zabbix-agentd "Beatle" port 10050
                     ...
                    As long as the IPs are routable and static (or discoverable via dynamic DNS or something), and the sysadmin has a really good knowledge of Docker, this won't be too hard to configure: For each service, the Zabbix admin creates a new host based on the relevant template. This solves most of the problems. However:
                    • There is no (obvious) way to trace a problem VM to its "physical" host.
                    • There is no way to create a dependency on the service's actual "physical" host.
                    • How to handle dynamically created service-host sets? (Can this be done currently with Discovery? Even so, we'd have to set up host discovery for a relatively high frequency).


                    Option 2:

                    Templates with dynamic parameters so that multiple template sets can be added to a host? Items that can "move" from one "host" to another on demand? All of that sounds rather ugly.

                    Option 3:

                    A separate "service" hierarchy under which items can be assigned. Obviously this would need to be designed so it solves the problems/weaknesses above.


                    Zabbix 3.2 is introducing problem (event) tags and event correlation module that will bring eventually top level view on problems and services along with much flexible way of managing actions and top-level dependencies.

                    I believe we still miss good and well-understood way of defining applications (services), it will come in the future.
                    I don't see how tags help the problem of configuration. Can the tags be used to help the admin figure out which VM/host is the problem when a group of pseudo-service-hosts present probelems? Regardless, I'm glad you see a future direction here, and yes, you're right: we need to have it better defined.

                    Thanks for listening.

                    Comment

                    • kloczek
                      Senior Member
                      • Jun 2006
                      • 1771

                      #11
                      I did discuss the multi node cluster problem. Your suggested solution requires that services be connectable on the routable network. It also requires that zabbix agent be listenable to that IP address. It also assumes that services are managed statically. In a docker world, these assumptions are not necessarily valid.
                      Above assumption is valid only in case of using passive monitoring which i known that it does not well scales so sooner or later you wil stop sing passive monitoring.
                      In case of active monitoring none of the above is valid.
                      As long as you have even on the same IP (it does not nredd to be single host .. it can be even one IP with NAT gateway) you can have as many agents as you want because on query of srv/prx agent says "give me monitoring cfg for <host_A>".
                      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                      https://kloczek.wordpress.com/
                      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                      My zabbix templates https://github.com/kloczek/zabbix-templates

                      Comment

                      • SS*
                        Junior Member
                        • Jul 2015
                        • 5

                        #12
                        "The docker phenomenon."

                        "really surprised in how many cases people are thinking that things like containerisation"

                        from one viewpoint it is simple yet surprisingly effective, cognitive ease. The word Docker itself is easy to say and read, as some words are repeated a positive association is formed. As I recall Larry Ellison quoted he was surprised by how much the name of software affects the likelihood of its success.

                        How was it done before? I take it by this - http://blog.aquasec.com/a-brief-hist...to-docker-2016

                        1979.. chroot. Yep wasn't even born then.

                        Comment

                        Working...