Zabbix approach to monitoring. Resource vs service
Before we can start talking about what a good template is, let’s briefly discuss monitoring in general. As monitoring is such common and ambiguous term and there is no true RFC definition of what it is. Some may say that monitoring is when you are constantly looking for mentions about your company on social networks, while others may say it is when you collect sensor readings about your plant soil nutrition and moisture. As Zabbix is the universal monitoring solution, we provide a toolset for any kind of monitoring project. But in this document, we will focus on our views on monitoring services provided by modern distributed systems.
Everything can be seen as a service or as a resource.
A service is something that your organization provides to the outside world. Like an online retail store, or public email service. If your store is online and customers can successfully purchase something from it – your service is available. Or maybe service is something that your department provides to the rest of the company. Like computation resources that you provide to other departments on demand. Such internal service can be the dependency for your company online store service. As you can see, service issues clearly affect the real world. Service monitoring availability and performance is what you should always do in the first place. As Google suggests in SRE book, service unavailability should be considered a red-hot situation and the responsible person must be immediately paged.
A resource is a component that helps to provide services. It can be a server, a virtual machine, a container, a database, middleware app, microservice custom app, some hardware controller, network or anything else. You can breakdown resources to even smaller bits, like splitting server into CPU, RAM, IO subsystem or splitting network link to layer 3 and 2 connectivity and physical link present on both sides. The modern distributed system might be a complex set of different resources with dynamic nature where resources are added and removed on demand based on the service load, just like in Kubernetes cluster or AWS. Resources require your monitoring attention too, but differently. Because resource unavailability doesn’t necessarily automatically affect the real world, keep paging people on resource failures at a minimum. Create tickets instead that can be solved during working hours.
Service monitoring is considered a project level monitoring – it is not something you can get out of the box or get a template from Zabbix template repository - it is something you need to create yourself using Zabbix features. That is because all services are different, have different SLOs, have different architecture and so on, so it’s hard to prepare a common blueprint for service monitoring. But consider the following approaches:
And most likely situations such as zero rate (or sudden drop of rate), high errors ratio (HTTP 500 everywhere) are the situations that indicate serious service problems.
But why do you need to monitor resources if service monitoring is set up? There are multiple reasons but the most important one is this: once you know your service is down (symptom) you need to isolate the root cause of the problem.
Resources is something that is common and generic in many projects, different architectures. That’s where templates can thrive. Seriously, do you really need to waste time to create your own monitoring solution to control OS Linux? Or for MySQL database, Cisco router, or for docker host? Maybe you can spend your time more efficiently by preparing service-level monitoring instead.
Let’s try to define some properties of what a good resource template is, some key principles we follow in Zabbix when building templates:
In Zabbix, the template is equal to the monitoring solution for some specific object. It’s a sort of container that should be used to transfer configuration, monitoring solution between Zabbix server instances. A good quality template is something that Zabbix users create, use for their own good, and then share it with the Zabbix community, so the next person can download this template and reuse it, update it with newer ideas and approaches, contributing to the common cause. So, the first thing that comes to mind when you try to answer a question what a good template in Zabbix is how flexible and reusable it is. If other Zabbix users can download it and use it without changing half of it – that’s really a good sign.
Here are a few rules of the thumb on how to achieve it:
We also think that a good template is not just a set of metrics (items), thresholds (triggers), and dashboards bundled together. The most important ingredient to a good template is how much expertise and knowledge about a monitored object is contained within. And by expertise and knowledge we mean, not the number of metrics someone knows how to collect – but knowing what metrics are useful and important, and which are just useless, or what thresholds should be used to be notified only about problems that matter without too much noise.
While very minimalistic template may not provide all the information you need, on the other hand, bloated, oversized templates are bad as well, as users lose focus on the most important metrics, as well as they get overwhelmed with problems noise. So:
The last thing that is very important is the template scope. We already talked about services and resources, so, generally, the good template is the one that has a scope of the single resource:
If you keep a template scope within a single resource, it will be much easier to share such templates and they will be useful to people who have the same building blocks in their architecture. Also, avoid merging resources of different layers – do not add metrics for Linux OS and PostgreSQL into the single template.
But what about ‘inner’, metrics scope? What metric types, classes should you be collecting? Surely you can do monitoring for various reasons, including collecting business indicators or looking for security breaches, but when creating a generic resource Zabbix template, try to adopt the following approach:
3.1 Always start with fault monitoring or availability monitoring. The most popular and very important answer people want to get from monitoring – is my system up and running? So, try to address that in your template first.
That is, prefer black-box monitoring approach here, simple or not so simple health checks are essential and the first thing you or any user want to know about. Add items and triggers to your template that can help you to be sure – the thing you are monitoring is accessible and is up and running. Use ICMP ping, check that TCP port is open, check that API returns HTTP 200 OK, and so on.
The second problem that is addressed by fault monitoring is an imminent failure. For example,
Add items and triggers that will help you to intervene and prevent such a drastic outcome.
Also, if your monitoring object can detect faults on its own - use it! Many systems can report faults directly using logs or sending SNMP traps and so on. And that's the kind of expertise we talked earlier provided to you from the developers, vendors, authors of the system you want to control. And nobody knows it better than them. So just make sure you can retranslate faults detected by the system itself in your Zabbix template.
3.2 Once your template can check the health of your system - proceed with performance monitoring. This is where you will need to open the box wide open (white-boxing). There are really nice methods out there to help you choose what metrics to collect first: USE, Four golden signals from Google or RED for request-driven services. Just make sure you extend the template with items and triggers to help solve the following use cases:
3.3 Inventory and state control
While Zabbix is not the inventory system, it still can collect lots of information about the resource and most importantly, detect changes, such as the system being restarted outside of maintenance window, the version was updated, or it is outdated, and so on. So, make it part of your template checklist.
If you know how to properly detect security issues with the resource, i.e.:
Then consider adding such items and triggers to your template as well.
Finally, is the style of the template. How to name your items? Templates? Triggers? If we would all follow the same style when creating Zabbix templates – then it wouldn’t really matter who made this template – you, Zabbix, or another community member from the other side of the globe – as template contents and layout will be very predictable and expected.
Following style guidelines and template core principles mean that we can reuse each other templates as building bricks for our monitoring projects, saving time and adding someone else knowledge on the monitored object.
That concludes the introduction to Zabbix template guidelines, a comprehensive set of rules how we build templates in Zabbix.
I recommend reading the guide if: