Key principles of a good template

Introduction

Zabbix approach to monitoring. Resource vs service

Before we can start talking about what a good template is, let’s briefly discuss monitoring in general. As monitoring is such common and ambiguous term and there is no true RFC definition of what it is. Some may say that monitoring is when you are constantly looking for mentions about your company on social networks, while others may say it is when you collect sensor readings about your plant soil nutrition and moisture. As Zabbix is the universal monitoring solution, we provide a toolset for any kind of monitoring project. But in this document, we will focus on our views on monitoring services provided by modern distributed systems.

Everything can be seen as a service or as a resource.

A service is something that your organization provides to the outside world. Like an online retail store, or public email service. If your store is online and customers can successfully purchase something from it – your service is available. Or maybe service is something that your department provides to the rest of the company. Like computation resources that you provide to other departments on demand. Such internal service can be the dependency for your company online store service. As you can see, service issues clearly affect the real world. Service monitoring availability and performance is what you should always do in the first place. As Google suggests in SRE book, service unavailability should be considered a red-hot situation and the responsible person must be immediately paged.

A resource is a component that helps to provide services. It can be a server, a virtual machine, a container, a database, middleware app, microservice custom app, some hardware controller, network or anything else. You can breakdown resources to even smaller bits, like splitting server into CPU, RAM, IO subsystem or splitting network link to layer 3 and 2 connectivity and physical link present on both sides. The modern distributed system might be a complex set of different resources with dynamic nature where resources are added and removed on demand based on the service load, just like in Kubernetes cluster or AWS. Resources require your monitoring attention too, but differently. Because resource unavailability doesn’t necessarily automatically affect the real world, keep paging people on resource failures at a minimum. Create tickets instead that can be solved during working hours.

Service monitoring is considered a project level monitoring – it is not something you can get out of the box or get a template from Zabbix template repository - it is something you need to create yourself using Zabbix features. That is because all services are different, have different SLOs, have different architecture and so on, so it’s hard to prepare a common blueprint for service monitoring. But consider the following approaches:

  • Try synthetic monitoring: emulate user (person or another application that interact with your service) activity on a regular basis: check for symptoms with the black-box approach:
    • Use Web monitoring in Zabbix and go through common user scenarios, for example, try to log in and purchase a test item from the store.
    • Simpler HTTP check will also work – just make sure your website URL returns HTTP 200 OK
    • If your service provides a REST API, write a script that will emulate some common service activity.
  • Try real user monitoring approach: gather and read transactions from logs, network, or database about your real users.
    • Count the ratio of success/failure requests
    • Collect request rate per second/per minute to your service
    • Calculate min/max/average response times to your service or create a histogram.

And most likely situations such as zero rate (or sudden drop of rate), high errors ratio (HTTP 500 everywhere) are the situations that indicate serious service problems.

But why do you need to monitor resources if service monitoring is set up? There are multiple reasons but the most important one is this: once you know your service is down (symptom) you need to isolate the root cause of the problem.

Resources is something that is common and generic in many projects, different architectures. That’s where templates can thrive. Seriously, do you really need to waste time to create your own monitoring solution to control OS Linux? Or for MySQL database, Cisco router, or for docker host? Maybe you can spend your time more efficiently by preparing service-level monitoring instead.

What does it take to become a good resource template?

Let’s try to define some properties of what a good resource template is, some key principles we follow in Zabbix when building templates:

1. Flexible and reusable

In Zabbix, the template is equal to the monitoring solution for some specific object. It’s a sort of container that should be used to transfer configuration, monitoring solution between Zabbix server instances. A good quality template is something that Zabbix users create, use for their own good, and then share it with the Zabbix community, so the next person can download this template and reuse it, update it with newer ideas and approaches, contributing to the common cause. So, the first thing that comes to mind when you try to answer a question what a good template in Zabbix is how flexible and reusable it is. If other Zabbix users can download it and use it without changing half of it – that’s really a good sign.

Here are a few rules of the thumb on how to achieve it:

  • Use low-level discovery as much as possible to avoid unsupported items or triggers. If you have some metric in your situation, that doesn’t mean it will be available in someone else case – it could be different hardware, software version or configuration.
  • Use user macros in triggers, items. For example, use {$NGINX.URL} for Nginx stub status URL. Or use {$TEMP.MAX.CRIT} in temperature controlling trigger. This will allow users to configure and fine-tune templates and linked hosts, instead of changing it and breaking compatibility with future versions.
  • Avoid adding rare metrics/triggers required for your project/service into the resource template. For project/service level metrics that you feel are very specific just move it to another template and link it to the generic template.
  • Avoid external dependencies where possible. Use internal Zabbix data collection and processing possibilities to collect data first. Use HTTP agent, powerful preprocessing steps such as Javascript, JSONPath, JMX and so on. This would ensure that such template is easy to install, and all its configuration and processing are defined within the template. Resort to external scripts only if there is no alternative available.

2. Knowledge and expertise

We also think that a good template is not just a set of metrics (items), thresholds (triggers), and dashboards bundled together. The most important ingredient to a good template is how much expertise and knowledge about a monitored object is contained within. And by expertise and knowledge we mean, not the number of metrics someone knows how to collect – but knowing what metrics are useful and important, and which are just useless, or what thresholds should be used to be notified only about problems that matter without too much noise.

While very minimalistic template may not provide all the information you need, on the other hand, bloated, oversized templates are bad as well, as users lose focus on the most important metrics, as well as they get overwhelmed with problems noise. So:

  • Avoid adding too many metrics. Keep it simple. Don’t try to do benchmarking, profiling, collect deep debugging level metrics. This will create unnecessary load on Zabbix, and object monitored. Let Zabbix isolate the problem – then do profiling and debugging using specialized tools on the hosts that really require it.
  • Avoid creating too much problem noise with triggers in the template. Make sure that the problems created from the trigger require immediate (page) or postponed (ticket) action. Avoid ‘for your information’ and ‘this looks weird’ level triggers.

3. Modularity and scope

The last thing that is very important is the template scope. We already talked about services and resources, so, generally, the good template is the one that has a scope of the single resource:

  • Vendor-specific hardware server
  • A temperature sensor
  • Operation system templates like Linux OS or Windows OS
  • Applications like Nginx, Apache, Tomcat, RabbitMQ
  • DBs like MySQL, PostgreSQL, Oracle, DB2, Redis, Mongo, and so on
  • Cloud providers such as AWS, Azure, or others
  • Virtualization providers such as VMware clusters, Hyper-V
  • Container orchestration systems such as Kubernetes
  • A network device or network controller
  • Some custom applications

If you keep a template scope within a single resource, it will be much easier to share such templates and they will be useful to people who have the same building blocks in their architecture. Also, avoid merging resources of different layers – do not add metrics for Linux OS and PostgreSQL into the single template.

But what about ‘inner’, metrics scope? What metric types, classes should you be collecting? Surely you can do monitoring for various reasons, including collecting business indicators or looking for security breaches, but when creating a generic resource Zabbix template, try to adopt the following approach:

3.1 Always start with fault monitoring or availability monitoring. The most popular and very important answer people want to get from monitoring – is my system up and running? So, try to address that in your template first.

That is, prefer black-box monitoring approach here, simple or not so simple health checks are essential and the first thing you or any user want to know about. Add items and triggers to your template that can help you to be sure – the thing you are monitoring is accessible and is up and running. Use ICMP ping, check that TCP port is open, check that API returns HTTP 200 OK, and so on.

The second problem that is addressed by fault monitoring is an imminent failure. For example,

  • the hardware is overheated and is about to shutdown
  • you are running very low on disk space and very soon your DB will refuse to write new data to the DB as there is no space.

Add items and triggers that will help you to intervene and prevent such a drastic outcome.

Also, if your monitoring object can detect faults on its own - use it! Many systems can report faults directly using logs or sending SNMP traps and so on. And that's the kind of expertise we talked earlier provided to you from the developers, vendors, authors of the system you want to control. And nobody knows it better than them. So just make sure you can retranslate faults detected by the system itself in your Zabbix template.

3.2 Once your template can check the health of your system - proceed with performance monitoring. This is where you will need to open the box wide open (white-boxing). There are really nice methods out there to help you choose what metrics to collect first: USE, Four golden signals from Google or RED for request-driven services. Just make sure you extend the template with items and triggers to help solve the following use cases:

  • My system is slow. It is up but the response time is unsatisfactory. Performance has degraded.
  • We just had a big outage. We need to investigate and do retrospective analysis to find out what happened to make sure it never happens again. To do such analysis we need helpful metrics collected beforehand.

3.3 Inventory and state control

While Zabbix is not the inventory system, it still can collect lots of information about the resource and most importantly, detect changes, such as the system being restarted outside of maintenance window, the version was updated, or it is outdated, and so on. So, make it part of your template checklist.

3.4 Security

If you know how to properly detect security issues with the resource, i.e.:

  • Resource version used is the subject to CVE. Consider updating
  • Misconfiguration causes the system to be publicly available without proper authentication when it should be.

Then consider adding such items and triggers to your template as well.

4. Follow guidelines

Finally, is the style of the template. How to name your items? Templates? Triggers? If we would all follow the same style when creating Zabbix templates – then it wouldn’t really matter who made this template – you, Zabbix, or another community member from the other side of the globe – as template contents and layout will be very predictable and expected.

Conclusion

Following style guidelines and template core principles mean that we can reuse each other templates as building bricks for our monitoring projects, saving time and adding someone else knowledge on the monitored object.

That concludes the introduction to Zabbix template guidelines, a comprehensive set of rules how we build templates in Zabbix.

I recommend reading the guide if:

  • You want to share your template with the rest of the world
  • You want to avoid common mistakes when creating a template
  • As a hardware or software vendor, you want to provide a Zabbix template for your solution