Ad Widget

**mwtzzz_2021** · 20-09-2021, 19:45

"Should" is different than "actual". Has anyone actually run Zabbix at this scale?

**mwtzzz_2021** · 21-09-2021, 17:51

Thanks for those links I will check those out.

The frequency of our checks runs the gamut - we've got some that run once a minute, other every five minutes, others every 10 minutes, etc.

**Spectator** · 23-09-2021, 16:34

Hi everyone!

First of all: sorry for my bad english.

I have a question which is very similar for the question mwtzzz_2021.
Our company has cca 35k hosts.
This hosts mainly are:
- storages,
- LTO tape drives,
- servers (iLO, IPMI,...),
- ethernet and SAN switches,
- routers,
- UPS-s
- VMWare and HyperV clusters,
- OSes, like Windows and Linux,
- MySQL, PostreSQL, Oracle databases,
- and so on

Each hosts has cca 50 checks
Some checks has every 1 and others has every 5 minutes
Historical datas must be keept for 3 months
Trend datas must be keept for 3 years

Does anyone have practical experience monitoring a system of this size?

What type of database is worth using for such a large system?
MySQL? MariaDB? PostgreSQL? PostreSQL time-series? Or something else?
And approximately what size database can I expect for a system of this size?

**Spectator** · 24-09-2021, 13:48

Originally posted by cyber

Our previously mentioned setup has PG +timescale and has ~1T DB... I guess compression would save some...14d history+365d trends.

Thanks your answer.

How often are checks in your system (1 min, 5 min,...)?
Which version of Zabbix are running?
May I ask what hardware background your Zabbix system is running on? How many servers, how many proxies? How much CPU, RAM, HDD?
Are the Zabbix web front-end and Zabbix server running on the same ost?
Sorry my lot of questions. Surely you know that all the information is important when designing such a large system. Or if you have any more ideas, advice, which you would like to share with me, please share it with me

**mwtzzz_2021** · 24-09-2021, 17:27

Originally posted by Spectator

Hi everyone!

Some checks has every 1 and others has every 5 minutes
Historical datas must be keept for 3 months
Trend datas must be keept for 3 years

Does anyone have practical experience monitoring a system of this size?

You typically cannot use a traditional monitoring system to store trend data. These systems normally only keep the last 10-30 check results. I don't know about Zabbix, though. But for example Icinga2 and Sensu only keep a very brief history.

For longer periods, you must use metrics instead - graphite, wavefront, etc. Indeed, a compelling argument these days is to replace traditional alerting/monitoring with metrics. Gather everything using metrics, then process it.
Metrics systems don't use a traditional database. instead they use their own storage for time series data.
My personal experience is with graphite which scales well - I had it processing on order of a million metrics a minute at my last company.
At my current company we use wavefront on a larger scale.

There's only a couple reasons to continue to use a traditional monitoring system (Zabbix, Icinga2) anymore:
1. running nagios-style check and custom scripts.
2. schedules, escalations, repeated attempts, etc.

These two things are still not handled well by modern metrics based replacement systems like Prometheus, Splunk, Wavefront. Such systems have rudimentary alerting but don't have the type of fine-grained control over the alerting that the traditional platforms do. Nor do they have the capability of running the nagios-style check scripts.

This is the reason I am on this forum asking about Zabbix. We still need to support the nagios-style checks. And we need to do it at a scale similar to yours. And we need to do it for both our on-prem and our cloud hosts. Icinga2 can handle this but it's not clear whether it's still under active development and support. Nagios cannot handle the scale.

**mwtzzz_2021** · 27-09-2021, 18:46

sorry, I was off by an order of magnitude in my last post, about Graphite. We were piping 25 million metrics a minute through it.

**mwtzzz_2021** · 28-09-2021, 21:35

Originally posted by cyber

At 2019 Summit there was at least one presentation about consolidating multiple Z servers to one, resulting in over 1M items and 250k NVPS in some situations...
https://assets.zabbix.com/files/zabb..._the_cloud.pdf

This says Max Tested Processed Values /s 250k during a "burst". What is a "burst" and what does values mean?
It also says Max Tested processed alerts up to 30k. What does this mean - 30k at one time? or 30k total defined alerts?
It's not clear from this document what the size of their environment was.

Other one mentions 65k+ devices https://assets.zabbix.com/files/zabb...nvironment.pdf

A bit of false advertising. This one says "will reach 65000+" devices in the future. It doesn't say how many they actually were running at the time they wrote the paper.

So we're back to the original question: has anyone actually run Zabbix on 30,000+ devices in production?

I'm beginning to suspect nobody has done it on this type of scale.

**LenR** · 29-09-2021, 20:41

We have 8250 hosts, 1,355,000 items, 5200 values per second. Almost all data is gathered by proxies, 60% of hosts are network devices gathered by SNMP, rest are split between Linux and Windows with mostly active agent items. Database is partitioned mysql VM with good spinning disk. Tune mysql with buffers and huge pages to avoid reads, partitioning deletes old history and trend, avoid housekeeping. We try to keep update LTS version 2x a year. We run mysql and zabbix server on the same vm, console on another vm and multi proxies, some for load, some for access.

SSD would be faster, but we are avoiding physical hardware now.

**logix88** · 01-10-2021, 14:27

We have recently started using AWS Aurora RDS... In my experience, large scale Zabbix always starts great and issues happen after it's. been in service for a while... really performant DB is key, don't do historical pruning from Zabbix - offload to DB using stored procedures. Use active items instead of passive.. obviouslly use proxies and offload as much onto them... I think if you follow the best practices, this shouldn't cause any troublle... though with any large scale deployment, get Zabbix support contract! It's not too expensive and if things go wrong, they can help like no one else can and personallly have had some critical issues resolved.

**mwtzzz_2021** · 01-10-2021, 22:47

Thanks for that information, guys. Very helpful. Good tips.
It still sounds like nobody has run it on a scale of 30K+ hosts. I'd still be interested in hearing from someone who has actual real-world experience running it in production.

**mwtzzz_2021** · 01-10-2021, 22:54

Another question for you all: for a public cloud environment, auto-scaling, immutable images, etc. is Zabbix a good option?
The dynamic nature of auto-scaling / short-lived instances always brings up the issue of programmatic cleanup / deregistering the hosts. Is this something Zabbix can handle gracefully?

**tobankeisha** · 14-12-2021, 13:05

not sure if anyone has run Zabbix on this scale

**cyber** · 23-12-2021, 22:33

These numbers should not be an issue....

**cyber** · 23-12-2021, 22:41

well.. I have ~10k hosts and average of 50 items per host... several DC-s... I would not hesitate to double the amount of hosts, just needs some extra proxies...
The amount of hosts and items is one side, but how often you check is another. With same amount of things you can get different nvps if you check interval is 1m or 10m..

Your servers/proxies work is much different.
At 2019 Summit there was at least one presentation about consolidating multiple Z servers to one, resulting in over 1M items and 250k NVPS in some situations...

https://assets.zabbix.com/files/zabbix_summit_2019/Olivier_Harand-From_10_standalone_Zabbix_platforms_to_a_major_one_in_the_cloud.pdf

Other one mentions 65k+ devices

https://assets.zabbix.com/files/zabbix_summit_2019/Andy_Zhou-Zabbix_integration_with_big_data_system_in_a_large_scale_environment.pdf

Ad Widget

Real world, honest assessment

Real world, honest assessment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment