Ad Widget

**mrnobody** · 20-12-2024, 18:15

Hi Jeff (this is why i live for kk)

Never seen an installation so huge, 4.7k VPS and almost a milion of hosts, OMFG.
In certification class, i heared about Patroni for the first time(cluster of DB, that runs in PostgreSQL). I trust in experience of teacher and other that sayid it's a good option, but i never tested one; i would migrate software first, to a more robust solution, wait it become avaible enough and only after this, migrate hardware (based on 30 minutes to restart server, looks something is wrong).

Good luck

**jjeff123** · 20-12-2024, 19:07

Originally posted by mrnobody

Hi Jeff (this is why i live for kk)

Never seen an installation so huge, 4.7k VPS and almost a milion of hosts, OMFG.
In certification class, i heared about Patroni for the first time(cluster of DB, that runs in PostgreSQL). I trust in experience of teacher and other that sayid it's a good option, but i never tested one; i would migrate software first, to a more robust solution, wait it become avaible enough and only after this, migrate hardware (based on 30 minutes to restart server, looks something is wrong).

Good luck

No, not millions of hosts. Around 1600, and 4700 VPS, not 4700K. But some of those hosts are stacks of 6 or 8 48 port switches. I've got a couple hosts with 5000+ items. Mostly because we used the default templates, and I think we'll be going through an exercise to significantly reduce monitoring on edge switches.

This project kind of grew organically, from more or less a lab/test bed to production. The hardware migration will be to much newer hardware, which should help a lot.

**mrnobody** · 20-12-2024, 19:25

Originally posted by jjeff123

No, not millions of hosts. Around 1600, and 4700 VPS, not 4700K. But some of those hosts are stacks of 6 or 8 48 port switches. I've got a couple hosts with 5000+ items. Mostly because we used the default templates, and I think we'll be going through an exercise to significantly reduce monitoring on edge switches.

This project kind of grew organically, from more or less a lab/test bed to production. The hardware migration will be to much newer hardware, which should help a lot.

Items*, not hosts, my beloved dislexy

That's a shorter path, reduce the quantity of items, keeping only what is necessary; can use mass update to do this right in front end.

**guille.rodriguez** · 21-12-2024, 17:22

Maybe a good option is disable items that you don't need, for example in switch, on discovery rule, only add ports with admin status = 1 (active). If admin status = 0 (disabled) you don't need to monitor.

Another option is increase the time interval monitoring. For example if you increase snmp pull on a switch with 48 from 1 minute to 2 minute...

**packetdust** · 01-01-2025, 20:12

For what it’s worth, SNMP polling has always seemed to be a bottleneck in Zabbix for as long as I can remember. The release notes for version 7 mention a change to synchronous SNMP polling, but I haven’t noticed any significant improvement in reducing the delays in proxies obtaining and delivering SNMP data to the Zabbix server.

Here are some general considerations:

PostgreSQL Tuning: How much tuning have you done on your PostgreSQL database? Proper optimization here can make a big difference.
Zabbix Server Configuration: Similarly, how much have you optimized the Zabbix server configuration? There are many settings that can support significant scaling, but any changes should be made incrementally and monitored closely to evaluate their impact.
Proxies and Checks: It sounds like you’re running a lot of proxies. How many checks is the Zabbix server itself performing, versus those handled by proxies? In my experience with larger infrastructures, offloading everything to proxies (we use about 10) significantly reduced the server load and improved its responsiveness.
Polling Frequencies: Reassess the polling intervals for your items. Are they set too frequently for certain use cases?
Scope of Monitoring: Are you monitoring everything by default? Consider whether all the monitored items are necessary.
Item History and Trends: Review your history and trends retention periods. For example, do you really need to keep 365 days of history for switch port utilization? Reducing retention for less critical data can alleviate storage and performance pressures.This version keeps the original intent while making it more concise and professional.

**markfree** · 02-01-2025, 03:21

I would argue that this is not such a large environment, but it is just as relevant as any other.
I handle some cases where each host can easily reach 11k+ items.
So, the first thing I did when I started monitoring these types of devices was to remap all the relevant metrics and recreate the legacy template.

Do your SNMP templates already use the newer SNMP walk method of data polling?

Also, as guille.rodriguez pointed out, using default templates without any filter can lead to a bunch of unnecessary metrics filling up your DB.
Usually, OotB templates provide some handy discovery filters, especially for switches and routers. If possible, configuring these filters can greatly reduce the load from hosts. Don't overlook overrides either.

Adding more hardware is not always a solution to performance issues.
It seems to me that your Server and Proxys may need some process and cache improvements.

Organizing the environment for different roles may prove beneficial. For example, separating proxys for specific regions, data-center, types of devices, types of data collection, passive or active, etc.

Keep in mind that the Zabbix DB can be the main point of latency. Isolating and optimizing it is very important.
Many DBMS provide load-balancing solutions...

You can find some performance tuning tips in the forum.

**Jason** · 09-01-2025, 08:16

As others have suggested then I'd start with looking long and hard at the templates and the setup on your hosts. If make sure bulk monitoring was enabled as this makes a massive difference on proxy efficiency.
Secondly disable any items you don't need on the templates and be quite ruthless with this. Unless you need the item for stats or reporting then disable it.
Everything that is left then consider adding discard unchanged with heartbeat and set this as long as you can up to about a day. Anything over that and it'll disappear from latest data occasionally.
Split out your server functions and have dedicated database server, zabbix server and web front-end. Database wise I've been really impressed with postgres especially when coupled with timescaledb. On each server take time to tune for its specific function. Database will need as much ram as can through at it to help with caching along with fast disks. Look at SSD with raid 10. Possibly even a cluster if can afford it.
snmp does seem to take up more resources on proxies than anything else and especially when some large hosts go offline it can cause issues if it's not been carefully configured. I've yet to try 7 on our biggest setups but upgrading to it soon and looking forward to the improvements

**jjeff123** · 25-01-2025, 02:26

OP here.
I'm mostly doing the back end stuff, database and server setup. Templates are mostly other people's job.
But yes, we mostly have the default templates, which are gathering entirely too much data, and that needs to stop. I don't need stats on thousands of switch ports that have end user PCs attached.

I was cheap and just have 1 box for DB, web and server.
The original post was prior to us moving from a 9 year old on-site server to Azure. Under azure I've got better performance, which is great, but then I added enough hosts to push my VPS to 6200.

Biggest thing I did was DB tuning, I finally noticed today that I was getting WAL writes every 2-3 minutes. Changing max_wal_size from 4G to 10G eliminated that.
Now my performance is reasonable, though I appreciate the tip about proxies.
I built a proxy image and deployed it, but discovered that a couple of my remote sites are large enough that I exceeded the configuration cache, and those sites always have thousands of items in the 5/10/30 second queue.
I thought that fixing the configuration cache the queue issue would resolve itself, but no such luck. I've got 30K queued items from one proxy right now, even though that proxy is using only a tiny amount of CPU/memory.
I'll have to look at that next while I prod my folks to fix our templates.

**cyber** · 28-01-2025, 09:26

how many pollers you have there in proxy... ? Default values will not work.

**jjeff123** · 31-01-2025, 15:54

For the proxy that's falling behind?

It has 1000 required VPS. Monitoring 94 devices, most of which are switch stacks, so on the order of 140K items.

Looking at the zabbix proxy monitoring, no alerts on this proxy. I did bump up the cachesize, originally it was at 128MB. But that was over a week ago, and didn't really make any difference.
I'm also running an upgrade from 6.0, and the template is a 6.0 template without bulk SNMP queries.

CacheSize=256M
HistoryCacheSize=64M
HistoryIndexCacheSize=32M
ProxyMemoryBufferSize=128M
StartVMwareCollectors=1
VMwareFrequency=60
VMwarePerfFrequency=60
VMwareCacheSize=16M
ProxyOfflineBuffer=48
StartDiscoverers=3
StartPollers=10
StartSNMPPollers=12
StartPingers=5
StartPreprocessors=5

**cyber** · 03-02-2025, 09:58

ok .. in v7 we have asynchronous pollers for SNMP... that 12 might be ok... But I have no v7 with a big load, so I dont have a comparison.. I have a netwrok proxy with 240 hosts, 220k items and ~700 nvps.. So probably polling a bit less.. You can look over polling times there also.

If you do not do any vmware monitoring through that proxy, you can always switch off vmware pollers. Same with discoverers... if not doing network discoveries, don't start them up..

**jjeff123** · 06-06-2025, 15:59

Responding to my own post so future people know how this worked out.

Spent considerable time tuning the database and zabbix, mostly database. Increased the MAX_Wal size so WALs were done on time instead of size was a major factor.

The proxy with the high queue was just something goofy on that one box. Yes, it was my highest used proxy, but we fixed 3 things on it:
- apt-get update/upgrade - both the kernel and zabbix to latest 7.0x release
- NIC had both a static and DHCP address on it, not sure how.
- Rebooted to fix both the NIC issue and allow new kernel to take effect
After that all my queue problems on this one proxy went away.

The server in Azure worked great, much better than the on-site. But eventually that also hit the bottleneck. And the issue was Disk IO.
I had built the server with the default standard SSD, assuming a modern Azure SSD would have far better performance than my on-site. But the Azure standard SSD is limited to 500 IOPS and 100MB/s IO.
And there's the issue, because they drastically rate limited the IOPS. A spinning disk is going to have 100-150 IOPS, but even a cheap, old SSD will be in the tens of thousands.
So the "standard SSD" in azure should really be sold as a premium HDD.
Upgrading to premium SSD with 7500 IOPS has worked great.

System has 6400 VPS and performance is fine.

**asch11** · 20-02-2026, 12:08

@jjeff123 . so, if i nderstand well you was have single installation zabbix server, withous proxies and it was keeps well or so-so about 750k items , how it looks now ? do you have proxies ? whats your vps and queue ?

**LenR** · 11-03-2026, 22:57

This is odd, we are about 2X your NVPS and items, on a local VM with MySQL. Is your disk I/O read, write or mixed? I added a lot of MySQL buffers to get buffer hit ratio's very high.

I find the Zabbix Queue a somewhat misleading stat. It's not necessarily a queue of gathered data waiting to be stored in the database, it can be a queue of "due but uncollected" items. If a host won't respond to an item poll, it's counted as queue. Those don't cause DB load, in fact, they delay it.

Ad Widget

Server hit the wall, I think - configuration steps to grow?

Server hit the wall, I think - configuration steps to grow?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment