Ad Widget

**steeladept** · 08-04-2019, 22:12

All the documentation to fit what you are asking for is right here, between the official Zabbix documentation, the forum posts, and the wiki. Note of caution, numbers of machines are largely immaterial, it is really based on the number of monitors or values per second you need. The documentation provides information on how to ballpark that based on what you want to monitor, so I recommend you start there.

**Mocho** · 10-04-2019, 17:24

Hi,

I'm looking at replicating my current monitoring system which is reacting to >150000 nodes on Zabbix.
Running a stress test now importing all the online nodes into zabbix VM to see how it performs.
Been tweaking as I go but I'm almost capping my VM resources, will probably migrate to a proper server but so far I'm impressed with Zabbix's behavior.
There's a lot of item trimming and grooming that I still need to make but as a starting point I've got Device/Node type grouping, with Specific templates per Type as well, with a few cloned templates where I need different types of snmp data.

This is my current dashboard and growing fast, It will eventually grow to >150000 hosts (scary) System information

Yes	192.168.56.10:10051
35328	35231 / 0 / 97
3341156	3339811 / 0 / 1345
1487115	1487115 / 0 [7357 / 1479758]
3	2
17294.41

Had to increase PHP memory_limit (for obvious reasons) zabbix.conf
Also increased step by step (512MB) at a time, when ever it would cry for more, on different settings in zabbix_server.conf
I really need to trim down the amount of #### I'm polling (snmp wise).

I'll try to follow up on this whenever I can, I'm basically trying to find a proper cooking recipe for my monitoring demands.

Current user custom settings in zabbix_server.conf are:

StartPollers=100
StartPollersUnreachable=5
StartPingers=100
StartDiscoverers=10
CacheSize=4G //Might of went overhead with this one ^_^' first time it screamed for more I gave it 4GB
StartDBSyncers=6
HistoryCacheSize=512M
HistoryIndexCacheSize=512M
TrendCacheSize=512M
ValueCacheSize=4G
UnreachablePeriod=60
UnavailableDelay=120
UnreachableDelay=30

My VM has 10GB 4*Cores and is in constant meltdown but I'm pushing it.
Also in effect, developed a 2-way ticketing plugin which is binding ticketid-eventid both ways. operation/action w/ack and recovery action, besides autonomous host import on provisioning details on parameter changes.
Keep in mind though that this is not a proper architecture since I've got everything on the same machine (zabbix + mysql > 1xCentOS server),

Let us know what and how you're cooking please!
Thanks
Best regards

**steeladept** · 11-04-2019, 15:00

Mocho - that is quite impressive that you got to that vps number without more issues. I am curious if you have been having any database issues or did you partition it as suggested elsewhere? I have got to assume you partitioned it, but just looking to confirm. Also, how many proxies are you running? I run 10 of them and am only pushing 2200 vps (though to be fair, they are more to continue monitoring in the event of a site to site network outage far more than to offload work from the application server). I am also curious if that would run into far more issues if you started doing more active monitoring using Zabbix Agents.

As for my configuration, it is much more modest:

Yes	localhost:10051
1030	946 / 0 / 84
137538	118771 / 37 / 18730
68555	63919 / 4636 [2054 / 61865]
23	2
2147.13

This is a production environment using 100% Zabbix Agents for server monitoring, though we will be bringing in SNMP devices eventually for network monitoring as well (mostly for event correlation).

I have this broken out across 10 Proxies as I already stated, with a separate MYSQL server and the application and front end servers both running on the same box. These boxes are intentionally small and distributed, to take advantage of the VMware performance profile. However, even at that small size, with my environment I have only run across issues with the cache size not big enough - the box configurations were more than fine. Once I configured my cache sizes and start processes, things started running smoothly.

**Mocho** · 11-04-2019, 16:54

Hi steeladept - Thanks for your feedback, much appreciated.
As expected even after increasing the memory_limit (adding hw memory as well) and a couple of extra cores I hit the ceiling and swap is killing performance and everything else.
But again this was a stress test, first time installing and testing zabbix.
Have too many items (snmp), I'm downsizing, enough just to cover each exotic monitoring demand (which will still include both snmp and icmp as well), but it was fun to see it grow and starting to meltdown.
I'm also trying to figure out what will be the best combination of templates/applications/items for our porpuses.
When I'm able to replicate a small scale setup that covers all the needs then I'll migrate this to proper staging hardware virtualization and also start looking into partitioning the monitoring, maybe by country, not sure yet, and start importing more nodes.
Still need to go through agent and proxy documentation.
I'm curious If I'll be able to have a single type of agent for all flavour without extra config needs per site.
The main issue I see with the proxies is that It will require a lot of hw and instances to cover all the nodes.
I have to cover a load of snmp data for around 155K hosts and growing.
I'll try to keep this thread running as I move forward and whenever possible.
Thanks again!
Best regards

**warp10** · 29-06-2019, 15:50

Hi Mocho How many templates for each server you monitor ?

**Jason** · 01-07-2019, 09:38

Originally posted by Mocho

Hi steeladept - Thanks for your feedback, much appreciated.
As expected even after increasing the memory_limit (adding hw memory as well) and a couple of extra cores I hit the ceiling and swap is killing performance and everything else.
But again this was a stress test, first time installing and testing zabbix.
Have too many items (snmp), I'm downsizing, enough just to cover each exotic monitoring demand (which will still include both snmp and icmp as well), but it was fun to see it grow and starting to meltdown.
I'm also trying to figure out what will be the best combination of templates/applications/items for our porpuses.
When I'm able to replicate a small scale setup that covers all the needs then I'll migrate this to proper staging hardware virtualization and also start looking into partitioning the monitoring, maybe by country, not sure yet, and start importing more nodes.
Still need to go through agent and proxy documentation.
I'm curious If I'll be able to have a single type of agent for all flavour without extra config needs per site.
The main issue I see with the proxies is that It will require a lot of hw and instances to cover all the nodes.
I have to cover a load of snmp data for around 155K hosts and growing.
I'll try to keep this thread running as I move forward and whenever possible.
Thanks again!
Best regards

You will almost need to design your own templates or at the very least heavily customise copies of the ones provided.

It pays to put the time into examining carefully what you're monitoring from each template and the frequency with which you monitor those items. If you don't need to monitor something then don't and the items that you do monitor then how quickly do you need to know about any problems?

**steeladept** · 01-07-2019, 14:08

I would agree with Jason. I started with basic templates but I then recreated them, heavily modified to meet our needs, and use those. Currently I use only 2 or 3 templates per machine, but they are nested, so a template contains a template type of thing. This has caused me minor troubles in the past, as sometimes I don't want to monitor the included template on a specific machine, but breaking it out is a pain, so I don't suggest doing it the way I did. You can add all your templates separately to each machine, if you want, and I learned that is usually the better way to go.

**Mocho** · 04-07-2019, 17:26

Hi warp10 steeladept, Jason

That was a stress test ( my first zabbix setup test) .

Currently I've moved away from the previous VM and migrated to a proper PVS environment, separating DB and APP, no zabbix proxies.
Also cleaned the load of items that were overhead for the porpuses.

Right now I've got grouping per device type as well as 1 Template per device type with 2xApplications (ICMP and SNMP) the biggest item filled template is for access controllers with 11 items all the others are downsized to 1 or 2 metrics. Plus the zabbix self monitoring ones.
Coming down to 171854 items for a total 149747 nodes/hosts.
Very stable right now, even with per node action, including also recovery operation and trigger correlations, automation is working fine both ways populating eventid in JSD and ticketID in Zabbix.
Also managed to automate the node imports (create, update, delete) through zabbix api adding crucial info on the tags (licensing, devtype, etc) and integrated with licensing api and provisioning api's.
Very happy with the results so far. Also made a stress test on this env, blocking all access for a major outage and zabbix didn't crash although huge load of tickets to a jira staging env.
All custom automations and external scripting were made with nodejs (maybe not the best option but the one I could build a proof of concept fastest including integrating with all the different api's plus zabbix api)
,

The approach was as steeladept and Jason mentioned. I started with Zabbix out of the box templates, then started downsizing them which was a good exercise.
In the end I made my custom templates which for now are more than enough to cover all the current business rules and alarmistic needs.

Many thanks
Best regards

**kloczek** · 04-07-2019, 19:16

170k items today it is not large scale monitoring.Currently It is bottom of the mid scale.
That number combined with only 500-600 NVPS I would say that it is even bottom of the small scale.

Additionally on the graph is activity of the housekeeper (yellow line) which meas that you are not using partitioned history* trends* tables. If you want to improve your monitoring stack start from that point.
However with 500-600 NVPS housekeeper still should be fine

**[email protected]** · 15-07-2024, 13:06

Interesting post to follow, wandering if you have had any experience with API performance yet?

Ad Widget

Large scale setup documentation

Large scale setup documentation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment