Ad Widget

**elkor** · 28-10-2005, 22:47

wow. now THAT's a large deploy.

the following is pure conjecture:

Obviously, you're going to have to scale the hardware appropriately. Your database will be VERY busy. You are almost certainly going to have to do at least some active checks (which currently have an issue with win32 systems) in order to cut down on tcp connections.

Templates, which you will absolutely need for managing this many servers, have some issues right now; they are all frontend based PHP issues though and I wouldn't let that stop you, and you could always impliment most of that functionality with shell scripts if it isn't repaired to your satisfaction.

as a point of reference: I'm looking at ~200 machines right now and I'm running just fine on a dell 2850 dual xenon box w/a single 300g disk for the db and a separate spindle for the OS (well, mirrors of course, but you get the idea). this system is well within my specs for maintaining trends for 365days for all these systems... maybe 15 items each give or take 10.

another point of contention you may run into with the current frontend: in several sections it loads all available hosts (and sometimes items by host) into dropdown boxes. With an installation this large it may make some page loads really really slow. Again, it's only php so it can be worked around and if templates work correctly you can minimize this impact as most of these screens have to do with per-host configuration of one sort of the other

**keitch** · 20-06-2008, 18:13

large scale monitoring with Zabbix?

Hi

I was wondering if someone has more answers and experiance with this original question from egironda. The original question also is more than 1 year old.

How large installations can I monitor with Zabbix 1.4? And with monitor I mean more than just ping availability, but detailed monitoring of processes and CPU, memory, disk.

Is 200 large? Is 500/ 1000/ 2000 server too large? Are there useres here with that experiance?

Thanks
Keitch

**Alexei** · 20-06-2008, 18:40

Originally posted by keitch

Is 200 large? Is 500/ 1000/ 2000 server too large? Are there useres here with that experiance?

Number of servers is not that important, number of checks per second is!

**keitch** · 23-06-2008, 17:51

Originally posted by Alexei

Number of servers is not that important, number of checks per second is!

Understand - hence I was mentioning that I mean more than just simple ping monitoring, but a more robust performance monitoring and health checks.

So lets say I have 1000 server (many of them virtual machines) of which all need every 5min:

server uptime
Disk usage in Total (MB), Free (MB), Used%.
Memory usage and memory paging
CPU% broken down by user/system/wait/idle
process state, process memory usage, process CPU usage
monitor services expected running state and automatically start, stop, restart (for example 3 services per server)

On 300 server additional monitor the log files of Applications every 5 min (for example 3 logfiles per server)

setting of thresholds for alarms and storing historical data for reporting and service level calculations.

Has anyone experiance with Zabbix for larger installations (not in theory, but real)?

**xs-** · 23-06-2008, 23:58

Ok, i am working on 2 zabbix environments, one is live and has about 700 hosts + fair amount of items per hosts, the other environment is in design phase and will have a lot of hosts (i expect 2-5k, distributed). Below are some of the things we have thought of so far, nothing conclusive yet but i'm pretty sure it can work.

Check my post in the troubleshooting forum (http://www.zabbix.com/forum/showthread.php?p=34849). It has some generic 'tips and tricks' for larger environments.

For really large environments:

Its mostly a good pick of database with the correct hardware (and a knowledgeable dba to config it).
Your main problems are the indexes on the history and trends tables (and events if you have lots of those). Search the forums for this issue, zabbix is really heavy on indexes in large environments, prepare for this!
- With Oracle you can do some tricks with automated index rebuilds on separate rotating tablespaces (or something, i was discussing this with one of our oracle dba's).
- You could also use a clustered mysql5 for the zabbix_server (would probably take some patching of zabbix_server to work with a clustered mysql backend, would be worth exploring i think) and a r/o replica for stuff like sla viewing, trigger viewing, history, graphs, etc.
You would also want to use 1.5.x/1.6 with the proxy setup. Have all servers communicate with the proxies and the proxies with either 1 main node or set of distributed nodes. Make a layered (pyramid) setup to divide the net connection load (agent->server) so your main node only receives history updates (bulk updates, less heavy).
Another good thing about this is that you can create maintenance windows on your main server because of the proxies/dm.
Search forum for methods to lessen the load on your server like:
- Use active agents, period!
- zabbix_server.conf tuning, not too little worker threads, not too many (each worker thread is 1 concurrent thread to your db, think heavy LAMP server!)
- use the active agent where possible (i.e. simple check for remote port status or agent for local port status). This has multiple advantages.
Again be realistic with what you collect and your intervals, there is no way a 5 minute interval is useful for disk monitoring. Some examples:
- Disk space and/or inode usage: 15min-30min
- Current time (for time sync check) 1hour
- Do you really need host based network statistics in your monitoring environment? (instead of switch based).
- You don't want cpu states (nice,sys,usr,idle,io)! you want runqueue size!
- Dont collect double info!
  Either pick disk free or used in real or percent value.
- Always keep in mind that if something happens, you dont need the monitoring tool to tell you all the details. You want it to tell you which component has failed where, you will go and look for yourself anyway!
Use templates, make concessions!
Don't go all out at once, make templates for your generic servers/service deployments. If the items / triggers are too specific they are irritating, the item/trigger does not matter to begin with.
Server/service specific information can be added afterwards, which will leave you with a nice clean setup with usefull triggers/info.
Make templates specific enough!
If zabbix sees an item is not supported (specially with snmp) it will not show the item, but it will keep trying (i.e. hw monitoring). These 'misses' in info collection can count up to 10%+ which is 10% useless system utilization.

Zabbix uses a relational database, as Alexei states, its all about the db operations per second!

**mdouhan** · 08-07-2008, 13:05

How does zabbix scale

Hey

We are monitoring close to 2000 sites

We monitor servers, and network equipment

ast the moment we have way over 2000 hosts defined. each with several checks.

we have over 2 million events

things are going just fine, our final aim is 30.000 hosts with at least 200.000 items in the final installation.

Our biggest holdback atm is IT Services not working correctly and the performance of log file parsing ( cannot wait for 1.6 )

**disgruntleddutch** · 10-07-2008, 07:44

Originally posted by mdouhan

Hey

We are monitoring close to 2000 sites

We monitor servers, and network equipment

ast the moment we have way over 2000 hosts defined. each with several checks.

we have over 2 million events

things are going just fine, our final aim is 30.000 hosts with at least 200.000 items in the final installation.

Our biggest holdback atm is IT Services not working correctly and the performance of log file parsing ( cannot wait for 1.6 )

I'd like to know more info? Hardware you are using? Are they multiple physical environments? Are you using distributed monitoring?

**lamont** · 13-07-2008, 22:34

I'm working on solving the 2,000-host problem.

As an example of scale, right now I've got 351 hosts on one zabbix server with 8940 total items with 60 second sampling intervals. I've got this setup on a server with 4-cores (i think dual-socket dual-core AMD) with 6 15k 2.5" SAS drives (HP DL365). This runs well, but CPU for the mysqld process is starting to get aggressive (>=50% on a single proc pretty constantly), but I/O is perfectly beautiful.

We have basically abandonded the idea of doing a zabbix-style distributed model and getting all the datapoints into a single mysql database. The current way we are pushing towards is to slice the hosts up into segments (we've got multiple different business units with different lines of business which are fairly separated, so this should work well) and then to consolidate the IT services into a single consolidated display. I wrote a quick script to remotely return the status of an IT service on a zabbix server so that I can (on paper anyway) remotely poll a server and display its IT services on a centralized display. Right now the idea is to just take a centralized zabbix server and mirror all the IT services on all the "leaf" zabbix servers with a web walkthrough on the centralized server to hit the URL on the leaf server to track its status changes. If the centralized zabbix server can actually stand up to that, and if we can do something about the rendering page weight of the IT services monitoring page that would be a solution. I expect that we're going to design our own centralized dashboard, however.

We tried the zabbix distributed model and it failed badly. We haven't tried the 1.5.x distributed model, but in general the zabbix "distributed" model is way too centralized. We don't want a single database will all the hosts and all the items in it. Doing that real-time is just way too much I/O and the mysql database inevitably gets overloaded and sooner or later corrupts itself after it has been abused enough. And having a non-real-time database with all the hosts/items in it doesn't solve any problems that we have.

What would be better would just be a consolidated dashboard pooling all the it services from all the leaf servers in one display, and then tools for ensuring that templates/triggers/actions/etc are synchronized and standardized amongst all the leaf servers. Also, it would be nice to be able to customize the IT services display to make it have multiple columns and only display red/green status for top-level IT services. We have tried putting the IT services display up on a big widescreen in the NOC and it winds up trailing off the bottom edge of the display.

Also, there really needs to be a REST-ish/XML-ey webservices interface in addition to the php interface. The ability to download and upload XML content is a step in the right direction, but having all the actions you can take from the php/html interface mirrored in an XML web services API would be extremely powerful.

A lot of the management features like templates and the XML downloads and uploads are good, but they're inherently limited. If there was a programmatical API I could just remotely query whatever I was interested in, process it, and then upload configuration deltas. Setting up hundreds of web services with associated triggers and it servers, for example, is just painful through the php API. Directly inserting the SQL can be a little bit dangerous (I just added a bunch of discovery rules today via SQL and now that I think about it i forgot to update the ids table... have to fix that now...)

**r3dn3ck** · 04-08-2008, 18:34

I'm bringing this back from the dead.. forgive it.

I'm currently doing a very large installation and any links to examples/intrucs/discussions of a federated install would be groovy.

I currently have 1650+ hosts, 500 SNMP devices, 173,000+ items, 70+ screens, ~1500 graphs and more than 150 maps. I have up to this point been purely working out the organizational layout for the hosts/items/groups/templates/actions.

Now comes the need to work out the actual installation. I have 3 datacenters with high speed links between them. I'm trying to see if we can have a local slave server which collects data at each site and would update a central master server or otherwise allow a unified view. I would like to get any pointers on how the zabbix-server component needs to be set up to support this. The manual is, as usual, a bit light on the details.

**mdouhan** · 06-08-2008, 15:53

large insalls

it will depend on how often you configure your checks to poll for data.

We have no grown a lot already and the type of solution that you are talking about should be possible with a single server, we are not quite there yet but we will be very soon, and we are very successful with a suse enterprise server running 2 quad cpu and 16 GB RAM but we are looking to upgrade to 4 quad cpu's but the memory is rather fine, also in 1.6 there is some talk about zabbix proxy which we find very interesting

**r3dn3ck** · 06-08-2008, 16:50

Originally posted by mdouhan

it will depend on how often you configure your checks to poll for data.

We have no grown a lot already and the type of solution that you are talking about should be possible with a single server, we are not quite there yet but we will be very soon, and we are very successful with a suse enterprise server running 2 quad cpu and 16 GB RAM but we are looking to upgrade to 4 quad cpu's but the memory is rather fine, also in 1.6 there is some talk about zabbix proxy which we find very interesting

I've got it on a dual quad core with 4GB RAM. Could possibly use more RAM. I'll look into that.

Probably 70% of the items are set to report every 60 seconds. Another 20%-ish are every 5 minutes. The balance are roughly split between 30 seconds and 1 hour.

A big concern is the MySQL DB performance with all those transactions especially as it directly affects the performance of the web interface. We'll have an average of ~20 people logged in pretty much around the clock using the app after go-live. As it sits when I update a template the whole thing crawls for a while (mostly because the DB is running off a single DAS spindle while I'm fiddling with the bits. We'll move it to faster storage before go-live). Once we figure out which way we're going with the final DB storage that should hopefully get better.

On that note: Has anyone tried NFS mounting the file-system that holds the DB? What kind of performance effects did you see? I'm talking about NFS from an appliance like a NetApp or similar, not a random linux box.

Thoughts?

**nelsonab** · 07-08-2008, 23:34

Originally posted by r3dn3ck

On that note: Has anyone tried NFS mounting the file-system that holds the DB? What kind of performance effects did you see? I'm talking about NFS from an appliance like a NetApp or similar, not a random linux box.

Thoughts?

I chatted with one of the guys at work with some storage experience, he said you'll need to do your research on tuning. This includes the client to the NAS. You'll also want to look at tuning MySQL for this as well. He did say it's very common and works quite well if you tune everything correctly.

Also for those *really* concerned with IO performance I was at Linux World yesterday and chatted with the CTO from FusionIO http://www.fusionio.com/. They have a NAND Flash based PCIe storage card with 80 or 300GB which can deliver about 120,000 IO operations a second. He was also telling me they can quite easily saturate the North Bridge on Intel architecture. If anyone remembers Black Dog Linux, this is the technology that came out of that. COOL STUFF! They were streaming I believe 1000 DVD clips from their storage device at the same time on a few widescreens in real time. Their product only runs on Linux, Windows is expected "soon".

Another one to consider for the performance minded Violin Memory http://www.violin-memory.com/.

The violin is a little more pricey than FusionIO, and likely WAY overkill for Zabbix, but that does not mean I can't figure out a need to "performance test Zabbix". ;-)

**Alexei** · 09-08-2008, 20:39

Very interesting thread!

By the way, we are building large ZABBIX setup for pre-1.6 testing purposes (>10K of hosts, 20 Proxies, >100K of items). I am looking forward to nailing down performance-related problems prior to release of 1.6.

**xs-** · 11-08-2008, 12:18

@Alexei
I hope you will be writing some sort of report / story / recommendations on these tests. Best practices, why certain choices were made, etc.

Ad Widget

Monitoring large installations?

Monitoring large installations?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment