Ad Widget

**LEM** · 03-03-2005, 18:45

Originally posted by Mirza

Hi,
Placing the database on a shared storage may also be difficult: if the primary node crashes dirty buffers (filesystem buffers) will be deleted. So MySQL may not be the first choice, as all dirty buffers will be cached by the operating system.

Have you envisaged the MySQL Cluster architecture?

How big do I need to size the hardware to store the most detailed values at least three months for 1000 servers? Do I need 2 CPU or 4 CPU servers? 300 GB in RAID10 mode for the database?

You should consider using this (wannabe-)sizer (for 1.0 release) : Zabbix (1.0) sizer.

Has someone experiences with such big environments?

For me and for now, we only run 1 central point of monitoring to monitor about 250 servers and 200 network elements (switches/routers) over 1 datacenter and about 100 remote locations dispatched over a private WAN.

Planned infrastucture (cause IT landscape's growth) is a 'simple' Linux cluster using MySQL clustering architecture, but we use for now (and since 5 months) a single old-out computer (2xP3 1Gh - 2 Go RAM - Debian) running all parts (www, zabbix, mysql) pretty fine (load avg 2.5 most of the time.... response time ok for now, database size about 4Gb for now) for about 30 triggers/servers, 3 triggers/network elements, retention time 90 days.

Is Zabbix able to schedule and run all items and triggers within 30 seconds?

YMMV : depends on you #of triggers/host (and finally, the number of checks/30 seconds).

**Alexei** · 03-03-2005, 21:50

Originally posted by Mirza

First I will need a Linux cluster. This may be an heartbeat, a RedHat Cluster Server, Steeleye's Lifekeeper or hp MC/Serviceguard. Two possible strategies:

Place the cluster into the pre-production - so I may use a shared storage (SCSI or FC-based) and store the database on it.
Distribute the cluster to the datacenters and have local copies of the database on each center.

I think that the first option is the best alternative currently. I say 'currently' because ZABBIX doesn't support distribution monitoring yet, so having three servers reporting to one with customised(?) front-end could be difficult from maintenance point of view. Reliability of connections between the datacenters is key argument for decision making.

Alternatively, you may have three intependent ZABBIX servers, one per each location.

Originally posted by Mirza

Placing the database on a shared storage may also be difficult: if the primary node crashes dirty buffers (filesystem buffers) will be deleted. So MySQL may not be the first choice, as all dirty buffers will be cached by the operating system. I thought about tuning the filesystem parameters and mount the filesystem in "sync" mode - but that is not reliable enough. So PostgreSQL or Oracle using raw devices and logs would be the better choice...?

Yes, filesystem buffers will be deleted. But, is the lost of hundreds of metrics really business critical? I don't think so.

Crashes is not something that happens often. I'd be more concerned about database integrity (short recovery time) and fast switch over. So, a solution with shared storage (HP Service Guard, for example) seems to be more preferrable in this case.

Both PostgreSQL and Oracle are not fast enough. I'd suggest using of MySQL for your environment.

Originally posted by Mirza

How big do I need to size the hardware to store the most detailed values at least three months for 1000 servers? Do I need 2 CPU or 4 CPU servers? 300 GB in RAID10 mode for the database?

It all depends on number of check per second and period for keeping detailed history and trend data. I'm confident that 300 GB is more than enough for three months.

I'd start with 2xCPU setup with at least 1GB of RAM for database usage. ZABBIX doesn't use lots of CPU power and does use virtually no RAM, the database does. Fast disk I/O is very important.

Again, take it with a grain of salt, it really depends on number of checks. It could even happen that 1xCPU server with 512MB is enough for you if you plan to run several (up-to 10) checks per server each 30 seconds.

Originally posted by Mirza

Has someone experiences with such big environments?

Check this: http://www.zabbix.com/forum/showthread.php?t=77

Originally posted by Mirza

Is Zabbix able to schedule and run all items and triggers within 30 seconds?

Yes, it is!

**Mirza** · 04-03-2005, 09:38

Originally posted by LEM

Have you envisaged the MySQL Cluster architecture?

I did - in fact we have the largest MySQL cluster (Master-Slave) in Europe.

Well, I know about EAC, the MySQL Cluster and MySQL Master-Slaves. EAC has problems concerning an uptodate MySQL version, the MySQL cluster is not very common und wide tested and MySQL Master_Slaves - well, ugly

Replication drift is one thing to know, and even if the Master dies, data may be disappear.

Thanks, I will check the sizer too.

Mirza

**Mirza** · 04-03-2005, 09:53

Originally posted by Alexei

I think that the first option is the best alternative currently. I say 'currently' because ZABBIX doesn't support distribution monitoring yet, so having three servers reporting to one with customised(?) front-end could be difficult from maintenance point of view. Reliability of connections between the datacenters is key argument for decision making.

Well, another approach: I will have one single clustered MySQL database, but three Zabbix servers writing data into it. A fourth zabbix server has been configured to make only read-only access and display data, configuration will be done on the remote zabbix-servers.

What happens when the Zabbix server will loose the connection to the database? Will data be stored locally (I assume not)?

Well, I could add a local database in each datacenter and to a replication to the central database...

Alternatively, you may have three intependent ZABBIX servers, one per each location.

In this case I loose my overview. As in any clustered environment only the big solution SLAs are interesting - if one server fails, this does not matter.

Yes, filesystem buffers will be deleted. But, is the lost of hundreds of metrics really business critical? I don't think so.

Depends on the business you are working in. If I need to pay a huge number of Dollar to my customers whenevery I am missing my SLAs I will definitly be interested in every single data for at least 9 months.

If it's my private server, I may loose data for a couple of hours - no problem.

Implementing a monitoring solution in a datacenter is more than the technical solution - customers, fees, penalty fees etc. taking more ressources than the solution itself.

Crashes is not something that happens often. I'd be more concerned about database integrity (short recovery time) and fast switch over. So, a solution with shared storage (HP Service Guard, for example) seems to be more preferrable in this case.

Right. The recovery time is most important. We have a couple of MySQL data losses after a crash of the Linux os. That's why I am asking for a database which is able to rollback logs and recovery data after a crash.

Both PostgreSQL and Oracle are not fast enough. I'd suggest using of MySQL for your environment.

What about a 2 to 4 node Oracle RAC-Cluster with 2*3.0 GHz CPUs each and load-balancing?

Again, take it with a grain of salt, it really depends on number of checks. It could even happen that 1xCPU server with 512MB is enough for you if you plan to run several (up-to 10) checks per server each 30 seconds.

1000 servers with 25% increase per year, at least 60 items per server, saving them 9 months in past.

Thanks for your help...

Mirza

**Alexei** · 04-03-2005, 13:54

Originally posted by Mirza

What happens when the Zabbix server will loose the connection to the database? Will data be stored locally (I assume not)?

The data will not be stored locally. Zabbix server will just wait until the database is up.

Originally posted by Mirza

Implementing a monitoring solution in a datacenter is more than the technical solution - customers, fees, penalty fees etc. taking more ressources than the solution itself.

Right. The recovery time is most important. We have a couple of MySQL data losses after a crash of the Linux os. That's why I am asking for a database which is able to rollback logs and recovery data after a crash.

Is InnoDB MySQL database structure supposed to be immune to power failures and OS crashes?

Originally posted by Mirza

What about a 2 to 4 node Oracle RAC-Cluster with 2*3.0 GHz CPUs each and load-balancing?

Oracle support is not finished, so I cannot comment performance of the database yet.

Originally posted by Mirza

1000 servers with 25% increase per year, at least 60 items per server, saving them 9 months in past.

It transforms to 2000 checks per second provide resfresh rate is 30 seconds. I'm not sure if current 1.0 can handle this nicely. This is because timeout processing is not perfect and 1.0 is not optimised for such heavy load.

Furtunately 1.1 will be here to address all these problems. Parallelism will be greatly improved (I already have understanding how to achieve this), so it will be possible to run tens and even hundreds of server processes simultaneously.

I'm very interested to make Zabbix scalable to big environments.

**Mirza** · 04-03-2005, 14:35

Originally posted by Alexei

Is InnoDB MySQL database structure supposed to be immune to power failures and OS crashes?

Not immune - as far as I understand.

It's much better than MyISAM as it has the doublewrite buffer. But still data may be lost (in detail the percentage of dirty buffers (90% default) which depends on your my.cnf configuration could be anything between 1MB and 512 MB...

But the underlaying filesystem still makes trouble. I learned from http://www.ussg.iu.edu/hypermail/lin...03.2/0629.html that with ext3 in ordered mode the Linux kernel tries to make real fsyncs. This works (according to the page) not with IDE drives and is damned slow even with SCSI drives.

I'm very interested to make Zabbix scalable to big environments.

Me too as Zabbix is a wonderful tool.

**Mirza** · 04-03-2005, 14:44

Originally posted by Mirza

But the underlaying filesystem still makes trouble. I learned from http://www.ussg.iu.edu/hypermail/lin...03.2/0629.html

Quoting http://www.issociate.de/board/post/1...mentioned.html
fsync is even in Linux 2.6 buggy.

Mirza

**Alexei** · 05-05-2005, 11:21

All,

We have finished development of active (i.e. initiated by ZABBIX agent) checks recently. Use of the active checks will greatly decrease hardware requirements of ZABBIX server and eliminates polling.

Recent benchamarks performed on 1xCPU Athlon 2800+, 2GB, shows that ZABBIX is able to process about 570 checks per second. In fact, this means that 1xCPU ZABBIX server may handle monitoring of 1000 servers with 34 monitored (once per minute) metrics each.

I would still recommend 2xCPU system as some more power is required for GUI and reports.

The functionality will be released as part of 1.1alpha8, next Monday (hopefully).

Ad Widget

Howto monitor 1000 servers in three datacenters with Zabbix?

Howto monitor 1000 servers in three datacenters with Zabbix?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment