Hello,
I work for a small company which is fully EC2 hosted.
Currently we monitor:
- 360 hosts
- 52936 items
- 11387 triggers
- 500 vps
For that, we use:
- 1 zabbix-server instance (Zabbix 1.8.3, 8GB RAM)
- 2 MySQL 5.1.41 in master/slave replication (70 GB RAM)
- 6 zabbix-proxy instances
From time to time, we get false positive flood alerts about zabbix-agent not answering from various hosts.
First investigations showed that false-positive occur when too much SQL request get locked.
The fact is that 'history' table is around 200M rows and 'history_uint' about 300M :-/
I already changed history settings for all items (mostly was 90 days, now is 7 days).
I also decreased housekeeper history for actions and events from 365 (default value) to 60.
But I still have too much rows in my table (and 60 days of history).
First question: which are the differences between item history setting and housekeeper one ? If items history is 7 days, should I decrease housekeeper's one to the same value ?
Some ideas I have to get more performances:
- Disable Housekeeper and use SQL partitions. With which impact on requests performances, index, ... ?
- Use master/master replication, make zabbix-server use one SQL instance and zabbix frontend the other one. Which could be the impact ? Does zabbix support odd and even id increment ?
- if previous improvements do not solve my problem, using distributed monitoring, replacing zabbix-proxy by other zabbix-servers
The main bottleneck here is disk I/O performances. Because of EC2, the SQL instance got somewhat poor disk performance (compared to physical hosting). We try to compensate with memory but InnoDB MySQL seems to be saturated.
Any ideas welcomed,
Regards,
JB
I work for a small company which is fully EC2 hosted.
Currently we monitor:
- 360 hosts
- 52936 items
- 11387 triggers
- 500 vps
For that, we use:
- 1 zabbix-server instance (Zabbix 1.8.3, 8GB RAM)
- 2 MySQL 5.1.41 in master/slave replication (70 GB RAM)
- 6 zabbix-proxy instances
From time to time, we get false positive flood alerts about zabbix-agent not answering from various hosts.
First investigations showed that false-positive occur when too much SQL request get locked.
The fact is that 'history' table is around 200M rows and 'history_uint' about 300M :-/
I already changed history settings for all items (mostly was 90 days, now is 7 days).
I also decreased housekeeper history for actions and events from 365 (default value) to 60.
But I still have too much rows in my table (and 60 days of history).
First question: which are the differences between item history setting and housekeeper one ? If items history is 7 days, should I decrease housekeeper's one to the same value ?
Some ideas I have to get more performances:
- Disable Housekeeper and use SQL partitions. With which impact on requests performances, index, ... ?
- Use master/master replication, make zabbix-server use one SQL instance and zabbix frontend the other one. Which could be the impact ? Does zabbix support odd and even id increment ?
- if previous improvements do not solve my problem, using distributed monitoring, replacing zabbix-proxy by other zabbix-servers
The main bottleneck here is disk I/O performances. Because of EC2, the SQL instance got somewhat poor disk performance (compared to physical hosting). We try to compensate with memory but InnoDB MySQL seems to be saturated.
Any ideas welcomed,
Regards,
JB


Comment