Hello all:
We have a fairly standard size installation, I have been working with Zabbix for years and have never seen the problems we are currently experiencing. This server was running 2.0.2 in a VMWare virtual machine and was doing fine, monitoring around 120 hosts and 24,000 items. We decided to move this to dedicated hardware since zabbix was beginning to consume too much of our high performance (expensive) ISCSI storage. So a physical machine was provisioned with:
dual quad core 2.13GHz CPU
1 300GB 15K SAS boot
3 600GB 15K SAS software RAID 5
16GB Memory
This should be more than adequate for our size of installation I think, but after porting over everything from the VM to this new 2.0.7 machine things quickly went to pot. Regularly I get "Housekeeping 75% busy" triggers, as well as SNMP pollers. I looked at our installation and I/O waits average 11-15%. Still I reduced the number of monitored items to 9,500 by eliminating many many SNMP checks. We still have the problem, Zabbix regularly things the server is not running when it is, during that period we are flooded with 100's of emails of systems being down that really are not. SO...I started poking around the database schema and discovered the history_log table has over 160 million rows. This is probably due to a template we use to gather windows event logs with a 90 day retention. My first thought was to just reduce the retention to 45 days, but that did not seem to solve the problem. I am now considering unlinking that template and clearing data, but have no idea whether it is going to choke on trying to delete that many rows.
What is the best way to approach this problem? Dump all tables with mysqldump excluding that table, create a fresh schema and reimport? Is the software EXT4 raid5 the problem? Unfortunately we do not have enough slots in this machine to add a hardware raid card which really sucks but I am stuck with what i have.
Any advice or other things to look for? I have already done many "high volume" tweaks to the zabbix_server.conf file to no effect. The installation can hum away without issues for a day or two, but then zabbix (wrongly) thinks the server process has died and shortly thereafter starts sending out erroneous emails about down hosts, very frustrating. I can post my config files if it helps
Thanks in advance,
/Christian
We have a fairly standard size installation, I have been working with Zabbix for years and have never seen the problems we are currently experiencing. This server was running 2.0.2 in a VMWare virtual machine and was doing fine, monitoring around 120 hosts and 24,000 items. We decided to move this to dedicated hardware since zabbix was beginning to consume too much of our high performance (expensive) ISCSI storage. So a physical machine was provisioned with:
dual quad core 2.13GHz CPU
1 300GB 15K SAS boot
3 600GB 15K SAS software RAID 5
16GB Memory
This should be more than adequate for our size of installation I think, but after porting over everything from the VM to this new 2.0.7 machine things quickly went to pot. Regularly I get "Housekeeping 75% busy" triggers, as well as SNMP pollers. I looked at our installation and I/O waits average 11-15%. Still I reduced the number of monitored items to 9,500 by eliminating many many SNMP checks. We still have the problem, Zabbix regularly things the server is not running when it is, during that period we are flooded with 100's of emails of systems being down that really are not. SO...I started poking around the database schema and discovered the history_log table has over 160 million rows. This is probably due to a template we use to gather windows event logs with a 90 day retention. My first thought was to just reduce the retention to 45 days, but that did not seem to solve the problem. I am now considering unlinking that template and clearing data, but have no idea whether it is going to choke on trying to delete that many rows.
What is the best way to approach this problem? Dump all tables with mysqldump excluding that table, create a fresh schema and reimport? Is the software EXT4 raid5 the problem? Unfortunately we do not have enough slots in this machine to add a hardware raid card which really sucks but I am stuck with what i have.
Any advice or other things to look for? I have already done many "high volume" tweaks to the zabbix_server.conf file to no effect. The installation can hum away without issues for a day or two, but then zabbix (wrongly) thinks the server process has died and shortly thereafter starts sending out erroneous emails about down hosts, very frustrating. I can post my config files if it helps
Thanks in advance,
/Christian
Comment