Hi,
I was asked to move a few years old zabbix server instance from VM machine to a dedicated server box.
However I see major performance issues when I have move over all the data and config.
I did many steps, and tried many different things, but I can not get my head around where is the issue.
The source system:
Centos 6.5 OS, Zabbix 2.2.2, Mysql Server version: 5.1.69
VM has 4 cpu core and 16Gig ram assigned. Storage is on a dedicated SAN.
Monitors about 650 host with a bit more than 45000 items, 90-95% SNMP based items.
CPU and memory 100% maxed out but no swapping.
The installation works about fine at the moment, but as we wanted to add more items (like not just port status but traffic metric on all ports of the switches (would mean an extra 10000 items), the systems become almost unusable slow on the web interface. Collection still works just some oncrease in the queue waits, but web interface will load like 5-10 sec on each click. and auto refresh of the pages being delayed.
The first target system:
Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
2x4 Core CPU + HT, 16 Gig ram installed, raid 1 SSD storage
Other config I have tried with very same result:
Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
2x8 Core CPU + HT, 192 Gig ram installed, raid 5 enterprise HDD
Scenario 1:
I perform a clean install on target system, install the packages from standard centos repos, install zabbix from zabbix repo, as it is on the website.
dump the zabbix database on source (90Gig), copy over all the config files (zabbix server/agent/ external scripts, my.cnf), change only values refer to the server address (rename), import sql data, start up zabbix server on target system.
The transport is seemingly successful, everything works, I can login with user created in the source, can see all the hosts and templates.
However in both HW config, I see the 10min queue is constantly have about 5-7000 items in it. Which is not the case with VM source. Most of the items eventually updates, but I see like only intermittent (once in every half hour or even once in every 4 hours) values on many important graphs where should be a 30sec or at least less than a minute collection time.
At the same time, I see also lot unreachable/ time out entries in the logs to many hosts, which is definitely not true, as same host works fine in the VM source and also by using other means of monitoring.
Scenario 2:
Target machine is installed with Zabbix 3.4 LTS or 4.0.1 version. Move over monitoring by exports from the source zabbix web interface (Templates, hosts, etc..), no sql level dumping or anything like that.
Again I have tried this on both HW, and as soon as the monitored items reach into the level of 1000, I start to see 10min queue items.
Scenario 3:
I install the big machine as zabbix server and sql database server, and small machine become a proxy and web interface with local proxy data store for 24 hours caching.
Result and approach is same as scenario 2...
Also it it is also obviously similar in all cases, the HW boxes use super minimal of CPU resource (like none of the core utilise over 10%) and memory usage also not really goes over 10Gig, as oppose of the VM where 4 core is constant 100% and memory too.
In the VM process list I can see the pollers and other zabbix instances always and almost all of them have something to do, but on the physical boxes I see only intermittent usage on them, regardless how many instances I use. (I have tried 20 to 100, the VM runs with 50)
I have spent wast amount of time, to look at logs, but apart from the "become unreachable, and reachable" messages I don't see any reason for it. Just time out, which is obviously not true.
I did write a small test script which polls a host which is very commonly shows as unreachable on the target system, I issued 100 snmpwalk with 1 sec frequency, and only 3 of them come back in 3 sec the rest were come back in below 1 sec, even the walk was fairly big, as it was all about 8 different metric for 36HDD in a NAS. (Just for the clarity the test was run on both physical boxes)
Scenario 4:
I just reinstalled from scratch the full system on the bigger box. Installed Zabbix 4 again, and started to manually re-create the templates for 2 routers and and a NAS system with 5 nodes. When I finished with the templates and added all the host I already started to see some 10min queue however there is only 7 host and 1300 items to monitor. And 2 nas hosts still reports timeouts, and not created all the discovery items after like 24 hours..
The CPU is usage on the machine is basically zero.. and after like 36 hours running memory is just below 3Gig..
I have spent like 2 days, to try to optimise the SQL side, if that would be the problem, but it is not possible, with such low amount of data, an out of box sql install wouldn't perform on ANY machine...
Also just an addition, disk IO on the server in Scenario 4, is avarage 50kb/s read and around 100Kb/s write, network interface peak traffic is around 400Kb/s including the terminal traffic I'm working on...
So at this point, I have no more idea where is the bottleneck in the system, why is a 10 times stronger physical box can not even reach the performance of a low level VM...
Could anyone help me, or just give some ideas where I should look?
Thanks
I was asked to move a few years old zabbix server instance from VM machine to a dedicated server box.
However I see major performance issues when I have move over all the data and config.
I did many steps, and tried many different things, but I can not get my head around where is the issue.
The source system:
Centos 6.5 OS, Zabbix 2.2.2, Mysql Server version: 5.1.69
VM has 4 cpu core and 16Gig ram assigned. Storage is on a dedicated SAN.
Monitors about 650 host with a bit more than 45000 items, 90-95% SNMP based items.
CPU and memory 100% maxed out but no swapping.
The installation works about fine at the moment, but as we wanted to add more items (like not just port status but traffic metric on all ports of the switches (would mean an extra 10000 items), the systems become almost unusable slow on the web interface. Collection still works just some oncrease in the queue waits, but web interface will load like 5-10 sec on each click. and auto refresh of the pages being delayed.
The first target system:
Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
2x4 Core CPU + HT, 16 Gig ram installed, raid 1 SSD storage
Other config I have tried with very same result:
Centos 7.4, Zabbix 2.2.23, Server version: 5.5.56-MariaDB MariaDB Server.
2x8 Core CPU + HT, 192 Gig ram installed, raid 5 enterprise HDD
Scenario 1:
I perform a clean install on target system, install the packages from standard centos repos, install zabbix from zabbix repo, as it is on the website.
dump the zabbix database on source (90Gig), copy over all the config files (zabbix server/agent/ external scripts, my.cnf), change only values refer to the server address (rename), import sql data, start up zabbix server on target system.
The transport is seemingly successful, everything works, I can login with user created in the source, can see all the hosts and templates.
However in both HW config, I see the 10min queue is constantly have about 5-7000 items in it. Which is not the case with VM source. Most of the items eventually updates, but I see like only intermittent (once in every half hour or even once in every 4 hours) values on many important graphs where should be a 30sec or at least less than a minute collection time.
At the same time, I see also lot unreachable/ time out entries in the logs to many hosts, which is definitely not true, as same host works fine in the VM source and also by using other means of monitoring.
Scenario 2:
Target machine is installed with Zabbix 3.4 LTS or 4.0.1 version. Move over monitoring by exports from the source zabbix web interface (Templates, hosts, etc..), no sql level dumping or anything like that.
Again I have tried this on both HW, and as soon as the monitored items reach into the level of 1000, I start to see 10min queue items.
Scenario 3:
I install the big machine as zabbix server and sql database server, and small machine become a proxy and web interface with local proxy data store for 24 hours caching.
Result and approach is same as scenario 2...
Also it it is also obviously similar in all cases, the HW boxes use super minimal of CPU resource (like none of the core utilise over 10%) and memory usage also not really goes over 10Gig, as oppose of the VM where 4 core is constant 100% and memory too.
In the VM process list I can see the pollers and other zabbix instances always and almost all of them have something to do, but on the physical boxes I see only intermittent usage on them, regardless how many instances I use. (I have tried 20 to 100, the VM runs with 50)
I have spent wast amount of time, to look at logs, but apart from the "become unreachable, and reachable" messages I don't see any reason for it. Just time out, which is obviously not true.
I did write a small test script which polls a host which is very commonly shows as unreachable on the target system, I issued 100 snmpwalk with 1 sec frequency, and only 3 of them come back in 3 sec the rest were come back in below 1 sec, even the walk was fairly big, as it was all about 8 different metric for 36HDD in a NAS. (Just for the clarity the test was run on both physical boxes)
Scenario 4:
I just reinstalled from scratch the full system on the bigger box. Installed Zabbix 4 again, and started to manually re-create the templates for 2 routers and and a NAS system with 5 nodes. When I finished with the templates and added all the host I already started to see some 10min queue however there is only 7 host and 1300 items to monitor. And 2 nas hosts still reports timeouts, and not created all the discovery items after like 24 hours..
The CPU is usage on the machine is basically zero.. and after like 36 hours running memory is just below 3Gig..
I have spent like 2 days, to try to optimise the SQL side, if that would be the problem, but it is not possible, with such low amount of data, an out of box sql install wouldn't perform on ANY machine...
Also just an addition, disk IO on the server in Scenario 4, is avarage 50kb/s read and around 100Kb/s write, network interface peak traffic is around 400Kb/s including the terminal traffic I'm working on...
So at this point, I have no more idea where is the bottleneck in the system, why is a 10 times stronger physical box can not even reach the performance of a low level VM...
Could anyone help me, or just give some ideas where I should look?
Thanks
Comment