Ad Widget

**LenR** · 28-11-2017, 06:13

On the zabbix server graphs, what processes are busy? Housekeeping?

**clarkritchie** · 28-11-2017, 19:27

Thanks for the reply.

I don't interact with Zabbix on a day-day basis, so I may ask a colleague to chime in for more detail here. But I think the answer to your question is, at least for our existing 2.4.8 installation:

history syncer is frequently 100% busy
housekeeper is frequently 100% busy
poller processes are intermittently 100% busy

I can't seem to get this for 3.x as those machines are down at the moment.

**LenR** · 28-11-2017, 19:47

As a test, disable history for history and trends and restart the zabbix server to make sure housekeeping stops for those two. If your poller and syncer's usage goes down, then they are probably blocked by housekeeping.

Look at the database partitioning schemes, those can be used to replace the need for housekeeping on history and trends.

**kloczek** · 28-11-2017, 21:29

Originally posted by clarkritchie

Hi all,

We currently run Zabbix 2.4.8 and 7 or 8 proxies:
- Approximate NVPS per proxy: 700-1,000
- Number of enabled hosts monitored: 4,485
- Number of enabled items monitored: 637,889

We've been trying to upgrade to Zabbix 3.2 or 3.4 and it seems that no matter which path we take, performance is absolutely horrible.

Generally performance of the zabbix server or proxy it is depends in +98% on DB backend performance,
Main limitation AWS MySQL RDS it is related to fact that you cannot use you cannot use transaction-isolation=READ-COMMITTED. Without this if you have a lot hosts appearing and disappearing by autoscale each host delete/add blocks whole DB until host add/delete transaction will be finished.
Second thing: you should have enough memory to cache in memory almost all last 24h data. If those data will be not cached in innodb memory pool DB backend will be always slow.
First symptom memory starving DB backen is more physical storage read IOs than write. Remember that even adding only data to the DB content generates almost the same as write IOs read IOs. If all necessary data are not well cached in RAM you cannot punch some inserts performance because banwidth of inserts will be limited by latency of storage read IOs.

- Custom compiled our own .deb files for 3.2/3.4 with a higher value for ZBX_MAX_HRECORDS, then run the upgrade on a clone of our db

Are you sure that you have problem with ZBX_MAX_HRECORDS and not wit not enough main zabbix server inserts speed which may be blocking smooth data transport between proxies and server?

Database:
- Amazon RDS db.m4.xlarge
- MySQL 5.6.34
- Db is currently 507,180 MB

Server:
- Amazon EC2 m4.xlarge
- We use AWS enhanced networking, however we've not seen any difference on 3.x with or without this mod
- We typically tune the kernel's shmmax value, which as I understand it, relates to shared memory available to processes and affects various cachesizes (HistoryCacheSize, TrendCacheSize, etc) -- but we've seen no difference on 3.x with this mod

Any suggestions on how to migrate to 3.x branch?

1) shmmax does not matter.
2) if wcache server internal metrics drops below 95% it meas that DB backend is limited by inserts speed (look above)

Generally with the same amount of memory and CPU cores on physical box which is mot AWS RDS is possible to do many more optimizations on DB backed side than on RDS instance and to bigger instance you will be allocating than more sense will be to buy somewhere physical host collocation.
AWS RDS is very good on scaling DB workload with almost all selects and not to many inserts and updates. Problem is that this is not the case of the zabbix DB backend.

PS. I'm assuming that you have partitioned at least history and trends tables.

**abjornson** · 28-11-2017, 23:06

Hello, I'm @clarkritchie's colleague, also working on Zabbix.

Many thanks for the helpful insights.

We will try disabling housekeeping for history and trends and see the impact. If positive, we will try partitioning history and trends tables and will report how that goes. We are not currently partitioning anything.

We will also consider your comments on the shortcomings of AWS RDS for zabbix backend and consider testing on non RDS mysql.

Here are 1d screenshots of zabbix server performance stats:

"Zabbix Cache Usage, % free" - I believe because we see frequent drops of "history write cache" to zero this means our "DB backend is limited by inserts speed" correct?

"zabbix internal processes busy" - I think you are saying that the saturated history syncer processes are probably due to DB insert limitations, and that saturated housekeeper is probably things worse?

"Zabbix data gathering processes busy" - I think you are saying that the saturated pinger and poller processes are likely limited by the backend inserts?

Here is the "internal processes busy" graph from one of our proxies. Does it make sense that data senders would be saturated if the server db inserts cannot keep up?

**kloczek** · 29-11-2017, 06:58

1) High poller utilization means that you are still using passive agent.
It does not scale well.You need to move all agents monitoring to active setup.
Best is to switch as well all proxies to active setup as well.

2) Move away all monitoring over the server and add separated proxy for all those host so far monitored over server.
If you want really well scale server performance only set f of metrics monitored over server should be set of internal server metrics. As side effect you will have max possible level of HA. Server can be down and still all data except server internal metrics will be collected and buffered over proxies.

3) every time when write cache drops below 98% it means that DB backend latency on inserts and update queries is to big and not written data are collected in this cache. When it hits 0% free effectively you are loosing data which are not monitored over proxies. Have to many eggs in one basket (central processing and some number of host monitoring) on server is not healthy.

4) with ll active proxies number of pollers could be lowered to lowest possible number and number of trappers should be only around number of the proxies.
Wit this will be possible to move to smaller EC2 instance as well.

**abjornson** · 29-11-2017, 23:58

1) High poller utilization means that you are still using passive agent.

Most of our monitored items (98% or more) are SNMP because they are network devices (radios, routers, etc). We are using proxies in active proxy mode. Our agents are passive mode..we can make this change, but I expect the change will be small due to small number of zabbix agent hosts monitored.

2) Move away all monitoring over the server and add separated proxy for all those host so far monitored over server.

Basically the same answer. 98% or more of our hosts are network devices monitored by our proxies. The server itself monitors a very small number (~10) of hosts. I can move them onto a proxy - but with such a small number of hosts, do you think it will help?

3) every time when write cache drops below 98% it means that DB backend latency on inserts and update queries is to big and not written data are collected in this cache. When it hits 0% free effectively you are loosing data which are not monitored over proxies. Have to many eggs in one basket (central processing and some number of host monitoring) on server is not healthy.

Yes - this aligns with what we're seeing (data loss when write cache drops to 0%). I do also see the condition you mentioned above when disk reads by the db backend sometimes exceed reads.

4) with ll active proxies number of pollers could be lowered to lowest possible number and number of trappers should be only around number of the proxies.
StartPollers is currently still at the default 5
StartTrappers is also currently still at the default 5

Changes so far based on your recommendations:
* innodb_buffer_pool size on the RDS database is 16GB
* I increased all of the cacheSize parameters greatly in zabbix server. I saw this greatly reduce disk reads/writes by the db backend
* I disabled trends/history housekeeping on the server, have not yet implemented partitioning, but will soon
* I also disabled housekeeper on the proxy temporarily. Will probably move the proxy to mysql partitioning as well.

I think I'm seeing improvement. I'm no longer seeing wcache drop below 98%. I'm no longer seeing gaps in data on the server.

My test environment is:
* a single proxy with 600 VPS
* a single upgraded 3.2 server with the above mentioned configuration changes

One thing that's puzzling is that even with these changes, data from that single proxy is still lagging about 1.5 hours behind realtime. The gaps in data have gone away, but the delay has not. I noted that the data sender on the proxy is perpetually maxed at 100%.

This small load - just one of my proxies - should be trivial for the server. I don't really understand why the sender is maxed out. I tested with iperf and there is ample bandwidth (60-80Mbps) from proxy to server.

**kloczek** · 30-11-2017, 06:03

Originally posted by abjornson

1) High poller utilization means that you are still using passive agent.

Most of our monitored items (98% or more) are SNMP because they are network devices (radios, routers, etc). We are using proxies in active proxy mode. Our agents are passive mode..we can make this change, but I expect the change will be small due to small number of zabbix agent hosts monitored.

2) Move away all monitoring over the server and add separated proxy for all those host so far monitored over server.

Basically the same answer. 98% or more of our hosts are network devices monitored by our proxies. The server itself monitors a very small number (~10) of hosts. I can move them onto a proxy - but with such a small number of hosts, do you think it will help?

So again: move monitoryng everything but zabbix server internal metrics ut of the server.
In other words all your SNMPdevices should be monitored over proxy.
Such change is necessary to have possibility continuous monitoring of all SNMP devices when some changes will be done on server. Even simple upgrade of the server to the latest version will not cause loosing data which needs to be collected.

[..]
Changes so far based on your recommendations:
* innodb_buffer_pool size on the RDS database is 16GB

If it still not caused that number of read IOs dropped it will bean that you still have not enough memory,
When you will have for example history tables partitioned with daily created partitions usually good estimation how much memory is needed to radically decrease read IOs is just check prev day size of the history tables partitions size and use this size as estimation how much innodb_pool_size or ZFS ARC divided by ZFS compression ratio vol used by MySQL data on Solaris is needed. On Solaris I found that MySQL caching less effective than ZFS ARC so I'm giving to MySQL innodb_buffer_pool amount of memory necessary to keep last day indexes in memory).
On estimation how much on Solaris needs to be given to MySQL I'm using metrics generated by LLD in my "Service MySQL" template which generates monitoring data about size of the data and indexes per database (https://github.com/kloczek/zabbix-te...ervice%20MySQL This template can be setup (over macro) to specify remote address of the monitored MySQL so it could be possible to use it to monitor AWS MySQL RDS as well.
Screenshot of my laptop MySQL:

Bs screen

Screenshot from 2017-11-30 02-49-23.png

https://drive.google.com/file/d/1UmrgTqrEkkQI6Ecj_7eOarPrsAn5mqoG/view?usp=sharing

* I increased all of the cacheSize parameters greatly in zabbix server. I saw this greatly reduce disk reads/writes by the db backend

If you don't need more caches why you are increasing all of them?
Standard zabbix server templates provides most of those caches metrics.

* I disabled trends/history housekeeping on the server, have not yet implemented partitioning, but will soon

On add main zabbix DB partitioning you will need to stop all write/update operations (or stop zabbix server). This is why it is good to have monitored everything over proxies because you can schedule such downtime without stopping collecting monitoring data.

* I also disabled housekeeper on the proxy temporarily. Will probably move the proxy to mysql partitioning as well.

Why? You have performance on the main database. Not on the proxies databases.

[..]
One thing that's puzzling is that even with these changes, data from that single proxy is still lagging about 1.5 hours behind realtime. The gaps in data have gone away, but the delay has not. I noted that the data sender on the proxy is perpetually maxed at 100%.

This small load - just one of my proxies - should be trivial for the server. I don't really understand why the sender is maxed out. I tested with iperf and there is ample bandwidth (60-80Mbps) from proxy to server.

I'm planning to add proxy internal metric which will allow monitor size of the queued items still not pushed to the server to have better observability such things.

Second thing: you must know that depends on size EC2 instance you have limited in/out bandwidth. If you are using smallest possible instance size you may be hitting 100Mb/s limit. That is the stupidity of the AWS that sometimes to increase net bandwidth you need to allocate bigger instance even if you don't need other resources.

**abjornson** · 30-11-2017, 16:39

"If you don't need more caches why you are increasing all of them?"

Since you said earlier that "if wcache server internal metrics drops below 95% it means that DB backend is limited by inserts speed" I thought i understood that even zabbix caches that are mostly free can be indications of problems (before I had assumed that zabbix cache sizes only needed to be increased if they were getting down to 20% free). My logic there was that zabbix caches. Is this wrong? Unless i'm imagining things, it did feel like the overall responsiveness of zabbix UI when browsing data went up with these much larger caches.

"I'm planning to add proxy internal metric which will allow monitor size of the queued items still not pushed to the server to have better observability such things."

This would be extremely helpful!

Thanks also for the mysql template...will definitely set that up.

**kloczek** · 30-11-2017, 20:58

Originally posted by abjornson

Thanks also for the mysql template...will definitely set that up.

As all my templates are now in git repo feel free to send patches or push requests even with some minor changes or fixing typos.
An please let me know if you will have some issues with using MySQL template against AWS MySQL or MarianDB RDS instances. I'm not sure whuch one MySQL version is now provided as AWS RDS. Nevertheless my tmplate is mainly done for MySQL +5.7.

**abjornson** · 01-12-2017, 20:31

Well, I feel like I'm making some progress toward understanding the problem, but it still feels far from solved.

Here is my current test setup:
* zabbix server 3.2.10 - housekeeping on trends and history disabled, innodb_buffer_pool_size is at 16 GB, zabbix cacheSizes are increased as discussed below
* zabbix proxy 3.2.10 - brand new proxy installation - proxy is monitoring hosts totaling only 700 VPS, which should be no problem for the server

This setup ran great for about 12 hours - no gaps and no delays. Then I started to see the following pattern:

* proxy performance graph shows periodic gaps for 15-20 minutes
* at those corresponding times, server internal process graph shows history syncer gets very busy
* however, at those same times, server caches do not drop below 98%
* at those corresponding times, mysql db monitoring shows read IOPS and read throughput go way up
* at those corresponding times, zabbix server logs show a much increased volume of slow query logs. The bulk of the slow queries seem to be of the following form, and there are tons of them...is there any way to tell why these queries are running and what they are doing?:

Code:

select distinct itemid from trends_uint where clock>=1512144000 and (itemid between 513894 and 513910 or itemid between 513928 and 513961 or itemid between 514030 and 514046 or itemid between 514064 and 514097 or itemid between 1279288 and 1279293 or itemid between 1279325 and 1279330 or....

graph for proxy

graph for server internal processes

graph for server caches

database read/write throughput and read/write IOPS

Ad Widget

Help on 2.4 to 3.2/3.4 upgrade

Help on 2.4 to 3.2/3.4 upgrade

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment