Ad Widget

**kloczek** · 29-06-2015, 19:25

Originally posted by natalia

My configuration :

Zabbix server, DB mysql (with SSD), WEB - installed on separate serevers
and 7 proxy (active)

vps (server) - 1045.04 (+ many trappers)
vps per proxy - 336.68, 50.35, 180.84, 178.35, 58.88, 224.74, 6.93

Zabbix server conf :
CacheSize=2G
CacheUpdateFrequency=5
HistoryCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=2G
StartDBSyncers=4
HistoryTextCacheSize=1G

You have 7 proxies and only 4 DB syncers.
This is the cause.

BTW: you have way to big caches.
I have 2.5 knvps and 250k items and on server even with 64MB for HistoryCacheSize and TrendCacheSize caches almost constantly all of them are 100% free.

With CacheSize=2G means that you are preserving items config cache for something like 2mln items. I have 386MB but I have quit big hosts in/out flow and I must keep 150% more CacheSize than I need for not monitored hosts (not monitored hosts cfg is constantly present in cfg cache for escalations and other operations).

Proxy conf :

ConfigFrequency=5
CacheSize=1G
StartDBSyncers=4
HistoryCacheSize=512M
HistoryTextCacheSize=512M

The same. Everything can be reduces by factor 10 if not more. On my biggest proxy with 89k items has CacheSize=80M
If you have active agents you may increase StartDBSyncers.

If you are using MySQL >= 5.5 you should consider to use innodb_buffer_pool_instances=N where N will not bigger than number of CPU cores*2 and higher than number of DB syncers. It noticeably improves latency of concurrent selects and a bit writes operations (which re the biggest problem in case of DB workload generated by zabbix).

**natalia** · 29-06-2015, 20:36

Many thanks for the reply and very useful information/suggestion!

Originally posted by kloczek

Everything can be reduces by factor 10 if not more. On my biggest proxy with 89k items has CacheSize=80M

where should I reduces on proxy or server or on both ?

Should I reduce all Cache* ?
CacheSize=2G
HistoryCacheSize=1G
TrendCacheSize=1G
ValueCacheSize=2G
HistoryTextCacheSize=1G

I have the following details in dashboard:

Number of hosts: 4639
Number of items :285906
Number of triggers: 259122
Number of users (online) 126 8
Required server performance, new values per second 1057.44

Originally posted by kloczek

If you have active agents you may increase StartDBSyncers.

I have ~4600 active agents, each proxy has 300 - 2500 hosts
where should I increase StartDBSyncers on proxy or server or on both ?
The proxy is active, what is the purpose of StartDBSyncers on servers side ?

Originally posted by kloczek

If you are using MySQL >= 5.5 you should consider to use innodb_buffer_pool_instances=N where N will not bigger than number of CPU cores*2 and higher than number of DB syncers. It noticeably improves latency of concurrent selects and a bit writes operations (which re the biggest problem in case of DB workload generated by zabbix).

I have MySQL 5.6 with partitions to all *history and *trend tables
should I increase innodb_buffer_pool_size as well ? how much ?

Thanks a lot for the help!

**kloczek** · 29-06-2015, 23:21

Originally posted by natalia

Many thanks for the reply and very useful information/suggestion!

where should I reduces on proxy or server or on both ?

Should I reduce all Cache* ?

Sometimes "more" means "to much".
Few GB RAM allocated and not used means that this memory effectively is wasted.
Probably on proxies on server you have some external scripts. If yes this memory used even by page cache may be working more effectively than not used in zabbix caches.

In theory of optimization and testing sometimes is used phrase/term "death by thousands cuts". If you will cut skin of elephant one time it is even possible that so big animal will not even notice this. However thousands such cuts may kill even elephant.
This is why so important is keeping some complicated systems as close as it is possible around possible "sweet spot".
If you will be not caring abut details here or there at some moment system will start behaving randomly and it will be not possible to make single or simple modification to improve/fix this. Why? Because this system will be suffering by only one issue: lack of care for many details

I have the following details in dashboard:

Number of hosts: 4639
Number of items :285906
Number of triggers: 259122
Number of users (online) 126 8
Required server performance, new values per second 1057.44

So your zabbix is almost the same as mine. I have less hosts but for zabbix more important is number of monitored items.

I have ~4600 active agents, each proxy has 300 - 2500 hosts
where should I increase StartDBSyncers on proxy or server or on both ?
The proxy is active, what is the purpose of StartDBSyncers on servers side ?

On both. I'm assuming that you are not monitoring hosts over server and only (active) proxies are connecting to srv. In such case number of syncers on server should be not lower than number of active proxies. Why? Because it is some probability that with active proxies all will be pushing own data to the server in exactly the same period of time. Again: this is true in case case of active proxies. In case passive ones server reads data sequentially from all proxies one by one.

I have MySQL 5.6 with partitions to all *history and *trend tables
should I increase innodb_buffer_pool_size as well ? how much ?

That depends what OS is working on DB backend.
I have Solaris and empirically I found that ZFS caching is better than mysql innodb cache., so I have only 16GB for innidb pool and 32GB for ZFS ARC.
In such scenario is it is some funny consequence of such architecture: restart of DB backend is quite lightweight and mysqld has only memory to hold only indexes. Data are served by read data from ZFS ARC.
With More memory used by ARC and using combination of MRU/MFU algorithms ZFS arc hits/misses ratio like 4-19k/0-20 per second (avg in whole day it is 5.5k/5.5 per second).
In attachment on the bottom you can peak on my daily IO graph on physical disk layer (below ZFS pool).

In case of Linux (as long as you are not using zfs) in may be different. I never had time to compare what is more effective: page cache or innodb pool. Probably inndb pool will be more effective but with more memory dedicated for this warming up DB caches may take longer so it may be kind of double edge sword

You have almost the same number of items as I have so daily volume of data written to DB should be close to size of memory used to by different caches used by and underneath DB engine. Why? Because most of the people are looking on the graphs in one day scale or less. With all last 24h data cached it is very possible that data which needs to be delivered to someone graph(s) will be in memory instead on storage.
As long as you have partitioned history* tables it is very easy calculate how much memory needs to be spend on caching. Something like avg size of daily partitions should enough

PS. If anyone is in London area and will be interested details of my zabbix server set up I'll have presentation on Solaris SIG meeting in September in Oracle office. Those meetings are every month second Wed usually starts at 20:00 (SIG meetings are free .. you need only register

).

Attached Files

**natalia** · 30-06-2015, 20:09

Originally posted by kloczek

So your zabbix is almost the same as mine. I have less hosts but for zabbix more important is number of monitored items.

Could you post your server and proxy conf ?

Originally posted by kloczek

On both. I'm assuming that you are not monitoring hosts over server and only (active) proxies are connecting to srv. In such case number of syncers on server should be not lower than number of active proxies. Why? Because it is some probability that with active proxies all will be pushing own data to the server in exactly the same period of time.

Why to increase dbsyncers on proxy ? 4 it's not enough
Regarding server, I understand :-)
What is the limit on proxy ? (How much items or hosts)

One more question, I have partition only for history and trends tables, so I still need housekeeping to run ... How often do you run it ?
Should it be run only on server side or on proxy as well ?
Why I need it on proxy ?

Thanks a lot!

**kloczek** · 30-06-2015, 23:13

Why to increase dbsyncers on proxy ? 4 it's not enough
Regarding server, I understand :-)
What is the limit on proxy ? (How much items or hosts)

In case of active agents you in worse case scenario you may have all agents connecting to the proxy pushing own data to proxy DB.
Active agents configuration scales better than passive variant but you must have enough channels of connections from proxy to DB backend to push data over more than single connection.

Have no idea where are real limits

So far 100k items with about 1Knvps is not big deal in case using MySQL as DB backend. Sqllite DB backend works enough well in case up to few tenths of monitored hosts. Problem with sqlite is that every insert or update is causing rewrite whole file so at some point this is cause of some IO bottlenecks.
IMO even for small proxies MySQL DB backend is absolute minimum. 1GB innode pool is enough to have good speed with 100k items and 1knvps.

Remember that proxy DB is used not in the same way as by server. Proxy generally is only storing data to the DB and few tenths MB cache is enough to hold all data before they will be send to server to not read those data from DB. DB content is used on second scenario when proxy is loosing sync with server and is pushing all data to server when connectivity will is restored.
Maybe when whole proxy buffered data queue will be enough big it wold be better to drop oldest data by dropping for example oldest every hour created partition. However so far I've not been able to reach enough big flow data over the proxy to have good justification to invest time to make some experiments.

One more question, I have partition only for history and trends tables, so I still need housekeeping to run ... How often do you run it ?
Should it be run only on server side or on proxy as well ?
Why I need it on proxy ?

In my setup all proxies holds only last 4h of data. It is enough in case of "typical" disconnections which we have time to time in our env or panned server downtime to do for example to make major upgrade (in my case minor zabbix upgrade is so rock solid that such upgrade is performed under BAU change).

In case of HK on server story is different.
Few weeks ago seems like we hit some server HK limits.

Problem generally is that even if you have disabled HK on trends and history tables delete host is causing write housekeeper tasks to housekeeper table. In my env we have relatively big from of host going in and out (caused by auto scaling in AWS).
After few months working with disabled HK on trends and history we had more than 50mln housekeeper entries (at the moment almost 60mln).

First problem which I've hit trying to flush HK records on trends table was with HK process which after finishing cycle been deleting all done HK entries by single delete query (with few millions rows it is not gonna happen

).
Additionally fail on executing such query is causing that HK is trying to repeat this delete query in infinite loop. So this is second bug.
3rd biggest bug in HK is when process is trying prepare list of data before starts doing delete queries on trends and and history tables.
Preparing those data is executed very long running select doing full scan of history or trends table.
Trends table is usually way smaller than history and in my case query like "select itemid,min(clock) from trends group by itemid" takes about 3h. In case of doing the same on history situation is way worse.
After running 22h query "select itemid,min(clock) from history group by itemid" on almost 2bln items I've stopped HK.

Housekeeper in current implementations have serious scalability limitations or issues. No matter how many HK records needs to be flushed away time of HK cycle scales not with size of the HK queue but with size of history and trends tables.

Problem with long running selects like above in env with housekeeping done by create in advance few daily partitions and dropping oldest one data is that with more than 20h running such query you have very high possibility collision with creating new partitions and dropping oldest data.
Those selects are holding backlog of not fully committed transactions to DB, and as long as those transactions are not fully committed you cannot make ALTER queries.
This is causing another bad consequence: ALTER query is waiting in infinite loop waiting on obtain lock on table. This is blocking all read and write queries.
I'm not sure but ALTER queries should end after some timeout if they cannot suicide so it is even possible that we are talking here about some MySQL bug as well (I'm using MySQL 5.5 and probably I'll try to discuss this with Oracle support as well).

To unlock this must be killed such ALTER query or stop/killed this long running select.
Cascade of those bad/wrong steps happens in case of using MySQL, but I'm pretty sure that something very similar happens in case of using PostgreSQL or other SQL engines.

However I have good news. Zabbix support already identified all those issues.
Anyone who is interested solution of HK issues should be monitoring publicly available case https://support.zabbix.com/browse/ZBXNEXT-2860

General advise to everyone who is using partitioned trends/history tables: disable HK on tables on which content is maintained by dropping partitions because it will cause freeze whole DB engine when partition maintenance will overlap with HK preparation select query or at last do not try to enable HK!!!

IMO immediate solution which should be applied as fix in delete item code should be stop writing HK records to housekeeper table.
However zabbix maintainers may have different opinion how to handle this

PS. At the and I must say Big Than You to whole Zabbix Support Team. These guys IMO are doing ReallyGoodJob(tm) !!!
Money spend on paid zabbix support are IMO worth of what is provided under such support

**natalia** · 01-07-2015, 04:36

Originally posted by kloczek

In my setup all proxies holds only last 4h of data. It is enough in case of "typical" disconnections which we have time to time in our env or panned server downtime to do for example to make major upgrade (in my case minor zabbix upgrade is so rock solid that such upgrade is performed under BAU change).

I still using sqllite on proxy, thinking to move to mysql.
So, HK is desabled on your proxies ? If config to keep data only for 4h in case of disconnecting from server - it will automatically clean old data, right ?

How do you fix web (php) performance ? Are you using apache or nginx ?
What is your php config ?

We have 2 web servers with apache in F5 define active-active with ~20 connected users.

Thanks again for help !

**kloczek** · 01-07-2015, 09:53

Originally posted by natalia

So, HK is desabled on your proxies ?

No. On proxies HK is enabled and seems everything is OK here.

If config to keep data only for 4h in case of disconnecting from server - it will automatically clean old data, right ?

Yes.

How do you fix web (php) performance ? Are you using apache or nginx ?
What is your php config ?

I'm using Apache.
PHP configuration .. I'm not overwriting anything in /etc/php.ini. All settings comes with zabbix web frontend rpm package.Configuration look like this:

Code:

# cat /etc/httpd/conf.d/zabbix.conf
## Zabbix Frontend
Listen                  zabbix.web:80
NameVirtualHost         zabbix.web:80

<VirtualHost zabbix-web:80>
        ServerName      zabbix.web
        DocumentRoot    "/var/www/zabbix"

        CustomLog logs/zabbix-access_log combined
        ErrorLog logs/zabbix-error_log
</VirtualHost>

<Directory "/var/www/zabbix/">
        Options         Indexes FollowSymLinks
        AllowOverride   AuthConfig
        Order           allow,deny
        Allow           from all

        php_value max_execution_time    300
        php_value max_input_time        300
        php_value memory_limit          512M
        php_value post_max_size         32M
        php_value upload_max_filesize   5M
        php_value date.timezone         GMT
</Directory>

Additionally I'm using php-pecl-zendopcache 7.0.5 module which decreases CPU usage on web frontends by about 10-20%.

We have 2 web servers with apache in F5 define active-active with ~20 connected users.

That is the beauty of using Solaris that it comes with something called ILB (Integrated Load Balancer). It is classic DSR LB. Solaris OOTB supports VRRP as well.
We are using F5 as well but ILB is way cheaper.

Integrated Load Balancer Overview - Oracle Solaris Administration: IP Services

http://docs.oracle.com/cd/E23824_01/html/821-1453/gijjm.html

System Administration Guide: IP ServicesThis book is for anyone responsible for administering TCP/IP network services for systems that run Oracle Solaris. The book discusses a broad range of Internet Protocol (IP) network administration topics. These topics include IPv4 and IPv6 network configuration, managing TCP/IP networks, DHCP address configuration, IP Security using IPsec and IKE, IP packet filtering, IP Network Multipathing (IPMP), and IP Quality of Service (IPQoS).

I must find time to publish some of my resources like my set of MIB based templates and on top of which I have assembled something like F5 template and few other SNMP based devices, my rpm packages which have fully integrated build procedure to generate from the same spec file Solaris IPS packages using pkgbuild. All with few useful patches (which will be integrated shortly in zabbix trunk). I have already full SMF manifest with backend script which allows on Solaris easy create services instances and which has fully mapped zabbix services setup to SMF properties so configuration files are generated on service start stage and on changing server, proxy or agent settings is used only svcprop.
At the moment I'm quite busy but I'll try prepare everything to publish before my Sep presentation.

**natalia** · 02-07-2015, 17:55

Originally posted by kloczek

Additionally I'm using php-pecl-zendopcache 7.0.5 module which decreases CPU usage on web frontends by about 10-20%.

will check it

Thanks a lot for all your help!

**kloczek** · 03-07-2015, 01:57

Please find in attachment my src.rpm wit this pecl ext (src.rpm is inside zip .. seems zabbix forum frontend does not accept rpm files as attachments

)

To rebuild this and produce binary arch depended rpm you need to rebuild this package by "rpmbuild --rebuild php-pecl-zendopcache-7.0.5-1.el6.src.rpm" (rpmbuild command s in rpm-build package).

After generating rpm package all what you need to to is install package by "rpm -Uvh php-pecl-zendopcache-7.0.5-1*rpm". Package postinstall script automatically will takes care of restarting apache (if it is already running) to load this php pecl extension.
Package post install scripts are compliant with CentOS 5/6, Amazon Linux 1, Oracle Linux 5/6, RedHat Enterprise Linux 5/6 (I'm using Oracle Linux

)

Default /etc/php.d/zendopcache.ini settings are enough for zabbix web frontend.

**kloczek** · 03-07-2015, 02:12

Ehhh .. zip file size excised max allowed size. I'v uploaded it to my google drive

php-pecl-zendopcache-7.0.5-1.el6.src.rpm

https://drive.google.com/open?id=0B3yHf0mXFAEtNDI4eTIwNVVLNUk

Ad Widget

Zabbix configuration syncer processes more than 75% busy

Zabbix configuration syncer processes more than 75% busy

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment