Ad Widget

**Colttt** · 21-01-2015, 13:07

maybe its the houskeeper process?

do you tune you zabbinx_server.conf and postgressql-config?

**kloczek** · 21-01-2015, 18:37

Originally posted by prakhar

top - 10:28:58 up 8 days, 23:03, 1 user, load average: 512.98, 607.09, 649.69
Tasks: 3440 total, 59 running, 3381 sleeping, 0 stopped, 0 zombie
Cpu(s): 93.8%us, 2.1%sy, 0.0%ni, 4.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 99022656k total, 97327092k used, 1695564k free, 426468k buffers
Swap: 20972848k total, 78320k used, 20894528k free, 67705676k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14744 postgres 20 0 10.0g 537m 530m S 2.9 0.6 22:08.43 postmaster
14973 postgres 20 0 10.0g 541m 532m S 2.7 0.6 21:57.05 postmaster
54173 admin 20 0 17692 3988 1004 R 2.7 0.0 0:00.52 top
14059 postgres 20 0 10.0g 540m 532m S 2.5 0.6 21:59.06 postmaster
14175 postgres 20 0 10.0g 539m 531m S 2.5 0.6 21:57.19 postmaster
14248 postgres 20 0 10.0g 540m 533m S 2.5 0.6 22:06.66 postmaster
14251 postgres 20 0 10.0g 309m 304m S 2.5 0.3 21:51.41 postmaster
14296 postgres 20 0 10.0g 321m 315m S 2.5 0.3 21:50.79 postmaster
14316 postgres 20 0 10.0g 539m 530m S 2.5 0.6 22:00.16 postmaster
14510 postgres 20 0 10.0g 537m 530m S 2.5 0.6 22:19.80 postmaster

96GB RAM, 10GB for postgresql, only 6-7GB left for buffer cache and 500-600 proccesses/threads in running queue. Are you sure that on this host is running only postgresql?
Zabbix srv is on the same host?
If yes probably most of your items are passive items (and it is first bell to start moving away from passive monitoring) and so long running queue is caused by 500-600 pollers waiting on receive data from monitored hosts. Isn't it? What is your StartPollers value?
Are you using partitioned history*/trends* tables?

With about 450nvps daily volume of new data should around only 2-6GB of new data .. 96GB RAM it is in this case overkill (but it is not an issue).

You must have few configuration issues overlapping causing such pathological results.

**prakhar** · 22-01-2015, 12:27

Yes the setup is running both postgres and zabbix_server.
10.0g is VIRT memory for each postmaster process which includes
/*postgresql.conf*/
shared_buffers = 9832MB.
the actual RES memory per process is ~ 450MB.

Most of the items are zabbix_active in my case.
The differences in my load setup (running at ~3000nvps) and this setup are:
1. This setup has ~120000 monitored items of which ~70000 are zabbix trapper items.
2. In 3000nvps setup i had only 15 hosts but here i have around 450 hosts.
3. In 3000nvps nodes i had no zabbix Trapper items but in this setup i have around 70k zabbix trapper items.

Doubt : Though zabbix trapper items are not polled, is there a possibility that they increase zabbix table sizes resulting in slower queries?

I have not partitioned history*/trends* tables.

/*zabbix_server.conf*/
StartTrappers=40
StartPollers=40
All the cache setting are to max allowed.

@ Colttt : i have disabled housekeeper process.
Ya i have tuned zabbix and posgtres for perf. I can provide other configuration details of postgresql and zabbix_server if you want.

**Colttt** · 22-01-2015, 16:43

how many db-syncers do you have?

**prakhar** · 23-01-2015, 09:21

StartDBSyncers=16

**kloczek** · 23-01-2015, 12:12

Originally posted by prakhar

Yes the setup is running both postgres and zabbix_server.
10.0g is VIRT memory for each postmaster process which includes
/*postgresql.conf*/
shared_buffers = 9832MB.
the actual RES memory per process is ~ 450MB.

Seems it is huuuuge waste of memory.
I have now my DB backend on Solaris on zfs for zabbix with 1.5knvps on host with only 32GB memory where at the moment only

Code:

$ kstat zfs:0:arcstats:size | grep size | awk '{printf "%2dMB\n",  $2/1024/1024+0.5}'
10102MB

ARC (zfs Adaptive Reclaim Cache) and still I have about 10GB not used memory.
On this host all volumes are with max recordsize (1MB) and lzjb compression I have way less now IOs than I had on Linux.

Code:

$ zpool iostat 2
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        102G  83.7G      2    143   624K  24.6M
rpool        102G  83.7G      7    121  60.7K  17.4M
rpool        102G  83.8G      0    216    767  27.2M
rpool        102G  83.8G      0    156      0  21.4M
rpool        102G  83.8G      2    476   642K   111M
rpool        102G  83.8G      0    188      0  18.3M
rpool        102G  83.8G      0    128  16.7K  18.2M
rpool        102G  83.7G      0    179    255  19.5M
rpool        102G  83.7G      0    328  16.7K  51.8M
rpool        102G  83.7G      0    412      0  78.7M
rpool        102G  83.7G      0    162  3.50K  21.7M
rpool        102G  83.7G      2    245  1.47M  36.5M
^C

in zpool used by mysql is only one pair of SSDs. On Linux the same mysql 5.5 been doing about 1.2-1.7kIO/s and this is why it was necessary to move to SSDs. Effectively after migrating to Solaris should be possible back to work on old spindles :P

Code:

# zfs get compression,recordsize,compressratio,referenced rpool/VARSHARE/mysql
NAME                  PROPERTY       VALUE  SOURCE
rpool/VARSHARE/mysql  compression    lzjb   local
rpool/VARSHARE/mysql  recordsize     1M     local
rpool/VARSHARE/mysql  compressratio  2.68x  -
rpool/VARSHARE/mysql  referenced     89.1G  -

Giving more cache for DB backend than you are daily storing data is usually wrong (with partitioned history* tables rotated every day it is easy to find how much is needed here).
I found that zfs ARC is working better than mysql innodb cache so mysql has only innodb_buffer_pool_size=5GB.
This host even doing on the fly compression/decompression with only running mysql has only 8-14% CPU time usage (on next promote slave DB to master I'm going to start experimenting with gzip compression) which is (strange) lower by about 10% than the same hardware mysql 5.5 been running on Linux.

Most of the items are zabbix_active in my case.
The differences in my load setup (running at ~3000nvps) and this setup are:
1. This setup has ~120000 monitored items of which ~70000 are zabbix trapper items.
2. In 3000nvps setup i had only 15 hosts but here i have around 450 hosts.
3. In 3000nvps nodes i had no zabbix Trapper items but in this setup i have around 70k zabbix trapper items.

Doubt : Though zabbix trapper items are not polled, is there a possibility that they increase zabbix table sizes resulting in slower queries?

You can treat these items almost like active items. Why? Because passive items are causing that poller thread is connecting to the monitored host and it waits until requested monitoring data will sampled and send back in reply (which sometimes may take even couple of seconds).
Using active items monitoring and trapper monitoring are causing that prx or srv threads are moving to running queue only to establish connectivity and instantly receive monitoring data and move back to poll of threads waiting to be used again.

I have not partitioned history*/trends* tables.

So seems now it is your biggest problem and it should be your top priority on your ToDo list

/*zabbix_server.conf*/
StartTrappers=40
StartPollers=40
All the cache setting are to max allowed.

I have no passive items and more than 99% of all items are monitored over prxies so my srv settings are:

Code:

# cat /etc/zabbix/zabbix_server/Start*
StartDBSyncers=10
StartDiscoverers=1
StartHTTPPollers=1
StartPingers=1
StartPollers=1
StartProxyPollers=15
StartTrappers=1

For example biggest prx settings monitoring almost half of my hosts) are:

Code:

# cat /etc/zabbix/zabbix_proxy/Start*
StartHTTPPollers=5
StartPingers=30
StartPollers=10
StartTrappers=100

With almost all items monitored over prxies (except couple of internal checks and few other items) I have no stress on restart server even if it is necessary to schedule a little longer srv downtime. All monitoring data are still in such architecture collected by prxies. As well I found that even running on the same host srv and prx with monitoring all items over proxy reduces IOs pressure .. only because srv always is digesting and storing monitoring data in bigger batches than the same hosts are straight monitored by srv.
My biggest prx which monitors almost half of hosts and srv settings:

Code:

# cat /etc/zabbix/zabbix_server/ProxyDataFrequency; cat /etc/zabbix/zabbix_proxy/DataSenderFrequency
ProxyDataFrequency=10
DataSenderFrequency=10

(I have mixture of passive and active proxies)

prx and srv can work on one physical host but I'm using two of them. Usually on first is running srv and second is running prx with own small mysql DB backend (to hold last 6h monitoring data). Both are not using hosts IPs but own per service dedicated addresses. Manual fail over takes few seconds (below sync period between srv and prx) and I can easily by this for example schedule reboot of one these hosts without affecting monitoring.

At the moment everything is prepared to put above under the cluster hood (it will be Oracle cluster on Solaris) so after this all operations will be even easier and more predictable.

With above architecture on restart srv (still 2.2.8) initial checking all ~50k triggers takes less than 2-5s (I have atm 116k monitored items) so restart of the srv still is below sync period between srv and proxies.

@ Colttt : i have disabled housekeeper process.
Ya i have tuned zabbix and posgtres for perf. I can provide other configuration details of postgresql and zabbix_server if you want.

To be honest with you .. in my private opinion using postgresql as DB backend typical warehouse DB like zabbix DB is a little overkill

Mysql as simpler engine theoretically under such workload will IMO always more or less win against postgresql.
However if you know better postgresql do not move to other engine

**kloczek** · 23-01-2015, 14:44

I've been asked to share some data about cpu usage in my case on Linux and Solaris. Here is the graph with 3m data where on left side is cpu usage on Linux. After it is gap when master db was on promoted slave (and I had time to reinstall everything and make couple of tests) and on right side is current cpu usage on Solaris. As i wrote I'm using zfs lzjb compression so theoretically now on Solaris cpu usage should be higher .. but it isn't

Most of the %sys time on Solaris is consumed by compression/decompression threads.

Attached Files

**Colttt** · 26-01-2015, 10:52

Originally posted by prakhar

StartDBSyncers=16

please decrease you Syncers to 8..

StartDBSyncers in Zabbix

https://ma.ttias.be/startdbsyncers-zabbix/

Zabbix Server has a configuration setting called StartDBSyncers. By default, this value is set to 4. This may seem like a conservative setting, but increasing the value can do more harm than good – I’ll try to explain why.

**prakhar** · 05-02-2015, 05:27

Colttt : thanks for the DBsyncers input. It did help.

But still the over all performance is not as expected.

kloczek : So the priority now is to partition tables, here are disk size of some of the tables.

postgres=# SELECT pg_size_pretty(pg_database_size('zabbix'));
pg_size_pretty
----------------
57 GB
(1 row)

zabbix=# select pg_size_pretty(pg_total_relation_size('items'));
pg_size_pretty
----------------
34 GB
(1 row)

zabbix=# select pg_size_pretty(pg_total_relation_size('history'));
pg_size_pretty
----------------
8000 MB
(1 row)

zabbix=# select pg_size_pretty(pg_total_relation_size('trends'));
pg_size_pretty
----------------
173 MB
(1 row)

zabbix=# select pg_size_pretty(pg_total_relation_size('hosts'));
pg_size_pretty
----------------
52 MB
(1 row)

As i could see, history table is around 8GB, but my main concern is "items" table it is 34GB. Any query involving items table takes a lot of time.

zabbix=# explain analyze select * from items;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on items (cost=0.00..3431875.54 rows=9995754 width=2497) (actual time=4.007..136778.522 rows=40714 loops=1)
Total runtime: 136781.698 ms

1. Why has my items table bloated to such a huge size. Is this normal size for a items table have around 50k items with 25 days of data?

2. What approach can i take to partition items table?

3. Can anyone help me with zabbix query optimizations (DB : postgresql 9.3)

**jan.garaj** · 05-02-2015, 11:21

Your questions should be:

1.) Why my item table has huge size?
Did you run vacuum command?

VACUUM

http://www.postgresql.org/docs/9.3/static/sql-vacuum.html

Did you use LLD feature before? (maybe old discovered items are there)

2.) Why the load is 500+?
Please post
- all last week graphs of Zabbix server (Zabbix performance graphs, CPU load/util, IOPs, network, Postgresql stats, ...)
- full output from commands:
ps -ef
iostat -xk 10 10
mpstat -P ALL 10 2
- zabbix server config
- zabbix server log (errors)

Ad Widget

Zabbix server taking very long to restart ~ 1.5 to 2 hrs

Zabbix server taking very long to restart ~ 1.5 to 2 hrs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment