Ad Widget

**kloczek** · 28-09-2015, 21:39

Very low trapper load and high poller means that you are using mostly "zabbix agent" items instead "zabbix agent (active)" type of items.

Passive monitoring does not scale above some NVPS and requires muuuuch more server/proxy treads than active one. On some scale switching to active monitoring and active agents as well is only option.

**akbar415** · 28-09-2015, 22:57

Originally posted by kloczek

Very low trapper load and high poller means that you are using mostly "zabbix agent" items instead "zabbix agent (active)" type of items.

Passive monitoring does not scale above some NVPS and requires muuuuch more server/proxy treads than active one. On some scale switching to active monitoring and active agents as well is only option.

First. Thanks for help me.

I changed some passive checks items (30% of all items) to active checks. This help me a little. From avg 90% of busy poller to avg 80% of busy poller.

But this caused me another problem. The zabbix queue for active checks grown up very fast, +250 items in more than 10 minutes queue.

Sorry for my bad englsih.

**kloczek** · 29-09-2015, 02:17

Originally posted by akbar415

First. Thanks for help me.

I changed some passive checks items (30% of all items) to active checks. This help me a little. From avg 90% of busy poller to avg 80% of busy poller.

But this caused me another problem. The zabbix queue for active checks grown up very fast, +250 items in more than 10 minutes queue.

So now you need to go over all agents and replace:
Server=<your_prx_or_srv_addr>
by:
ServerActive=<your_prx_or_rsv_addr>
StartAgents=0

This will cause that agents will start deciding when push batches of collected monitoring data to srv/prx.
Without using active agents setup new bottleneck is created in serialized query agent by agent to read monitoring data.

**akbar415** · 29-09-2015, 20:44

Originally posted by kloczek

So now you need to go over all agents and replace:
Server=<your_prx_or_srv_addr>
by:
ServerActive=<your_prx_or_rsv_addr>
StartAgents=0

This will cause that agents will start deciding when push batches of collected monitoring data to srv/prx.
Without using active agents setup new bottleneck is created in serialized query agent by agent to read monitoring data.

I confiured 10 server to use only active checks. No success.
But thanks for you help.

**akbar415** · 29-09-2015, 21:01

Some more information that might help you to help me.

I added 4 server to be monitored on zabbix server, after that, zabbix server started the "Zabbix poller processes more than 75% busy" trigger.

I have this problem before, so I known that all I have to do is raise the StartPollers number (256).
First I set StartPollers = 280 (without success), 300 (without success) ...
now the value is 420 and I'm still have the problem.

I tried do disable 4 hosts monitored in server (without success)

I changed 10 hosts to be monitored only by active checks. (the zabbix queue grew very fast and cause me another issue).

Nothing appears on the log. (debuglevel=4)

Performance of zabbix server is fine (memory, cpu).

I don't know what else I can check.

Sorry for tha bad Englsih.

**abevern** · 20-05-2016, 05:35

Grave Dig!

Did you get any resolution on this?

I'm looking to reduce my "poller busy" numbers too.

We currently have 200 pollers with 70% busy. We have ~350 nvps load.

Increasing pollers works to an extent, but you really need to identify things that are tying the pollers up for longer periods. i.e. checks that are particularly expensive.

I've had some luck by trying to optimise some external checks. Specifically reduce the number of modules and "prettiness" in perl scripts to maximise the speed. I have also re-implemented some perl scripts in bash, as this loads much faster.

I test optimisations by running them 500 times from the command line.
i.e.

Code:

$ time for i in {1..500} ;do 
/usr/lib/zabbix/externalscripts/check.pl SERVER >/dev/null 
done 

real	0m8.372s
user	0m3.879s
sys	0m1.516s

You can get an idea on external checks that take longer by looking at the output of ps -e. Specifcally, pollers that have children are running external checks. If there's a lot running a particular type, then it could be worth looking into.

Sample below lead me to look more closely at check.pl

Code:

$ ps -fuzabbix --forest 
....
zabbix   19947 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #177 [got 0 values in 0.000003 sec, getting values]
zabbix   19948 19765  0 08:30 ?        00:00:07  \_ zabbix_server: poller #178 [got 0 values in 0.000002 sec, getting values]
zabbix   56279 19948  0 13:30 ?        00:00:00  |   \_ /usr/bin/perl -w /usr/lib/zabbix/externalscripts/check.pl vXXXXXXD087
zabbix   56290 56279  0 13:30 ?        00:00:00  |       \_ sh -c ping -c 1 -s 56 -w 1 vXXXXXXD087  1>/dev/null 2>/dev/null
zabbix   56295 56290  0 13:30 ?        00:00:00  |           \_ ping -c 1 -s 56 -w 1 vXXXXXXD087
zabbix   19949 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #179 [got 0 values in 0.000002 sec, idle 1 sec]
zabbix   19950 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #180 [got 0 values in 0.000003 sec, idle 1 sec]
zabbix   19951 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #181 [got 0 values in 0.000002 sec, idle 1 sec]

If anyone else has some tips on working out what is taking time for the pollers, i'd be interested in hearing.

**kloczek** · 20-05-2016, 08:38

Why you are using your own custom pinging method if you have build-in pinger?

**abevern** · 20-05-2016, 08:42

Originally posted by kloczek

Why you are using your own custom pinging method if you have build-in pinger?

We have 6 domains, so we wrap fping to try all 6 permutations at the same time. The FQDN isn't known due only being monitored by an "active agent".

I just used it as an example, the pinging itself isn't a problem. But there are other external checks that are.

**kloczek** · 20-05-2016, 14:51

Originally posted by abevern

We have 6 domains, so we wrap fping to try all 6 permutations at the same time. The FQDN isn't known due only being monitored by an "active agent".

I just used it as an example, the pinging itself isn't a problem. But there are other external checks that are.

ICMP related keys are not "zabbix (active) agent" type of keys but "simple check". You can add pinging to dummy host with registered in interface any hostname or IP you want.
As you are using extremalnie checks for every pinged address ostaining every such metric value must be done by separated sub process of zabbix server. With internal pinger over simple checks such process I see paaellized by definition.
So I think that you may be wrong if you are doing enough big number of such checks.
Despite above you should not be monitoring any metrics over server except zabbix server monitoring over internal checks. Everything else should be moved to proxy or proxies. Why? To simplify workload on server and to have ability to collect monitoring data even when zabbix server is temporary down (maintainacne)

**akbar415** · 20-05-2016, 21:31

Database upgrade

Originally posted by abevern

Did you get any resolution on this?

I'm looking to reduce my "poller busy" numbers too.

We currently have 200 pollers with 70% busy. We have ~350 nvps load.

Increasing pollers works to an extent, but you really need to identify things that are tying the pollers up for longer periods. i.e. checks that are particularly expensive.

I've had some luck by trying to optimise some external checks. Specifically reduce the number of modules and "prettiness" in perl scripts to maximise the speed. I have also re-implemented some perl scripts in bash, as this loads much faster.

I test optimisations by running them 500 times from the command line.
i.e.

Code:

$ time for i in {1..500} ;do 
/usr/lib/zabbix/externalscripts/check.pl SERVER >/dev/null 
done 

real	0m8.372s
user	0m3.879s
sys	0m1.516s

You can get an idea on external checks that take longer by looking at the output of ps -e. Specifcally, pollers that have children are running external checks. If there's a lot running a particular type, then it could be worth looking into.

Sample below lead me to look more closely at check.pl

Code:

$ ps -fuzabbix --forest 
....
zabbix   19947 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #177 [got 0 values in 0.000003 sec, getting values]
zabbix   19948 19765  0 08:30 ?        00:00:07  \_ zabbix_server: poller #178 [got 0 values in 0.000002 sec, getting values]
zabbix   56279 19948  0 13:30 ?        00:00:00  |   \_ /usr/bin/perl -w /usr/lib/zabbix/externalscripts/check.pl vXXXXXXD087
zabbix   56290 56279  0 13:30 ?        00:00:00  |       \_ sh -c ping -c 1 -s 56 -w 1 vXXXXXXD087  1>/dev/null 2>/dev/null
zabbix   56295 56290  0 13:30 ?        00:00:00  |           \_ ping -c 1 -s 56 -w 1 vXXXXXXD087
zabbix   19949 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #179 [got 0 values in 0.000002 sec, idle 1 sec]
zabbix   19950 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #180 [got 0 values in 0.000003 sec, idle 1 sec]
zabbix   19951 19765  0 08:30 ?        00:00:06  \_ zabbix_server: poller #181 [got 0 values in 0.000002 sec, idle 1 sec]

If anyone else has some tips on working out what is taking time for the pollers, i'd be interested in hearing.

We solve the problem with a hardware upgrade on mysql database (8 GB to 16 Gb RAM)

Ad Widget

Perfomance. Yes. Performance issue again.

Perfomance. Yes. Performance issue again.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment