Ad Widget

**nobody** · 12-01-2017, 19:44

Hi Moderators,

Please move this post into the "Zabbix Help" forum. I didn't see that "Large environments" are 2K+ hosts (and at least 1K+ values per second).

Which we definitely have only half of that.

Sorry for my mistake!

I hope someone has some insight as to how to reduce items in the queue.

Regards,
Nobody

**nobody** · 18-01-2017, 18:31

Hi Zabbix Forums,

No one has any insight into this issue? Load on the server is not an issue for disk, nor CPU, no ram.

I can't post any attachments because the forums don't let me. The zabbix busy poller is 80% busy all of the time looking for snmp values that don't exist on devices. Removing these items from the templates are not an option unfortunately. How can I reduce the busy poller % so that I stop losing data in my graphs?

Regards,
Nobody

**Pada** · 19-01-2017, 00:20

Hi,

Is there no way that you can use LLD (low level discovery) to only add the necessary/available items instead?

Ideally the "unsupported" items should be handled better by Zabbix and not affect the monitoring of available items as much...

**nobody** · 19-01-2017, 00:44

Hi Pada,

I appreciate your response. I haven't used discovery or lld within zabbix before. I manually create all of my hosts. We have quite a vast network with lots of different devices. I only anticipate this issue becoming much worse as our switches become increasingly more modular.

I'll start looking into LLD to see if I can better optimize Zabbix, but in the mean time do you have any other suggestions (aside from disabling the host items). Perhaps I can change some server options or something to better manage the unreachable poller?

Code:

### Option: StartPollersUnreachable
#       Number of pre-forked instances of pollers for unreachable hosts (including IPMI).
# 
# Mandatory: no
# Range: 0-1000
# Default:
 StartPollersUnreachable=500

### Option: UnreachablePeriod
#       After how many seconds of unreachability treat a host as unavailable.
# 
# Mandatory: no
# Range: 1-3600
# Default:
UnreachablePeriod=15

### Option: UnavailableDelay
#       How often host is checked for availability during the unavailability period, in seconds.
# 
# Mandatory: no
# Range: 1-3600
# Default:
# UnavailableDelay=60

### Option: UnreachableDelay
#       How often host is checked for availability during the unreachability period, in seconds.
#
# Mandatory: no
# Range: 1-3600
# Default:
UnreachableDelay=15

Zabbix Internal process busy %: All values averaging less tha 1% ( usually 0.025% to 0.500%)

Zabbix data gathering process busy%: Most items about 5% or lower, with the exception of "Zabbix busy unreachable poller process, in%" which is averaging 71%, hitting about 90% busy on occasion.

Swap is not an issue, server swappiness is set to "last ditch effort", and will only swap out if system is out of ram.

**Pada** · 19-01-2017, 01:07

What is your "StartPollers" value, because it defaults to "5" I believe, which could be the reason why you're having issues.

You should only bump up "StartPollersUnreachable" when you have lots of hosts that are considered as unreachable, an not when you have items that became "unsupported" - or at least that is my understanding of it.
In our environment we bumped it from the default value of 1 to 5.

Another thing that you can try, is to monitor the hosts from various Zabbix Proxy servers, so that they can split the load.
For example have Zabbix proxy server per /24 subnet that you have or any other kind of logical/physical grouping that would make sense.

For instance our Zabbix server is currently not doing any fetching of data, since our proxy servers are doing that and then pushing the results to the server.
Like our "StartPollers" is set to "70" on our big subnet that has lots of SNMP or (passive) Zabbix agents.

Update/Edit:
Are you perhaps monitoring your devices over a high latency network connection (eg. cross continent)? Zabbix proxies are very helpful in cross-continent monitoring scenarios, especially when the links flap or go down.
I'm not sure if increasing the "StartPollers" would result in less hosts being regarded as being unreachable and then less unreachable pollers are required...

**nobody** · 19-01-2017, 18:40

Hi Pada,

Thanks again for your attention to this matter.

These are the core settings that I have:

Code:

### Option: StartPollers
#	Number of pre-forked instances of pollers.
#
# Mandatory: no
# Range: 0-1000
# Default:
 StartPollers=550

### Option: StartPingers
#	Number of pre-forked instances of ICMP pingers.
#
# Mandatory: no
# Range: 0-1000
# Default:
 StartPingers=750

After increasing StartPollersUnreachable from 400 to 500 yesterday I saw ~20% decrease in "Zabbix busy unreachable poller process in%" so I think I'm on the right track.

My StartPollers value hasn't changed much since the upgrade I did from 2.4.5, (I think it was 500 before and I changed it to 550); I didn't see any noticeable improvement with the 50 more pollers. I don't think there's an issue with mysql anymore, no database crashes after I trashed my old zabbix database (just renamed it). Using the same mysql settings (smaller innodb_buffer_pool_size database size. I changed from 12GB in memory to 8GB, prior to this and it seems to have helped overall system performance)

The "Zabbix busy poller process, in %" is never more than 20% busy in spikes, and usually averages about 5%. I don't think it's an issue with the amount of StartPollers that I have running.

Our network is relatively low latency, most of our equipment is less than ~20ms away from our primary data center (where zabbix server is located); the vast majority of it is about 5ms away. Which is really good. (With bursts up to maybe 10-15ms during peak times). All of my hosts are monitored with icmp (as well as advanced snmp based triggers). Nothing outside of the province/state that we operate in [thus far].

I have always been interested in using proxies, but have not yet had a chance. It was removed from one release some time ago and I have not yet had a chance to deploy proxies. This would certainly help reduce network inconsistencies, but at this point the number of datacenters that we have is small (4) with no more than about 50 dedicated servers (plus maybe around 200 vm's; which zabbix trappers are set as active, so it has little load on the dedicated zabbix server.) The main components of our infrastructure are locked-up appliances that support SNMP only. Bulk Requests aren't supported on most of our appliances and cause lots of gappy issues in graphs. Our IP infrastructure is large and all encompassing with a vast mixed vendor of deployed networking equipment (servers, telecom radios, backhaul microwave links, LTE equipment, etc)

One major improvement we made was to move Zabbix onto it's own dedicated StupidMicro server and gave it 16GB of ram. We found with the VM's that the disk queuing was causing serious issues with the database.

I appreciate your incite Pada, if I have any more improvements I will be sure to post them on here.

Regards,
James

**Pada** · 19-01-2017, 21:45

Perhaps changing the "Refresh unsupported items (in sec)" setting from the default 600s to like 3600s (or more, or perhaps even setting to 0) may improve your situation - see https://www.zabbix.com/documentation...ration/general

If the majority of your items that are in "unsupported" state are because of those SNMP items not existing (and will never exist) on that particular host because say you applied a template to monitor a 32-port switch to an 8-port switch, then I would reckon that you'll benefit a lot from having LLD to detect and only try and monitor the correct amount of items for each host.

Out network is vast, so if for some reason a portion of our network goes down, I don't want the items to all become disabled on all of the affected hosts and have to re-enable them manually.

If it goes down due to the link between your Zabbix server and that particular data center, then having a Zabbix proxy at that data center to cache/buffer the results would also help. When you have a local Zabbix proxy in that data center, then the hosts should all still remain in a reachable state, which implies that you'll not have the need for such a high StartPollersUnreachable any more. Also, the proxy has its own unreachable poller processes.

Like our Zabbix proxy would cache results for 1h when the link between our on-premesis Zabbix server and Amazon EC2 goes down, and when it comes back online, Zabbix server will get all of the data back, except for calculated items (or at least this is the case with Zabbix 1.8).

Just 1 big caveat with the proxy, and that is that if you are sending alerts when hosts become unreachable, you'll now have to let the triggers of those hosts behind the proxy depend on a trigger that indicates whether the proxy server is reachable. If you already have trigger dependencies like that built into multiple templates, then I suppose this wouldn't be too difficult to apply either.

Ad Widget

Zabbix Queue - Host Item Rechecking

Zabbix Queue - Host Item Rechecking

Comment

Comment

Comment

Comment

Comment

Comment

Comment