Ad Widget

**kloczek** · 11-02-2019, 17:41

Originally posted by rglade

Hi all,

In many cases, the problem of filling queues has already been addressed. First of all, I have already tested all the Performence presets from other articles.

Which one queue?
What exactly happens in your case?

**rglade** · 11-02-2019, 18:17

We use Zabbix for monitoring servers with the Zabbix agent / IPMI and switches and routers with SNMP.

We currently monitor more than 84000 items and generally all IPMI and Zabix Agend Items are less troublesome. We use 3 zabbix proxies with one zabbix server.
For SNMP items this looks different. Entries that seemingly could not be checked are not checked again and then land in the intems, which could not be checked for more than 10 minutes.

I've just watched this a little bit more and checked if the server ever tries to repeat these previously aborted queries. And to the surprise: he does not do it.

This means that items whose query has been canceled will never be queried again. These will not be put back into the queue.
In my case, the warning will eventually come up, "More than 100 items having missing data for more than 60 minutes"

We use debian with the Zabbix respository - aslo currently version 4.0.4. From the feeling, the problems with zabbix 2.x were not yet available and since 3.x - 4.x the problem has become more pronounced.

**kloczek** · 11-02-2019, 18:49

1) About which one zabbix queue you are talking about?
2) What exactly happens?
3) How many nvps you have on the zabbix server and on each proxy?
4) How many SNMP and IPMI devices is monitored per proxy?
5) How many pollers you have on each proxy? What is the pollers utilisation? Do you have each proxy monitoring? (dummy host on 127..0.0.1/localhost monitored over exact proxy with used only standard "Template App Zabbix Proxy" template)

- used OS doesn't matter
- please just answer on my questions .. as precisely and shortly as you only able.

**rglade** · 11-02-2019, 20:04

1) About which one zabbix queue you are talking about?
Affected is the SNMP queue on the respective zabbix proxy

2) What exactly happens?
Maybe an example. A switch is monitored and several items are now defined for the interfaces. The menu-like queries (in / out / err) are usually not a problem. However, there are queries that are queried only once an hour. If these fail, they are not retrieved after the defined time. Apparently even the aborted queries are not queried again.

3) How many nvps you have on the zabbix server and on each proxy?

4) How many SNMP and IPMI devices is monitored per proxy?
IPMI: No requests to the affected zabbix proxies.
SNMP: 20 to 30 switches.

5) How many pollers you have on each proxy? What is the pollers utilisation? Do you have each proxy monitoring? (dummy host on 127..0.0.1/localhost monitored over exact proxy with used only standard "Template App Zabbix Proxy" template)

I have tested many parameters here. From smaller values to larger ones - actually even utopian values. As already mentioned, up to version 2.0 we had no problems of this kind. It makes no big difference how many pollers we set. Actually we use this parameter:

ConfigFrequency=1200
StartPollers=60
StartIPMIPollers=20
StartPollersUnreachable=84
StartTrappers=20
StartPingers=30
StartHTTPPollers=5
StartVMwareCollectors=5
CacheSize=1024M
HistoryCacheSize=2048M
HistoryIndexCacheSize=128M
ExternalScripts=/usr/lib/zabbix/externalscripts
StartPingers=50
Timeout=15
TrapperTimeout=80
HousekeepingFrequency=1
UnavailableDelay=20
UnreachableDelay=10
UnreachablePeriod=30
Because of these described problems, we have outsourced all SNMP requests to two proxies that are also in the same VLAN as the switches themselves.
Thank you for your support - we are really at a loss. :-) Robert

**kloczek** · 12-02-2019, 01:07

Originally posted by rglade

1)
2) What exactly happens?
Maybe an example. A switch is monitored and several items are now defined for the interfaces. The menu-like queries (in / out / err) are usually not a problem. However, there are queries that are queried only once an hour. If these fail, they are not retrieved after the defined time. Apparently even the aborted queries are not queried again.

What do you see in proxy logs about those OIDs?

3) How many nvps you have on the zabbix server and on each proxy?

You are below OOTB limit of the monitoring data point which is defined in source code in include/proxy.h

Code:

#define ZBX_MAX_HRECORDS       1000
#define ZBX_MAX_HRECORDS_TOTAL 10000

so this is not the case when you may reached such limit.

Check what is logged in proxies logs.

**rglade** · 12-02-2019, 12:18

Ok, I'm not sure if not more than 1000 data points are being monitored. Some switches have 96 ports - which in turn has at least 10 items. If we monitor 10, than 9600 items are monitored - but no any minutes - some have a update time set to 30m. Is this limit per unit of time?

In DebugLevel 3 nothing will be log.
In tcpdump, however, I can no longer see a query that again scans the affected records.

There are definitely only affected items that have a longer update time defined. Maybe detailed example. Here are the items that have the problems:

You can retrieve the entries manually at any time via snmpwalk:

There is an important connection. Only entries with a longer update time are affected. Other entries from the same host with a shorter update time work without any problems!

**kloczek** · 12-02-2019, 16:05

Again .. what did you found in proxies logs about those items?

**rglade** · 12-02-2019, 16:33

Unfortunately I find neither in DebugLevel3 nor 4 any log entries to the SAN05 or the OID. But in TCPDump I could see the requests.

**rglade** · 12-02-2019, 20:01

Are this to many items??:

**kloczek** · 12-02-2019, 23:43

Originally posted by rglade

Unfortunately I find neither in DebugLevel3 nor 4 any log entries to the SAN05 or the OID. But in TCPDump I could see the requests.

Again .. what did you found in the logs?
Did you found any errors/warnings related to monitoring off those hosts with SNMP metrics on default debug lvl?
Please .. I've not been asking you to increase debug level on the proxies of fiddle with tcpdump.
It is really hard to help when instead on focusing on what I've asked you to do are trying to do what you are thing that I'm asking you to do.

**rglade** · 13-02-2019, 08:12

The problem is that the LOG is not really something useful. For example, the disk ID of a SAN:

OID: 3.6.1.4.1.674.11000.2000.500.1.2.14.1.2.19

7033:20190213:065047.619 snmp:[oid:'1.3.6.1.4.1.674.11000.2000.500.1.2.14.1.2.19' community:'{$SNMP_COMMUNITY}' oid_type:0]
7033:20190213:065047.619 snmpv3:[securityname:'' authpassphrase:'' privpassphrase:'']
7033:20190213:065047.619 snmpv3:[contextname:'' securitylevel:0 authprotocol:0 privprotocol:0]
7033:20190213:065047.619 itemid:100100000079866 hostid:100100000010357 key:'scDiskID[20]'
7033:20190213:065047.619 type:4 value_type:3
7033:20190213:065047.619 interfaceid:100100000000200 port:''
7033:20190213:065047.619 state:0 error:''
7033:20190213:065047.619 flags:4 status:0
7033:20190213:065047.619 valuemapid:0
7033:20190213:065047.619 lastlogsize:0 mtime:0
7033:20190213:065047.619 delay:'1h' nextcheck:1550039800 lastclock:0
7033:20190213:065047.619 data_expected_from:1550037047
7033:20190213:065047.619 history:1
7033:20190213:065047.619 poller_type:0 location:1
7033:20190213:065047.619 inventory_link:0
7033:20190213:065047.619 priority:1 schedulable:1
7033:20190213:065047.619 units:'' trends:1

Nevertheless, the item will continue to display as not available. What should I do in the logfile? In LogLevel 5 is really much logged.

There are really no abnormalities in it. Especially with this SAN is also very rarely spent that this temporarily does not respond.

**rglade** · 14-02-2019, 17:00

I have now spent a lot of time again analyzing the log data and cause research. I could say the following:

Only items are affected that have a longer update time (for example 1h)
In the log data no query of the affected data can be found
All servers have the same time, same time zone - updated by NTP
All affected devices answer other SMTP queries without any problems (for example the queries in the 1m interval)
It is also interesting that Zabbix correctly queries the Discovery Rules with partially the OIDs. It seems to me that Zabbix simply ignores the queries with longer intervals.

Others report similar - that the queue fills for updates. Maybe there is a design error?

However, the log file itself is difficult to interpret, as the connections are difficult to recognize. Maybe you have an idea?

**rglade** · 14-02-2019, 22:29

I could find a thing after all. Sometimes Zabbix can sometimes cut away the first character in SNMP OID??

In items configuratuin, there ist the correct oid configured!

**rglade** · 18-03-2019, 14:12

I did the work to understand the problem more deeply and to reproduce it in a test.

- Installation 2.4 -> no problems, even without a proxy
- Update 3.2 / 3.4 -> Problems, but with a proxy solvable
- Update 4.0 -> Problems, but with two proxies solvable

In fact, version 2.x was apparently not prone to this problem, and only with version 3.2 does it seem to be a problem when many SNMP devices need to be requested. The problem dawned with increasing data points. It also seems that with the version 4, the problem is even more extensive. Unfortunately, this statement is very dare, as I had to find in version 3.4 already larger problems with the filling queue. Compared to all assumptions, the performance of one Zabbix Proxies was no longer enough. We could only fix the problem by installing another proxy for the switches.

Ad Widget

Zabbiy queue fills up - SNMP are not recalculated/requeue

Zabbiy queue fills up - SNMP are not recalculated/requeue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment