Ad Widget

**colohost** · 21-05-2018, 18:21

Have you tried any of what is being discussed in this thread on an actual piece of network gear? I have, on equipment from Cisco, Arista and Brocade. In all cases, there is a specific number of OID's the given devices can return in 30 seconds when using SNMPv3 with AES. It is easy to measure, you just keep adding OID's to snmpbulkget until you find the number. I've already pointed out how to do this. If you need to query more OID's than what the device will send back in 30 seconds, then you have no choice but to switch to SNMPv2 or not use Zabbix, because it is not currently possible to monitor those reliably.

The cause of the failures is Zabbix forcing an artificial timeout onto the response from the device. Traditional network hardware cannot send AES encrypted OID data back for 1000+ items in less than the maximum timeout Zabbix is capable of using; currently 30 seconds. This is not a netsnmp timeout, this is Zabbix telling it stop at 30 seconds. Zabbix forces this timeout onto netsnmp as your own code quote shows, so if it is still receiving data when the Zabbix-imposed timeout is reached, collection is halted at that point, no more data is collected, your items have gaps.

The same issue does not happen if using netsnmp directly via snmpbulkget, it only happens when using Zabbix and the target device can't send back all of the data requested in less than the timeout Zabbix has forced onto the query.

So again, the solution is to put code changes in place on the Zabbix side where a device is never queried for more SNMPv3 AES-priv OID's than it can possibly return in the current CONFIG_TIMEOUT value Zabbix has been set with. If a given device has more items than can be returned in that amount of time, then Zabbix needs to break them up into buckets of items that are queried at separate non-overlapping intervals so they can all be retrieved. If you add more items at a low enough query interval than the target device can possibly every keep up with, then Zabbix should warn you.

I submitted all of this as a bug.

**kloczek** · 22-05-2018, 08:26

Originally posted by colohost

Have you tried any of what is being discussed in this thread on an actual piece of network gear?

No I cannot debug anything on those devices agents side. Can you do anything like this?

The cause of the failures is Zabbix forcing an artificial timeout onto the response from the device.

How did you get to this conclusion? or it is only your guess?

Traditional network hardware cannot send AES encrypted OID data back for 1000+ items in less than the maximum timeout Zabbix is capable of using; currently 30 seconds.

So what zabbix has to do with what happens on agent side?

This is not a netsnmp timeout, this is Zabbix telling it stop at 30 seconds.

OK. How did you get to this conclusion? Any experiment proving/disproving this?
Can you show some zabbix debug output showing this?

The same issue does not happen if using netsnmp directly via snmpbulkget, it only happens when using Zabbix and the target device can't send back all of the data requested in less than the timeout Zabbix has forced onto the query.

Default net-snmp SNMP client side timeout applied to https://linux.die.net/man/2/select is 1s. If data from agent will be return within this 1s it will be no error.
If 1s timeout works for you using snmbulkwalk than use Timeout-=1 in zabbix settings.

I submitted all of this as a bug.

Bug but where?

BTW AES.
Are aware of the fact that AES is used only on authentication and not on session communication?

Strong Authentication or Encryption - Net-SNMP Wiki

http://net-snmp.sourceforge.net/wiki/index.php/Strong_Authentication_or_Encryption

If use AES auth would be an issue it will adding some fided latency to the single OID and multiple OIDs query.

**sfl** · 22-05-2018, 15:08

Hi colohost and kloczek,

I'm following this thread with highly interest because I have several device in AuthPriv SNMP v3 credentials with random unsupported items on the same host.
I notice that switching back from AES to DES or 3DES solves my issue for unsupported items.

Meanwhile I agree with colohost 's point of view , zabbix should not be depends of any snmp cipher to take answer into account from agent side.

So when our client asks us what snmp configuration they should apply, we tell them to use snmpV2 despite security concerns :-(

Regards,
Sfl

**kloczek** · 22-05-2018, 16:07

Originally posted by sfl

Meanwhile I agree with colohost 's point of view , zabbix should not be depends of any snmp cipher to take answer into account from agent side.

Nothing in zabbix forces to use AES.
If someone controls physical layer (separated physical paths, VPN with encryption or VLAN separation) of the communication between server/proxy and SNMP agents using SNMPv3 is kind of over architecting as well.
If device embedded system HW on which is running SNMP agent is provides more complicated encryption but it is to weak to use those encryption smoothly nothing can be changed on SNMP client side.

**colohost** · 22-05-2018, 17:08

Originally posted by kloczek

No I cannot debug anything on those devices agents side. Can you do anything like this?

Yes, I have access to a linux shell on my Arista gear. In any case, the reason for my inquiry is because you do not seem to have any experience with the type of equipment being discussed, so I was just curious.

Originally posted by kloczek

How did you get to this conclusion? or it is only your guess?

Real world experience. I set up the equipment in question, I poll it using SNMPv3 with any other tool at my disposal, I never lose data. I do the same exact thing with Zabbix, I lose data. One of the tools in question is snmpbulkget, which is part of net-snmp, and therefore the same code base Zabbix relies on. Other tools using net-snmp do not fail, Zabbix does.

Originally posted by kloczek

BTW AES.
Are aware of the fact that AES is used only on authentication and not on session communication?

Strong Authentication or Encryption - Net-SNMP Wiki

http://net-snmp.sourceforge.net/wiki/index.php/Strong_Authentication_or_Encryption

If use AES auth would be an issue it will adding some fided latency to the single OID and multiple OIDs query.

Wow. You clearly do not understand SNMPv3.

RFC 3414: User-based Security Model (USM) for version 3 of the Simple Network Management Protocol (SNMPv3)

https://tools.ietf.org/html/rfc3414

This document describes the User-based Security Model (USM) for Simple Network Management Protocol (SNMP) version 3 for use in the SNMP architecture. It defines the Elements of Procedure for providing SNMP message level security. This document also includes a Management Information Base (MIB) for remotely monitoring/managing the configuration parameters for this Security Model. This document obsoletes RFC 2574. [STANDARDS-TRACK]

I'd recommend you read it and learn the difference between authentication and privacy. The auth protocol is MD5 or SHA, that is used for authentication. The data coming back, if encrypted, is the privacy protocol, which is either DES or AES. AES was added to SNMPv3 for data encryption 14 years ago: https://www.ietf.org/rfc/rfc3826.txt

**colohost** · 22-05-2018, 17:16

Originally posted by kloczek

Nothing in zabbix forces to use AES.
If someone controls physical layer (separated physical paths, VPN with encryption or VLAN separation) of the communication between server/proxy and SNMP agents using SNMPv3 is kind of over architecting as well.

Again, you are showing your lack of experience with network equipment. Not all network devices have dedicated management ports, or the ability to use VRF's to isolate management traffic to a management VLAN. If a device cannot use either of those methods to isolate management traffic, then you're forced to interact with the management plane through a normal data port and ACL it, but that wouldn't stop others on the same VLAN from seeing it. If you're forced to do that, and you care at all about security, then you must use SNMPv3 with both a privacy protocol and encryption protocol.

Let me guess, you suggest using telnet for device config too?

Originally posted by kloczek

If device embedded system HW on which is running SNMP agent is provides more complicated encryption but it is to weak to use those encryption smoothly nothing can be changed on SNMP client side.

Umm actually things can be changed on the SNMP client side, which was the whole point of my original post. The code necessary to solve this issue can be incorporated into Zabbix's SNMPv3 poller. It lets a zabbix admin define how many SNMPv3 responses can be delivered by a given device per second. Then, Zabbix does not query a given host for a number of OID's greater than (CONFIG_TIMEOUT * (responses per second)) since going above that number would fail. The only hard part is Zabbix would need to track which items are polled for a host in specific groups, so they always get polled together to ensure each group is polled at the required frequency.

**colohost** · 22-05-2018, 17:24

Originally posted by sfl

Hi colohost and kloczek,

I'm following this thread with highly interest because I have several device in AuthPriv SNMP v3 credentials with random unsupported items on the same host.
I notice that switching back from AES to DES or 3DES solves my issue for unsupported items.

Meanwhile I agree with colohost 's point of view , zabbix should not be depends of any snmp cipher to take answer into account from agent side.

So when our client asks us what snmp configuration they should apply, we tell them to use snmpV2 despite security concerns :-(

Regards,
Sfl

Thanks for posting this. The fact that DES works contributes to my theory that the issue is devices that cannot send 100% of the OID's back before Zabbix's Timeout is reached. DES responses come far faster than AES responses due to the much lower computational requirements.

I'm digging into the Zabbix poller code now to find what is forcing the artificial timeout on receiving the returning data.

**kloczek** · 22-05-2018, 17:38

Originally posted by colohost

Yes, I have access to a linux shell on my Arista gear. In any case, the reason for my inquiry is because you do not seem to have any experience with the type of equipment being discussed, so I was just curious.
[..]
Real world experience. I set up the equipment in question, I poll it using SNMPv3 with any other tool at my disposal, I never lose data. I do the same exact thing with Zabbix, I lose data. One of the tools in question is snmpbulkget, which is part of net-snmp, and therefore the same code base Zabbix relies on. Other tools using net-snmp do not fail, Zabbix does.

An did you debug snmpd running on this dev using gdb or at least something like ltrace or strace/truss to point that this snmpd process is not able to send more data when AES is used to encrypt communication channel?

So one more time .. you wrote:

The cause of the failures is Zabbix forcing an artificial timeout onto the response from the device.

Please explain or provide detailed description what exactly you've done to allow other people to reproduce what you've observed (which I'm still not sure that you did because so far you are giving only some murky description without any traces data or used commands or other details about used procedure).

Theoretically AES (IIRC) requires only 3-4 times bigger computation power compare to DES. If I'm right and if you can observe CPU saturation due use AES still it should be no issue with + 1:30 (1s vs 30s max timeout) ratio between time necessary to send the same data encrypted DES and AES. Ergo: still it may be not AES but something else and it is only coincidence that issue is possible to observe when AES is used.

I'd recommend you read it and learn the difference between authentication and privacy. The auth protocol is MD5 or SHA, that is used for authentication. The data coming back, if encrypted, is the privacy protocol, which is either DES or AES. AES was added to SNMPv3 for data encryption 14 years ago: https://www.ietf.org/rfc/rfc3826.txt

Indeed my mistake about used algorithm names in auth and communication encryption (on the page which I've pointed "AES" phrase is not used).

**colohost** · 22-05-2018, 22:39

Originally posted by kloczek

Theoretically AES (IIRC) requires only 3-4 times bigger computation power compare to DES. If I'm right and if you can observe CPU saturation due use AES still it should be no issue with + 1:30 (1s vs 30s max timeout) ratio between time necessary to send the same data encrypted DES and AES. Ergo: still it may be not AES but something else and it is only coincidence that issue is possible to observe when AES is used.

DES is not secure, so going down that path to avoid this issue doesn't gain one much, but I guess if you didn't care about the security of the data being returned, you could always do v3 with SHA auth and DES privacy so at least your credentials are protected, just not the response. I'd prefer to focus on solving the problem when AES is being used.

Digging through the source code and looking more at the data from the debugging output and tcpdump, I believe I have a better idea of why there is an issue, and my recommended solution (for Zabbix developers) remains the same because there won't be a solution users can implement.

In poller/checks_snmp.c, zbx_snmp_get_values() is called with subsets of OID's from a particular host's Items list. Unfortunately Zabbix has a feature dating to SNMPv1 and early SNMPv2 implementations that creates a problem for devices using AES SNMPv3. This is the section:

Code:

        else if (1 < mapping_num &&
                        ((STAT_SUCCESS == status && SNMP_ERR_TOOBIG == response->errstat) || STAT_TIMEOUT == status ||
                        (STAT_ERROR == status && SNMPERR_TOO_LONG == ss->s_snmp_errno)))

and it has an explanation for why the code that follows is present:

Code:

/* Since we are trying to obtain multiple values from the SNMP agent, the response that it has to */
/* generate might be too big. It seems to be required by the SNMP standard that in such cases the */
/* error status should be set to "tooBig(1)". However, some devices simply do not respond to such */
/* queries and we get a timeout. Moreover, some devices exhibit both behaviors - they either send */
/* "tooBig(1)" or do not respond at all. So what we do is halve the number of variables to query - */
/* it should work in the vast majority of cases, because, since we are now querying "num" values, */
/* we know that querying "num/2" values succeeded previously. The case where it can still fail due */
/* to exceeded maximum response size is if we are now querying values that are unusually large. So */
/* if querying with half the number of the last values does not work either, we resort to querying */
/* values one by one, and the next time configuration cache gives us items to query, it will give */
/* us less. */

/* The explanation above is for the first two conditions. The third condition comes from SNMPv3, */
/* where the size of the request that we are trying to send exceeds device's "msgMaxSize" limit. */
halve:
if (*min_fail > mapping_num)
*min_fail = mapping_num;

In SNMPv1 and early SNMPv2 days, a query for too many OID's in one request could cause a device to simply not send a response because it grew too large. In later years, devices would send an actual 'tooBig' response indicating that specific issue. I don't know of anything that doesn't send that at this point. However, you'll notice that a timeout is treated as reason to execute this block of code. When this section executes, what it does is take the previously attempted number of OID's and query for half of them instead of the whole batch; the idea being that now the response is smaller and has a higher likelihood of succeeding.

That one block of code itself is not the total issue, but it becomes an issue when combined with both SNMPv3 AES and how Zabbix spreads out polling across multiple poller processes.

I just tested a Cisco ASR9000 with 2200 Zabbix Items to monitor at 60 second intervals with SNMPv3 AES authPriv. I deleted the device, added it back in, attached a discovery rule template to it which uses ifDescr to get the indexes, then it built the relevant items. Once the items all created and began monitoring, Zabbix split the ~2200 OID's up into groups of 61 items each to distribute to pollers for get_values_snmp(). I cannot find in the code how it picked 61 items as the starting point. You can see the item count per poller with debug level 4 in your logs where it will mention being in the get_values_snmp() function and give a number of items being polled as "num:#".

The problem is that since it parallelizes the work out to the pollers, those pollers send the bulk query for their 61 items each at nearly the same time. In my case they were logged as sending them all over the course of about 3/4 a second. So instead of the target device getting hit with a bulk get for 2200 OID's, the device gets hit with 36 nearly parallel requests for 61 OID's each. In the case of this particular CIsco device, receiving back all 2000 OID's with AES priv takes about 45 seconds if I snmpbulkget them all in one shot, however, it should be noted that making such a request produces data flow that begins almost immediately. If they're split up and sent as numerous parallel requests, the device appears to attempt to process them in parallel, so the responses do not always BEGIN before the Zabbix config Timeout value is reached, which it considers a non-response timeout. If Zabbix had sent them all as one huge bulk get, the response would have begun immediately, it just would have taken a while, and may or may not be compatible with all devices.

Once that issue occurs and a given poller has its response missed, Zabbix triggers the code block that begins splitting the number of items into smaller chunks because it thinks it's dealing with a device which sent no response because the response was too big. Now that this has occurred, Zabbix starts chopping the OID groups into smaller chunks (31 in my case after some started to fail, and further down as more fail), which results in it sending even more parallel queries for what are now smaller numbers of OID's per poller.

Depending on your Zabbix config and the number of pollers you run, this may or may not make things appear to get worse or better. If you have a low number of pollers, you ultimately have enough polls missed that this particular network device ends up having a huge number of polls for small numbers of OID's, and between that and other things Zabbix is doing for other devices, may result in pollers being busy with other tasks so your problem device gets some artificial help from the pollers being too busy to send all the OID's in parallel. Then, you end up with more items than not being polled, but regularly misses some anyway. If you have a lot of pollers, you may never get anything even close to reliable because it keeps sending all 2200 OID's mostly in parallel while it splits them into ever smaller polling jobs, but those always hit the timeout because the device can't begin a response to nearly any of them, so you miss data, or your discovery rules fail, etc. This is what is happening to me because the Zabbix host watching it has a large number of pollers set up; most of my Items will go unsupported until the Item count gets down enough that they begin to pop back in, then the count grows until they fail again, and it just oscillates in a frustrating mess with no specific items remaining reliable.

So, my recommended solution remains what I proposed 12 days ago:

Long term, only thing I can think of as a way to get around this is for zabbix to analyze a given host's items, and if the quantity of snmpv3 items exceeds a certain user defined number (which you could tune to your hardware), break them up into sequentially polled batches of that defined quantity, and adjust when they poll to keep the polling interval the same for each batch.

Since that time we've learned this is specific to items using AES encryption, because only that type of device incurs enough of a performance penalty in the sending of responses to trigger the issue. The only solutions I see as an end user are switch to SNMPv2 or SNMPv3 with DES, neither of which secure your data, but at least SNMPv3 can still safely use SHA auth to secure your logins.

**kloczek** · 23-05-2018, 00:42

You know .. again you\ve wrote nothing about what you exactly tested on agent side which pushed you to exact conclusions.
Really it is pretty boring to read again in what you are believe. Engineering does not work that way.
You wrote that you've been digging around tcpdump .. just show what you've found. Interpretations of what you see or saw could be done only with those data or description of what exactly you've done so far.
Don't get me wrong. I'm not trying to tell that you are wrong. I want only to say that I do't see raw evidences. Only this and nothing more ..

And yet another thing.
I can reproduce SNMP timeout issue even on my home broadband router which has only 16 entries in each vectors if IF-MIB::interfaces. All this using not SNMPv3 items but SNMPv2.
Firmware on this router is using snmpd from net-snmp 5.3.
I can reproduce similar timeouts over SNMPv2 even on my laptop with nsnmpd from latest net-snmp 5.7.3.
Basing on this and all what you've already wrote I don't see why AES needs to be blamed for those issues.

Exact MIB entries on Cisco or other provides CPU utilization. Just please check those CPU stats when you are using SNMP v2 and v3.
SNMPv2-MIB::snmp data provides snmpd internal metric data which can be used on diagnostics.
For example SNMPv2-MIB::snmpInTooBigs.0 and SNMPv2-MIB::snmpOutTooBigs.0 provides number of those "too big" queries rate.
In my case both counters are zero.

Code:

$ snmpbulkwalk -v2c -c public 192.168.1.1 SNMPv2-MIB::snmp
SNMPv2-MIB::snmpInPkts.0 = Counter32: 35732
SNMPv2-MIB::snmpOutPkts.0 = Counter32: 38397
SNMPv2-MIB::snmpInBadVersions.0 = Counter32: 0
SNMPv2-MIB::snmpInBadCommunityNames.0 = Counter32: 1342
SNMPv2-MIB::snmpInBadCommunityUses.0 = Counter32: 0
SNMPv2-MIB::snmpInASNParseErrs.0 = Counter32: 0
SNMPv2-MIB::snmpInTooBigs.0 = Counter32: 0
SNMPv2-MIB::snmpInNoSuchNames.0 = Counter32: 0
SNMPv2-MIB::snmpInBadValues.0 = Counter32: 0
SNMPv2-MIB::snmpInReadOnlys.0 = Counter32: 0
SNMPv2-MIB::snmpInGenErrs.0 = Counter32: 0
SNMPv2-MIB::snmpInTotalReqVars.0 = Counter32: 145128
SNMPv2-MIB::snmpInTotalSetVars.0 = Counter32: 0
SNMPv2-MIB::snmpInGetRequests.0 = Counter32: 7424
SNMPv2-MIB::snmpInGetNexts.0 = Counter32: 24477
SNMPv2-MIB::snmpInSetRequests.0 = Counter32: 0
SNMPv2-MIB::snmpInGetResponses.0 = Counter32: 0
SNMPv2-MIB::snmpInTraps.0 = Counter32: 0
SNMPv2-MIB::snmpOutTooBigs.0 = Counter32: 0
SNMPv2-MIB::snmpOutNoSuchNames.0 = Counter32: 1
SNMPv2-MIB::snmpOutBadValues.0 = Counter32: 0
SNMPv2-MIB::snmpOutGenErrs.0 = Counter32: 0
SNMPv2-MIB::snmpOutGetRequests.0 = Counter32: 0
SNMPv2-MIB::snmpOutGetNexts.0 = Counter32: 0
SNMPv2-MIB::snmpOutSetRequests.0 = Counter32: 0
SNMPv2-MIB::snmpOutGetResponses.0 = Counter32: 34389
SNMPv2-MIB::snmpOutTraps.0 = Counter32: 4010
SNMPv2-MIB::snmpEnableAuthenTraps.0 = INTEGER: enabled(1)
SNMPv2-MIB::snmpSilentDrops.0 = Counter32: 0
SNMPv2-MIB::snmpProxyDrops.0 = Counter32: 0

$ snmpbulkwalk -v2c -c public 192.168.1.1 SNMPv2-MIB::snmp | grep -i err
SNMPv2-MIB::snmpInASNParseErrs.0 = Counter32: 0
SNMPv2-MIB::snmpInGenErrs.0 = Counter32: 0
SNMPv2-MIB::snmpOutGenErrs.0 = Counter32: 0

All what must do to block snmpd on my router it is just start query from more than one snmpwalk/snmpbulkwalk.
When at least one time I'm able to see one timeout on the other sessions snmpd almost instantly is finishing with timeout.
Here is with three such oneliners running in separated terminals:

Code:

$ for i in {1..1000}; do (snmpbulkwalk -v2c -c public 192.168.1.1 IF-MIB::ifInOctets) | grep -v IF-MIB; date; done
Tue 22 May 22:51:20 BST 2018
Timeout: No Response from 192.168.1.1
Tue 22 May 22:51:26 BST 2018
Timeout: No Response from 192.168.1.1
Tue 22 May 22:51:32 BST 2018
Timeout: No Response from 192.168.1.1
Tue 22 May 22:51:38 BST 2018
Timeout: No Response from 192.168.1.1
Tue 22 May 22:51:44 BST 2018
^C

And the same with 30s timeout:

Code:

$ for i in {1..1000}; do (snmpbulkwalk -v2c -c public -t30 192.168.1.1 IF-MIB::ifInOctets) | grep -v IF-MIB; date; done
Tue 22 May 22:53:58 BST 2018
Tue 22 May 22:54:09 BST 2018
Tue 22 May 22:54:20 BST 2018
Tue 22 May 22:54:31 BST 2018
Tue 22 May 22:54:43 BST 2018
Tue 22 May 22:54:54 BST 2018
Tue 22 May 22:55:05 BST 2018

And the same oneliner but when only one is running:

Code:

$ for i in {1..1000}; do (snmpbulkwalk -v2c -c public -t30 192.168.1.1 IF-MIB::ifInOctets) | grep -v IF-MIB; date; done
Tue 22 May 22:58:10 BST 2018
Tue 22 May 22:58:13 BST 2018
Tue 22 May 22:58:17 BST 2018
Tue 22 May 22:58:21 BST 2018
Tue 22 May 22:58:24 BST 2018
Tue 22 May 22:58:28 BST 2018
Tue 22 May 22:58:31 BST 2018
Tue 22 May 22:58:35 BST 2018
Tue 22 May 22:58:40 BST 2018
^C

snmpd in all those cases should be reading only one file .. /proc/net/dev (if snmpd is running on linux) which can be opened multiple times and read in as many parallel process as it is only possible to execute
Just done small test on this router after login over ssh:

Code:

# date; for i in {1..1000}; do cat /proc/net/dev >/dev/null; done; date; /userfs/bin/snmpd -v; uname -a
Tue May 22 22:22:45 UTC 2018
Tue May 22 22:22:45 UTC 2018

NET-SNMP version:  5.3.1
Web:               http://www.net-snmp.org/
Email:             [email protected]

Linux tc 2.6.36 #15 SMP Thu Dec 1 11:20:04 CST 2016 mips unknown

As you see in shell loop I'm able to read /proc/net/dev content 1000 times in below 1s!!!!

Another interesting thing is that on query IF-MIB::ifInOctets on my laptop I don't see on each query reading /proc/net/dev. No matter about what kind of OID tree I'm asking over SNMP I see only something like below:

Code:

# strace -tf -e trace=read,openat,close -p $(pidof snmpd)
strace: Process 22812 attached
23:20:17 openat(AT_FDCWD, "/proc/net/dev", O_RDONLY) = 9
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/net/if_inet6", O_RDONLY) = 11
23:20:17 read(11, "fe800000000000001110fda18d0c88c5"..., 1024) = 108
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 close(12)                      = 0
23:20:17 read(11, "", 1024)             = 0
23:20:17 close(11)                      = 0
23:20:17 read(9, "Inter-|   Receive               "..., 1024) = 580
23:20:17 close(11)                      = 0
23:20:17 close(11)                      = 0
23:20:17 close(11)                      = 0
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv4/neigh/lo/retrans_time_ms", O_RDONLY) = 11
23:20:17 read(11, "1000\n", 1024)       = 5
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/lo/retrans_time_ms", O_RDONLY) = 11
23:20:17 read(11, "1000\n", 1024)       = 5
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/conf/lo/forwarding", O_RDONLY) = 11
23:20:17 read(11, "0\n", 1024)          = 2
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms", O_RDONLY) = 11
23:20:17 read(11, "30000\n", 1024)      = 6
23:20:17 close(11)                      = 0
23:20:17 close(11)                      = 0
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv4/neigh/wlp2s0/retrans_time_ms", O_RDONLY) = 11
23:20:17 read(11, "1000\n", 1024)       = 5
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/wlp2s0/retrans_time_ms", O_RDONLY) = 11
23:20:17 read(11, "1000\n", 1024)       = 5
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/conf/wlp2s0/forwarding", O_RDONLY) = 11
23:20:17 read(11, "0\n", 1024)          = 2
23:20:17 close(11)                      = 0
23:20:17 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/wlp2s0/base_reachable_time_ms", O_RDONLY) = 11
23:20:17 read(11, "30000\n", 1024)      = 6
23:20:17 close(11)                      = 0
23:20:17 read(9, "", 1024)              = 0
23:20:17 close(9)                       = 0
23:20:17 close(10)                      = 0
23:20:19 openat(AT_FDCWD, "/proc/diskstats", O_RDONLY) = 9
23:20:19 read(9, "   8       0 sda 408669 1850 265"..., 1024) = 720
23:20:19 read(9, "", 1024)              = 0
23:20:19 close(9)                       = 0
23:20:19 openat(AT_FDCWD, "/proc/stat", O_RDONLY) = 9
23:20:19 read(9, "cpu  6144846 12024 5359796 34418"..., 4095) = 1507
23:20:19 close(9)                       = 0
23:20:19 openat(AT_FDCWD, "/proc/vmstat", O_RDONLY) = 9
23:20:19 read(9, "nr_free_pages 637920\nnr_zone_ina"..., 4095) = 2878
23:20:19 close(9)                       = 0
23:20:20 openat(AT_FDCWD, "/proc/net/dev", O_RDONLY) = 9
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/net/if_inet6", O_RDONLY) = 11
23:20:20 read(11, "fe800000000000001110fda18d0c88c5"..., 1024) = 108
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 close(12)                      = 0
23:20:20 read(11, "", 1024)             = 0
23:20:20 close(11)                      = 0
23:20:20 read(9, "Inter-|   Receive               "..., 1024) = 580
23:20:20 close(11)                      = 0
23:20:20 close(11)                      = 0
23:20:20 close(11)                      = 0
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv4/neigh/lo/retrans_time_ms", O_RDONLY) = 11
23:20:20 read(11, "1000\n", 1024)       = 5
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/lo/retrans_time_ms", O_RDONLY) = 11
23:20:20 read(11, "1000\n", 1024)       = 5
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/conf/lo/forwarding", O_RDONLY) = 11
23:20:20 read(11, "0\n", 1024)          = 2
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms", O_RDONLY) = 11
23:20:20 read(11, "30000\n", 1024)      = 6
23:20:20 close(11)                      = 0
23:20:20 close(11)                      = 0
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv4/neigh/wlp2s0/retrans_time_ms", O_RDONLY) = 11
23:20:20 read(11, "1000\n", 1024)       = 5
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/wlp2s0/retrans_time_ms", O_RDONLY) = 11
23:20:20 read(11, "1000\n", 1024)       = 5
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/conf/wlp2s0/forwarding", O_RDONLY) = 11
23:20:20 read(11, "0\n", 1024)          = 2
23:20:20 close(11)                      = 0
23:20:20 openat(AT_FDCWD, "/proc/sys/net/ipv6/neigh/wlp2s0/base_reachable_time_ms", O_RDONLY) = 11
23:20:20 read(11, "30000\n", 1024)      = 6
23:20:20 close(11)                      = 0
23:20:20 read(9, "", 1024)              = 0
23:20:20 close(9)                       = 0
23:20:20 close(10)                      = 0
23:20:23 openat(AT_FDCWD, "/proc/net/dev", O_RDONLY) = 9
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 close(11)                      = 0
23:20:23 openat(AT_FDCWD, "/proc/net/if_inet6", O_RDONLY) = 11
23:20:23 read(11, "fe800000000000001110fda18d0c88c5"..., 1024) = 108
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 close(12)                      = 0
23:20:23 read(11, "", 1024)             = 0
23:20:23 close(11)                      = 0
23:20:23 read(9, "Inter-|   Receive               "..., 1024) = 580
23:20:23 close(11)                      = 0
23:20:23 close(11)                      = 0
23:20:23 close(11)                      = 0
23:20:23 close(11)                      = 0
^C

Looks like snmpd is reading few files in fixed loop and whatever it serves is served not from just-red data but from what snmpd cached and looks like something is seriously screwed in snmpd code because serving those already cached data is few orders of magnitude slower than just reading exact procfs files.
I've not been looking yet on net-snmp code but but from point of view how it behaves IMO it really looks very odd ..
I would be not surprised if main part of the time which snmpd spends on serving any set or subset of OIDs will be in the same (looks like) screwed part of the code which does not depends on instantly red data from HW counters but on already cached data.

If you can trace anything on SNMP agent side just try above ..

**colohost** · 23-05-2018, 02:01

My use of tcpdump was simply to confirm that what Zabbix was logging to it's log file in debug mode was accurate; i.e. to confirm that it really was starting out by sending multiple batches of no more than 61 OID's per request, across multiple pollers separated by a few hundredths of a second, to a tenth of a second, at most. Not much point in posting packet captures if Zabbix's own log is accurate about it. The log itself is also accurate about how many OID's per get_values_snmp call it is sending, so again, not much gained by posting thousands of lines of useless log of the same exact thing over and over. My 2200-OID device starts out with a large grouping of 61 OID's per request, and continuously halves them from there as it finds it can't get responses, causing the number of calls to get_values_snmp() to increase while the num:# value decreases.

Here's what the first second of the first polling for the new items looked like in the logs:

1710:20180522:162044.000 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1709:20180522:162044.010 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1708:20180522:162044.011 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1712:20180522:162044.085 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1711:20180522:162044.086 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1723:20180522:162044.130 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1719:20180522:162044.214 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1718:20180522:162044.220 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1722:20180522:162044.234 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1721:20180522:162044.298 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1720:20180522:162044.318 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1724:20180522:162044.388 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1728:20180522:162044.389 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1729:20180522:162044.397 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1702:20180522:162044.440 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61
1701:20180522:162044.470 In get_values_snmp() host:'bar1...' addr:'192.0.2.1' num:61

It continued to ramp up to the whole 2200 over the following second or two, and then in subsequent minutes those numbers started to drop from 61 to 31 as the polling concurrency increased, so on and so forth.

**sfl** · 11-06-2018, 11:16

Hi colohost and kloczek,

I just want adding my own experience and comments regarding this issue.
Snmp polling is done by ZBX proxy and now in 3.4.10 version, assuming snmpEngineID are all differents.

with snmp v3 AES/SHA => I have items unsupported in zabbix_server.log and so gap in graphes due to 10 minutes recheck unsupported items.
with snmp v3 DES/SHA => no more unsupported items in zabbix_server.log but many snmp time out in zabbix_proxy.log

Code:

 14964:20180611:110821.696 resuming SNMP agent checks on host "c29-sw-004": connection restored
 14867:20180611:110912.297 SNMP agent item "ifInErrors[Te1/0/1]" on host "c29-sw-004" failed: first network error, wait for 5 seconds
 14931:20180611:110917.190 resuming SNMP agent checks on host "c29-sw-004": connection restored
 14908:20180611:110918.111 SNMP agent item "ifOutOctets[Te1/0/1]" on host "c29-sw-004" failed: first network error, wait for 5 seconds
 14993:20180611:110923.212 SNMP agent item "ifOutOctets[Te1/0/2]" on host "c29-sw-004" failed: another network error, wait for 5 seconds
 14958:20180611:110928.256 SNMP agent item "ifOutOctets[Te1/0/2]" on host "c29-sw-004" failed: another network error, wait for 5 seconds
 14960:20180611:110933.256 SNMP agent item "ifOutOctets[Te1/0/2]" on host "c29-sw-004" failed: another network error, wait for 5 seconds
 14931:20180611:110938.300 resuming SNMP agent checks on host "c29-sw-004": connection restored
 14948:20180611:110953.374 SNMP agent item "ifOutOctets[Te1/0/1]" on host "c29-sw-004" failed: first network error, wait for 5 seconds
 14967:20180611:110958.540 SNMP agent item "ifOutOctets[Te1/0/1]" on host "c29-sw-004" failed: another network error, wait for 5 seconds
 14930:20180611:111003.162 SNMP agent item "ifOutOctets[Te1/0/1]" on host "c29-sw-004" failed: another network error, wait for 5 seconds
 14957:20180611:111108.128 temporarily disabling SNMP agent checks on host "c29-sw-004": host unavailable

So with DES/SHA it's better but it is not solved in my case.

Regards,
Sfl

**kloczek** · 11-06-2018, 12:38

Originally posted by steveroebuck

We are experiencing exactly the same issues with SNMPv3 using authpriv, some interfaces will come back fine others will have massive gaps in time series data for switch throughput.

You can reproduce this issue using snmp{,bulk}walk commands.
Open HPE cace about the issue.
Zabbix developers done really a lot to mitigate all those issues. Bug sits on SNMP agent side (not SNMP client like zabbix side).

**steveroebuck** · 20-07-2018, 10:21

Originally posted by kloczek

You can reproduce this issue using snmp{,bulk}walk commands.
Open HPE cace about the issue.
Zabbix developers done really a lot to mitigate all those issues. Bug sits on SNMP agent side (not SNMP client like zabbix side).

Why would the issue be fixed by HPE, we have evidence of the same behavior (and gappy graphs) on from SNMPv3 polls on F5 BIG IP devices, Trend Tipping Point devices and Checkpoint Firewall appliances, all when using SNMPv3 with authpriv AES/SHA.

We have previously monitored all these devices fully using the same SNMPv3 security levels using Solarwinds and Librenms...this to me points to a issue with Zabbix, that they seemingly refuse to acknowledge.

**kloczek** · 20-07-2018, 12:01

Originally posted by steveroebuck

Why would the issue be fixed by HPE, we have evidence of the same behavior (and gappy graphs) on from SNMPv3 polls on F5 BIG IP devices, Trend Tipping Point devices and Checkpoint Firewall appliances, all when using SNMPv3 with authpriv AES/SHA.

If you will look closer you will find that both devices are using net-snmp SNMP agent code.

We have previously monitored all these devices fully using the same SNMPv3 security levels using Solarwinds and Librenms...this to me points to a issue with Zabbix, that they seemingly refuse to acknowledge.

Are you 100% sure that using both monitoring tools you've been sending SNMP queries about exactly the same OIDs and with the same rate?

And again .. you can reproduce timeout effect using snmp{,builk}walk commands.

Ad Widget

Gaps on the graphs, SNMPv3

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment