Ad Widget

**kloczek** · 17-05-2018, 09:03

So what is the avg rate of SNMP queries to those devices?
Do you have enabled bulk queries?

**colohost** · 17-05-2018, 10:09

Again, the "average rate" has *absolutely nothing* to do with the issue. I could have 2000 items which get queried once per day and the problem will occur once per day. I could have them queried once per minute and the issue will occur once per minute. The issue is simply the fact that when using SNMPv3, with authPriv in my case, there is a finite number of OID's which can be queried for, whether using bulk or not (and I'm using bulk) before the response takes longer than 30 seconds, which is the highest amount of time you can configure Zabbix's Timeout value to be. Any response data received after the Timeout mark is ignored and missed because Zabbix is no longer listening for the remaining data. This number is based entirely on the device being queried since some hardware returns data faster than others. I can guarantee you the threshold for failure is a minimum of an order of magnitude lower than when using SNMPv2. Zabbix is not designed to handle this scenario; i.e. a high number of SNMPv3 authPriv items per device. I have Cisco 2960XR devices where failures will begin to occur with just a few hundred OID's per polling cycle. I have Arista 7280 devices where the threshold where issues occur is perhaps 1000 items. In all cases, switching to SNMPv2 eliminates the issue because those responses are delivered much quicker.

**kloczek** · 18-05-2018, 00:41

Originally posted by colohost

Again, the "average rate" has *absolutely nothing* to do with the issue. I could have 2000 items which get queried once per day and the problem will occur once per day. I could have them queried once per minute and the issue will occur once per minute. The issue is simply the fact that when using SNMPv3, with authPriv in my case, there is a finite number of OID's which can be queried for, whether using bulk or not (and I'm using bulk) before the response takes longer than 30 seconds, which is the highest amount of time you can configure Zabbix's Timeout value to be. Any response data received after the Timeout mark is ignored and missed because Zabbix is no longer listening for the remaining data. This number is based entirely on the device being queried since some hardware returns data faster than others. I can guarantee you the threshold for failure is a minimum of an order of magnitude lower than when using SNMPv2. Zabbix is not designed to handle this scenario; i.e. a high number of SNMPv3 authPriv items per device. I have Cisco 2960XR devices where failures will begin to occur with just a few hundred OID's per polling cycle. I have Arista 7280 devices where the threshold where issues occur is perhaps 1000 items. In all cases, switching to SNMPv2 eliminates the issue because those responses are delivered much quicker.

I've not been asking abut how many zabbix items you are querying but avg rate SNMP packets measured over SNMPv2-MIB metrics.
Did you try to use SNMPv2-MIB template?

**colohost** · 19-05-2018, 00:26

I checked one of my devices which has about 667 OID's queried on a five minute basis. It does not produce a significant number of packets given the amount of data is only about 55kb when sent in the clear with SNMPv2; it was <50 packets. Zabbix seemed to split the queries into bulk requests of ~60 OID's per.

**kloczek** · 19-05-2018, 08:25

Originally posted by steveroebuck

We are experiencing exactly the same issues with SNMPv3 using authpriv, some interfaces will come back fine others will have massive gaps in time series data for switch throughput.

If you will switch (temporary) to SNMP v2 items you will find that issue still is present.
Ergo -> this issue has nothing to do with used SNMP protocol version.

**colohost** · 19-05-2018, 15:04

What are you talking about? I just stated in this thread, as have several others, that when experiencing the exact same problem, going to SNMPv2 resolved it. Are you suggesting I'm imagining this behavior? Any network device with a few hundred OID's being monitored simultaneously, interval doesn't matter, should be all you need to reproduce this.

**kloczek** · 19-05-2018, 15:49

What I've wrote that is not a suggestion.
I say (because I've tested this several times on many different devices) that this issue has NOTHING to do with used SNMP protocol version.
Doesn't matter if you will use SNMP v2, or v3.
On protocol layer difference between v2 and v3 is only authentication and communication channel encryption. What happens in SNMP session after connecting to the agent to query OIDs is exactly the same.

snmpd used on all network devices firmware is built out of the net-snmp code.
net-snmp has well-known issues. This code is not reentrant.
To reproduce the issue you can try to execute on more than one terminal session something like:

Code:

for i in {1..1000}; do (snmpbulkwalk -v2c -c public 192.168.1.1 IF-MIB::ifInOctets) | grep -v IF-MIB; echo done; done

As long as you will have only just one such loop usually everything will be OK.
More such parallel loops and at some point you will be able to see only timeouts.
As long as at least one of those parallel loops will start reporting "Timeout: No Response from 192.168.1.1" all other loops will be failing as well.
Kill all those loops and wait for 10-15s and start one such loop and you will see that again you will be able to see results of the SNMP queries.
Kind of workaround for above test is using not snmpbulkwalk but snmpwalk. Usually even with 2-3 parallel loops still is not possible to see timeouts so this could be used to change host interface settings to disable bulk queries (which is enabled by default on creating the new host with SNMP items).
However, this is nothing more than workaround because with more OIDs which needs to be queried at some point it will be not possible to have enough rate of those non-bulk queries to query all necessary OIDs.
Whatever more is possible to do is only possible to change on SNMP agent side.
Some of those non-reentrant issues are possible to solve by building net-snmp code with --enable-reentrant on net-snmp source code configuration, and this can be solved only by hardware vendor by delivery fixed device firmware.
In other words: all SNMP timeouts issues need to be reported to hardware vendors.

**colohost** · 19-05-2018, 16:05

Do the devices you've tested this on include actual network equipment, where there was both a high OID count combined with low cpu power? You seem to be completely ignoring the fact that most network devices return SNMPv3 data at a dramatically slower rate than SNMPv2, and Zabbix cannot have a timeout of longer than 30 seconds. I've reproduced this problem countless times with Cisco 2960XR, ASR9000, ASR1000 and NCS5500 platforms. I've reproduced this with Arista 7010 and 7280 platforms, but the OID count needs to be much higher than the Cisco devices because the Arista devices have much faster CPU's and return the data faster, but there is still a threshold where it cannot output the data in 30 seconds and Zabbix starts dropping data at that point.

Please explain how you can successfully, reliably, query a switch for more OID's than what the switch is capable of returning in 30 seconds. We're talking a very simple math equation: if device X is incapable of returning 300 OID's in 30 seconds when authPriv SNMPv3 is used, then it will not be possible for Zabbix to monitor those 300 OID's reliably.

**kloczek** · 19-05-2018, 16:30

Originally posted by colohost

Do the devices you've tested this on include actual network equipment, where there was both a high OID count combined with low cpu power? You seem to be completely ignoring the fact that most network devices return SNMPv3 data at a dramatically slower rate than SNMPv2[..],

.. and by this snmpd running on monitored device side hangs longer in query some data-> source of those information on device side is locked longer which increases probability of locking snmpd on monitored device side.
Again: generally this issue has nothing to do with SNMP protocol version or zabbix because it is possible to reproduce it without zabbix and without communicating over SNMP v3.
This issue is ONLY related to net-snmp code which is used to build all firmware SNMP agents.

Please explain how you can successfully, reliably, query a switch for more OID's than what the switch is capable of returning in 30 seconds. We're talking a very simple math equation: if device X is incapable of returning 300 OID's in 30 seconds when authPriv SNMPv3 is used, then it will not be possible for Zabbix to monitor those 300 OID's reliably.

Please ask about this you network device vendor. I have no control of what is embedded in those devices firmware so you are asking wrong person.
Again: the issue is possible to reproduce without zabbix just using snmpwalk/snmpbuilkwalk commands.
All what you need to do is open ticket against the issue if you have support with full description how to reproduce it using only snmpwalk/snmpbuilkwalk commands.
Monitoring is one of the fundamental functionality so raising such ticket as critical issue would be well justifiable.

**colohost** · 19-05-2018, 17:59

Originally posted by kloczek

.. and by this snmpd running on monitored device side hangs longer in query some data-> source of those information on device side is locked longer which increases probability of locking snmpd on monitored device side.

Incorrect, and not representative of what anyone has reported previously in this thread. There's never been any suggestion that there is a delay in responses arriving from the queried device; the issue is when there is a high quantity of responses, and you're using SNMPv3, with what is likely AES privacy, it can take longer than Zabbix's maximum allowable "Timeout" value to receive them all.

Whatever software or daemon exists inside the network device, whether that be snmpd or similar, does not "hang" at all. If testing with snmpwalk/snmpbulkwalk, data begins flowing back from the device immediately, and continues flowing steadily until all the OID's have been returned. There is no pause, no timeout, no ramp up time before data begins being returned, and it will complete successfully with ALL values returned 100% of the time.

Originally posted by kloczek

Again: generally this issue has nothing to do with SNMP protocol version or zabbix because it is possible to reproduce it without zabbix and without communicating over SNMP v3.
This issue is ONLY related to net-snmp code which is used to build all firmware SNMP agents.

Inaccurate and has nothing to do with net-snmp code. As pointed out above, there is no hanging issue. If you take snmpbulkwalk and request all the same OID's Zabbix is trying to request in one run, it will work, you will get all the data back whether using SNMPv2 or SNMPv3. It will work 100% of the time. I pull most of the same data using mrtg and never encounter this issue.

Originally posted by kloczek

Please ask about this you network device vendor. I have no control of what is embedded in those devices firmware so you are asking wrong person.
Again: the issue is possible to reproduce without zabbix just using snmpwalk/snmpbuilkwalk commands.
All what you need to do is open ticket against the issue if you have support with full description how to reproduce it using only snmpwalk/snmpbuilkwalk commands.
Monitoring is one of the fundamental functionality so raising such ticket as critical issue would be well justifiable.

What issue it is you're thinking can and should be reproduced for the network vendors? Their devices are working fine and returning the OID's asked for, 100% of the time. From their perspective, the devices are working exactly as intended; you query them, they give you back the data, 100% of the time. I doubt Cisco is going to consider it a bug that one specific network monitoring platform isn't capable of receiving data for longer than 30 seconds, or splitting the queries up to accommodate its own limitations.

Based on all of the above, I have to assume you are still not understanding the issue. I'll go through it again and add additional information on why none of the above is relevant to the problem:

Most network devices rely on low power CPU's because packet forwarding is done in ASIC and the CPU is just for management tasks. Going back to the example that has already been provided; a Cisco 2960XR current generation switch uses a 600MHz APM86392 processor. Obviously that is not a particularly fast device, and likely doesn't have AES offloading. It should not be too great a leap in faith to theorize that this device will most likely be slower at sending SNMPv3 authPriv (SHA/AES) data back compared to plain text SNMPv2 data. In fact, that is exactly what occurs; if I query a stack of 2960's where there are hundreds of ports to collect data from, it is delivered at a significantly slower rate with SNMPv3 AES-priv, than using SNMPv2, but the responses are delivered successfully, 100% of the time.
If you're doing a bulk request for hundreds or thousands of OID's, using authPriv (SHA/AES) SNMPv3, against a typical network device with a lower speed CPU and/or no AES-offload like described above, it is highly likely that the device in question will take much longer to deliver 100% of the responses than it would if you were using SNMPv2.
If you're using snmpbulkwalk to make an SNMPv3 AES-priv query against a device as described above, it will pull the data successfully 100% of the time.
If you're using snmpbulkwalk to make an SNMPv3 AES-priv query against a device as described above, and you are pulling a large number of OID's, such as hundreds or more, it may take longer than 30 seconds for all the data to be returned.
If you're using snmpbulkwalk to make an SNMPv2 query against the same device as described above, and you are pulling a large number of OID's, such as hundreds or more, it will most likely not take longer than 30 seconds for all the data to be returned.
Zabbix ties maximum SNMP response time to its "Timeout" configuration value. Data still being returned after Timeout has been reached will not be seen because Zabbix has stopped listening for it.
Zabbix does not permit a Timeout configuration value of greater than 30 seconds.
Items in Zabbix tied to OID's whose responses came after Zabbix reached the Timeout configuration value will show as having gaps in the data.
Items in Zabbix tied to discovery rules whose responses came after Zabbix reached the Timeout configuration value will flip flop between discovered and not discovered in the server log.

Based on all of the above facts, it is not possible for Zabbix to reliably monitor a network device if both of the following conditions are met:

Monitoring a device using SNMPv3 with AES privacy (have not tested DES or clear text authNoPriv)
Monitoring for a quantity of items where testing with snmpbulkwalk demonstrates the TOTAL response takes longer than 30 seconds to be received, because Zabbix does not support a timeout value greater than that

Would it be nice if Cisco would stick a chip in their switches that can do AES in hardware so SNMPv3 AES-priv responses are delivered faster? Sure. Is that Cisco's problem? No. I have the exact same problem with Arista gear, but it requires a much greater number of OID's because the Arista devices I use are based on AMD GX-424CC chips which are 2400MHz and have AES offload. I can get hundreds more OID's out in 30 seconds than Cisco switches, but there's still a threshold where it will take more than 30 seconds and then Zabbix begins failing again. The same devices don't fail monitoring with other systems, including snmpbulkwalk, because those systems do not impose a maximum response time.

There are three possible solutions to this issue:

Switch to SNMPv2. I have yet to encounter any network device that cannot return thousands of OID's in a few seconds when using SNMPv2, let alone 30 seconds. You could probably query for tens of thousands of OID's in 30 seconds using SNMPv2.
Get every network vendor you use to alter their hardware so they can return AES-encrypted SNMPv3 responses in less than 30 seconds, regardless of how many OID's are involved, just to keep Zabbix happy.
Zabbix has code written so that when SNMPv3 AES-priv is being used, it groups Items in a manner that allows it to not query for more OID's than the given device will successfully return in whatever its Timeout value is set to, or 30 seconds.

**kloczek** · 19-05-2018, 21:38

Do you know that all SNMP queries are done not over some zabbix own SNMP protocol implementation but over net-snmp client API/ABI provided by libnetsnmp library?

Code:

$ objdump -x /usr/sbin/zabbix_server | grep NEEDED
  NEEDED               libmysqlclient.so.20
  NEEDED               libpthread.so.0
  NEEDED               librt.so.1
  NEEDED               libxml2.so.2
  NEEDED               libodbc.so.2
  NEEDED               libnetsnmp.so.30   <<==  here
  NEEDED               libOpenIPMI.so.0
  NEEDED               libOpenIPMIposix.so.0
  NEEDED               libevent-2.1.so.6
  NEEDED               libssl.so.1.1
  NEEDED               libcrypto.so.1.1
  NEEDED               libldap-2.4.so.2
  NEEDED               liblber-2.4.so.2
  NEEDED               libcurl.so.4
  NEEDED               libm.so.6
  NEEDED               libdl.so.2
  NEEDED               libresolv.so.2
  NEEDED               libpcreposix.so.0
  NEEDED               libc.so.6

Those timeouts reported by zabbix are reported by libnetsnmp routines.
Do you know that libnetsnmp client library and snmpd on agent side on use AES is using openssl?
If you are lucky this this openssl is compiled with HW acceleration which now provides most of the ARM/MIPS/PPC SoCs which are used in network devices BMCs and by this in most cases difference between v2c or v3 query time will be hard to measure.

Just try to execute something like:

Code:

$ (echo "scale=3; ("; (for i in {1..5}; do /usr/bin/time -f "%e+" snmpwalk -v2c -c public 192.168.1.1 >/dev/null; done) 2>&1;  echo "0)/5") |xargs|bc

Than repeat the same onelinmer with -v3 type query.
Please share results of those tests.

**colohost** · 19-05-2018, 22:10

What timeout is it that you think is being reported? There are no timeouts being reported by zabbix, it just shows items going from monitored to unmonitored because it incorrectly assumes the matching OID was not returned when in reality the response came back after Zabbix gave up listening.

There is NO timeout occurring on the net-snmp side; the data simply comes back from the device past the 30 second mark, but Zabbix is no longer receiving it because it hit its own internal configured Timeout value and the poller handling it stopped receiving. Zabbix cannot be set to have a Timeout value larger than 30 seconds, so that's as high as I can set it. If Zabbix did permit a longer timeout, I could set it up and this problem would go away, but that would not be a very elegant solution.

In any case, I wrote specifically that the devices in question, measuring all the OID's in question, take a few seconds via SNMPv2 and take over 30 seconds for the same OID's queried via SNMPv3 AES. Why do you keep posting about shell scripts when none of that is needed? Here's how you do it:

1) Build up my list of 600+ OID's. For convenience, I wrote them to a space separated list in oidlist.txt
2) time snmpbulkget -v 2c -c public 192.0.2.1 `cat oidlist.txt`
3) time snmpbulkget -v 3 -x -a -l authPriv -u v3user -X privPass -A authPass 192.0.2.1 `cat oidlist.txt`

#2 takes a few seconds, #3 takes over 30 seconds. Zabbix constantly misses data from these same items if SNMPv3 is used. If I alter my OID list to a quantity that 'time snmpbulkget -v 3....' can return in less than 30 seconds, and then trim Zabbix's item count for the given device down to only the same list, no data is missed.

This is not a complicated matter; snmpbulkget never fails regardless of how big the OID list is, it just might take longer than 30 seconds for all the data to come back if the sending device is not very powerful. Zabbix with a big SNMPv3 item list fails if the same list takes snmpbulkget longer than 30 seconds to finish returning. Zabbix with an SNMPv3 item list that snmpbulkget returns in less than 30 seconds NEVER fails.

**kloczek** · 19-05-2018, 23:12

Originally posted by colohost

There is NO timeout occurring on the net-snmp side; the data simply comes back from the device past the 30 second mark, but Zabbix is no longer receiving it because it hit its own internal configured Timeout value and the poller handling it stopped receiving. Zabbix cannot be set to have a Timeout value larger than 30 seconds, so that's as high as I can set it. If Zabbix did permit a longer timeout, I could set it up and this problem would go away, but that would not be a very elegant solution.

There is no such thing like internal zabbix configured Timeout.
Timeout config param is passed to snmp_open(). This is not zabbix timeout but libnetsnmp timeout.

From src/zabbix_server/poller/checks_snmp.c:

Code:

        snmp_sess_init(&session);
[..]
        session.timeout = CONFIG_TIMEOUT * 1000 * 1000; /* timeout of one attempt in microseconds */
                                                        /* (net-snmp default = 1 second) */
[..]
        if (NULL == (ss = snmp_open(&session)))
        {
                SOCK_CLEANUP;

                zbx_strlcpy(error, "Cannot open SNMP session", max_error_len);
        }

Check https://linux.die.net/man/3/snmp_open what it is.

**colohost** · 21-05-2018, 16:27

Thanks for posting that code block. It seems to back up my point that the problem is with Zabbix. I have no idea what you mean by this:

There is no such thing like internal zabbix configured Timeout.
Timeout config param is passed to snmp_open(). This is not zabbix timeout but libnetsnmp timeout.

Your statements contradict one another. You say there is no such thing as a "internal zabbix configured Timeout.", but then the following sentence admits to there being a "Timeout config param". They are one in the same, the Timeout config param is what I'm referring to as the internally configured Zabbix timeout value. That value is then imposed upon the netsnmp code (CONFIG_TIMEOUT * 1000 * 1000 = 30 seconds in my case), forcing a failure if the response is still coming at the 30 second mark, which in turn forces data collection to fail. This is not a device issue, or a netsnmp issue, it's Zabbix's code being inadequate to handle SNMPv3 polling of large numbers of OID's where the target device cannot send the response back before the Timeout occurs.

### Option: Timeout
# Specifies how long we wait for agent, SNMP device or external check (in seconds).
#
# Mandatory: no
# Range: 1-30
# Default:
Timeout=30

People have been having these issues for years and years; here's some random person's blog where commenters talk about having to alter the source code to allow for independently configured snmp timeouts and retries, to decouple it from the zabbix Timeout value:

http://blog.zabbix.com/zabbix-2-2-fe...ovements/2551/

If Zabbix was trying to monitor whether or not my stove could boil a pot of water, but turned the switch off every 30 seconds, would you tell me my stove is the problem because it can't boil water in less than 30 seconds? Am I supposed to reach out to Cisco and tell them their equipment is broken because it takes longer to send back 1000 OID's than Zabbix is willing to wait?

This is a flawed design on Zabbix's part. If a device can return all the desired values, then Zabbix needs to be able to collect all the responses. Cisco is not going to start including 3GHz Xeon's and GPU's in their switches to make Zabbix happy, so regardless of whether you think I'm wrong on this, the fact of the matter remains that Zabbix is not going to be able to reliably function with SNMPv3 AES-priv and large numbers of OID's against many vendors' hardware products. This gives you two choices, downgrade your security to SNMPv2 to make Zabbix happy, or don't use Zabbix. If you're comfortable blaming everyone else, while having a network management system can't work reliably with one of the largest network product vendors, when monitoring it securely, that's your choice.

**kloczek** · 21-05-2018, 17:52

Originally posted by colohost

Your statements contradict one another. You say there is no such thing as a "internal zabbix configured Timeout.", but then the following sentence admits to there being a "Timeout config param". They are one in the same, the Timeout config param is what I'm referring to as the internally configured Zabbix timeout value. That value is then imposed upon the netsnmp code (CONFIG_TIMEOUT * 1000 * 1000 = 30 seconds in my case), forcing a failure if the response is still coming at the 30 second mark, which in turn forces data collection to fail. This is not a device issue, or a netsnmp issue, it's Zabbix's code being inadequate to handle SNMPv3 polling of large numbers of OID's where the target device cannot send the response back before the Timeout occurs.

Those are not mine statements. These are facts completely independent from my person.

More facts.
From snmpcmd(1) man page:

-t timeout
Specifies the timeout in seconds between retries. The default is 1. Floating point numbers can be used to specify fractions of seconds.

And this why by default zabbix server/proxy applies 1s session timeout.
Please do not try to convince me that I'm wrong or present you guesses. If you cannot agree with facts .. this is not my problem.
Just show the code which is wrong or show exact test cases .. really (again) I have nothing to do with how libnetsnmp client code works.
Isolation SNMP client code and zabbix server/proxy only reports session timeout errors in libnetsnmp functions.

Yes, is it quite big probability that something is wrong in the net-snmp client or agent side code. If you want you can try to install on your computer net-snmp and start snmpd service so you would be able to start playing with that code. If you have necessary skills you would be able to at least encircle where some issues are and/or at least show how to reproduce those issues. Yest it is possible to reproduce those issues when snmpd is working on way more powerful HW than some devices BMCs.
it I can guarantee you that many people will thank you .. not only zabbix guys but nagios and other SNMP tools users as well because those session timeouts issues are not zabbix specic.

People have been having these issues for years and years; here's some random person's blog where commenters talk about having to alter the source code to allow for independently configured snmp timeouts and retries, to decouple it from the zabbix Timeout value:

http://blog.zabbix.com/zabbix-2-2-fe...ovements/2551/

I remember those issues occuring long time before 2.2.x.
Do you know why nothing changed about those issues in meantime? Because nothing in net-snmp changed since zabbix started provide collecting data over SNMP.
If you don't believe me just have a look on https://trends.google.co.uk/trends/e...snmp%20timeout

Ad Widget

Gaps on the graphs, SNMPv3

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment