Have you tried any of what is being discussed in this thread on an actual piece of network gear? I have, on equipment from Cisco, Arista and Brocade. In all cases, there is a specific number of OID's the given devices can return in 30 seconds when using SNMPv3 with AES. It is easy to measure, you just keep adding OID's to snmpbulkget until you find the number. I've already pointed out how to do this. If you need to query more OID's than what the device will send back in 30 seconds, then you have no choice but to switch to SNMPv2 or not use Zabbix, because it is not currently possible to monitor those reliably.
The cause of the failures is Zabbix forcing an artificial timeout onto the response from the device. Traditional network hardware cannot send AES encrypted OID data back for 1000+ items in less than the maximum timeout Zabbix is capable of using; currently 30 seconds. This is not a netsnmp timeout, this is Zabbix telling it stop at 30 seconds. Zabbix forces this timeout onto netsnmp as your own code quote shows, so if it is still receiving data when the Zabbix-imposed timeout is reached, collection is halted at that point, no more data is collected, your items have gaps.
The same issue does not happen if using netsnmp directly via snmpbulkget, it only happens when using Zabbix and the target device can't send back all of the data requested in less than the timeout Zabbix has forced onto the query.
So again, the solution is to put code changes in place on the Zabbix side where a device is never queried for more SNMPv3 AES-priv OID's than it can possibly return in the current CONFIG_TIMEOUT value Zabbix has been set with. If a given device has more items than can be returned in that amount of time, then Zabbix needs to break them up into buckets of items that are queried at separate non-overlapping intervals so they can all be retrieved. If you add more items at a low enough query interval than the target device can possibly every keep up with, then Zabbix should warn you.
I submitted all of this as a bug.
The cause of the failures is Zabbix forcing an artificial timeout onto the response from the device. Traditional network hardware cannot send AES encrypted OID data back for 1000+ items in less than the maximum timeout Zabbix is capable of using; currently 30 seconds. This is not a netsnmp timeout, this is Zabbix telling it stop at 30 seconds. Zabbix forces this timeout onto netsnmp as your own code quote shows, so if it is still receiving data when the Zabbix-imposed timeout is reached, collection is halted at that point, no more data is collected, your items have gaps.
The same issue does not happen if using netsnmp directly via snmpbulkget, it only happens when using Zabbix and the target device can't send back all of the data requested in less than the timeout Zabbix has forced onto the query.
So again, the solution is to put code changes in place on the Zabbix side where a device is never queried for more SNMPv3 AES-priv OID's than it can possibly return in the current CONFIG_TIMEOUT value Zabbix has been set with. If a given device has more items than can be returned in that amount of time, then Zabbix needs to break them up into buckets of items that are queried at separate non-overlapping intervals so they can all be retrieved. If you add more items at a low enough query interval than the target device can possibly every keep up with, then Zabbix should warn you.
I submitted all of this as a bug.
Comment