Hi all,
I'm currently testing Zabbix for monitoring mainly network devices in remote locations, so all of my configuration is currently SNMPv3-based. My setup is pretty simple:
- 1 Zabbix server (2 vCPU, 4GB RAM) in AWS
- PostgreSQL backend for the server
- 1 Zabbix proxy (1 vCPU, 2GB RAM) with local PostgreSQL database as a VM on VMware ESX
This configuration has been running for a few weeks, pulling quite some SNMP metrics off the following devices (mainly interface data and system information):
- 4 x Cisco switch
- 1 x Cisco router
- 2 x Cisco ASA (in cluster)
- 1 x pfSense
During my testing, I noticed that the 1 CPU core on the proxy couldn't handle the amount of StartPingers I had configured. I shut down the proxy VM, added a CPU core and relaunched it.
Since that modification of the proxy VM, all of my SNMPv3 was broken
My zabbix_proxy.log file was full of these messages:
1640:20200409:150627.566 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
1640:20200409:150638.598 SNMP agent item "system.uptime[sysUpTime.0]" on host "host2" failed: first network error, wait for 15 seconds
1645:20200409:150642.603 resuming SNMP agent checks on host "host1": connection restored
1645:20200409:150700.254 resuming SNMP agent checks on host "host2": connection restored
1640:20200409:150712.654 SNMP agent item "net.if.in[ifHCInOctets.527305088]" on host "host2" failed: first network error, wait for 15 seconds
1640:20200409:150727.693 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
1645:20200409:150733.616 resuming SNMP agent checks on host "host2": connection restored
1645:20200409:150742.109 resuming SNMP agent checks on host "host1": connection restored
1640:20200409:150757.731 SNMP agent item "snmp.engineid" on host "host1" failed: first network error, wait for 15 seconds
1641:20200409:150808.050 SNMP agent item "snmp.engineid" on host "host2" failed: first network error, wait for 15 seconds
1645:20200409:150812.794 resuming SNMP agent checks on host "host1": connection restored
Of my 8 hosts, 6 of them had the red SNMP status with a timeout message. Only 2 of them were still green at some point (I think where the logs above come from).
I did some further digging and troubleshooting, and the closest I could get to this problem was the SNMPv3 engineID issue as described in https://support.zabbix.com/browse/ZBX-8385
I did have the ASA cluster in my list of hosts, where apparently both ASAs share the same engineID when polled through SNMP. So I tried to resolve my issue by throwing out the secondary ASA, restarting services or even rebooting Zabbix server and proxy, disabling and enabling hosts, disabling and enabling SNMP checks, ... Nothing helped.
(As a sidenote, I replicated the issue as described in ZBX-8385 in my home setup by duplicating an engineID over 2 hosts, got a completely different error, and that error was resolved automatically by re-adjusting the engineID on the monitored device.)
So I went a bit deeper and looked at a packet capture of my SNMPv3 traffic and whenever the errors above occurred, it looks like Zabbix was sending garbage encrypted scopePDUs. In the example below, the first request failed (with a retransmit after a few seconds), the second one got through and got a normal response:

The last thing I did was spin up a new proxy and move the 4 Cisco switches (out of which 3 had a red SNMP status) to that new proxy. They have been fine SNMP-wise since then, while the Cisco router, the single ASA and the pfSense are still throwing fits on the old proxy ...
I have no more idea where to look for this issue, since I'm quite sure it's not related to ZBX-8385. The proxy was running fine for weeks with the ASAs having duplicate engineIDs as well (which is surprising, in retrospect). I'm mostly flabbergasted by the fact that one SNMP request can fail, and the next to the same device (with the same auth/encryption key) can succeed.
Is changing the Zabbix proxy (or server when used for monitoring) hardware specification having an impact on SNMPv3 encryption, or Zabbix functionality in general? Is this something I should avoid, and just spin up a new proxy with different hardware and move the hosts to the new proxy?
Or could it still be an issue with SNMP itself, or the implementation of SNMP in libnetsnmp?
Thanks!
Edit: well, it's getting worse:
- moved Cisco router to new proxy: same issue, but now only for the Cisco router (the 4 switches already on the new proxy get monitored fine, without errors)
- moved 4 Cisco switches back to old proxy: immediately same errors for all 4 of them - move them back to new proxy, all is well
It must be something in the configuration, but I have no clue ... Have been looking for incorrect macros defined in templates or whatnot, but the templates are the same for both switches and routers. It seems to go fine for a number of SNMP requests, and then scopedPDU encryption starts going wrong.
Edit2:
- disabled all hosts on new proxy (4 switches, 1 router), re-enabled just the router after a few minutes: SNMP breaks after some time, so it clearly has nothing to do with engineIDs
- created copies of my templates to use SNMPv3 with noAuth level instead of authPriv, applied it to the single router, now it's perfectly fine ...
So it's config-related, or SNMPv3 encryption, or combination of both.
My authPriv templates are the same and work fine on the switches, and have worked fine on the router for some weeks ... Gotta dig a bit deeper to find the probable config woopsie.
I'm currently testing Zabbix for monitoring mainly network devices in remote locations, so all of my configuration is currently SNMPv3-based. My setup is pretty simple:
- 1 Zabbix server (2 vCPU, 4GB RAM) in AWS
- PostgreSQL backend for the server
- 1 Zabbix proxy (1 vCPU, 2GB RAM) with local PostgreSQL database as a VM on VMware ESX
This configuration has been running for a few weeks, pulling quite some SNMP metrics off the following devices (mainly interface data and system information):
- 4 x Cisco switch
- 1 x Cisco router
- 2 x Cisco ASA (in cluster)
- 1 x pfSense
During my testing, I noticed that the 1 CPU core on the proxy couldn't handle the amount of StartPingers I had configured. I shut down the proxy VM, added a CPU core and relaunched it.
Since that modification of the proxy VM, all of my SNMPv3 was broken

My zabbix_proxy.log file was full of these messages:
1640:20200409:150627.566 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
1640:20200409:150638.598 SNMP agent item "system.uptime[sysUpTime.0]" on host "host2" failed: first network error, wait for 15 seconds
1645:20200409:150642.603 resuming SNMP agent checks on host "host1": connection restored
1645:20200409:150700.254 resuming SNMP agent checks on host "host2": connection restored
1640:20200409:150712.654 SNMP agent item "net.if.in[ifHCInOctets.527305088]" on host "host2" failed: first network error, wait for 15 seconds
1640:20200409:150727.693 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
1645:20200409:150733.616 resuming SNMP agent checks on host "host2": connection restored
1645:20200409:150742.109 resuming SNMP agent checks on host "host1": connection restored
1640:20200409:150757.731 SNMP agent item "snmp.engineid" on host "host1" failed: first network error, wait for 15 seconds
1641:20200409:150808.050 SNMP agent item "snmp.engineid" on host "host2" failed: first network error, wait for 15 seconds
1645:20200409:150812.794 resuming SNMP agent checks on host "host1": connection restored
Of my 8 hosts, 6 of them had the red SNMP status with a timeout message. Only 2 of them were still green at some point (I think where the logs above come from).
I did some further digging and troubleshooting, and the closest I could get to this problem was the SNMPv3 engineID issue as described in https://support.zabbix.com/browse/ZBX-8385
I did have the ASA cluster in my list of hosts, where apparently both ASAs share the same engineID when polled through SNMP. So I tried to resolve my issue by throwing out the secondary ASA, restarting services or even rebooting Zabbix server and proxy, disabling and enabling hosts, disabling and enabling SNMP checks, ... Nothing helped.
(As a sidenote, I replicated the issue as described in ZBX-8385 in my home setup by duplicating an engineID over 2 hosts, got a completely different error, and that error was resolved automatically by re-adjusting the engineID on the monitored device.)
So I went a bit deeper and looked at a packet capture of my SNMPv3 traffic and whenever the errors above occurred, it looks like Zabbix was sending garbage encrypted scopePDUs. In the example below, the first request failed (with a retransmit after a few seconds), the second one got through and got a normal response:
The last thing I did was spin up a new proxy and move the 4 Cisco switches (out of which 3 had a red SNMP status) to that new proxy. They have been fine SNMP-wise since then, while the Cisco router, the single ASA and the pfSense are still throwing fits on the old proxy ...
I have no more idea where to look for this issue, since I'm quite sure it's not related to ZBX-8385. The proxy was running fine for weeks with the ASAs having duplicate engineIDs as well (which is surprising, in retrospect). I'm mostly flabbergasted by the fact that one SNMP request can fail, and the next to the same device (with the same auth/encryption key) can succeed.
Is changing the Zabbix proxy (or server when used for monitoring) hardware specification having an impact on SNMPv3 encryption, or Zabbix functionality in general? Is this something I should avoid, and just spin up a new proxy with different hardware and move the hosts to the new proxy?
Or could it still be an issue with SNMP itself, or the implementation of SNMP in libnetsnmp?
Thanks!
Edit: well, it's getting worse:
- moved Cisco router to new proxy: same issue, but now only for the Cisco router (the 4 switches already on the new proxy get monitored fine, without errors)
- moved 4 Cisco switches back to old proxy: immediately same errors for all 4 of them - move them back to new proxy, all is well
It must be something in the configuration, but I have no clue ... Have been looking for incorrect macros defined in templates or whatnot, but the templates are the same for both switches and routers. It seems to go fine for a number of SNMP requests, and then scopedPDU encryption starts going wrong.
Edit2:
- disabled all hosts on new proxy (4 switches, 1 router), re-enabled just the router after a few minutes: SNMP breaks after some time, so it clearly has nothing to do with engineIDs
- created copies of my templates to use SNMPv3 with noAuth level instead of authPriv, applied it to the single router, now it's perfectly fine ...
So it's config-related, or SNMPv3 encryption, or combination of both.
My authPriv templates are the same and work fine on the switches, and have worked fine on the router for some weeks ... Gotta dig a bit deeper to find the probable config woopsie.
Comment