Ad Widget

Collapse

SNMPv3 failure after Zabbix proxy hardware reconfiguration

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • frcre
    Junior Member
    • Apr 2020
    • 10

    #1

    SNMPv3 failure after Zabbix proxy hardware reconfiguration

    Hi all,

    I'm currently testing Zabbix for monitoring mainly network devices in remote locations, so all of my configuration is currently SNMPv3-based. My setup is pretty simple:

    - 1 Zabbix server (2 vCPU, 4GB RAM) in AWS
    - PostgreSQL backend for the server
    - 1 Zabbix proxy (1 vCPU, 2GB RAM) with local PostgreSQL database as a VM on VMware ESX

    This configuration has been running for a few weeks, pulling quite some SNMP metrics off the following devices (mainly interface data and system information):

    - 4 x Cisco switch
    - 1 x Cisco router
    - 2 x Cisco ASA (in cluster)
    - 1 x pfSense

    During my testing, I noticed that the 1 CPU core on the proxy couldn't handle the amount of StartPingers I had configured. I shut down the proxy VM, added a CPU core and relaunched it.

    Since that modification of the proxy VM, all of my SNMPv3 was broken

    My zabbix_proxy.log file was full of these messages:

    1640:20200409:150627.566 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
    1640:20200409:150638.598 SNMP agent item "system.uptime[sysUpTime.0]" on host "host2" failed: first network error, wait for 15 seconds
    1645:20200409:150642.603 resuming SNMP agent checks on host "host1": connection restored
    1645:20200409:150700.254 resuming SNMP agent checks on host "host2": connection restored
    1640:20200409:150712.654 SNMP agent item "net.if.in[ifHCInOctets.527305088]" on host "host2" failed: first network error, wait for 15 seconds
    1640:20200409:150727.693 SNMP agent item "system.uptime[sysUpTime.0]" on host "host1" failed: first network error, wait for 15 seconds
    1645:20200409:150733.616 resuming SNMP agent checks on host "host2": connection restored
    1645:20200409:150742.109 resuming SNMP agent checks on host "host1": connection restored
    1640:20200409:150757.731 SNMP agent item "snmp.engineid" on host "host1" failed: first network error, wait for 15 seconds
    1641:20200409:150808.050 SNMP agent item "snmp.engineid" on host "host2" failed: first network error, wait for 15 seconds
    1645:20200409:150812.794 resuming SNMP agent checks on host "host1": connection restored

    Of my 8 hosts, 6 of them had the red SNMP status with a timeout message. Only 2 of them were still green at some point (I think where the logs above come from).

    I did some further digging and troubleshooting, and the closest I could get to this problem was the SNMPv3 engineID issue as described in https://support.zabbix.com/browse/ZBX-8385

    I did have the ASA cluster in my list of hosts, where apparently both ASAs share the same engineID when polled through SNMP. So I tried to resolve my issue by throwing out the secondary ASA, restarting services or even rebooting Zabbix server and proxy, disabling and enabling hosts, disabling and enabling SNMP checks, ... Nothing helped.

    (As a sidenote, I replicated the issue as described in ZBX-8385 in my home setup by duplicating an engineID over 2 hosts, got a completely different error, and that error was resolved automatically by re-adjusting the engineID on the monitored device.)

    So I went a bit deeper and looked at a packet capture of my SNMPv3 traffic and whenever the errors above occurred, it looks like Zabbix was sending garbage encrypted scopePDUs. In the example below, the first request failed (with a retransmit after a few seconds), the second one got through and got a normal response:

    Click image for larger version  Name:	zabbix_pcap.png Views:	4 Size:	54.9 KB ID:	399237

    The last thing I did was spin up a new proxy and move the 4 Cisco switches (out of which 3 had a red SNMP status) to that new proxy. They have been fine SNMP-wise since then, while the Cisco router, the single ASA and the pfSense are still throwing fits on the old proxy ...

    I have no more idea where to look for this issue, since I'm quite sure it's not related to ZBX-8385. The proxy was running fine for weeks with the ASAs having duplicate engineIDs as well (which is surprising, in retrospect). I'm mostly flabbergasted by the fact that one SNMP request can fail, and the next to the same device (with the same auth/encryption key) can succeed.

    Is changing the Zabbix proxy (or server when used for monitoring) hardware specification having an impact on SNMPv3 encryption, or Zabbix functionality in general? Is this something I should avoid, and just spin up a new proxy with different hardware and move the hosts to the new proxy?
    Or could it still be an issue with SNMP itself, or the implementation of SNMP in libnetsnmp?

    Thanks!

    Edit: well, it's getting worse:

    - moved Cisco router to new proxy: same issue, but now only for the Cisco router (the 4 switches already on the new proxy get monitored fine, without errors)
    - moved 4 Cisco switches back to old proxy: immediately same errors for all 4 of them - move them back to new proxy, all is well

    It must be something in the configuration, but I have no clue ... Have been looking for incorrect macros defined in templates or whatnot, but the templates are the same for both switches and routers. It seems to go fine for a number of SNMP requests, and then scopedPDU encryption starts going wrong.

    Edit2:

    - disabled all hosts on new proxy (4 switches, 1 router), re-enabled just the router after a few minutes: SNMP breaks after some time, so it clearly has nothing to do with engineIDs
    - created copies of my templates to use SNMPv3 with noAuth level instead of authPriv, applied it to the single router, now it's perfectly fine ...

    So it's config-related, or SNMPv3 encryption, or combination of both.

    My authPriv templates are the same and work fine on the switches, and have worked fine on the router for some weeks ... Gotta dig a bit deeper to find the probable config woopsie.
    Last edited by frcre; 14-04-2020, 12:01.
  • frcre
    Junior Member
    • Apr 2020
    • 10

    #2
    I think I found something, so posting separately: I increased StartPollers, StartPingers, StartPollersUnreachable and StartDiscoverers to 8, all 5 hosts on my new proxy were monitored properly now. Applied the same changes to my old proxy and moved the hosts back: all 5 hosts are monitored properly on the old proxy as well ...

    I'm guessing there is some relation between the number of CPU cores and the number of e.g. poller processes then? Does it have to be dividable?

    Performance was never an issue, the only parameter I tweaked previously was StartPingers to keep it under the 80% threshold.

    Comment

    • tim.mooney
      Senior Member
      • Dec 2012
      • 1427

      #3
      That's a bizarre problem and I'm really glad you stuck with it and followed-up with the "fix".

      I'm not aware of any Start* to cores requirement, though that doesn't mean it's not possible. Typically if you have too many or too few, you can tell based on general performance. If there's any ratio that you need to keep between the various subsystems, it should be documented in the comments in the zabbix_serverd.conf.

      Since your original problem appeared to be intermittent encryption issues, by any chance were you monitoring the entropy pool on the proxy itself? Running out of entropy came up in another thread on these forums lately, and especially in a VM environment doing a lot of encryption, it's one thing I would be suspicious enough to check. If you still have the original proxy and care to test it, you could set up a monitor on your cloud server of the proxy's entropy (see something like https://major.io/2007/07/01/check-av...ropy-in-linux/ for an idea of the proc file to monitor) and then set the Start* settings back to their orig values, and wait to see if it fails.

      Comment

      • frcre
        Junior Member
        • Apr 2020
        • 10

        #4
        The entropy is an interesting suggestion. I went on to check it, but it doesn't look like it's related:

        Situation this morning (7 hosts monitored fine):

        Code:
        :~$ cat /proc/sys/kernel/random/entropy_avail
        3583
        Previous zabbix_proxy.conf with default Start* settings (output taken few minutes after reboot, 5 out of 7 hosts in red SNMP status):

        Code:
        :~$ cat /proc/sys/kernel/random/entropy_avail
        948
        Working settings restored (proxy service restart, no reboot, once all SNMP checks were green again):

        Code:
        :~$ cat /proc/sys/kernel/random/entropy_avail
        865
        Entropy does ramp up faster with more processes active, but SNMP was all good again when entropy was even lower than with the default settings.

        Comment

        • tim.mooney
          Senior Member
          • Dec 2012
          • 1427

          #5
          I should have mentioned this in my previous post: to accurately monitor it, it would be better to do it via a Zabbix item. The reason is that by accessing the box interactively and running commands, you're actually contributing to the entropy pool. So in true Heisenberg fashion, by (manually) looking at entropy_avail, you're changing the results of the reading. Even the Zabbix item may contribute to the entropy pool, but it will likely be a much smaller contribution than you were making.

          Using a Zabbix item and not logging in to the system will give you a more accurate reading of how it's going to respond under normal use, and you'll also be able to watch it over time to see if it ever looks like it's getting depleted.

          Still, I think you're probably right that it's not the problem here.

          Comment

          • frcre
            Junior Member
            • Apr 2020
            • 10

            #6
            I added an item like you said, and it's showing roughly the same values.
            Anyhow, I have been constantly connected to my proxy for the past few days for troubleshooting, so my input generated probably quite some entropy then as well

            Comment

            • frcre
              Junior Member
              • Apr 2020
              • 10

              #7
              Well, I'm running into the same issue again, although now no hosts are disabled for SNMP. I'm just constantly getting the "first network error" + "connection restored" spam in zabbix_proxy.log

              I'm investigating whether this has to do with our underlying hypervisor setup by moving the VM to a separate physical machine, but any other ideas are still more than welcome!

              Comment

              • frcre
                Junior Member
                • Apr 2020
                • 10

                #8
                After 1 more day of testing, I'm quite sure it's not the VM itself, but some configuration issue.

                I removed all 7 hosts and started building them from scratch again:

                - adding the 4 switches with their templates: all OK
                - adding the router with its templates: all OK
                - adding both ASA firewalls with their templates: all OK

                I then restarted the zabbix-proxy service on my VM, and the SNMP timeouts and subsequent errors reappeared all over again ...

                I'm completely lost at the moment, went back to just the 4 switches with 16 basic SNMP items on each, with SNMP credentials hardcoded on each of them instead of using macros: still the same issue.

                It seems that I have to completely remove all hosts, restart the proxy, readd them manually and not restart the proxy service or VM to keep running ...

                Comment

                • frcre
                  Junior Member
                  • Apr 2020
                  • 10

                  #9
                  OK, I guess I have solved my issue, so posting one last (hopefully) update about this, should others ever run into the same behaviour ...

                  The root cause seems to be very stupid: I had a few SNMP items in a default template that I copied, and in my mass update something went wrong with adjusting the encryption method to AES. Some of them had DES applied, so naturally querying these items failed big time.

                  I'm suspecting that having multiple 100's of SNMPv3 items on a host, with some of them having incorrect encryption settings, messes up some internal caching mechanism handling the encryption of PDUs. This seems to have led to incorrect encryption of even items with the correct AES setting applied. I replicated the issue with ALL items incorrectly configured, and in that case Zabbix is spamming the log files with errors about this. When the majority of SNMP items seems to be working fine, these errors are not present.

                  Long story short: pebkac, and the "SNMP settings on host level" feature from 5.0 is long overdue

                  Comment

                  • tim.mooney
                    Senior Member
                    • Dec 2012
                    • 1427

                    #10
                    Thanks for following up with the solution. That may indeed be a help to others in the future!

                    Comment

                    Working...