Ad Widget

Collapse

SNMP Walk connection keeps failing

Collapse
This topic has been answered.
X
X
 
  • Time
  • Show
Clear All
new posts
  • markfree
    Senior Member
    • Apr 2019
    • 868

    #1

    SNMP Walk connection keeps failing

    With Zabbix 7.0, I'm trying to retrieve some SNMP data from a host using the SNMP Walk request, but the connection keeps failing.
    The host has 5 SNMP Walk items and all of them have a timeout of 10s, but none of them work.
    Zabbix shows the host as unavailable with the following error message.
    Click image for larger version

Name:	unavailable_error.png
Views:	3843
Size:	6.8 KB
ID:	485358
    cannot retrieve OID: '.1.3.6.1.2.1.2.2.1' from [[host.domain]:161]: timed out
    When I test the item, it immediately fails with a simple "Cannot connect to host.domain:161" error message.
    Click image for larger version

Name:	test_error.png
Views:	3828
Size:	3.9 KB
ID:	485359

    However, trying a few more times, the test eventually works and the correct data is shown.

    When I try to get the same data using the SNMPWalk command, it works every time and very quickly.

    Code:
    snmpbulkwalk -t 5 -r 1 -v 3 -l authPriv -u [USER] -a SHA -A [AUTH] -x AES -X [CRYPT] -Oe -Ot -On host.domain:161 .1.3.6.1.2.1.2.2.1
    .1.3.6.1.2.1.2.2.1.1.1 = INTEGER: 1
    .1.3.6.1.2.1.2.2.1.1.2 = INTEGER: 2
    .1.3.6.1.2.1.2.2.1.1.3 = INTEGER: 3
    .1.3.6.1.2.1.2.2.1.1.4 = INTEGER: 4
    .1.3.6.1.2.1.2.2.1.2.1 = STRING: lo
    .1.3.6.1.2.1.2.2.1.2.2 = STRING: eth0
    .1.3.6.1.2.1.2.2.1.2.3 = STRING: teql0
    .1.3.6.1.2.1.2.2.1.2.4 = STRING: eth1
    (...)
    There's no error messages in the Zabbix logs either.​
    I'm not sure why Zabbix is timed out. Any thoughts on this?​
  • Answer selected by Markku at 24-06-2024, 07:53.
    markfree
    Senior Member
    • Apr 2019
    • 868

    I managed to find the issue.
    It was an IPv6 default-route issue on the remote end.

    The remote host uses a DNS name with dual-stack, and its network router had a problem with the IPv6 default route. So, the host was receiving SNMP requests, but some replies were on limbo.
    After fixing the default-route, it is now connected with no more issues.

    Btw, it is a Mikrotik router, but I wasn't monitoring its routes.

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1784

      #2
      Does this help you:



      Changes in Authentication protocol, Authentication passphrase, Privacy protocol or Privacy passphrase, made without changing the Security name, will take effect only after the cache on a server/proxy is manually cleared (by using -R snmp_cache_reload) or the server/proxy is restarted. In cases, where Security name is also changed, all parameters will be updated immediately.
      (Maybe not, as your requests eventually succeed)

      Markku

      Comment

      • markfree
        Senior Member
        • Apr 2019
        • 868

        #3
        I reloaded the SNMP cache, but it didn't work. However, I guess I skipped the rest of the warning and did not restart the server at first.
        Now, after restarting the server, the host is finally available again.
        I will continue to monitor the host and hope it doesn't fail again.

        Thanks a lot Markku

        Comment

        • markfree
          Senior Member
          • Apr 2019
          • 868

          #4
          This is all quite odd.
          I did a quick server restart today, and now SNMP checks are failing in Zabbix.
          The server log shows general network errors. This makes the host interface unavailable.

          Code:
          75603:20240615:180847.905 SNMP agent item "general.walk" on host "[HOST]" failed: first network error, wait for 15 seconds
          75603:20240615:180914.913 SNMP agent item "general.walk" on host "[HOST]" failed: another network error, wait for 15 seconds
          75603:20240615:180924.909 SNMP agent item "enterprises.ifEntry.walk" on host "[HOST]" failed: another network error, wait for 15 seconds
          75603:20240615:180928.911 SNMP agent item "enterprises.walk" on host "[HOST]" failed: another network error, wait for 15 seconds
          75603:20240615:180951.202 resuming SNMP agent checks on host "[HOST]": connection restored
          75603:20240615:181001.909 SNMP agent item "general.walk" on host "[HOST]" failed: first network error, wait for 15 seconds
          75603:20240615:181038.907 SNMP agent item "enterprises.walk" on host "[HOST]" failed: another network error, wait for 15 seconds
          75603:20240615:181042.910 SNMP agent item "general.walk" on host "[HOST]" failed: another network error, wait for 15 seconds
          75603:20240615:181119.905 temporarily disabling SNMP agent checks on host "[HOST]": interface unavailable
          I've already increased the SNMP walk items timeout to 15s, which is more than enough, but Zabbix keep showing these network errors.
          So, in general, the host availability is oscillating.

          Click image for larger version

Name:	image.png
Views:	3685
Size:	24.8 KB
ID:	485730

          For testing, I did 100 queries in sequence and they all succeeded.
          Code:
          $ for i in {1..100}; do time (snmpbulkwalk -t 10 -r 1 -v 3 -l authPriv -u [USER] -a SHA -A "[AUTH]" -x AES -X "[CRIPT]" -Oe -Ot -On [HOST]:[PORT] .1.3.6.1.4.1 &> /dev/null); done
          
          real 0m7.927s
          user 0m0.401s
          sys 0m0.051s
          
          real 0m7.966s
          user 0m0.405s
          sys 0m0.058s
          
          real 0m7.636s
          user 0m0.398s
          sys 0m0.061s
          
          real 0m9.243s
          user 0m0.412s
          sys 0m0.051s
          
          real 0m7.838s
          user 0m0.416s
          sys 0m0.036s

          I'm not sure why Zabbix is failing so much.

          Comment

          • Markku
            Senior Member
            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
            • Sep 2018
            • 1784

            #5
            Maybe you could capture the SNMP traffic from Zabbix server and check with Wireshark how it looks like, if it gives you more insight what's really happening.

            Markku

            Comment

            • troffasky
              Senior Member
              • Jul 2008
              • 587

              #6
              There seems to be something specifically about the walk items that causes this. I didn't have any issues with 3 specific hosts after upgrading to Zabbix 7, until I applied a new template with walk items in them. Now the SNMP availability flaps constantly on these hosts. They are all SNMP v2.

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1784

                #7
                This is interesting, I haven't yet tested the new async SNMP items (walk[], get[]).

                If/when you are able to reproduce the problem, definitely take a capture of the traffic and see what happens: is the device unable to provide the requested data adequately, or is the new async SNMP poller unable to use the data.

                Markku

                Comment

                • troffasky
                  Senior Member
                  • Jul 2008
                  • 587

                  #8
                  I have been trying to troubleshoot this but as yet cannot fathom what Zabbix is doing.

                  Template is "Mikrotik by SNMP". Hosts are added by hostname and not IP.


                  net.if.wireless.walk is:

                  walk[1.3.6.1.4.1.14988.1.1.14.1.1.2,1.3.6.1.2.1.31.1.1. 1.18,1.3.6.1.2.1.2.2.1.3,1.3.6.1.2.1.2.2.1.7,1.3.6 .1.4.1.14988.1.1.16.1.1.2,1.3.6.1.4.1.14988.1.1.16 .1.1.4,1.3.6.1.4.1.14988.1.1.16.1.1.3,1.3.6.1.4.1. 14988.1.1.16.1.1.7,1.3.6.1.4.1.14988.1.1.1.3.1.4,1 .3.6.1.4.1.14988.1.1.1.3.1.8,1.3.6.1.4.1.14988.1.1 .1.3.1.9,1.3.6.1.4.1.14988.1.1.1.3.1.6,1.3.6.1.4.1 .14988.1.1.1.3.1.11,1.3.6.1.4.1.14988.1.1.1.7.1.5, 1.3.6.1.4.1.14988.1.1.1.7.1.4,1.3.6.1.4.1.14988.1. 1.1.7.1.2,1.3.6.1.4.1.14988.1.1.1.7.1.3]

                  When SNMP availability icon goes red, hover over it:

                  Code:
                      
                  cannot resolve address [[miaap]:161]: timed out: nodename nor servname provided, or not known
                  It's complaining that it can't resolve it, I think? The DNS server is running on the same host as the Zabbix server. Nslookup *always* returns the correct answer, instantly. Nothing is ever logged about DNS in the Zabbix server log.

                  It didn't have any problem with DNS until around 2200 yesterday, which is when I uploaded the Zabbix 7 version of "Mikrotik by SNMP", used by this host [this updated the template from the factory 6.0 version]. I am quite sure that this is not a DNS problem

                  Click image for larger version

Name:	miaap.png
Views:	3801
Size:	98.1 KB
ID:	486208


                  Server log:

                  Code:
                  1945097:20240623:095124.030 enabling SNMP agent checks on host "miaap": interface became available
                  1945135:20240623:095124.185 SNMP agent item "sensor.temp.walk" on host "miaap" failed: first network error, wait for 15 seconds
                  1945135:20240623:095139.192 SNMP agent item "net.if.wireless.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945097:20240623:095154.069 resuming SNMP agent checks on host "miaap": connection restored
                  1945135:20240623:095154.184 SNMP agent item "net.if.wireless.walk" on host "miaap" failed: first network error, wait for 15 seconds
                  1945135:20240623:095209.193 SNMP agent item "sensor.temp.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945135:20240623:095224.197 SNMP agent item "net.if.wireless.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945135:20240623:095236.204 SNMP agent item "net.if.wireless.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945097:20240623:095251.141 resuming SNMP agent checks on host "miaap": connection restored
                  1945135:20240623:095251.188 SNMP agent item "system.cpu.walk" on host "miaap" failed: first network error, wait for 15 seconds
                  1945135:20240623:095306.197 SNMP agent item "net.if.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945135:20240623:095321.196 SNMP agent item "net.if.wireless.walk" on host "miaap" failed: another network error, wait for 15 seconds
                  1945135:20240623:095336.195 temporarily disabling SNMP agent checks on host "miaap": interface unavailable

                  Comment

                  • troffasky
                    Senior Member
                    • Jul 2008
                    • 587

                    #9
                    Whilst in an "unresponsive" state, I can force a check on a specific item, it will work and store the retrieved value, but that doesn't actually clear the "unresponsive" flag on the host in Zabbix. Interesting.

                    Comment

                    • Markku
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                      • Sep 2018
                      • 1784

                      #10
                      You haven't provided any details of your DNS setup, so cannot tell anything yet. But if Zabbix says it cannot resolve the name, then I'm pretty sure that's what happening.

                      Some questions:
                      - Is the host name actually "miaap" = not FQDN?
                      - How are your domain/search settings configured (if not using FQDN)?
                      - What is your DNS server app and how should it resolve the name?
                      - How is your local DNS resolver configured? (Running a DNS server locally does not affect the local DNS resolver unless specifically configured)

                      I'm not sure if nslookup is guaranteed to make name resolution the same way Zabbix does, probably not, as ZBXNEXT-8620 (shortlink, "Add async DNS resolver for HTTP, SNMP and Zabbix agent") made some changes in Zabbix 7 name resolution. You can try capturing the local DNS traffic to see what's actually happening in the name resolution.

                      Markku

                      Comment

                      • Markku
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                        • Sep 2018
                        • 1784

                        #11
                        Notable detail is that the new "async" features need "async" usage, AFAIK. Thus using the new Mikrotik template (that uses the new async SNMP items) changed the underlying behavior for sure.

                        Markku

                        Comment

                        • troffasky
                          Senior Member
                          • Jul 2008
                          • 587

                          #12
                          Yes, miaap is the name. It has only an AAAA record, no A record. It resolves just fine with either the bare host name or the FQDN.
                          Disabling all of the walk[] items stopped the SNMP unreachable alerts.
                          DNS server is bind. The configuration is a bunch of text files.
                          OS resolver configuration is

                          nameserver 127.0.0.1
                          nameserver 8.8.8.8
                          search <my domain>




                          "The library keeps track of the state of nameservers and will avoid them when they go down. Otherwise it will round robin between them."

                          Having Google public DNS second doesn't actually mean that it will only be used if 127.0.0.1 isn't working in this case
                          I removed it from /etc/resolv.conf, restarted zabbix-server and now I am no longer getting SNMP unreachable messages with these walk[] items enabled!

                          Comment

                          • Markku
                            Senior Member
                            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                            • Sep 2018
                            • 1784

                            #13
                            Yeah, it's expected that any of the listed servers will be used for name resolution (unless some specific mapping is done in systemd-resolved or similar components).

                            But this DNS side track was not related to markfree problems, right?

                            Markku

                            Comment

                            • troffasky
                              Senior Member
                              • Jul 2008
                              • 587

                              #14
                              Yeah, it's expected that any of the listed servers will be used for name resolution (unless some specific mapping is done in systemd-resolved or similar components).
                              No. man 5 resolv.conf says it will only use the next server if the first one is down. That's why I had GPD in there second, so if bind breaks then I will still have internet.
                              I am not really sure what the distribution of things using libc for DNS resolution and things using libevent for DNS resolution is.

                              But this DNS side track was not related to markfree problems, right?
                              Symptoms were very similar and he hasn't come back. So I don't know. A bit odd that the server does not log about failure to resolve hostnames.

                              Comment

                              • Markku
                                Senior Member
                                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                                • Sep 2018
                                • 1784

                                #15
                                As your case proved, there are other things that just resolv.conf and its documentation Systemd-resolved is often used by default, and it makes its own decisions about the resolving order.

                                Markku

                                Comment

                                Working...