Ad Widget

Collapse

Intermittent SNMP failures with a host

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mdw
    Junior Member
    • Sep 2021
    • 18

    #1

    Intermittent SNMP failures with a host

    I am slowly and carefully implementing Zabbix 5.0 LTS. Everything looked OK in the server log until I added our central DC switch (a pair of Cisco Catalyst 6800 in a VSS stack). Suddenly I am starting to get:

    Code:
    486979:20210922:084611.402 resuming SNMP agent checks on host "strS00": connection restored
    486977:20210922:084656.930 SNMP agent item "net.if.status[ifOperStatus.345]" on host "strS00" failed: first network error, wait for 15 seconds
    486979:20210922:084712.022 resuming SNMP agent checks on host "strS00": connection restored
    486974:20210922:084756.171 SNMP agent item "net.if.status[ifOperStatus.269]" on host "strS00" failed: first network error, wait for 15 seconds
    486979:20210922:084811.393 resuming SNMP agent checks on host "strS00": connection restored
    486978:20210922:084856.319 SNMP agent item "net.if.in[ifHCInOctets.356]" on host "strS00" failed: first network error, wait for 15 seconds
    486979:20210922:084911.838 resuming SNMP agent checks on host "strS00": connection restored
    486975:20210922:084956.462 SNMP agent item "net.if.status[ifOperStatus.833]" on host "strS00" failed: first network error, wait for 15 seconds
    486979:20210922:085011.220 resuming SNMP agent checks on host "strS00": connection restored
    486976:20210922:085056.730 SNMP agent item "net.if.speed[ifHighSpeed.732]" on host "strS00" failed: first network error, wait for 15 seconds
    486979:20210922:085111.592 resuming SNMP agent checks on host "strS00": connection restored
    I am using bulk requests, I increased SNMP timeout to 10 seconds -- no change. If I do snmpbulkwalk on command-line it will never ever hiccup on anything. My own tool that gets thousands of OIDs from the switch runs without any trouble whatsoever and it runs fast.

    This isn't an issue with the host, it's something with the Zabbix server. Any ideas what that might be?
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    There are ways to dial up Zabbix server logging for just certain subprocesses, unfortunately I think the process involved here is just the "poller", which will increase logging for large parts of what Zabbix does. If you don't mind wading through a lot of debug info, it might provide a clue what's going on.

    Zabbix is built to use the SNMP libraries on your system, so it's probably using the same net-snmp libraries that your snmpbulkwalk is linked against.

    When you created the "hosts" for these switches, did you use IP address in the host entry, or does it rely on DNS?

    Are you comfortable using tcpdump or wireshark? I'm just thinking that with tcpdump the filter language would allow you to only capture traffic to the specific port on your specific switches, which might help you identify what's happening when you get one of these intermittent "network error" issues. It should at least show you whether Zabbix is attempting to start up a connection.

    I personally would probably use 'strace' to attach to the pollers and some combination of options to record everything they're doing until catching a "network error". strace isn't for everyone, though. It requires that you have an understanding of C and the system calls that are getting executed. It's a fantastic tool for figuring out what is going wrong with a running process, but it requires a lot of systems knowledge to correctly interpret the results.

    Comment

    • mdw
      Junior Member
      • Sep 2021
      • 18

      #3
      Originally posted by tim.mooney
      When you created the "hosts" for these switches, did you use IP address in the host entry, or does it rely on DNS?
      I entered hosts's IP address.

      Are you comfortable using tcpdump or wireshark? I'm just thinking that with tcpdump the filter language would allow you to only capture traffic to the specific port on your specific switches, which might help you identify what's happening when you get one of these intermittent "network error" issues. It should at least show you whether Zabbix is attempting to start up a connection.
      I already looked at the communications with tcpdump. Zabbix is communicating with the host normally and then suddenly the communications seems to stop (after all, the problem always fixes itself 15 seconds later). I didn't look too deep yet, but I might revisit the case. At any rate, when I switch the particular host from SNMPv3 to SNMPv2, the problems go away. Still, I need to find a solution as SNMPv3 is mandated in our network. I also added few more large switches and only this one exhibits problems, so I can't rule out there's an issue with its software. But then again, command-line snmpbulkwalk runs fine...

      I personally would probably use 'strace' to attach to the pollers and some combination of options to record everything they're doing until catching a "network error". strace isn't for everyone, though. It requires that you have an understanding of C and the system calls that are getting executed. It's a fantastic tool for figuring out what is going wrong with a running process, but it requires a lot of systems knowledge to correctly interpret the results.
      I'm not sure I have deep enough knowledge to be able to interpret strace output. Also, how do I know which poller to attach to?

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        Originally posted by mdw
        I'm not sure I have deep enough knowledge to be able to interpret strace output. Also, how do I know which poller to attach to?
        You probably wouldn't be able to choose just one, but with strace you can attach to all of them and with the right combination of options you can have strace write the trace output to different files based on pid. It would be a lot of information to review.

        Comment

        • mdw
          Junior Member
          • Sep 2021
          • 18

          #5
          Got some more troubleshooting time with this issue (which meanwhile started appearing with another host). I analyzed new packetcapture and I can say that the timeout happens when there's Report PDU that contains "SNMP-USER-BASED-SM-MIB::usmStatsNotInTimeWindows.0". I am not sure what to make of it. Both the Zabbix server and the switch are NTP synchronized to the same NTP servers. Any idea why would this be happening? Also any idea why would this never happen when doing snmpwalk?

          Comment

          • mdw
            Junior Member
            • Sep 2021
            • 18

            #6
            Click image for larger version

Name:	01.png
Views:	3231
Size:	23.8 KB
ID:	432861

            OK, I am digging deeper into this issue, which has actually started appearing on more hosts. Above is a wire capture of a failed SNMPv3 request from Zabbix on 172.20.113.121 to a Cisco switch. Let's go through the packet dissections.

            Click image for larger version

Name:	02.png
Views:	3218
Size:	40.9 KB
ID:	432862

            This is the initial request. All is fine here, unencrypted/unauthenticated packet meant to find out required parameters for subsequent request (EngineID, EngineBoots, EngineTime).

            Click image for larger version

Name:	03.png
Views:	3202
Size:	45.0 KB
ID:	432863

            This is the switch response... plaintext Report packet with required parameters (highlighted). Everything is going well, the response OID is usmStatsUnknownEngineIDs, which is as expected.

            Click image for larger version

Name:	04.png
Views:	3210
Size:	46.4 KB
ID:	432864

            This is subsequent request from Zabbix server, now it's proper AuthPriv as required parameters are known by this time. But there is the problem. The EngineBoots and EngineTime parameters are wrong: they don't match what we got from the switch, in fact they are wildly off (the standard requires that EngineTime is within +-150 seconds).

            Click image for larger version

Name:	05.png
Views:	3208
Size:	78.8 KB
ID:	432865

            And of course, the switch responds with another Report PDU, this time indicating the problem. The highlighted OID is the usmStatsNotInTimeWindows error response.

            OK, so now I have conclusive evidence that Zabbix is generating incorrect SNMPv3 requests which are corretly dropped by network devices. The question is, what do I do about this? At this state Zabbix is basically inoperational for me.

            Comment

            • mdw
              Junior Member
              • Sep 2021
              • 18

              #7
              Originally posted by cyber
              Support case? If you have support contract...
              BUG report, with evidence... maybe receives enough interest....
              No support contract. Where can I log a bug report?

              Comment

              • tim.mooney
                Senior Member
                • Dec 2012
                • 1427

                #8
                You've done a very good job digging into the problem and have already narrowed the problem down considerably. That's the kind of bug reporting that developers appreciate. I would very much hope that the Zabbix developers wouldn't ignore this.

                Comment

                • mdw
                  Junior Member
                  • Sep 2021
                  • 18

                  #9
                  Originally posted by tim.mooney
                  You've done a very good job digging into the problem and have already narrowed the problem down considerably. That's the kind of bug reporting that developers appreciate. I would very much hope that the Zabbix developers wouldn't ignore this.
                  I think I have found the cause: Some of the Cisco switches had duplicate (ie. non-unique) SNMP engineID. I would still wait a little longer before I consider this solved, but it's been half a day since I fixed the engineIDs and so far everything is running smoothly.

                  Comment

                  • tim.mooney
                    Senior Member
                    • Dec 2012
                    • 1427

                    #10
                    My SNMP knowledge isn't bad, but you're clearly more knowledgeable about it than I am. How does duplicate engineIDs happen? Is it a firmware level issue, or a configuration issue (perhaps specific to SNMPv3?), or something else?

                    Comment

                    • mdw
                      Junior Member
                      • Sep 2021
                      • 18

                      #11
                      Originally posted by tim.mooney
                      My SNMP knowledge isn't bad, but you're clearly more knowledgeable about it than I am. How does duplicate engineIDs happen? Is it a firmware level issue, or a configuration issue (perhaps specific to SNMPv3?), or something else?
                      One would expect that default EngineID is generated in a way that makes it unlikely a duplicate appears (from a system MAC address, serial number etc.). Apparently the way Cisco IOS generates it is flawed. So from now on I'll be generating random EngineID for all new devices.

                      Comment

                      • cyber
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Dec 2006
                        • 4807

                        #12
                        This document describes the User-based Security Model (USM) for Simple Network Management Protocol (SNMP) version 3 for use in the SNMP architecture. It defines the Elements of Procedure for providing SNMP message level security. This document also includes a Management Information Base (MIB) for remotely monitoring/managing the configuration parameters for this Security Model. This document obsoletes RFC 2574. [STANDARDS-TRACK]

                        you can find reasons for usmStatsNotInTimeWindows from there...
                        But TBH, I still think there is something in Zabbix that does not work properly... if every other tool can poll same devices without issues and Zabbix is the one, that fails... well, do the math...I have read many pages where Z people claim, that providers do not implement smnp properly etc (and it might be true in some cases).. I have still hard time to believe it..
                        I do see a lot of such things in our env aswell, I have just settled to ignore it as it works most of time, but logs are full of that snmp crap...

                        Comment

                        • cyber
                          Senior Member
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Dec 2006
                          • 4807

                          #13
                          Support case? If you have support contract...
                          BUG report, with evidence... maybe receives enough interest.......

                          Comment

                          • cyber
                            Senior Member
                            Zabbix Certified SpecialistZabbix Certified Professional
                            • Dec 2006
                            • 4807

                            #14

                            need to create account there and then you can submit reports

                            Comment

                            Working...