Ad Widget

Collapse

Zabbix 5.4.8 False High Bandwidth Monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • iconicnetworks
    Member
    • Feb 2022
    • 34

    #16
    tim.mooney Thank you for the feedback. This is an issue we fear is impacting others too, so trying to bottom it out for us all.

    NVPS is very low on this environment currently, around 500, and for context we're writing into 3PAR All-Flash storage so the I/O is OK disk wise. We've done half of the above previuosly but will absolutely follow through today on the whole process. Early discovery showed that Zabbix was collecting the correct data in RAW format, but then the procesing of this led to some very off results. We've turned on full debug logging mapped to an extenal source already so we have the data values but will query SQL today as well. Overnight we've apparently hit 1997.33Tbps on a 400G link - impressive!

    Comment

    • hhaiderzad
      Junior Member
      • Feb 2022
      • 1

      #17
      Hello Everyone,
      Im new to Zabbix with 0 linux skills, The zabbix server is already installed in our environment and we want to monitor the uplinks based on Alias or description only, When i plug the ip it pulls data for all the switch ports which we don't want to monitor. I need help with the regular express so i can filter on the specific alias such as : Internet ISP 1, Trunk link to The router, i would really appreciate the help i get thank you in advance.

      Comment


      • cyber
        cyber commented
        Editing a comment
        This is not a thread for this, you should start your own...
    • tikkami
      Member
      • May 2018
      • 71

      #18
      I saw problems in relatively small system (~100 NVPS) running with Zabbix versions 5.0.10 and 5.0.19.

      However problematic switch has 700...800 items, most of them are polled by 1min interval (tried 5min interval and it didn't change the situation).

      My quess is that zabbix server messes values internally when big amount of oids are read with bulk mode.

      Comment

      • iconicnetworks
        Member
        • Feb 2022
        • 34

        #19
        A short update from us. We spun up a second Zabbix instance and mirrored some traffic from two switches into two instances. It confirmed what we knew but wanted to make sure, bulk requests do make this issue much worse, the frequency of false positives greatly increases.

        Comment

        • iconicnetworks
          Member
          • Feb 2022
          • 34

          #20
          We've gone through and event correlated this back as far as we can in Excel. Of the 3k+ alerts we've had for this, almost every single one is on the 'out' counter, not in. This is the same for us cross device and manufacturer. Operational data for the 'In' side is stable and accurate by and large, it just seems to be specific to the 'Out' counters for now. tikkami not sure if this is a similar pattern for you?

          Comment

          • tikkami
            Member
            • May 2018
            • 71

            #21
            iconicnetworks , yes this pattern happens in my system too.

            I guess zabbix does polling from top of the mib tree.
            So it will poll first in values and then out values. Perhaps some internal buffer etc. has some overflow issue?
            ​​

            Comment

            • iconicnetworks
              Member
              • Feb 2022
              • 34

              #22
              Potentially - but this is really weird. We know this impacts multiple vendors now as well, we've seen it throughout a number of devices. We've removed bulk requests which somewhat stabilisies things but still see the false positives. I agree with your position of potentially a buffer or caching issue.

              Comment

              • tikkami
                Member
                • May 2018
                • 71

                #23
                Bug report would be a next step.
                Before that, it should be verified that problem is existing in the latest release.

                Comment

                • iconicnetworks
                  Member
                  • Feb 2022
                  • 34

                  #24
                  Yep it does - we've updated already, was a few posts ago. I'll have a look into a bug report shortly.

                  Comment

                  • iconicnetworks
                    Member
                    • Feb 2022
                    • 34

                    #25
                    A quick update from us, we have found a sort of work around which has been tested for 5 days and is currently working. We have changed the trigger expression to drop the 15m to 1m interval, as follows:

                    From this:

                    (avg(/Interfaces SNMP/net.if.in[ifHCInOctets.{#SNMPINDEX}],15m)>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}]) or
                    avg(/Interfaces SNMP/net.if.out[ifHCOutOctets.{#SNMPINDEX}],15m)>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}])) and
                    last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}])>0

                    To this:

                    (avg(/Interfaces SNMP/net.if.in[ifHCInOctets.{#SNMPINDEX}],1m)>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}]) or
                    avg(/Interfaces SNMP/net.if.out[ifHCOutOctets.{#SNMPINDEX}],1m)>({$IF.UTIL.MAX:"{#IFNAME}"}/100)*last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}])) and
                    last(/Interfaces SNMP/net.if.speed[ifHighSpeed.{#SNMPINDEX}])>0

                    Althought not perfect, this has worked so far and we're seeing no more erroneous high bandwidth alerts.

                    Thanks all

                    Comment

                    • tikkami
                      Member
                      • May 2018
                      • 71

                      #26
                      I have used data validation preprocessing to get rid of faulty enormous values.

                      This is just a dirty workaround.

                      I'm going to upgrade some system to the latest version and make a bug report if problem can be repeated.

                      Comment

                      • tikkami
                        Member
                        • May 2018
                        • 71

                        #27
                        Made really interesting observation today when looked pcap-data.

                        In this case I have double items to read ifInUcastPkts OID.
                        One to calculate packet per second values and one for raw values.


                        This set seems to be related to item with raw values. Counter values increases at constantly.

                        "frame.time": "Mar 16, 2022 12:26:22.496782000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4079198949"

                        "frame.time": "Mar 16, 2022 12:31:22.439778000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4079859516" (diff to previous 660 567)

                        "frame.time": "Mar 16, 2022 12:36:22.514265000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4080520255" (diff to previous 660 739)

                        "frame.time": "Mar 16, 2022 12:41:22.425662000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4081180176" (diff to previous 659 921)


                        This set seems to be related to pps item.

                        "frame.time": "Mar 16, 2022 12:26:22.522068000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4079198949"

                        "frame.time": "Mar 16, 2022 12:31:22.497455000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4082365523" (diff to previous 3 166 574)

                        "frame.time": "Mar 16, 2022 12:36:22.556665000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4080520255" (diff to previous -1 845 268)

                        "frame.time": "Mar 16, 2022 12:41:22.455305000 FLE Standard Time",
                        "1.3.6.1.2.1.2.2.1.11.10277: 4081180176" (diff to previous 659 921)


                        Negative counter values?!?!
                        It seems that Cisco IE-5000 switch sends faulty counter values when snmp bulk requests are used.


                        Comment

                        • iconicnetworks
                          Member
                          • Feb 2022
                          • 34

                          #28
                          This is a really good spot. I have set up the same pcap trap on Juniper QFX 5120 as well to see if this is the case. Although I have the same issue with bulk requests turned on or off, if anything it's worse without bulk enabled.

                          Comment

                          • iconicnetworks
                            Member
                            • Feb 2022
                            • 34

                            #29
                            Just keeping this alive - this is very much a current issue still. Updated to 6.0.2 and the issue persists still.

                            Comment

                            • tikkami
                              Member
                              • May 2018
                              • 71

                              #30
                              Would there be any way to add some delay between snmp requests for specific hosts?

                              I was just thinking if switch can't handle properly snmp requests sent with too high frequency.

                              How poller processes are doing their work? Would one host be polled parallel with multiple pollers or would one poller handle one host with serialized polls?

                              Comment

                              Working...