Ad Widget

Collapse

Zabbix 5.4.8 False High Bandwidth Monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • iconicnetworks
    Member
    • Feb 2022
    • 34

    #1

    Zabbix 5.4.8 False High Bandwidth Monitoring

    Hi all. We are about at the end of how far we can take this, so asking the forum for some help.

    We have an instance updated to 5.4.8, and we monitor a range of devices mostly Cisco and Juniper, however we have a random High Bandwidth alert issue. The raw data from the devices has been checked and the bandwdith is definitely stable, it does ramp up and down but not to these reported levels. This seems to impact both Cisco and Juniper default templates, and we are monitoring on 1m intervals.

    This example is a 20G port channel (2 x 10G ports) sometimes showing traffic bursting to 100G, which is impossible! During these periods we are seeing traffic increasing to maybe 8-9G from the raw data.

    This seems to impact physcal ports as well as LAG's.

    We've seen some people talking about changing pre-processing orders which we've changed about with no luck. We've also changed from 32-64 bit counters and back and forth a few times, still no change. This is utilising the default Cisco and Juniper templates, apart from changing the monitoring interval.

    Any help greatly appreciated!
    Attached Files
  • iconicnetworks
    Member
    • Feb 2022
    • 34

    #2
    Hi all. Just to update further. We have upgraded the whole estate to 6.0 LTS. Problem persists.

    Comment

    • pgatty
      Junior Member
      • Feb 2022
      • 14

      #3
      Have you checked the Zabbix template git repo to see if there are updated versions of the templates you're using?

      Comment

      • tikkami
        Member
        • May 2018
        • 71

        #4
        I have exactly same problem.
        I haven't identified root cause yet. Database performance could be one reason.

        If you look numerical values from history, are there any gaps in data before or after peak values?

        Is there anything weird in zabbix server log?

        Comment

        • LenR
          Senior Member
          • Sep 2009
          • 1007

          #5
          I've seen this, but I don't remember the exact cause, but the device returned out of bound values after patching, fail-over or something. There are post processing range validation rules that might fix this.

          Comment

          • iconicnetworks
            Member
            • Feb 2022
            • 34

            #6
            Initially i thought this might be where we had modified the stock Juniper templates to improve polling frequency, however i have since removed these templates and restored the originals - same problem. We also know someone else now having the same issue, on a different database setup (AWS) with same problems.

            Nothing in the zabbix server logs at all of any interest.

            Re numerical values, nope no gaps. Raw data from the device dumped to a log file shows the correct integers being generated by the device as well.

            Comment

            • tikkami
              Member
              • May 2018
              • 71

              #7
              Some time ago, I was running snmpwalk to read same values from switch. All were ok there.
              if this is not about database performance, could change per second preprosessing mess something?

              Comment

              • iconicnetworks
                Member
                • Feb 2022
                • 34

                #8
                tikkami i think this might be what it is, or at least something along these lines. Traffic is bursty (if that's a word) on this port. It will go from 200Mb to 10Gb in a few seconds. I wonder if Zabbix is mis-calculating that burst and forecasting or projecting the data? I now know of at least 3 other people having the same issues, all with the same symptoms. Interestingly though, at this moment this 'seems' limited to Juniper devices, it's not impacting Cisco devices in the same way as of yet. Any thoughts on how to solve this?

                Comment

                • tikkami
                  Member
                  • May 2018
                  • 71

                  #9
                  I have seen this with Cisco Catalyst- and IE-series switches.

                  Comment

                  • tikkami
                    Member
                    • May 2018
                    • 71

                    #10
                    Collected some data today from Cisco switch.

                    Item OID: IF-MIB::ifOutUcastPkts.
                    This item has only change per second preprocessing.

                    Click image for larger version

Name:	false_values.png
Views:	3011
Size:	44.3 KB
ID:	440149

                    Maybe next step is to add items to collect raw data to Zabbix...

                    Comment

                    • iconicnetworks
                      Member
                      • Feb 2022
                      • 34

                      #11
                      I've been doing a bit of digging into patterns around this. Currently looking as to why the resolution message for this issue is sent 15 mins later than the event actually 'clearing' in Zabbix. I think we know why this is, but we've had another sequence of events this evening causing more of the same high bandwidth alerts. This time though traffic on this Juniper switch was stable throughout and wasn't ramping up.

                      Comment

                      • tikkami
                        Member
                        • May 2018
                        • 71

                        #12
                        Added new item to collect raw counter value from switch.

                        Zabbix server/database is definitely messing up with collected data somehow.
                        Counter value should increase (snmpwalk shows steady growth).

                        Here is a graph from ifOutUcatsPkts -counter.

                        Click image for larger version  Name:	pps.png Views:	0 Size:	432.1 KB ID:	440209
                        Last edited by tikkami; 22-02-2022, 13:05.

                        Comment

                        • tikkami
                          Member
                          • May 2018
                          • 71

                          #13
                          Collected values seems to be better when "Use bulk requests" is NOT enabled.

                          Comment

                          • iconicnetworks
                            Member
                            • Feb 2022
                            • 34

                            #14
                            We've now got similar results to you. We had previously tried different test conditions with bulk requests accepting the CPU difference and had similar results. Ultimately there is something definitely not right with data processing in Zabbix here. A bit more of a search shows a few more people having similar issues now too.

                            Comment

                            • tim.mooney
                              Senior Member
                              • Dec 2012
                              • 1427

                              #15
                              I've been following this thread with interest -- not because it's an issue that's impacting my site, but because you both have been doing a good job of debugging, which is a nice change of pace for questions on these forums. :-)

                              Depending on how high your new values per second (NVPS) is for your Zabbix server, what you might want to consider doing is increasing the debug level for your SNMP collectors, so that they log a lot more of what they're doing. It will eat a lot of log space on your Zabbix server (and cause some additional disk I/O, but hopefully you have that to spare), but it might be valuable in shedding light on the problem. If you can use graphs to find an incorrect peak and use SQL to verify that an incorrect value got inserted into your database, with sufficient debug logs you could trace it back and hopefully see what Zabbix read and what got inserted. Look at the '-R' (runtime control) option for zabbix_server, to dynamically increase debugging for just certain subprocesses, if this is something you are interested in pursuing.

                              Good luck and please update this thread as you make progress on the issue.

                              Comment

                              Working...