Ad Widget

Collapse

Linux Network Traffic Spikes (net.if.in)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jonnjonzzn
    Junior Member
    • Sep 2007
    • 18

    #1

    Linux Network Traffic Spikes (net.if.in)

    On my Linux boxes I see spikes occasionally in the values collected for net.if.in and net.if.out. These spike values are outside the range the physical transport hardware can even carry. For example I will see a spike of like 350Gbps. Usual load is around 3-5Mbps. The problem is that these spikes make the graphs useless when they include a time period including one of the spikes. I was just wondering if anyone else has seen similar and what they might have done to correct/circumvent the problem. These are all RHEL boxes including the zabbix server. What am I doing wrong? The only thing I notice is that there is consistently a 10 second interval from the previous data collection when the "bad" metric is collected as opposed to the normal collection interval of 5 seconds.

    Item Descriptions and exports:

    Network-eth0-IN net.if.in[eth0,bytes] 5 7 365 ZABBIX agent Active
    Network-eth0-OUT net.if.out[eth0,bytes] 5 7 365 ZABBIX agent Active

    <item type="0" key="net.if.in[eth0,bytes]" value_type="0">
    <description>Network-$1-IN</description>
    <delay>5</delay>
    <history>7</history>
    <trends>365</trends>
    <units>bps</units>
    <multiplier>1</multiplier>
    <delta>1</delta>
    <formula>8</formula>
    <snmp_port>161</snmp_port>
    </item>

    <item type="0" key="net.if.out[eth0,bytes]" value_type="0">
    <description>Network-$1-OUT</description>
    <delay>5</delay>
    <history>7</history>
    <trends>365</trends>
    <units>bps</units>
    <multiplier>1</multiplier>
    <delta>1</delta>
    <formula>8</formula>
    <snmp_port>161</snmp_port>
    </item>


    Here are example data collection items when these occur:

    2007.Oct.26 09:09:24 2061299.2000
    2007.Oct.26 09:09:19 752688.0000
    2007.Oct.26 09:09:14 125880.0000
    2007.Oct.26 09:09:10 155951132954.6667
    2007.Oct.26 09:08:59 415083.2000
    2007.Oct.26 09:08:54 797884.8000
    2007.Oct.26 09:08:49 611404.0000

    2007.Oct.26 05:30:39 73660.8000
    2007.Oct.26 05:30:34 37859.2000
    2007.Oct.26 05:30:29 589180.8000
    2007.Oct.26 05:30:24 186098273800.0000
    2007.Oct.26 05:30:14 160400.0000
    2007.Oct.26 05:30:09 82140.8000
    2007.Oct.26 05:30:04 1401803.2000

    2007.Oct.25 11:46:39 280332.0000
    2007.Oct.25 11:46:35 700073.3333
    2007.Oct.25 11:46:29 220990983652.0000
    2007.Oct.25 11:46:19 749732.8000
    2007.Oct.25 11:46:14 468668.8000
    2007.Oct.25 11:46:09 337068.8000

    Graph examples:

    Example for an hour view with spikes:



    Example for a week view with spikes:



    Examples for "normal" hour view:



    -Mike
  • rts
    Member
    • May 2007
    • 54

    #2
    I've got the same issue - I'd love to hear how Zabbix recommend we solve this.

    Comment

    • Tarick
      Junior Member
      • Oct 2007
      • 5

      #3
      This is not Zabbix fault.



      "The bnx2 ethernet driver occasionally produces pseudorandom statistics
      which can be seen in /proc/net/dev (such as rx byte counters
      mysteriously jumping by ~ 2^60). This is making things hard on our
      network statistics graphing applications, as this causes them to report
      incredibly enormous spikes of traffic that dwarf everything else in the
      graphs by many orders of magnitude.

      A patch to fix this has already gone into vanilla 2.6.22, and it's a
      fairly small, trivial patch that looks like it would backport easily."

      I guess in order to fix this bug upgrade to RHEL Update 6 is needed, it has bnx2 1.5.11 and I haven't seen such error on this system. Error was spotted on RHEL U5 and FC5 with older drivers.

      Comment

      • fableman
        Member
        • Oct 2007
        • 78

        #4
        Many other network routers and stuff dont priority ping soo sending 1 ping can give peeks.

        I made a small script to ping my net work devices to get a more correct ping.


        This script will ping -c10 = 10 times at the speed of 0.25 ms and then take out the average number soo you wont get any peek number as asnwers if they happen.

        #!/bin/sh
        tal=`ping -c10 -i0.25 $1 | tail -n 1 | awk '{print $4}'| cut -d/ -f2`
        if [ -z $tal ]
        then
        echo 0
        else
        echo $tal
        fi

        [sms@netscan ~]$ /etc/zabbix/externalscripts/icmp_ping 127.0.0.1
        0.012


        Good luck.. btw.. fping suck!

        Comment

        • Tarick
          Junior Member
          • Oct 2007
          • 5

          #5
          Originally posted by fableman
          Many other network routers and stuff dont priority ping soo sending 1 ping can give peeks.
          -- skipped ----
          Good luck.. btw.. fping suck!
          How is it related? We measure traffic utilizations on interface, not ping response.

          For this I made custom
          UserParameter=usr.net.if.eth0.in, awk 'sub(/eth0:/,"") {print $1}' /proc/net/dev
          UserParameter=usr.net.if.eth0.out, awk 'sub(/eth0:/,"") {print $9}' /proc/net/dev

          and then realized that it was driver error.

          Comment

          • fableman
            Member
            • Oct 2007
            • 78

            #6
            Originally posted by Tarick
            How is it related? We measure traffic utilizations on interface, not ping response.
            Its not related at all Mixed it up.

            Comment

            • bbrendon
              Senior Member
              • Sep 2005
              • 870

              #7
              FYI... even though you say this is a bug in SNMP, I think zabbix should have max values for SNMP data recording. Every other SNMP device you encounter does not work properly when the 32bit counters cycle, power cycles, whatever and you get spikes without a little help from the monitoring software.
              Unofficial Zabbix Expert
              Blog, Corporate Site

              Comment

              Working...