Ad Widget

Collapse

Zabbix 2.2.3 + NetApp SNMP problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • roby
    Junior Member
    • Feb 2013
    • 13

    #1

    Zabbix 2.2.3 + NetApp SNMP problem

    as already started to discuss here: https://support.zabbix.com/browse/ZBX-8145
    and here:


    lets create another thread, because as it turns out this is a different issue than discussed above.

    We have found that only CPU utilization values are displayed wrong for both of our NetApp storages. Other values that I checked randomly are being collected correctly.
    as you can see in screenshot I attached to other forum thread - this started after Zabbix upgrade from 2.2.1 to 2.2.3.


    edit:

    if I query with snmpget manually, then I get normal values:
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 17
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 10
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 25
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 13
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 9
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 10
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 17
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 18
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 30
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 33
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 34
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 48
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 34
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 31
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 26
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 24
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 26
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 24
    root@zabbix:~# snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0
    .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 25
    root@zabbix:~#


    at same time, zabbix gets these values:
    2014.Apr.29 17:23:55 0
    2014.Apr.29 17:22:54 98
    2014.Apr.29 17:21:55 75
    2014.Apr.29 17:20:54 0
    Last edited by roby; 29-04-2014, 16:24.
  • asaveljevs
    Zabbix developer
    • Feb 2010
    • 36

    #2
    What happens when you query multiple values together?

    For instance, based on netapp.tcpdump.txt in the other thread:

    # snmpget -v 2c -c public -On 10.14.0.2 .1.3.6.1.4.1.789.1.2.1.3.0 .1.3.6.1.4.1.789.1.2.4.4.0 .1.3.6.1.4.1.789.1.2.4.1.0 .1.3.6.1.4.1.789.1.7.3.1.1.2.0 .1.3.6.1.4.1.789.1.7.1.1.0 .1.3.6.1.4.1.789.1.2.5.1.0 .1.3.6.1.4.1.789.1.7.2.12.0 .1.3.6.1.4.1.789.1.2.3.2.0 .1.3.6.1.4.1.789.1.2.2.4.0 .1.3.6.1.4.1.789.1.7.2.13.0 .1.3.6.1.4.1.789.1.6.12.0 .1.3.6.1.4.1.789.1.7.3.1.1.4.0 .1.3.6.1.4.1.789.1.7.3.1.1.1.0 .1.3.6.1.4.1.789.1.2.3.8.0 .1.3.6.1.4.1.789.1.7.2.9.0 .1.3.6.1.4.1.789.1.2.3.1.0 .1.3.6.1.4.1.789.1.7.3.1.1.9.0 .1.3.6.1.4.1.789.1.3.1.2.1.0 .1.3.6.1.4.1.789.1.7.3.1.1.6.0 .1.3.6.1.4.1.789.1.2.2.25.0 .1.3.6.1.4.1.789.1.7.3.1.1.8.0 .1.3.6.1.4.1.789.1.6.4.8.0 .1.3.6.1.4.1.789.1.2.3.4.0 .1.3.6.1.4.1.789.1.6.4.11.0 .1.3.6.1.4.1.789.1.6.4.7.0 .1.3.6.1.4.1.789.1.7.3.1.1.10.0 .1.3.6.1.4.1.789.1.5.7.1.0 .1.3.6.1.4.1.789.1.7.3.1.1.5.0 .1.3.6.1.4.1.789.1.2.4.2.0

    Here, the OID for CPU load will be the first value returned.

    Comment

    • asaveljevs
      Zabbix developer
      • Feb 2010
      • 36

      #3
      In general, my first idea was the same as richlv's in the other thread:

      as for netapp device, would be interesting to gather data for an hour with each version and compare the average.
      wild guess - preparing an answer for getbulk makes cpu very busy for some period of time, but overall busy rates might be lower - the load is more concentrated, but the device also spends more time doing nothing

      Comment

      • roby
        Junior Member
        • Feb 2013
        • 13

        #4
        I get very inconsistent results, as zabbix snmp checks NetApp once per minute - at 54th second of each minute, then I am doing bulk get at same time manually with snmpget.

        these are few results:
        zabbix:
        2014.Apr.30 11:11:55 50
        2014.Apr.30 11:08:55 0
        2014.Apr.30 10:38:54 0
        2014.Apr.30 10:37:54 0
        2014.Apr.30 10:35:54 97
        2014.Apr.30 10:34:54 0

        snmpget:
        Wed Apr 30 10:34:54 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 25
        Wed Apr 30 10:35:54 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 72
        Wed Apr 30 10:37:54 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 55
        Wed Apr 30 10:38:54 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 97
        Wed Apr 30 11:08:55 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 23
        Wed Apr 30 11:08:55 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 36
        Wed Apr 30 11:11:55 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 30
        Wed Apr 30 11:11:55 EEST 2014
        .1.3.6.1.4.1.789.1.2.1.3.0 = INTEGER: 34


        snmpget results seem more realistic, at lease we do not get 0.

        I have attached more results in 2 files.


        edit:
        I have reached my attached files quota of this forum, so I uploaded a screenshot here:


        it shows what actually load is on this NetApp storage. The screenshot is from NetApp management console.
        the CPU load is from 30 % to 60 %
        Attached Files
        Last edited by roby; 30-04-2014, 10:52.

        Comment

        • jpka
          Junior Member
          • Sep 2013
          • 24

          #5
          Some more things.
          1. As i can see, there is no any data loss with NetApp, unlike my issue (ZBX-8145). But still, results are not useful, due to requests are aggregated and so shifted in timeline too much (much more than 1 s), jitter applied is around +/- 30 s.
          2. Making CPU load dancing with huge amplitude slowly kills the CPU due to thermal expand and collapse cycles. (Remember CPU cooler software? For industrial apps, there is reverse option, CPU 100% loader, often used, due to remove thermal cycles.)
          3. If we make bulk request feature optional, we definitely solve this issue, my issue, and all similar future problems. So if there is some specific reason why dear Zabbix Team refuses to make this feature optional, like changed via config file, (say, it may not conforms with software ideology), would be nice to know.
          Thanks!

          Comment

          • asaveljevs
            Zabbix developer
            • Feb 2010
            • 36

            #6
            jpka, regarding (1) in your post, could you please elaborate on "shifted in timeline"?

            According to zabbix.zip from roby's post, the values are consistently queried at 54 and 55 seconds, so there is no shift:

            2014.Apr.30 10:33:54 98
            2014.Apr.30 10:34:54 0
            2014.Apr.30 10:35:54 97
            2014.Apr.30 10:36:54 96
            2014.Apr.30 10:37:54 0
            2014.Apr.30 10:38:54 0
            2014.Apr.30 10:39:54 0
            2014.Apr.30 10:40:54 0
            2014.Apr.30 10:41:54 30
            2014.Apr.30 10:42:55 0
            Regarding (3), the answer is simple: in order to provide a solution, we need to understand the problem first. In your case (ZBX-8145), the device does not seem to conform to SNMP RFC3416 (see http://tools.ietf.org/html/rfc3416#section-4.2.1), because it returns fewer variables than it is expected to.

            Zabbix is under no obligation to support such devices, but it may provide a workaround. Currently, we see two options: (a) improve the AI to detect such cases and (b) provide a way to disable SNMP bulk for each interface. For (a), we need to understand how widespread such behavior is to warrant a workaround in Zabbix. It might be that a better solution would be to write to SNMP authors on this device to make it standard-conformant. For (b), we can only do this in 2.4 (which is feature complete already, so probably 2.6), because it requires database changes and database changes can only be done in major versions. Introducing a global configuration parameter for disabling SNMP bulk is not an option, because it would disable SNMP bulk globally, even for devices that are OK with it.

            Regarding NetApp, as mentioned above, we need to understand the problem first before proposing a solution.

            Comment

            • tatapoum
              Senior Member
              • Jan 2014
              • 185

              #7
              That really looks like a poor implementation of the SNMP standard. I monitor a few dozens of network and storage equipments with SNMP and Zabbix 2.2.3 and see no issue at all with the bulk mode. Most of these are Cisco, Brocade switches, firewalls, etc.
              Having the CPU jumping around 60% load due to SNMP bulk gets is scary...
              Most modern SNMP monitoring software uses SNMP walks and bulk gets to query the devices, so that behavior would be experienced with other product than Zabbix.
              What does NetApp say about this ? I agree with asaveljevs, you should ask your manufacturer to investigate this, so Zabbix team could find the right workaround.

              Comment

              • asaveljevs
                Zabbix developer
                • Feb 2010
                • 36

                #8
                roby, could you please post a tcpdump of Zabbix to NetApp communication for a longer period of time, say, several hours? Ideally, a pcap file rather than plain text.

                http://support.ipmonitor.com/mibs/NE...-MIB/tree.aspx says that .1.3.6.1.4.1.789.1.2.1.3.0 (cpuBusyTimePerCent) is "The percent of time that the CPU has been doing useful work since the last time a client requested the cpuBusyTimePerCent". Could it be that someone else is querying this device for CPU statistics?

                Could you please also try using .1.3.6.1.4.1.789.1.2.1.2.0 (cpuBusyTime) and see whether it gives sensible data?

                Comment

                • jpka
                  Junior Member
                  • Sep 2013
                  • 24

                  #9
                  Hi!
                  could you please elaborate on "shifted in timeline
                  I am sorry, i not really sure, just a guess: in order to trying effectively aggregating values, i think that Zabbix can wait extra time, adding some random time jitter to timed requests, but now i can't confirm it. (I am not familiar with NetApp and with adult/fat SNMP, sorry).

                  The next text looks offtopic here because not directly related with subj.
                  That really looks like a poor implementation of the SNMP standard.
                  In your case (ZBX-8145), the device does not seem to conform to SNMP RFC3416
                  Yes, it is realtime industrial assembler-written SNMP stack for 1 kB RAM device, so it have some limitations. It can't be (easy) fixed to support RFC3416.4.2.1.
                  (a) improve the AI to detect such cases and (b) provide a way to disable SNMP bulk for each interface.
                  Please do not spread your forces for (a) even while it seem to be easy. Because it is affects only me (and with high probability other low-end devices, but no case registered so far), but Zabbix not was written for tiny/limited devices.
                  (b) is good but quite hard and requires so many programmer's time. But it solves all current and future problems (while may be not best way to solve it).
                  For me, the absolute brilliant option is
                  Introducing a global configuration parameter for disabling SNMP bulk
                  . No DB change need. And i really know which devices i have in my network, and definitely no and never planned fat ones.
                  Thanks so much to Zabbix Team.

                  Comment

                  • asaveljevs
                    Zabbix developer
                    • Feb 2010
                    • 36

                    #10
                    Originally posted by jpka
                    Yes, it is realtime industrial assembler-written SNMP stack for 1 kB RAM device, so it have some limitations. It can't be (easy) fixed to support RFC3416.4.2.1.
                    http://tools.ietf.org/html/rfc3416#section-4.2.1 leaves a possibility for a device to return "genErr" or "tooBig". Perhpas the stack can be modified so that it returns this error when multiple variable bindings are detected?

                    Originally posted by jpka
                    For me, the absolute brilliant option is "Introducing a global configuration parameter for disabling SNMP bulk"
                    . No DB change need. And i really know which devices i have in my network, and definitely no and never planned fat ones.
                    In your case, if you wish to disable SNMP bulk completely, you can apply the following patch and recompile the server.

                    In src/libs/zbxdbcache/dbconfig.c there is a function called DCconfig_get_suggested_snmp_vars(). If you make it always return 1, it should start working for you.

                    Comment

                    Working...