Ad Widget

Collapse

A device randomly stops to be monitored via snmp (timeout while connecting)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • brynza
    Junior Member
    • Jun 2016
    • 3

    #1

    A device randomly stops to be monitored via snmp (timeout while connecting)

    Hi,

    I have an issue with using snmpv3 with some devices.

    Zabbix zerver version: 3.0.1
    Net-snmp version: 5.7.2

    I added two cisco routers on Zabbix and linked the same template to both.
    One of them is monitored without any problem.
    The other router was monitored without errors during 10 hours and then it became 'unavailable by snmp' with error "Timeout while connecting to..."

    Meanwhile snmpget works fine.
    The tcpdump output of failed check shows that there is no answer with data from router, the only data I have is a report about dropped requests:

    Code:
    16:26:04.064796 IP 192.168.10.252.52075 > 10.11.11.254.snmp:  F=r U= E=  C= GetRequest(14)
    16:26:04.082442 IP 192.168.10.252.37416 > 10.11.11.254.snmp:  F=apr U=snmpuser [!scoped PDU]0f_92_ab_8e_e7_a2_a9_8d_13_27_d2_60_8f_1e_ae_c6_c3_61_cc_12_0b_05_d1_f7_ac_6c_1d_86_ac_7c_59_1f_89_48_98_1c_d7_04_49_26_80_ce_a3_91_9a_0d_d4_5c_68_91_8a_48_ed_5a_99_cb
    16:26:04.087753 IP 10.11.11.254.snmp > 192.168.10.252.52075:  F= U= E= 0x800x000x000x090x030x000xEC0xBD0x1D0x280x510x58 C= Report(32)  .1.3.6.1.6.3.15.1.1.4.0=3655
    16:26:04.088121 IP 192.168.10.252.52075 > 10.11.11.254.snmp:  F=apr U=snmpuser [!scoped PDU]58_62_f2_9b_52_e0_1a_53_5b_9a_cc_8a_a1_a7_f5_67_9e_d7_9a_94_70_98_2d_77_b3_fd_2d_bc_0c_e7_44_9d_57_22_e9_4e_38_25_0f_3d_b7_33_c7_7c_8f_a5_60_aa_1c_f1_33_2c_27_c0_fc_92
    16:26:19.103239 IP 192.168.10.252.52075 > 10.11.11.254.snmp:  F=apr U=snmpuser [!scoped PDU]3f_c8_b6_bb_39_b6_aa_8b_cb_21_89_6b_bf_3d_59_31_5f_da_87_f4_d5_03_a3_4a_1d_f3_29_6a_94_f0_a0_7f_00_82_b8_ed_92_99_05_2f_ae_43_92_ef_6d_7d_4f_74_78_dc_84_89_c2_29_2c_3c
    16:26:19.113663 IP 192.168.10.252.59323 > 10.11.11.254.snmp:  F=r U= E=  C= GetRequest(14)
    16:26:19.148463 IP 10.11.11.254.snmp > 192.168.10.252.59323:  F= U= E= 0x800x000x000x090x030x000xEC0xBD0x1D0x280x510x58 C= Report(32)  .1.3.6.1.6.3.15.1.1.4.0=3656
    16:26:19.148726 IP 192.168.10.252.59323 > 10.11.11.254.snmp:  F=apr U=snmpuser [!scoped PDU]c1_06_82_ff_f0_26_04_6c_78_f4_27_aa_f5_ec_e1_39_d2_29_e0_25_51_0c_21_8f_a4_bd_0f_b3_f9_df_bd_23_ea_50_84_66_a2_37_f8_2f_23_7c_39_6c_43_01_41_4f_77_5f_af_9b_cd_04_46_f9
    When I run snmpget on zabbix server I have the correct output:
    Code:
    snmpget -v3 -l authPriv -u snpuser -a sha -A 'authpass' -x AES -X privpass 10.11.11.254 1.3.6.1.4.1.9.9.48.1.1.1.6.2
    
    SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.2 = Gauge32: 24544240
    And this is what tcpdump shows in this case:

    Code:
    19:16:00.267284 IP 192.168.10.252.41494 > 10.11.11.254.snmp:  F=r U= E=  C= GetRequest(14)
    19:16:00.294419 IP 10.11.11.254.snmp > 192.168.10.252.41494:  F= U= E= 0x800x000x000x090x030x000xEC0xBD0x1D0x280x510x58 C= Report(32)  .1.3.6.1.6.3.15.1.1.4.0=3774
    19:16:00.294682 IP 192.168.10.252.41494 > 10.11.11.254.snmp:  F=apr U=snmpuser [!scoped PDU]1a_a9_1e_3f_a3_51_55_a0_b0_46_c7_4c_df_1c_f5_40_79_a1_7c_8d_29_c1_42_0b_0a_70_c4_d1_5b_2b_e6_15_c5_a1_da_41_3e_3a_45_73_d0_b6_f5_f5_b0_ff_03_6b_b8_2e_2f_08_22
    19:16:00.318018 IP 10.11.11.254.snmp > 192.168.10.252.41494:  F=ap U=snmpuser [!scoped PDU]92_80_40_54_90_6a_bd_ae_56_68_d0_51_d6_9e_82_5e_3f_f2_68_a1_37_b0_55_8b_66_4a_9f_62_6d_5a_ff_49_7c_95_f0_ec_87_c3_55_c8_80_35_e8_ef_64_ab_cd_97_42_93_f7_9b_9d_38_4e_cf_3c
    I've read a lot of information about possible causes but still have no answer.
    Also I have to say that network configuration wasn't changed and that I've checked all EngineID's on my network and they are all unique.

    Does anybody have the same problem?
    Last edited by brynza; 15-06-2016, 14:20.
  • viktorkho
    Member
    • Jul 2013
    • 90

    #2
    SNMP is a low priority process as far as the CPU scheduler is concerned, so another process requiring CPU resources takes priority. Therefore, while CPU spikes occur in this scenario, they should not affect performance.
    This document explains how to troubleshoot high CPU utilization in a router due to the SNMP ENGINE process running in the router, especially in low end routers.


    > When I run snmpget on zabbix server I have the correct output

    Try loop instead of single check:
    Code:
    while true; do snmpget ...; sleep N; done
    Also tune UnavailableDelay, UnreachableDelay and UnreachablePeriod in your zabbix_server conf to avoid long periods of items inactivity.
    Last edited by viktorkho; 09-06-2016, 12:38. Reason: typo fixed

    Comment

    • brynza
      Junior Member
      • Jun 2016
      • 3

      #3
      Thanks for reply.
      I would believe that the reason is that snmp has a low priority but there is no high CPU usage on router.

      Originally posted by viktorkho
      > When I run snmpget on zabbix server I have the correct output

      Try loop instead of single check:
      Code:
      while true; do snmpget ...; sleep N; done
      Also tune UnavailableDelay, UnreachableDelay and UnreachablePeriod in your zabbix_server conf to avoid long periods of items inactivity.
      I tried a loop and it proceeds without timeouts.

      The current settings are:
      Timeout=15
      UnreachablePeriod=180
      UnreachableDelay=60
      UnavailableDelay=180

      Also today the second router fell into the same state.
      Here is the 'show snmp' output on that router:

      Code:
      104748 SNMP packets input
          0 Bad SNMP version errors
          42 Unknown community name
          0 Illegal operation for community name supplied
          19534 Encoding errors
          41036 Number of requested variables
          0 Number of altered variables
          37837 Get-request PDUs
          0 Get-next PDUs
          0 Set-request PDUs
          0 Input queue packet drops (Maximum queue size 1000)
      85172 SNMP packets output
          1 Too big errors (Maximum packet size 1500)
          0 No such name errors
          0 Bad values errors
          0 General errors
          0 Response PDUs
          0 Trap PDUs
      SNMP Dispatcher:
         queue 0/75 (current/max), 0 dropped
      SNMP Engine:
         queue 0/1000 (current/max), 0 dropped
          0 Unknown Security Models
          0 SNMP Invalid Messages
          0 SNMP Unknown PDU handlers
          0 Unsupported Security Level
          0 Unknown User Names
          47249 Unknown EngineIDs
          0 Not In Time Windows
          0 Wrong MD5 or SHA Digests
          0 Decryption Errors

      Comment

      • viktorkho
        Member
        • Jul 2013
        • 90

        #4
        In any way the first what you need is a bug localization.

        Can you set up DebugLevel=4 and LogFileSize=0 to prevent automatic log rotation?

        Then run in background something like 'tcpdump -vvv -nn udp port snmp and host $zabbix_server or host $cisco_switch > $logfile'

        Wait for error "Timeout while connecting to..." and compare 2 logfiles? There is no guarantee of delivery for UDP..

        Also you can use logrotated to rotate logs..

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by viktorkho
          Wait for error "Timeout while connecting to..." and compare 2 logfiles? There is no guarantee of delivery for UDP..
          In many cases embedded computer which is used to run snmpd on monitored device side is very weak which is causing that querying over SNMP for more metrics/s is very unreliable.
          Probably it would be good as well back to subject of using SNMP over TCP (as by default SNMP is used over UDP).
          This memo defines a transport mapping for using the Simple Network Management Protocol (SNMP) over TCP. The transport mapping can be used with any version of SNMP. This document extends the transport mappings defined in STD 62, RFC 3417. This memo defines an Experimental Protocol for the Internet community.
          Last edited by kloczek; 15-06-2016, 18:46.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • viktorkho
            Member
            • Jul 2013
            • 90

            #6
            Originally posted by kloczek
            In many cases embedded computer which is used to run snmpd on monitored device side is very weak which is causing that querying over SNMP for more metrics/s is very unreliable.
            A'm agree with you, my first answer is accordant to yours one. But TS said he cannot repeat packet loss from shell..
            Originally posted by kloczek
            Probably it would be good as well back to subject of using SNMP over TCP (as by default SNMP is used over UDP).
            https://tools.ietf.org/html/rfc3430
            Not all solutions on monitored device side support SNMP over TCP (even not all NMS). But the most interesting for me is to clarify the side of the problem.

            I can confirm that we have seen "Timeout while connecting to..." errors at SNMP-icons of our Zabbixes too. To reduse time of inactivity we use UnavailableDelay=5, UnreachableDelay=5 and UnreachablePeriod=15 with our servers.

            But I suspect, that Zabbix Server side can be not so guiltless..

            Comment

            Working...