Ad Widget

Collapse

Inconsistent SNMP, and timeouts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • untergeek
    Senior Member
    Zabbix Certified Specialist
    • Jun 2009
    • 512

    #1

    Inconsistent SNMP, and timeouts

    I'm running 1.8.4 server on RHEL.

    snmpget running every second pulls every value without fail.

    When I put the same OIDs into Zabbix, I get some timeouts and even a few Disabling/Enabling SNMP Host messages, but the data is coming, however sporadically.

    Host0 has no dropouts and always connects and works.
    Host1 has a few dropouts.
    Host2 has the most dropouts and almost never connects and works.
    Host3 has many dropouts.

    These servers are NOT loaded down. They are ENORMOUS Solaris servers with dozens of CPU cores.

    I am running 3 checks on each host with a 120 second interval for each check (12 checks, at most, running simultaneously).

    I increased my StartPollers from 25 to 50. This did not prevent timeouts.

    I don't know what to make of this. This doesn't seem like it should be possible. I reiterate that snmpget hits these and gets the values correctly every time, run from my zabbix server. Why does Zabbix fail?


    7549:20110209:131906.385 Item [host1.example.com:snmp[cpu-idle]] error: Timeout while connecting to [10.95.0.80:161]
    7549:20110209:131906.395 SNMP Host [host1.example.com]: another network error, wait for 15 seconds
    7555:20110209:131926.346 Item [host3.example.com:snmp[cpu-sys]] error: Timeout while connecting to [10.95.0.71:161]
    7555:20110209:131934.919 SNMP Host [host3.example.com]: another network error, wait for 15 seconds
    7567:20110209:131940.964 Item [host2.example.com:snmp[cpu-idle]] error: Timeout while connecting to [10.95.0.67:161]
    7568:20110209:132052.773 Item [host2.example.com:snmp[cpu-user]] error: Timeout while connecting to [10.95.0.67:161]
    7491:20110209:132057.137 Item [host3.example.com:snmp[cpu-sys]] error: Timeout while connecting to [10.95.0.71:161]
    7491:20110209:132057.146 SNMP Host [host3.example.com]: first network error, wait for 15 seconds
    7537:20110209:132059.192 Item [host1.example.com:snmp[cpu-sys]] error: Timeout while connecting to [10.95.0.80:161]
  • zychonatic
    Member
    • Jun 2010
    • 52

    #2
    hi,

    i´ve got the same problem.

    any solution?

    br zychonatic

    Comment

    • fmrapid
      Member
      • Aug 2010
      • 43

      #3
      pre 1.8.5 (the cvs builds) do include a fix for something related. SNMP queries doubled or delayed.

      I suggest you see if this fixes the problem or open a support request.

      Have you taken a wireshark packet trace to see if data is indeed being requested and received from the problematic host with out any errors.

      Cheers

      fmrapid

      Comment

      • bonobo_slr
        Junior Member
        • Nov 2010
        • 15

        #4
        I am noticing the same issue. I have devices that I poll via snmp - they are not experiencing load at all. I am using zabbix and xymon to monitor the load/interface traffic etc.

        Xymon seems to have no problems with the data collection but Zabbix is erratic and therefore reporting incorrectly for items that require the delta between values.

        The devices are on the same VLAN as the zabbix server, plugged into the same switch, so I am ruling out a network issue. The problem is not confined to one device - it is many.

        Some tips on how to trouble shoot this would be appreciated.

        Comment

        • untergeek
          Senior Member
          Zabbix Certified Specialist
          • Jun 2009
          • 512

          #5
          Agreed. From what I've read, the problem I'm seeing may or may not be fixed by the 1.8.5 update. I wish there were some other way to get info, as debuglevel=4 is WAY more info than I need.

          Comment

          • dminstrel
            Member
            • Apr 2005
            • 72

            #6
            Is there a JIRA ticket for this? I'm also having this problem on 1.8.5rc1.

            Thanks,

            Comment

            • ericgearhart
              Senior Member
              • Jan 2009
              • 115

              #7
              Did anyone find a resolution to this? All the symptoms mentioned here sound eerily similar to issues I'm currently having with SNMP in Zabbix

              I'm wondering if rolling back the changes that were made to checks_snmp.c and poller.c to close this ticket: https://support.zabbix.com/browse/ZBX-4026 might resolve this. I'll give it a shot.

              See http://git.zabbixzone.com/trunk/.git...d02726405800d8 for the related git commit

              Comment

              • ericgearhart
                Senior Member
                • Jan 2009
                • 115

                #8
                Please see https://support.zabbix.com/browse/ZBX-4901 for the bug report that seems to match the symptoms described in this thread

                Comment

                • PhilSynek
                  Junior Member
                  • May 2012
                  • 13

                  #9
                  Hi Everyone,

                  I am experiencing the same problem. I am monitoring 5 appliances from the same manufacturer via a Zabbix proxy 2.2.6 (same as server) with SNMPv3. Four of them are the exact same model, the fifth one is a different appliance and this is the one causing problems. Let’s call them the four “switches” and the fifth one “router”, just for easier explanation.

                  First we were monitoring only the router. No problems so far, everything worked like a charm. LLD of SNMP Items was working fine.
                  Then one day (no changes were made according the Zabbix audit) the queue started to fill up with almost 200 items. The graphs started to print only sporadic lines. And the zabbix_proxy.log filled up with lines like these:

                  Code:
                  26158:20141106:100010.785 SNMP agent item "ifOperStatus[GigabitEthernet-10]" on host "Router" failed: first network error, wait for 15 seconds
                  26161:20141106:100025.092 resuming SNMP agent checks on host "Router": connection restored
                  26151:20141106:100040.871 SNMP agent item "ifHCInUcastPkts[GigabitEthernet-5]" on host "Router" failed: first network error, wait for 15 seconds
                  26161:20141106:100055.118 resuming SNMP agent checks on host "Router": connection restored
                  26154:20141106:100110.473 SNMP agent item "ifHCInOctets[GigabitEthernet-3]" on host "Router" failed: first network error, wait for 15 seconds
                  26161:20141106:100125.154 resuming SNMP agent checks on host "Router": connection restored
                  26151:20141106:100140.675 SNMP agent item "ifHCInOctets[GigabitEthernet-12]" on host "Router" failed: first network error, wait for 15 seconds
                  I knew that problem from another company I worked at, so I started to check the items, discovery rules and item prototypes of the template for any wrong configuration (additional dot in front of the OIDs, wrong port or security settings, etc.) I could not find anything. So I started to look for changes made to systems involved. I found out, that the kernel of the proxy got downgraded. Just to be sure, the proxy was reinstalled, with the right kernel from beginning. Unfortunately this didn’t solve the problem.

                  In the meanwhile the four switches were installed and monitoring started. Same Interface Template as used for the router. Only LLD for the switches is the same as for the router, the Interface Discovery based on standard MIB OIDs. Everything is working fine for the switches!

                  We are using two templates in this scenario. “Template switch” and “Template router”, both are linked to our “Template Network Interfaces SNMPv3”. And the items from this network template are the only ones popping up in the logs.

                  The authentication and security name are controlled via macros and set in all hosts. Only difference between the switches and the router is, that the used authentication and security names in the switches are not identical. The router is using identical strings for authentication and security. But I can’t believe this should be the problem, because it worked this way before.

                  SNMPWALKS performed from the zabbix proxy are working flawless. I checked every item in the router template, every discovery rule and every item prototype directly via SNMPWALK, works without problems.

                  I hope someone can help me.
                  Thanks!
                  Philipp

                  Comment

                  • tchjts1
                    Senior Member
                    • May 2008
                    • 1605

                    #10
                    I would double check that your SNMPV3 username and security strings are correct.

                    I just went through this same scenario you are describing. Took days to figure it out. One of our NetApp devices was upgraded to SNMPv3, but they failed to create an SNMPv3 user and passphrase.

                    All of the above are worth checking. Also worth drilling down to the items for your problem device and checking to see if somehow the passphrase or securityname inherited the wrong macro or string. Also make sure to validate you have the correct securitylevel applied for the items.

                    Outside of that, have a look at your Zabbix internal process health. Make sure you have enough resources allocated. See the last paragraph of this post as well as the graphs that follow it:https://www.zabbix.com/forum/showthread.php?t=41219

                    Comment

                    • PhilSynek
                      Junior Member
                      • May 2012
                      • 13

                      #11
                      Hi!

                      Thank you for the answer. Again, I checked all the items for the correct
                      • Type = SNMPv3 agent
                      • Security name = {$SNMP_SECURITY}
                      • Security level
                      • Authentication protocol
                      • Authentication passphrase = {$SNMP_AUTHENTICATION}
                      • Port = empty (Port ist set via Host)

                      Everything was correct. I double checked the macros, they are alright. I also checked the Zabbix internals before and now as you mentioned them. I already raised my pollers on the server and the proxy.

                      Just to be sure, see the attachement for screenshots.

                      I turned on log debugging on the proxy and here is what I found:

                      Code:
                      16335:20141107:100240.499 In zbx_snmp_get_values() num:94 level:0
                      16335:20141107:100240.500 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:94
                      16335:20141107:100240.500 End of zbx_snmp_get_values():NETWORK_ERROR
                      16335:20141107:100240.500 End of zbx_snmp_process_standard():NETWORK_ERROR
                      16335:20141107:100240.500 In zbx_snmp_close_session()
                      16335:20141107:100240.500 End of zbx_snmp_close_session()
                      16335:20141107:100240.500 getting SNMP values failed: Cannot connect to "10.255.242.95:161": Too long.
                      16335:20141107:100240.500 End of get_values_snmp()
                      16335:20141107:100240.500 In deactivate_host() hostid:10286 itemid:39984 type:6
                      16335:20141107:100240.501 query [txnlev:1] [begin;]
                      16335:20141107:100240.501 query [txnlev:1] [update hosts set snmp_errors_from=1415350960,snmp_disable_until=1415350975,snmp_error='Cannot connect to "10.255.242.95:161": Too long.' where hostid=10286]
                      16335:20141107:100240.501 query [txnlev:1] [commit;]
                      16335:20141107:100240.503 SNMP agent item "ifHCOutOctets[GigabitEthernet-2]" on host "router" failed: first network error, wait for 15 seconds
                      16335:20141107:100240.503 deactivate_host() errors_from:1415350960 available:1
                      16335:20141107:100240.503 End of deactivate_host()
                      Does "getting SNMP values failed: Cannot connect to "10.255.242.95:161": Too long." means, that the proxy is reaching some kind of timeout?

                      Thank you!
                      Philipp
                      Attached Files

                      Comment

                      • tchjts1
                        Senior Member
                        • May 2008
                        • 1605

                        #12
                        A few things here.

                        You mentioned you raised your pollers. Did you increase your unreachable pollers also? If not, I would bump those up. I would also allocate some more configuration cache. You are not hitting the alert threshold, but I personally don't like to run that close.

                        As for your timeout. No, that doesn't necessarily mean that you can't reach the device. You would also get that error if there was something wrong with the credentials. I know you have double-checked the settings in the Zabbix frontend. Have you also checked with your SNMP Admin to make sure you are matching what they have set in SNMPv3 on that device?

                        I would explicitly tell him/her what you are using for username, passphrase and security level and ask them to validate that they are an exact match.

                        Also - Any chance you can point that device directly to the Zabbix server to take the proxy out of the mix for troubleshooting purposes?

                        Lastly, there is a Timeout= setting on your Zabbix server in zabbix_server.conf. By default, it is set to 3. I always make it a point to set that to at least 15. (restart Zabbix server process any time you make conf changes)

                        Comment

                        • PhilSynek
                          Junior Member
                          • May 2012
                          • 13

                          #13
                          Thanks a lot for your time and support! I really appreciate that.

                          Originally posted by tchjts1
                          You mentioned you raised your pollers. Did you increase your unreachable pollers also? If not, I would bump those up. I would also allocate some more configuration cache. You are not hitting the alert threshold, but I personally don't like to run that close.
                          No I didn't raise the unreachable pollers. What do you suggest?

                          My settings on the server:
                          • StartPollers=15
                          • StartPollersUnreachable=1
                          • StartPingers=15
                          • StartDiscoverers=15
                          • CacheSize=8M
                          • Timeout=5

                          My settings on the proxy:
                          • StartPollers=10
                          • StartPollersUnreachable=1
                          • StartPingers=5
                          • StartDiscoverers=5
                          • CacheSize=8M
                          • Timeout=3

                          Originally posted by tchjts1
                          As for your timeout. No, that doesn't necessarily mean that you can't reach the device. You would also get that error if there was something wrong with the credentials. I know you have double-checked the settings in the Zabbix frontend. Have you also checked with your SNMP Admin to make sure you are matching what they have set in SNMPv3 on that device?

                          I would explicitly tell him/her what you are using for username, passphrase and security level and ask them to validate that they are an exact match.
                          I snmpwalked the host directly from my zabbix proxy shell. Everything is fine from there. I get values every time. So I guess that's not the solution for my problem, right?

                          Originally posted by tchjts1
                          Also - Any chance you can point that device directly to the Zabbix server to take the proxy out of the mix for troubleshooting purposes?
                          Unfortunately not. The devices network segment is a not routed management network. I could try to get a temporary management interface on the zabbix server, but first I would like to eliminate all the other possibilities. Let's keep that in mind.

                          Originally posted by tchjts1
                          Lastly, there is a Timeout= setting on your Zabbix server in zabbix_server.conf. By default, it is set to 3. I always make it a point to set that to at least 15. (restart Zabbix server process any time you make conf changes)
                          See my configuration above. Should I increase both, server and proxy, timeouts or only the server timeout?

                          Again, thank you for the help!
                          Philipp

                          Comment

                          • tchjts1
                            Senior Member
                            • May 2008
                            • 1605

                            #14
                            Without knowing what your new values per second are, how many devices you are monitoring, the number of items and triggers - I am going strictly by your internal process graphs and my personal experience.

                            Since your proxy and server are both fairly close to the stock default settings, I would make them both have these settings. I show the changed lines with <--. I'm sure you know that if you change any values in the conf, you either have to remove the comment (#) at the beginning of the line or put a new line without the comment. Otherwise it will still use default settings.

                            On the server:
                            Code:
                                StartPollers=25   <---
                                StartPollersUnreachable=5   <---
                                StartPingers=15
                                StartDiscoverers=15
                                CacheSize=32M    <---
                                Timeout=15   <---
                            
                            On the proxy:
                            
                                StartPollers=25   <---
                                [B]StartPollersUnreachable=5   <---
                                StartPingers=5
                                StartDiscoverers=5
                                CacheSize=32M    <---
                                Timeout=15    <---
                            The above changes will probably not solve your SNMP issue, but your setup is going to run more smoothly.
                            Those above settings are somewhat liberal, but it gives you some room for growth without having to adjust these settings if you add in a few more hosts/devices or items/triggers.
                            Last edited by tchjts1; 20-11-2014, 18:55.

                            Comment

                            • PhilSynek
                              Junior Member
                              • May 2012
                              • 13

                              #15
                              tchjts1, thank you for your help! I adjusted the config as you suggested. You were right, it didn't help with my SNMP problem, but the zabbix performance looks better now.

                              Our network pro took a look at the zabbix proxy (tcpdump) and recognized, that some SNMP packages weren't send from the proxy. They appeared in the debug logs, as I already saw too, but did not appear in tcpdump, which I did check, but did not look close enough. I did not expect the proxy to send only a few SNMP requests. I thought: If one goes out, all go out.

                              tl;dr:
                              The solution is to upgrade the zabbix proxy to version 2.2.7 and deactivate snmpbulk.

                              Cheers,
                              Philipp

                              Comment

                              Working...