Ad Widget

Collapse

SNMP stops working ... partially

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MichaelM
    Member
    • Sep 2008
    • 38

    #1

    SNMP stops working ... partially

    Hi there,

    I'am running a fresh installation of Zabbix 1.6.2 on Ubuntu 8.04. For the first tests I'am monitoring 10 Routers, they are all connected through ADSL or SDSL, mostly with a fixed IP.

    My problem is that SNMP monitoring stops working for about 8 Hosts after a while (didn't figured out if its always after x hours or random). No new data is gathered anymore, and the Web-Interface shows me a timeout for the non-working hosts in the item-configuration and the status goes to "not supported".

    Per Host I'am catching two values via SNMPv3 every 60 seconds:

    interfaces.ifTable.ifEntry.ifInOctets.10001
    interfaces.ifTable.ifEntry.ifOutOctets.10001

    For these values I created a graph for each Host and a screen containg all 10 graphs.

    I enabled the log, but nothing unexpected here so far.

    Is this happen to others users too, or is it just me? Any advice?

    --Michael
  • MichaelM
    Member
    • Sep 2008
    • 38

    #2
    UPDATE: stopped again

    Hi,

    I restarted yesterday the Zabbix server, and 7 out of 10 items stopped working again.

    Most of the items stopped working at about 2am, thats the time routers get restarted to force the 24h DSL line termination (german telekom, bad behavior).

    Does this mean if the item did not worked for a couple times (lines should be back in about a minute) it gets disabled?

    If I try to bring the item up from "not supported" to "active", it's not working. The web-interface shows "Timeout while connecting to [nameofhostobject:161]".

    I need to restart the Zabbix-Server. After that all items are automatically back up again.

    The log with DebugLevel=2 shows nothing, so it shouldn't be an error detected? Why that? I updated that to =3 meanwhile.

    --Michael
    Last edited by MichaelM; 03-03-2009, 10:02.

    Comment

    • MichaelM
      Member
      • Sep 2008
      • 38

      #3
      Update: stopped again and again

      Hi,

      with DebugLevel=3, I can see messages every 10 minutes like this in my log:

      Item [gw-location.domain.de:ifInOctets.CityneT] error: Timeout while connecting to [x.x.x.x:161]

      But this make no sense, the hosts are reachable without any problems, pingable and I'am able to gather data with snmpwalk.

      If I restart the zabbix-server, everything works again.

      What else can I do to fix this?

      --Michael

      Comment

      • MrKen
        Senior Member
        • Oct 2008
        • 652

        #4
        Hi Michael,

        You say that restarting the Zabbix server resolves the problem of unreachable hosts. This suggests to me a configuration problem. For example, in your zabbix_server.conf there is a section about 'Unreachable Period', you may need to increase the period to maybe 3 or 4 minutes. That way, when German Telekom force your routers to restart, zabbix will wait 3 or 4 minutes before declaring them unreachable.

        Hope this solves the problem.

        MrKen

        p.s Don't forget to restart zabbix after editing the server.conf
        Disclaimer: All of the above is pure speculation.

        Comment

        • MichaelM
          Member
          • Sep 2008
          • 38

          #5
          Hi MrKen,

          sounds very reasonable.

          I configured now

          UnreachablePeriod=300
          UnreachableDelay=15
          UnavailableDelay=60

          Hopefully these are working values.

          Thanks in advance for your quick post.

          --Michael

          Comment

          • troffasky
            Senior Member
            • Jul 2008
            • 567

            #6
            Originally posted by MichaelM
            Per Host I'am catching two values via SNMPv3 every 60 seconds:

            interfaces.ifTable.ifEntry.ifInOctets.10001
            interfaces.ifTable.ifEntry.ifOutOctets.10001
            When these stop polling in Zabbix, are you able to poll them manually with snmpget from the host Zabbix is running on?

            Comment

            • MichaelM
              Member
              • Sep 2008
              • 38

              #7
              Originally posted by troffasky
              When these stop polling in Zabbix, are you able to poll them manually with snmpget from the host Zabbix is running on?
              Yes, like I wrote in post #3. The hosts were pingable and snmpgetable, the snmpget and ping was fired from my zabbix server machine.

              It seems that the config values doing the trick. One of the routers had some longer outage today and the snmp values came back up automatically.

              --Michael

              Comment

              • MichaelM
                Member
                • Sep 2008
                • 38

                #8
                Update: config change didn't do the trick

                Hi,

                it's so frustrating but the modification of the zabbix-server.conf didn't changed anything.

                I'am still getting the timeouts after the router are restarted. I need to restart zabbix_server to get them monitored again.

                Any more things I can check?

                --Michael

                Comment

                • MrKen
                  Senior Member
                  • Oct 2008
                  • 652

                  #9
                  Hi Michael,

                  Probably not the answer you want, but, maybe you could run a cron job at 2.10am to restart the Zabbix Server.

                  Alternatively, change the host configs to connect to IP address or DNS name; whichever is opposite of the current config. Just a thought.

                  MrKen
                  Disclaimer: All of the above is pure speculation.

                  Comment

                  • MichaelM
                    Member
                    • Sep 2008
                    • 38

                    #10
                    Hi MrKen,

                    yeah cronjob would be an option for this specific problem, but what if the router is not reachable over the day, should I do a restart in cron.hourly to get sure it's working properly?

                    8 out of my 10 routers are connected thru IP-address, the remaining 2 by DNS-name.

                    I would love to dig deeper, maybe it's a bug, but I don't know where to start. LogLevel=4 is producing to much data to overlook.

                    --Michael

                    Comment

                    • MrKen
                      Senior Member
                      • Oct 2008
                      • 652

                      #11
                      Originally posted by MichaelM

                      yeah cronjob would be an option for this specific problem, but what if the router is not reachable over the day, should I do a restart in cron.hourly to get sure it's working properly?

                      --Michael
                      My understanding is that German Telekom is the cause of the routers becoming unreachable. So a restart at 2.10am should fix that. The routers shouldn't become unreachable during the day. I wouldn't want to be restarting Zabbix every hour!

                      Dig deeper - it may have something to do with the Router configeration.

                      MrKen
                      Disclaimer: All of the above is pure speculation.

                      Comment

                      • MichaelM
                        Member
                        • Sep 2008
                        • 38

                        #12
                        Hi MrKen,

                        well the cron.hourly was a bit overdone, more sacastic than real.

                        Due to DSL outages or whatever reason, the line could be done every time every day, without warning. I cannot restart zabbix_server whenever something goes wrong.

                        I don't believe in a router issue, because I can gather data with snmpwalk right after the router is rebooted, why Zabbix can't? But I will check into this again.

                        --Michael

                        Comment

                        • MichaelM
                          Member
                          • Sep 2008
                          • 38

                          #13
                          Update: because authentication does not work anymore

                          Hi there,

                          I figured out that the router complains about the SNMP checks from Zabbix after restart. The router message says that the authentication fails.

                          So what does that mean? Is there a hash calculated for authentication or something that Zabbix uses, and this hash gets invalid after restarting the router? I'am not familar with SNMPv3 authentication, but fact ist, after restarting zabbix-server it works again, so it must be zabbix related because I didn't touched the router.

                          Help highly appreciated.

                          --Michael

                          Comment

                          • MrKen
                            Senior Member
                            • Oct 2008
                            • 652

                            #14
                            I think you need to double-check whatever authentication is required between the router and zabbix server. Personally I don't use v3, so I can't comment.

                            But I think I'm still leaning toward a router config issue.

                            Keep digging.
                            Disclaimer: All of the above is pure speculation.

                            Comment

                            • nottestuser
                              Junior Member
                              • Mar 2009
                              • 3

                              #15
                              I find your problems very similar to mine:



                              Since your monitoring actually does work sometimes, perhaps it might be interesting to see if the snmp responses or queries change. Do you have tcpdump on the Zabbix machine? Perhaps logging all UDP port 161 traffic to/from one of the monitored switches will reveal something.

                              Comment

                              Working...