Ad Widget

Collapse

SNMP problems with Zabbix

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Steveo
    Member
    • Jun 2013
    • 31

    #1

    SNMP problems with Zabbix

    Hi All,

    I have been having several issues with zabbix not working well with the SNMP agent. I have over 900 hosts, which are monitored with a ping check template, all of the servers have the zabbix agent installed and seem to be working, but when I tried to use the SNMP templates to monitor/graph switches, routers, etc. I run into the problems. I have 7 switches monitored via SNMP and seems to work fine. I tried to add more and was immediately bombarded with errors in the logs. See clip below:

    SNMP item [XXX]] on host [XXX] failed: first network error, wait for 15 seconds
    resuming SNMP checks on host [XXXV]: connection restored

    Basically it's a loop, it would fail then restore over and over. It also affected the 7 that seemed to be working before. I removed the additional hosts and it went back to normal. I thought maybe I added too many at once, but I tried again adding only 5 more and it was nuts again. Any suggestions would be greatly appreciated .

    Thanks,
    Steve
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    What version of Zabbix are you running?

    There are some internal checks and related graphs that were introduced in the later 1.8.x version. See this post on those graphs:


    Basically, they are going to show you the performance of Zabbix.

    I suspect that if you increase the Timeout= value to 10 in zabbix_server.conf, as well as increase the number of trappers in StartTrappers= and then restart your Zabbix service, hopefully you will see some relief from those errors.

    I do very light SNMP monitoring of 5 devices and have my trappers set at 15.

    Code:
    ### Option: StartTrappers
    #	Number of pre-forked instances of trappers.
    #
    # Mandatory: no
    # Range: 0-1000
    # Default:
    # StartTrappers=5
    StartTrappers=15
    But those graphs I reference above will help guide you in tuning those types of settings.

    (Edit)

    Oh, and if you are using ICMP pings, you may want to tune this parameter as well, as the default is 1. But again, the internal checks/graphs will tell you how well your current setting is performing. There is a metric for how busy the icmp pinger process is.

    Code:
    ### Option: StartPingers
    #	Number of pre-forked instances of ICMP pingers.
    #
    # Mandatory: no
    # Range: 0-1000
    # Default:
    # StartPingers=1
    StartPingers=2
    Last edited by tchjts1; 20-06-2013, 23:23.

    Comment

    • SupportGuy
      Member
      • Mar 2012
      • 30

      #3
      Whole equipmeent reachable ?

      Other possibility, have you verified whole equiement were reachable through SNMP with snmpwalk or snmpget from your Zabbix server ?

      Comment

      • Steveo
        Member
        • Jun 2013
        • 31

        #4
        Originally posted by tchjts1
        What version of Zabbix are you running?

        There are some internal checks and related graphs that were introduced in the later 1.8.x version. See this post on those graphs:


        Basically, they are going to show you the performance of Zabbix.

        I suspect that if you increase the Timeout= value to 10 in zabbix_server.conf, as well as increase the number of trappers in StartTrappers= and then restart your Zabbix service, hopefully you will see some relief from those errors.

        I do very light SNMP monitoring of 5 devices and have my trappers set at 15.

        Code:
        ### Option: StartTrappers
        #	Number of pre-forked instances of trappers.
        #
        # Mandatory: no
        # Range: 0-1000
        # Default:
        # StartTrappers=5
        StartTrappers=15
        But those graphs I reference above will help guide you in tuning those types of settings.

        (Edit)

        Oh, and if you are using ICMP pings, you may want to tune this parameter as well, as the default is 1. But again, the internal checks/graphs will tell you how well your current setting is performing. There is a metric for how busy the icmp pinger process is.

        Code:
        ### Option: StartPingers
        #	Number of pre-forked instances of ICMP pingers.
        #
        # Mandatory: no
        # Range: 0-1000
        # Default:
        # StartPingers=1
        StartPingers=2
        I am using 2.0.6. I have had the trappers/pingers set as high as 250 each w/o any luck. I did miss that Timeout option in the conf file, so I will give that a try. Thanks!

        Comment

        • Steveo
          Member
          • Jun 2013
          • 31

          #5
          Originally posted by SupportGuy
          Other possibility, have you verified whole equiement were reachable through SNMP with snmpwalk or snmpget from your Zabbix server ?
          Yes, all of the equipment is reachable with an smnpwalk, as well as snmpget via Zabbix. SNMP works, just intermittently. I'm hoping the timeout increase will help. Thanks!

          Comment

          • SupportGuy
            Member
            • Mar 2012
            • 30

            #6
            I had understood that you were making acitve SNMP polling, ie through, usually, the port 161.

            I'm not confident with whole Zabbix architecture, and I thought trappers were in charge of retrieving SNMP traps (through the port 162, usually) eventually
            sent by equipments/servers, so not concerned in your case. Am I wrong ?

            Comment

            • Steveo
              Member
              • Jun 2013
              • 31

              #7
              Well, I increased the Timeout to 30, which was the max and added a 20 hosts to the SNMP template and it started doing the same thing. It added the hosts, then it started the loop again and it affected the original hosts that were working. The log is full of the same messages for all of the hosts.

              SNMP item [XXX]] on host [XXX] failed: first network error, wait for 15 seconds
              resuming SNMP checks on host [XXXV]: connection restored


              This of course is causing all of the graphs to have missing data. I never had any of these issues with Cacti.

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #8
                Did you take a look at the Zabbix internal processes graphs as previously mentioned in this thread?

                They can be a great indicator if a given process is being stressed.

                And when you changed that Timeout value, did you restart the Zabbix App process?

                Comment

                • Steveo
                  Member
                  • Jun 2013
                  • 31

                  #9
                  Originally posted by tchjts1
                  Did you take a look at the Zabbix internal processes graphs as previously mentioned in this thread?

                  They can be a great indicator if a given process is being stressed.

                  And when you changed that Timeout value, did you restart the Zabbix App process?
                  Yes, nothing out of the ordinary in there either.

                  Comment

                  • nilie
                    Junior Member
                    • May 2013
                    • 16

                    #10
                    Originally posted by Steveo
                    Well, I increased the Timeout to 30, which was the max and added a 20 hosts to the SNMP template and it started doing the same thing. It added the hosts, then it started the loop again and it affected the original hosts that were working. The log is full of the same messages for all of the hosts.

                    SNMP item [XXX]] on host [XXX] failed: first network error, wait for 15 seconds
                    resuming SNMP checks on host [XXXV]: connection restored


                    This of course is causing all of the graphs to have missing data. I never had any of these issues with Cacti.
                    Cacti is a totally different beast, created specifically to graph performance using SNMP. Zabbix has its own architectural goals and purposes and does SNMP performance monitoring in a sub-optimal manner because it is not among its main objectives.

                    Now in order to solve your problem, I would suggest you to try and capture some traffic on your Zabbix server and see if the device in question really responds to snmp queries ? Make sure you limit the capture to one single device and only to UDP protocol otherwise you will have tons of traffic to look at.

                    Comment

                    • Steveo
                      Member
                      • Jun 2013
                      • 31

                      #11
                      Originally posted by nilie
                      Cacti is a totally different beast, created specifically to graph performance using SNMP. Zabbix has its own architectural goals and purposes and does SNMP performance monitoring in a sub-optimal manner because it is not among its main objectives.

                      Now in order to solve your problem, I would suggest you to try and capture some traffic on your Zabbix server and see if the device in question really responds to snmp queries ? Make sure you limit the capture to one single device and only to UDP protocol otherwise you will have tons of traffic to look at.
                      I have already done that. I know the devices are responding to the SNMP queries. The graphs shows data for all of the device I have tried to add so far, the data is just broken and missing. I had 7 hosts running fine with no issues, graphs looked great. I added a few more hosts and those hosts are now failing SNMP and then the connection is being restored. It is looping like that over and over. I have also graphed the zabbix data gathering and internal processes and I was no where near overloading any of them.

                      Comment

                      • tchjts1
                        Senior Member
                        • May 2008
                        • 1605

                        #12
                        One last thought from me... are you connecting to these devices through Zabbix via IP or DNS? If it is DNS... with 900 hosts, maybe you are overwhelming your DNS server?

                        If the bulk of your current hosts are using DNS, try adding in your new hosts and have them use IP instead.

                        Comment

                        • Steveo
                          Member
                          • Jun 2013
                          • 31

                          #13
                          All are being monitored via IP

                          Comment

                          • Steveo
                            Member
                            • Jun 2013
                            • 31

                            #14
                            Housekeeper was causing the issue...

                            I disabled housekeeping and all of the SNMP issues went away. I added 150 hosts to template and all are working properly.

                            Comment

                            • tchjts1
                              Senior Member
                              • May 2008
                              • 1605

                              #15
                              That's interesting. Housekeeper is a resource hog when it runs. Do you have housekeeper set at the default of 1 hour?

                              You are going to run into issues though if you leave it disabled. Your DB will grow exponentially huge. (Unless you have it partitioned and are managing the data that way)

                              Comment

                              Working...