Ad Widget

Collapse

Zabbix Server Causing Network Device Latency Hourly

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bdaniel
    Junior Member
    • Dec 2019
    • 14

    #1

    Zabbix Server Causing Network Device Latency Hourly

    We are currently testing Zabbix to replace our SolarWinds NPM deployment. I have built the server on CentOS 8 following the install documentation using MySQL and have set the server to discover and add hosts from 3 of our internal subnets. The VM has 2 vCPU and 6GB of RAM and at this stage is only monitoring 165 items.

    Every hour, a couple minutes past the hour, I get a notification from NPM that all of our UPS's have a high latency warning of over 200ms, this then goes back to normal 2 minutes later. I also get an email from one of the UPS's stating that it has had a network outage detected at this same time. Initially, I thought this was the discovery or an action rule causing the latency/load but disabling all of these rules the issue still occurs. If I log onto the CentOS 8 box and actually stop the zabbix-server the issue stops. I have tailed all lolgs in /var/log/zabbix as well as main logs in /var/log and can't see anything generated that would cause this issue. I have even disabled all cron.hourly tasks just in case this was related but the issue still occurs.

    Can anyone shed any light on this weird one for me?
  • gofree
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2017
    • 400

    #2
    just a wild guess - can it b e housekeeper - its being run by default every hour....

    Comment

    • bdaniel
      Junior Member
      • Dec 2019
      • 14

      #3
      Yeah I read something about Housekeeper running hourly but didn't think it would interact with the actual scanning of network devices - it seemed to be more of a db cleanup.

      Can I safely disable it as a test?

      Comment

      • gofree
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2017
        • 400

        #4
        I guess for a test purpuse it will do no harm if youll disable it for a while. Or cahnge its interval in zabbix server conf file and youll see the results.

        Comment

        • bdaniel
          Junior Member
          • Dec 2019
          • 14

          #5
          I have changed the interval to 6h and rebooted the server. A bunch of the UPS's just reported the same latency issue at 5min past the hour.

          Does anyone else have any suggestions of where to look to solve this issue?

          EDIT: Still getting the exact same symptoms every hour, a couple minutes past the hour.
          Last edited by bdaniel; 04-12-2019, 03:10. Reason: More information.

          Comment

          • dimir
            Zabbix developer
            • Apr 2011
            • 1080

            #6
            Could it be the Network Discovery?

            Comment

            • bdaniel
              Junior Member
              • Dec 2019
              • 14

              #7
              I don't think it is Network Discovery Rules, for a few days I had them (and Actions) completely disabled and this issue was still occuring. I also had them set to 24h with the same result. I have just set all 3 of them to 6h and will see if this does make a difference for some reason...

              Comment

              • bdaniel
                Junior Member
                • Dec 2019
                • 14

                #8
                But I just found this (see screenshot) under one of the affected hosts. Could one of these Discovery Rules be putting too much load on the low end network card in the UPS's?

                Attached Files

                Comment

                • dimir
                  Zabbix developer
                  • Apr 2011
                  • 1080

                  #9
                  Well, according to the image you have 2 discovery rules that attempt to discover network interfaces of the UPS every hour using SNMP protocol, so that very possibly is it. One thing to find out, which time of the hour those rules are fired. Unfortunately this information is not available in history tables, but you could use some traffic analyzer, e. g. tcpdump to see at what time of the hour SNMP traffic goes between zabbix server and this UPS.

                  Comment

                  • bdaniel
                    Junior Member
                    • Dec 2019
                    • 14

                    #10
                    So, I have disabled each of those discovery rules independently and left for a 24 hour period, I still have the same results each hour on the hour.

                    Dimir: When I run Wireshark and watch the traffic, there is lots of SNMP get-request and get-response traffic all the time, all appears to be as usual. Around the time the issue happens there is some SNMP getBulkRequest traffic and then we see the issue on the devices. Does this give you any clues?

                    If I stop the Zabbix-Server service the issues completely stop.

                    Comment

                    • Mike2K
                      Member
                      • Oct 2018
                      • 62

                      #11
                      Originally posted by bdaniel
                      So, I have disabled each of those discovery rules independently and left for a 24 hour period, I still have the same results each hour on the hour.

                      Dimir: When I run Wireshark and watch the traffic, there is lots of SNMP get-request and get-response traffic all the time, all appears to be as usual. Around the time the issue happens there is some SNMP getBulkRequest traffic and then we see the issue on the devices. Does this give you any clues?

                      If I stop the Zabbix-Server service the issues completely stop.
                      Zabbix uses the SNMP GetBulk request to get all SNMP data from the device, instead of the SNMP Get which causes a lot of network traffic. Could it be that the UPS is not able to handle the GetBulk request properly, causing the network stack on the UPS to crash?

                      Edit:
                      Could you please try to disable the bulk requests ? You can do this in the host configuration.
                      Click image for larger version

Name:	Annotation 2019-12-09 114242.jpg
Views:	1062
Size:	12.0 KB
ID:	391439
                      Last edited by Mike2K; 09-12-2019, 12:44.

                      Comment

                      • Mike2K
                        Member
                        • Oct 2018
                        • 62

                        #12
                        Could you try disabling the bulk requests ? You can do this in the host configuration...

                        Attached Files

                        Comment

                        • bdaniel
                          Junior Member
                          • Dec 2019
                          • 14

                          #13
                          Since disabling the "use bulk requests" checkbox for each UPS I have not noticed any issues. I will continue to monitor this for the next 24 hours.

                          If this is indeed the solution, is there anyway to disable this setting per host group or per VLAN? Possibly as part of the discovery/action process?

                          Comment

                          • bdaniel
                            Junior Member
                            • Dec 2019
                            • 14

                            #14
                            Originally posted by Mike2K

                            Zabbix uses the SNMP GetBulk request to get all SNMP data from the device, instead of the SNMP Get which causes a lot of network traffic. Could it be that the UPS is not able to handle the GetBulk request properly, causing the network stack on the UPS to crash?

                            Edit:
                            Could you please try to disable the bulk requests ? You can do this in the host configuration.
                            Click image for larger version

Name:	Annotation 2019-12-09 114242.jpg
Views:	1062
Size:	12.0 KB
ID:	391439
                            The system seemed to behave well during the day yesterday after making the change. However, overnight the same issues have been observed on multiple UPS's.

                            Comment

                            • bdaniel
                              Junior Member
                              • Dec 2019
                              • 14

                              #15
                              Disabling the "use bulk requests" has actually seemed to make the problem worse. All UPS's are now reporting latency spikes over 200ms on a random timeframe.

                              Does anyone have a suggestion for me? Surely I am not the only person SNMP monitoring Eaton UPS's using Zabbix...

                              Comment

                              Working...