Ad Widget

Collapse

Zabbix graphs show multiple gaps.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Kshitij Sinha
    Junior Member
    • Jun 2019
    • 12

    #1

    Zabbix graphs show multiple gaps.

    Hi All,

    I am running Zabbix 4.0.7. As I have started adding more hosts, I am noticing that graphs for many items are not showing holes/gaps. I am using Zabbix to monitor firewalls, routers and switches. Current issue is on firewals (Cisco and Checkpoint).

    I am using SNMP OID to monitor the devices.

    I searched a bit on google and tried different things:

    1. Interfaces are set up already with 64-bit SNMP counter.
    2. Disabled use bulk requests.
    3. Change polling frequency to 5 minutes, 1 minute, 30 seconds. Nothing fixes the issue.


    Attaching the screenshot for the broken graph and value.

    Would appreciate if someone can help me out with this issue. What can I do to make the graph continuous. The issue persists with most of the new devices I add now.
    Attached Files
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Your Cisco device embedded system on which is running on monitored device side SNMP agent is not made out of rubber and has very limited processing power. You are even making this worse by disable SNMP bulk queries.
    By this you can probably find in server/proxy logs a lot of entries about SNMP timeouts.
    To mitigate this you can as well switch from SNMPv3 to SNMPv2c if you are using it.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • Kshitij Sinha
      Junior Member
      • Jun 2019
      • 12

      #3
      Hi @kloczek,

      Thanks for response.

      When switching to SNMP v2c it does resolve the issue however that is not a viable solution for us since it breaches our corporate policy.

      I read your discussion with colohost on the below thread:
      https://www.zabbix.com/forum/zabbix-...-graphs-snmpv3

      I am using MD5/DES and not even using AES and still facing this issue. The thread ended with no resolution, can you tell me what can I do to proceed to resolve this issue?

      I am facing issue on Cisco and checkpoint both vendors. So it's not vendor specific issue and happens on multiple vendors.

      Comment

      • colohost
        Junior Member
        • May 2018
        • 19

        #4
        You may need to alter your corporate policy to solve this problem, or not use Zabbix. Zabbix is not written in a manner that is going to be compatible with retrieving a large number of OID's from a device that serves them slow enough to not permit retrieving them all in one shot, as was demonstrated in that other thread. It needs logic to spread out the retrievals to groups of ports so it can achieve a query volume the device can sustain before it (zabbix) times out. It should also have logic to detect that if you've set a polling frequency that would be impossible to keep up with, i.e. polling continuously still wouldn't keep up, then throttle back to a sustainable value until all groups can be retrieved successfully in the desired interval. I tried to push that agenda but got nowhere because the people who author it can't seem to understand what life is like dealing with high port density network equipment. We have some gear with real or logical management planes that cover hundreds or 1000+ ports; you can't grab 10 OID's per port with snmpv3 encryption and Zabbix.

        We're still using snmpv3, but with encryption only on the auth side for any high density device. The data isn't sensitive in any way, just counters, and the management traffic is isolated to management VRF's and vlans, so it's safe enough regardless. Any snmp sets or smaller more sensitive retrievals, we use encryption throughout, but those aren't bulk retrievals where the problems occur.

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by Kshitij Sinha
          Hi @kloczek,

          Thanks for response.

          When switching to SNMP v2c it does resolve the issue however that is not a viable solution for us since it breaches our corporate policy.

          I read your discussion with colohost on the below thread:
          https://www.zabbix.com/forum/zabbix-...-graphs-snmpv3

          I am using MD5/DES and not even using AES and still facing this issue. The thread ended with no resolution, can you tell me what can I do to proceed to resolve this issue?

          I am facing issue on Cisco and checkpoint both vendors. So it's not vendor specific issue and happens on multiple vendors.
          It is possible to reproduce SNMP timeouts using snmp{,bulk}walk commands.
          The issue has nothing to do with zabbix and zabbix developers already done enough to mitigate those bugs which are still sitting in nst-snmp code.
          That is really ridiculous that very expensive sometimes devices are equipped with very weak and/or cheap embedded systems running snmpd.
          Some of those devices are able to provide tenths of thousands metrics over SNMP but hardware running snmpd is not able sometimes produce full list of these metrics using snmpwalk command.

          If you have support contract just report this to the Cisco and Checkpoint.

          Other think is that +2 years ago I've been able to reproduce the same timeouts on Linux box running snmpd.
          Issue is that most of the network devices vendors are using net-smnp code on building own SNMP agent code.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • mellis
            Senior Member
            • Oct 2017
            • 145

            #6
            When I see gaps in the graph's it tells me that data is not getting inserted in the zabbix database. I would look t the MySQL performance, increase so of the buffers and pools. Next I would give the zabbix server more memory and processes. I had a system that I increased from 25 devices to 500 devices and I was able to turn out the zabbix to handle the load.

            Comment


            • ingus.vilnis
              ingus.vilnis commented
              Editing a comment
              MySQL and Zabbix performance is important but it is not the cause of this particular SNMPv3 issue.
          • Kshitij Sinha
            Junior Member
            • Jun 2019
            • 12

            #7
            Hi colohost, @kloczek,

            I have opened a case with CISCO and am waiting for their reply. But I don't think they will be able to help much on this, but will let you guys know when I hear from them.
            Interestingly when I use only Auth and NO Priv, I don't face any issue.

            So to summarize:

            When using SNMP version 2c – No Issue
            When using SNMP version 3 with AUTH ONLY – No Issue
            When using SNMP version 3 with both AUTH AND PRIV – Issue

            However as per our company policies using SNMP v3 without Pirv is a strict No-No. So I am stuck here. Is there any configuration I can change on Zabbix server side? mellis mentioned we can change MySQL performance, can you please guide me what changes I can do to on MySQL side to try and make this better?

            Even if the issue is at net-snmp code, we have no control or support to get that changed. So we need to figure out a way to fix it on Zabbix itself if possible. Any help from you experienced folks will be highly appreciated. I am very new to Zabbix and don't have much knowledge.

            Regards,
            Kshitij

            Comment

            • Kshitij Sinha
              Junior Member
              • Jun 2019
              • 12

              #8
              @ingus.vilnis - Can you help out on how to resolve this problem? Is there a way to resolve it or is this a dead end?

              Comment

              • colohost
                Junior Member
                • May 2018
                • 19

                #9
                Originally posted by Kshitij Sinha
                Hi colohost, @kloczek,

                I have opened a case with CISCO and am waiting for their reply. But I don't think they will be able to help much on this, but will let you guys know when I hear from them.
                Interestingly when I use only Auth and NO Priv, I don't face any issue.

                So to summarize:

                When using SNMP version 2c – No Issue
                When using SNMP version 3 with AUTH ONLY – No Issue
                When using SNMP version 3 with both AUTH AND PRIV – Issue

                However as per our company policies using SNMP v3 without Pirv is a strict No-No. So I am stuck here. Is there any configuration I can change on Zabbix server side? mellis mentioned we can change MySQL performance, can you please guide me what changes I can do to on MySQL side to try and make this better?

                Even if the issue is at net-snmp code, we have no control or support to get that changed. So we need to figure out a way to fix it on Zabbix itself if possible. Any help from you experienced folks will be highly appreciated. I am very new to Zabbix and don't have much knowledge.

                Regards,
                Kshitij
                Cisco won't care, unfortunately. I see this problem on 2960 closet switches, all the way up to $500,000+ NCS chassis routers and Nexus 7k switches, and those actually have reasonable processors installed. I've gone the TAC route with no success. The issue is not limited to them; I do see far better performance from Arista devices, but there's still a limit you'll ultimately hit. The issue is the speed in which these devices can output encrypted payloads via SNMPv3 if you're querying a large quantity of OID's. For example, we query about ten OID's per port (state change, admin status, description, name, errors, traffic, etc.). If you're querying a switch that has hundreds of ports, or even thousands for that matter if you have a high density chassis doing gigE, and using something like Zabbix that wants to get every piece of data at the defined interval, now you're asking this device to spit out possibly tens of thousands of pieces of data in one shot, before either net-snmp or zabbix time out. That isn't happening with snmpv3 payload encryption added on the top. Won't matter if you decrease the monitoring frequency, because Zabbix was written to query every OID on every attempt.

                Other NMS's know about this issue and work around it by not trying to get every piece of data on every attempt. For example, you could bulk query interfaces 0-100 in group number one, 101-200 in group number two, etc. and space the groups out, so you're only asking for what the switch is capable of returning in the amount of time you, or your snmp library, is willing to wait. Then, if the switch is capable of working on these in parallel, query the groups in parallel, if not, then you are forced to reduce the querying frequency to what the device is capable of sustaining. If it can return no more than (100 interfaces * 10 OID's) per minute, and you have 500 interfaces to query, then you cannot query them more frequently than every five minutes, with one group in minute one, group two in minute two, so on and so forth, to avoid the timeout issue. Zabbix folks don't seem interested in deploying code to accomplish this, likely because the number of people using SNMPv3 with data encryption is very low, and the number of people using it AND having a large device to query, is even lower.

                Your only real options are not using SNMPv3 PRIV, or using another NMS. We opted to turn off priv because the data is already on a private vlan and doesn't contain anything of consequence anyway.

                Comment

                • Kshitij Sinha
                  Junior Member
                  • Jun 2019
                  • 12

                  #10
                  Hi colohost, @kloczek,

                  We have decided to use AuthNoPriv and it was working fine for most devices. One interesting thing I noticed was that when I changed to AuthNoPriv for some of the devices we are now seeing gaps on graphs in these devices. Earlier with AuthPriv it was working fine. It might be that Zabbix/net-snmp caches engine ID and that is what is causing the issue, so if either of you know how can I clear cached EngineID's and retest please let me know.

                  Thanks in advance.

                  Comment

                  • kloczek
                    Senior Member
                    • Jun 2006
                    • 1771

                    #11
                    Originally posted by Kshitij Sinha
                    So to summarize:

                    When using SNMP version 2c – No Issue
                    When using SNMP version 3 with AUTH ONLY – No Issue
                    When using SNMP version 3 with both AUTH AND PRIV – Issue

                    However as per our company policies using SNMP v3 without Pirv is a strict No-No. So I am stuck here. Is there any configuration I can change on Zabbix server side? mellis mentioned we can change MySQL performance, can you please guide me what changes I can do to on MySQL side to try and make this better?

                    Even if the issue is at net-snmp code, we have no control or support to get that changed. So we need to figure out a way to fix it on Zabbix itself if possible. Any help from you experienced folks will be highly appreciated. I am very new to Zabbix and don't have much knowledge.
                    If you have support contract for those devices and now you can independently reproduce those timeouts without zabbix you should now open the support ticket mentioning in issue ticket how to reproduce that issue using only net-snmp SNMP client commands.
                    As far I've been trying to investigate this issue some bugs definitely sits in net-snmp code (which is used by most of the network devices vendors as base platform to implement own SNMP agent).
                    net-snmp code has a lot of legacy and is widely used and looks like at the moment net-snmp developers have lot enough man/hour resources to refresh the net-snmp code base.

                    If network devices vendors will not try to do this IMO sooner or later some vendors will start embedding in network devices firmware zabbix agent.
                    If net-snmp will not improve it is IMO only matter of time when it will happen.
                    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                    https://kloczek.wordpress.com/
                    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                    My zabbix templates https://github.com/kloczek/zabbix-templates

                    Comment

                    • Kshitij Sinha
                      Junior Member
                      • Jun 2019
                      • 12

                      #12
                      Hi kloczek , colohost


                      As an update we are facing the issue again. Even with SNMP v3 in AuthNoPriv mode with Auth as MD5 we are still facing the issue. The worst affected are checkpoint devices. The issue started after we completed our POC and added 150+ hosts. During POC we had 25 hosts (mostly CISCO firewalls, switches and routers) and they were running fine without any gaps in data. Some of the hosts mostly switches and bluecoat proxies are working fine but firewalls are the biggest problem, especially checkpoint firewalls which has 80%+ data loss.

                      I have tried everything I can find online changing startpollers, unreachable pollers, increasing memory, cache size etc etc yet the issue still exists.

                      Also opened a case with TAC but it was of no help, on captures on firewall they showed firewall is sending replies for SNMP queries which Zabbix sends. Which in Zabbix

                      Since you both have working set up, can you share your settings for Zabbix server? What is the best possible settings which I can try to have for around 300 hosts, I will try it in our environment and check if that makes any difference.

                      Currently we will upgrade Zabbix server to 4.2.11 (currently running 4.0.7) and update net-snmp library to latest one (currently 5.7.2). This is the only thing that is left to do and after upgrade if this doesn't work I will be in a dead lock.

                      Any help will be highly appreciated, unfortunately since it is POC, we do not have any support for Zabbix. If the POC is not successful then we might have to go for Solar winds as POC for Solar Winds was a success.

                      Comment

                      • zumi_fi
                        Junior Member
                        • Sep 2018
                        • 10

                        #13
                        I've seen this issue with multiple switches. They are not capable to give much information in short intervals.
                        I suggest you to shrink the data pulling. Also things that you don't need = disable

                        Comment

                        • kloczek
                          Senior Member
                          • Jun 2006
                          • 1771

                          #14
                          Originally posted by Kshitij Sinha
                          Hi kloczek , colohost
                          As an update we are facing the issue again. Even with SNMP v3 in AuthNoPriv mode with Auth as MD5 we are still facing the issue. The worst affected are checkpoint devices. The issue started after we completed our POC and added 150+ hosts. During POC we had 25 hosts (mostly CISCO firewalls, switches and routers) and they were running fine without any gaps in data. Some of the hosts mostly switches and bluecoat proxies are working fine but firewalls are the biggest problem, especially checkpoint firewalls which has 80%+ data loss.
                          If you have support contract open support ticket.
                          In zabbix side have been done a lot to mitigate that issue and to have minimal loses of the monitoring data.

                          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                          https://kloczek.wordpress.com/
                          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                          My zabbix templates https://github.com/kloczek/zabbix-templates

                          Comment

                          Working...