Ad Widget

Collapse

Gaps on the graphs, SNMPv3

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • f0x0ff
    Junior Member
    • Aug 2015
    • 11

    #1

    Gaps on the graphs, SNMPv3

    Hello,

    First of all - I'm quite new to zabbix, so please excuse me if I'm asking ...not very smart questions

    I'm using zabbix (2.2.7) for a month and I noticed some gaps on the graphs.

    I have a centralized deployment (zabbix_server, zabbix frontend, mysql, no proxies) on a single VM (debian 8, 64 bits with 8GB of RAM, 4 CPU cores, 100GB HDD).
    I'm monitoring 18 hosts (17 SNMPv3 network devices and zabbix server itself via agent) with a total number of items - 2974 (most of them are with interval of 300 seconds and the most aggressive are on 180 seconds) with 894 triggers. Maximum processed values per second are 48.

    All my network devices are using SNMPv3 with MD5 authentication and AES encryption.

    On the zabbix_server.log there are many (515 for the last day) entries like:

    57228:20150825:003925.909 SNMP agent item "ITEM" on host "HOST" failed: first network error, wait for 15 seconds
    57236:20150825:003940.927 resuming SNMP agent checks on host "HOST": connection restored

    However, when I'm trying to manually pull the device with snmpwalk - it always works.
    I've read that this could be because by default snmpwalk retries 5 times before giving up, unlike zabbix, which doesn't retry at all.

    Do you believe that those gaps are because of this SNMP errors?
    Are there a simple workaround of activating the SNMP retries (I've found a patch, but it's quite old and I'm not sure is it still valid - https://support.zabbix.com/browse/ZBXNEXT-1096)
    Or maybe there is another performance issue causing these gaps?
    My queue looks green and I've done some tuning on zabbix_server and on mysql server.
    I've check the mysql table - history for the particular item and there are no entries for the time window when gap occurs, meaning that the problems is not with visualization, but with collection of the values.

    Do you have any ideas what could be wrong with my setup?

    Thanks!
  • BDiE8VNy
    Senior Member
    • Apr 2010
    • 680

    #2
    "Starting from 2.2.8 Zabbix server and proxy daemons will always retry at least one time: either through the SNMP library's retrying mechanism or through the internal bulk processing mechanism."
    See: Item type SNMP agent

    Before applying any patch or doing further investigation I strongly suggest to first try it again with having EnableSNMPBulkRequests disabled.
    See: configuration parameter and ZBXNEXT-2301

    Comment

    • ingus.vilnis
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Mar 2014
      • 908

      #3
      Hi,

      The problem might be on the device side and the manufacturer's implementation of net-snmp library.

      Please check your devices whether they allow you to set unique EngineID per each device.

      Might be a long reading but this could also be related to your problem https://support.zabbix.com/browse/ZBX-8385.

      When your SNMPv3 devices fail to communicate with Zabbix, what happens if you restart Zabbix server? Are the devices monitored fine after server restart and, if so, after how long time do they fail again?

      Best Regards,
      Ingus

      Comment

      • nucleusv
        Member
        • Apr 2013
        • 40

        #4
        Originally posted by ingus.vilnis
        Hi,

        The problem might be on the device side and the manufacturer's implementation of net-snmp library.
        I have check all EngineIDs over my devices they are unique.

        I have Juniper, ARISTA, ENGECORE devices.

        If only one Juniper device is activated, zabbix server begins to query data from network device after each zabbix server restarts, but stops to query after about 70-80 seconds.

        Is any requirements to length of EngineID?

        And what must be configured on network device?
        Only EngineID?

        Comment

        • Alessan
          Junior Member
          • Jan 2016
          • 2

          #5
          We have the same problem,

          Then we decide capture traffic with tcpdump on zabbix ethernet interface and decrypt with wireshark using snmp user tables.

          The results are random bad formatted request that can't be decrypted.

          Ones because unexpected size (not multiple of 8), other because can't be decrypted with the key.

          Same result with new installed zabbix server on virtualbox test machine with only one snmpv3 device no bulk (~500 items).

          Zabbix version: 2.4.7

          Images attached.
          Attached Files

          Comment

          • nucleusv
            Member
            • Apr 2013
            • 40

            #6
            Originally posted by Alessan
            We have the same problem,

            Then we decide capture traffic with tcpdump on zabbix ethernet interface and decrypt with wireshark using snmp user tables.
            Please check all snmp v3 items on host, and be sure that EVERY item has the same Authentication protocol and Privacy protocol.

            I had a similar problem, and hosts become unavailable, but than I had discovered that in one item prototype auth protocol was different from other prototypes, and device was not set to server this protocol, so zabbix hadn't got result from device.

            Comment

            • Alessan
              Junior Member
              • Jan 2016
              • 2

              #7
              All items are inherited from the same template with an item prototype. Items can't have different authentication options.

              Inbound errors on interface $1
              SNMPv3 agent
              ifInErrors[{#SNMPVALUE}]
              IF-MIB::ifInErrors.{#SNMPINDEX}

              Authentication on interface 1 cant be different that authentication on interface 2.

              Each run fails different items, each time we go to lastest data there are random interfaces with no value in last run.

              Comment

              • dampersand
                Junior Member
                • Apr 2016
                • 16

                #8
                I would kill for an update to this.

                I am having a very similar issue after my network guys updated firmware on four switches (Y U DO DIS IN PROD), and I too have tried all of these things, to no avail.

                The only difference I have in my problem is that our engineIDs are NOT all unique. I don't know a lot about engineIDs, so I'm curious why this would matter? It didn't matter before we had them set at all, and net-snmp doesn't ever seem to return them, that I see.

                One very interesting symptom is that all switches will poll correctly except for one... and if I restart snmpd on that switch, it will start working - but a DIFFERENT switch will go down.

                Comment

                • brynza
                  Junior Member
                  • Jun 2016
                  • 3

                  #9
                  I have the same problem with one of cisco routers (with the second one I also have a problems but it stops to be monitored at all). All the engineID's are unique.
                  With default timeout I get data irregularly. Meanwhile snmpget proceeds without pauses or timeouts. I've found temporal solution increasing the timeout and now I get the data with frequency from 1 to 3 minutes (checks frequency is 1 minute).
                  But this just an ugly workaround.

                  Comment

                  • troffasky
                    Senior Member
                    • Jul 2008
                    • 567

                    #10
                    Is this still an issue in 2.4 and 3.0? I've experienced poor reliability of SNMPv3 in 2.2 with holes in graphs and eventually no data at all. Restarting the zabbix-server service always brings it back.

                    Comment

                    • colohost
                      Junior Member
                      • May 2018
                      • 19

                      #11
                      I'm seeing this issue in 3.4 whenever there is any significant quantity of SNMPv3 OID's to poll from a given device; i.e. if I'm watching multiple OID's per port (status, bit rate, error rate, etc.) on a 48-port switch, and using SNMPv3, we're talking 48 ports * 6 OID's = ~288 OID's to collect. The polling interval doesn't seem to matter; if it's five minutes, the errors and gaps will occur at five minute intervals, if it's an hour, you'll random miss items on an hourly basis. I will say I'm using authPriv; I haven't tested with lower since, if you're discarding all the security benefits, what's the point.

                      Just for comparison purposes, I wrote a quick bash script to snmpwalk ifDescr on a particular Cisco switch stack of five devices, 48 ports each, then snmpbulk requested the OID's for admin status and oper status of each port; took about 45 seconds. Same thing with snmpv2 took 6 seconds.

                      I assume this significantly longer query time is some combination of each side having to do the hashing and encryption/decryption; perhaps the low powered cpu's in the switches can't do the sha/aes very fast. I've had to switch all of my network devices with 10+ OID's back to snmpv2 or zabbix will simply always miss some data.

                      Long term, only thing I can think of as a way to get around this is for zabbix to analyze a given host's items, and if the quantity of snmpv3 items exceeds a certain user defined number (which you could tune to your hardware), break them up into sequentially polled batches of that defined quantity, and adjust when they poll to keep the polling interval the same for each batch.

                      Comment

                      • kloczek
                        Senior Member
                        • Jun 2006
                        • 1771

                        #12
                        Originally posted by colohost
                        I'm seeing this issue in 3.4 whenever there is any significant quantity of SNMPv3 OID's to poll from a given device; i.e. if I'm watching multiple OID's per port (status, bit rate, error rate, etc.) on a 48-port switch, and using SNMPv3, we're talking 48 ports * 6 OID's = ~288 OID's to collect. The polling interval doesn't seem to matter; if it's five minutes, the errors and gaps will occur at five minute intervals, if it's an hour, you'll random miss items on an hourly basis. I will say I'm using authPriv; I haven't tested with lower since, if you're discarding all the security benefits, what's the point.
                        • Chec you proxy logs looking for your device timeout messages
                        • Check do you have in you host interface "Use bulk requests"
                        • How many LLDs do you have defined for this monitored device?
                        Generally with many LLD problem is that snmpd from net-snmp which is used even on proprietary devices is not reentrant which is causing that when proxy will be querying multiple OIDs at the same time at least one of those querier may fail with timeout.
                        If this is the case always it should be report as service request against exact device with snmp aganer. Maybe at least they one of the companies will invest some paid developers time to rewrite the critical net-snmp code.
                        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                        https://kloczek.wordpress.com/
                        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                        My zabbix templates https://github.com/kloczek/zabbix-templates

                        Comment

                        • colohost
                          Junior Member
                          • May 2018
                          • 19

                          #13
                          There isn't a timeout logged, the devices are just still sending data back when Zabbix hits its own configured timeout value and stops looking for additional data; the result is you have gaps in your item data. This isn't an LLD-specific issue, but it occurs there as well if the underlying discovery doesn't complete, so then you have items that bounce between no longer discovered and discovered. If there are too many SNMPv3 OID's to poll, the response simply won't finish until Zabbix has already stopped waiting for the rest of the data. Of interesting note, Arista devices seem to send snmpv3 responses at about 3x the rate of Cisco devices, but even those have the same problem if there are enough OID's that the responses can't finish in at most 30 seconds (max zabbix timeout).

                          I had one switch with 2000+ OID's (stack of several hundred ports) miss more data than it got; switched to SNMPv2 and problem was gone.

                          Comment

                          • kloczek
                            Senior Member
                            • Jun 2006
                            • 1771

                            #14
                            SNMP OIDs per ports polled over OIDs generated over LLD if SNMP bulk query is enabled are red over batched SNMP queries so number of data is not relevant. More important is total number of SNMP queries/s.
                            You can measure this factor by use my SNMPv2-MIB template https://github.com/kloczek/zabbix-te...MIB/SNMPv2-MIB

                            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                            https://kloczek.wordpress.com/
                            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                            My zabbix templates https://github.com/kloczek/zabbix-templates

                            Comment

                            • steveroebuck
                              Junior Member
                              • Jan 2018
                              • 19

                              #15
                              We are experiencing exactly the same issues with SNMPv3 using authpriv, some interfaces will come back fine others will have massive gaps in time series data for switch throughput.

                              Templates are fine there is no misconfiguration there, we have the timeout set to maximum of 30 and it's just not enough with some devices to pull back all the metrics we need to monitor...does anyone have a workaround or solution for this as its a deal breaker in terms of our zabbix deployment, we cannot use SNMPv2 as it violates our corporate security policy.

                              We experience the timeout issues on the following hardware

                              Trend Tipping Point NGFW
                              Checkpoint 4600 Appliance
                              HPE Flexfabric 5900AF, 5900CP and 5940
                              HPE 6125XLG and 6127XLG
                              F5 Big IP Appliance

                              Devices are not becoming "unsupported" and we can see no timeouts or strange errors in either the Zabbix or proxy logs.
                              Last edited by steveroebuck; 16-05-2018, 15:05. Reason: Added HW detail

                              Comment

                              Working...