Ad Widget

Collapse

Zabbiy queue fills up - SNMP are not recalculated/requeue

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • rglade
    Junior Member
    • Jan 2017
    • 13

    #1

    Zabbiy queue fills up - SNMP are not recalculated/requeue

    Hi all,

    In many cases, the problem of filling queues has already been addressed. First of all, I have already tested all the Performence presets from other articles.

    For me, the problem occurs only in connection with SNMP Abgragen. Above all, items ​​that are not queried so frequently (for example, items ​​that are defined only once per hour) are apparently never queried again if the query itself once failed.

    Therefore, I have logged the queries once over a longer period of time and found that not even trying to retrieve these values ​​again!

    I wonder if there is a design flaw in Zabbix here. The tapped Zappixproxy has enough resources and no caches or similar seem to be overflowing.
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Originally posted by rglade
    Hi all,

    In many cases, the problem of filling queues has already been addressed. First of all, I have already tested all the Performence presets from other articles.
    Which one queue?
    What exactly happens in your case?
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • rglade
      Junior Member
      • Jan 2017
      • 13

      #3
      We use Zabbix for monitoring servers with the Zabbix agent / IPMI and switches and routers with SNMP.

      We currently monitor more than 84000 items and generally all IPMI and Zabix Agend Items are less troublesome. We use 3 zabbix proxies with one zabbix server.
      For SNMP items this looks different. Entries that seemingly could not be checked are not checked again and then land in the intems, which could not be checked for more than 10 minutes.

      I've just watched this a little bit more and checked if the server ever tries to repeat these previously aborted queries. And to the surprise: he does not do it.

      This means that items whose query has been canceled will never be queried again. These will not be put back into the queue.
      In my case, the warning will eventually come up, "More than 100 items having missing data for more than 60 minutes"

      We use debian with the Zabbix respository - aslo currently version 4.0.4. From the feeling, the problems with zabbix 2.x were not yet available and since 3.x - 4.x the problem has become more pronounced.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        1) About which one zabbix queue you are talking about?
        2) What exactly happens?
        3) How many nvps you have on the zabbix server and on each proxy?
        4) How many SNMP and IPMI devices is monitored per proxy?
        5) How many pollers you have on each proxy? What is the pollers utilisation? Do you have each proxy monitoring? (dummy host on 127..0.0.1/localhost monitored over exact proxy with used only standard "Template App Zabbix Proxy" template)

        - used OS doesn't matter
        - please just answer on my questions .. as precisely and shortly as you only able.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • rglade
          Junior Member
          • Jan 2017
          • 13

          #5
          1) About which one zabbix queue you are talking about?
          Affected is the SNMP queue on the respective zabbix proxy


          2) What exactly happens?
          Maybe an example. A switch is monitored and several items are now defined for the interfaces. The menu-like queries (in / out / err) are usually not a problem. However, there are queries that are queried only once an hour. If these fail, they are not retrieved after the defined time. Apparently even the aborted queries are not queried again.

          3) How many nvps you have on the zabbix server and on each proxy?

          Click image for larger version

Name:	zabbix_perf.png
Views:	1040
Size:	684.8 KB
ID:	373703

          4) How many SNMP and IPMI devices is monitored per proxy?
          IPMI: No requests to the affected zabbix proxies.
          SNMP: 20 to 30 switches.

          5) How many pollers you have on each proxy? What is the pollers utilisation? Do you have each proxy monitoring? (dummy host on 127..0.0.1/localhost monitored over exact proxy with used only standard "Template App Zabbix Proxy" template)

          I have tested many parameters here. From smaller values ​​to larger ones - actually even utopian values. As already mentioned, up to version 2.0 we had no problems of this kind. It makes no big difference how many pollers we set. Actually we use this parameter:

          ConfigFrequency=1200
          StartPollers=60
          StartIPMIPollers=20
          StartPollersUnreachable=84
          StartTrappers=20
          StartPingers=30
          StartHTTPPollers=5
          StartVMwareCollectors=5
          CacheSize=1024M
          HistoryCacheSize=2048M
          HistoryIndexCacheSize=128M
          ExternalScripts=/usr/lib/zabbix/externalscripts
          StartPingers=50
          Timeout=15
          TrapperTimeout=80
          HousekeepingFrequency=1
          UnavailableDelay=20
          UnreachableDelay=10
          UnreachablePeriod=30
          Because of these described problems, we have outsourced all SNMP requests to two proxies that are also in the same VLAN as the switches themselves.
          Thank you for your support - we are really at a loss. :-) Robert

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by rglade
            1)
            2) What exactly happens?
            Maybe an example. A switch is monitored and several items are now defined for the interfaces. The menu-like queries (in / out / err) are usually not a problem. However, there are queries that are queried only once an hour. If these fail, they are not retrieved after the defined time. Apparently even the aborted queries are not queried again.
            What do you see in proxy logs about those OIDs?

            3) How many nvps you have on the zabbix server and on each proxy?
            You are below OOTB limit of the monitoring data point which is defined in source code in include/proxy.h

            Code:
            #define ZBX_MAX_HRECORDS       1000
            #define ZBX_MAX_HRECORDS_TOTAL 10000
            so this is not the case when you may reached such limit.

            Check what is logged in proxies logs.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • rglade
              Junior Member
              • Jan 2017
              • 13

              #7
              Ok, I'm not sure if not more than 1000 data points are being monitored. Some switches have 96 ports - which in turn has at least 10 items. If we monitor 10, than 9600 items are monitored - but no any minutes - some have a update time set to 30m. Is this limit per unit of time?

              In DebugLevel 3 nothing will be log.
              In tcpdump, however, I can no longer see a query that again scans the affected records.

              There are definitely only affected items that have a longer update time defined. Maybe detailed example. Here are the items that have the problems:


              Click image for larger version

Name:	Zabbix_Queue_hanging_items.png
Views:	1026
Size:	933.9 KB
ID:	373758


              You can retrieve the entries manually at any time via snmpwalk:
              Click image for larger version

Name:	Zabbix_Queue_hanging_items_manualquery.png
Views:	1022
Size:	18.5 KB
ID:	373759



              Click image for larger version

Name:	Zabbix_Queue_hanging_items_definitions.png
Views:	1033
Size:	39.8 KB
ID:	373760

              There is an important connection. Only entries with a longer update time are affected. Other entries from the same host with a shorter update time work without any problems! Click image for larger version

Name:	Zabbix_Queue_hanging_items_correct.png
Views:	1029
Size:	122.2 KB
ID:	373761

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Again .. what did you found in proxies logs about those items?
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • rglade
                  Junior Member
                  • Jan 2017
                  • 13

                  #9

                  Unfortunately I find neither in DebugLevel3 nor 4 any log entries to the SAN05 or the OID. But in TCPDump I could see the requests.

                  Comment

                  • rglade
                    Junior Member
                    • Jan 2017
                    • 13

                    #10
                    Are this to many items??:
                    Click image for larger version

Name:	Zabbix_Proxies.png
Views:	1028
Size:	33.1 KB
ID:	373796

                    Comment

                    • kloczek
                      Senior Member
                      • Jun 2006
                      • 1771

                      #11
                      Originally posted by rglade
                      Unfortunately I find neither in DebugLevel3 nor 4 any log entries to the SAN05 or the OID. But in TCPDump I could see the requests.
                      Again .. what did you found in the logs?
                      Did you found any errors/warnings related to monitoring off those hosts with SNMP metrics on default debug lvl?
                      Please .. I've not been asking you to increase debug level on the proxies of fiddle with tcpdump.
                      It is really hard to help when instead on focusing on what I've asked you to do are trying to do what you are thing that I'm asking you to do.
                      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                      https://kloczek.wordpress.com/
                      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                      My zabbix templates https://github.com/kloczek/zabbix-templates

                      Comment

                      • rglade
                        Junior Member
                        • Jan 2017
                        • 13

                        #12
                        The problem is that the LOG is not really something useful. For example, the disk ID of a SAN:

                        OID: 3.6.1.4.1.674.11000.2000.500.1.2.14.1.2.19

                        7033:20190213:065047.619 snmp:[oid:'1.3.6.1.4.1.674.11000.2000.500.1.2.14.1.2.19' community:'{$SNMP_COMMUNITY}' oid_type:0]
                        7033:20190213:065047.619 snmpv3:[securityname:'' authpassphrase:'' privpassphrase:'']
                        7033:20190213:065047.619 snmpv3:[contextname:'' securitylevel:0 authprotocol:0 privprotocol:0]
                        7033:20190213:065047.619 itemid:100100000079866 hostid:100100000010357 key:'scDiskID[20]'
                        7033:20190213:065047.619 type:4 value_type:3
                        7033:20190213:065047.619 interfaceid:100100000000200 port:''
                        7033:20190213:065047.619 state:0 error:''
                        7033:20190213:065047.619 flags:4 status:0
                        7033:20190213:065047.619 valuemapid:0
                        7033:20190213:065047.619 lastlogsize:0 mtime:0
                        7033:20190213:065047.619 delay:'1h' nextcheck:1550039800 lastclock:0
                        7033:20190213:065047.619 data_expected_from:1550037047
                        7033:20190213:065047.619 history:1
                        7033:20190213:065047.619 poller_type:0 location:1
                        7033:20190213:065047.619 inventory_link:0
                        7033:20190213:065047.619 priority:1 schedulable:1
                        7033:20190213:065047.619 units:'' trends:1


                        Nevertheless, the item will continue to display as not available. What should I do in the logfile? In LogLevel 5 is really much logged.


                        There are really no abnormalities in it. Especially with this SAN is also very rarely spent that this temporarily does not respond.

                        Comment

                        • rglade
                          Junior Member
                          • Jan 2017
                          • 13

                          #13
                          I have now spent a lot of time again analyzing the log data and cause research. I could say the following:
                          • Only items are affected that have a longer update time (for example 1h)
                          • In the log data no query of the affected data can be found
                          • All servers have the same time, same time zone - updated by NTP
                          • All affected devices answer other SMTP queries without any problems (for example the queries in the 1m interval)
                          • It is also interesting that Zabbix correctly queries the Discovery Rules with partially the OIDs. It seems to me that Zabbix simply ignores the queries with longer intervals.
                          Others report similar - that the queue fills for updates. Maybe there is a design error?

                          However, the log file itself is difficult to interpret, as the connections are difficult to recognize. Maybe you have an idea?

                          Comment

                          • rglade
                            Junior Member
                            • Jan 2017
                            • 13

                            #14

                            I could find a thing after all. Sometimes Zabbix can sometimes cut away the first character in SNMP OID??
                            Click image for larger version  Name:	zabbix2_missentries.png Views:	1 Size:	1.45 MB ID:	373994

                            In items configuratuin, there ist the correct oid configured!
                            Last edited by rglade; 18-03-2019, 14:12. Reason: I did the work to understand the problem more deeply and to reproduce it in a test. - Installation 2.4 -> no problems, even without a proxy - Update 3.2 / 3.4 -> Problems, but with a proxy solv

                            Comment

                            • rglade
                              Junior Member
                              • Jan 2017
                              • 13

                              #15

                              I did the work to understand the problem more deeply and to reproduce it in a test.

                              - Installation 2.4 -> no problems, even without a proxy
                              - Update 3.2 / 3.4 -> Problems, but with a proxy solvable
                              - Update 4.0 -> Problems, but with two proxies solvable

                              In fact, version 2.x was apparently not prone to this problem, and only with version 3.2 does it seem to be a problem when many SNMP devices need to be requested. The problem dawned with increasing data points. It also seems that with the version 4, the problem is even more extensive. Unfortunately, this statement is very dare, as I had to find in version 3.4 already larger problems with the filling queue. Compared to all assumptions, the performance of one Zabbix Proxies was no longer enough. We could only fix the problem by installing another proxy for the switches.

                              Comment

                              Working...