Ad Widget

Collapse

Zabbix Queue - Host Item Rechecking

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • nobody
    Junior Member
    • Jul 2013
    • 17

    #1

    Zabbix Queue - Host Item Rechecking

    Hi Zabbix Forums!

    I have recently upgraded from 2.4.5 to 3.0.7 LTS and am having some issues. We previously had lots of database corruption for a multitude of disk failures.. and somehow zabbix 2.4.5 still ran great. Need more features and a more intuitive interface, as well as take advantage of performance enhancements with your latest software. So far 3.0.7 seems to be stellar.

    I'm having my Administration > Queue fill up with items that it can't get (they don't exist); which is taking SNMP pollers away from other items that could be using them.
    My SNMP Pollers is currently set to 750.

    Zabbix server is running Yes XX.YY.ZZ.16:10051
    Number of hosts (enabled/disabled/templates) 560 431 / 41 / 88
    Number of items (enabled/disabled/not supported) 54047 20659 / 10 / 33378
    Number of triggers (enabled/disabled [problem/ok]) 5189 5173 / 16 [110 / 5063]
    Number of users (online) 9 2
    Required server performance, new values per second 647.01
    Problem: Hosts without specific items are being checked too frequently, what configuration item do I need to change to only recheck for items every X seconds or Y Minutes?

    IE:

    2017-01-10 15:32:42 21s SwitchNameHere cswSwitchState1004
    2017-01-10 15:34:23 19s SomeOtherSwitch9 cswSwitchState1002
    2017-01-10 15:34:24 18s AnotherPlace16 cswSwitchState1001
    2017-01-10 15:34:24 18s AFarOffGalaxy cswSwitchState1005
    2017-01-10 15:34:24 18s WhereAJedi cswSwitchState1002
    2017-01-10 15:34:24 18s UsedToLive cswSwitchState1001
    2017-01-10 15:34:24 18s ByTheName cswSwitchState1002
    2017-01-10 15:34:24 18s OfNobody cswSwitchState1004
    2017-01-10 15:34:24 18s SkyWalker cswSwitchState1006
    ...
    Truncated (500)...
    ...
    Some of our Cisco switches don't have multiple logical members, therefore it should set the item as disabled and forget about it for some time before trying to check again.

    We commonly stack switches together to make logical stacks, however not all of our deployments have multiple switches. Unfortunately I'm not in the position to make templates custom tailored to every specific installation that we have. Sometimes we have up to 6 switches stacked together of the same vendor type, other times they're deployed as single access switches in locations that don't need more than 48 ports or so.

    Out network is vast, so if for some reason a portion of our network goes down, I don't want the items to all become disabled on all of the affected hosts and have to re-enable them manually.

    I look forward to your suggestions!

    Regards,
    Nobody
    Last edited by nobody; 10-01-2017, 23:50.
  • nobody
    Junior Member
    • Jul 2013
    • 17

    #2
    Hi Moderators,

    Please move this post into the "Zabbix Help" forum. I didn't see that "Large environments" are 2K+ hosts (and at least 1K+ values per second).

    Which we definitely have only half of that.

    Sorry for my mistake!

    I hope someone has some insight as to how to reduce items in the queue.

    Regards,
    Nobody

    Comment

    • nobody
      Junior Member
      • Jul 2013
      • 17

      #3
      Hi Zabbix Forums,

      No one has any insight into this issue? Load on the server is not an issue for disk, nor CPU, no ram.

      I can't post any attachments because the forums don't let me. The zabbix busy poller is 80% busy all of the time looking for snmp values that don't exist on devices. Removing these items from the templates are not an option unfortunately. How can I reduce the busy poller % so that I stop losing data in my graphs?

      Regards,
      Nobody

      Comment

      • Pada
        Senior Member
        • Apr 2012
        • 236

        #4
        Hi,

        Is there no way that you can use LLD (low level discovery) to only add the necessary/available items instead?

        Ideally the "unsupported" items should be handled better by Zabbix and not affect the monitoring of available items as much...

        Comment

        • nobody
          Junior Member
          • Jul 2013
          • 17

          #5
          Hi Pada,

          I appreciate your response. I haven't used discovery or lld within zabbix before. I manually create all of my hosts. We have quite a vast network with lots of different devices. I only anticipate this issue becoming much worse as our switches become increasingly more modular.

          I'll start looking into LLD to see if I can better optimize Zabbix, but in the mean time do you have any other suggestions (aside from disabling the host items). Perhaps I can change some server options or something to better manage the unreachable poller?

          Code:
          ### Option: StartPollersUnreachable
          #       Number of pre-forked instances of pollers for unreachable hosts (including IPMI).
          # 
          # Mandatory: no
          # Range: 0-1000
          # Default:
           StartPollersUnreachable=500
          
          ### Option: UnreachablePeriod
          #       After how many seconds of unreachability treat a host as unavailable.
          # 
          # Mandatory: no
          # Range: 1-3600
          # Default:
          UnreachablePeriod=15
          
          ### Option: UnavailableDelay
          #       How often host is checked for availability during the unavailability period, in seconds.
          # 
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # UnavailableDelay=60
          
          ### Option: UnreachableDelay
          #       How often host is checked for availability during the unreachability period, in seconds.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          UnreachableDelay=15
          Zabbix Internal process busy %: All values averaging less tha 1% ( usually 0.025% to 0.500%)

          Zabbix data gathering process busy%: Most items about 5% or lower, with the exception of "Zabbix busy unreachable poller process, in%" which is averaging 71%, hitting about 90% busy on occasion.

          Swap is not an issue, server swappiness is set to "last ditch effort", and will only swap out if system is out of ram.

          Comment

          • Pada
            Senior Member
            • Apr 2012
            • 236

            #6
            What is your "StartPollers" value, because it defaults to "5" I believe, which could be the reason why you're having issues.

            You should only bump up "StartPollersUnreachable" when you have lots of hosts that are considered as unreachable, an not when you have items that became "unsupported" - or at least that is my understanding of it.
            In our environment we bumped it from the default value of 1 to 5.

            Another thing that you can try, is to monitor the hosts from various Zabbix Proxy servers, so that they can split the load.
            For example have Zabbix proxy server per /24 subnet that you have or any other kind of logical/physical grouping that would make sense.

            For instance our Zabbix server is currently not doing any fetching of data, since our proxy servers are doing that and then pushing the results to the server.
            Like our "StartPollers" is set to "70" on our big subnet that has lots of SNMP or (passive) Zabbix agents.

            Update/Edit:
            Are you perhaps monitoring your devices over a high latency network connection (eg. cross continent)? Zabbix proxies are very helpful in cross-continent monitoring scenarios, especially when the links flap or go down.
            I'm not sure if increasing the "StartPollers" would result in less hosts being regarded as being unreachable and then less unreachable pollers are required...
            Last edited by Pada; 19-01-2017, 01:11. Reason: I missed the part where your unreachable poller was 71% busy on avg

            Comment

            • nobody
              Junior Member
              • Jul 2013
              • 17

              #7
              Hi Pada,

              Thanks again for your attention to this matter.

              These are the core settings that I have:


              Code:
              ### Option: StartPollers
              #	Number of pre-forked instances of pollers.
              #
              # Mandatory: no
              # Range: 0-1000
              # Default:
               StartPollers=550
              
              ### Option: StartPingers
              #	Number of pre-forked instances of ICMP pingers.
              #
              # Mandatory: no
              # Range: 0-1000
              # Default:
               StartPingers=750
              After increasing StartPollersUnreachable from 400 to 500 yesterday I saw ~20% decrease in "Zabbix busy unreachable poller process in%" so I think I'm on the right track.

              My StartPollers value hasn't changed much since the upgrade I did from 2.4.5, (I think it was 500 before and I changed it to 550); I didn't see any noticeable improvement with the 50 more pollers. I don't think there's an issue with mysql anymore, no database crashes after I trashed my old zabbix database (just renamed it). Using the same mysql settings (smaller innodb_buffer_pool_size database size. I changed from 12GB in memory to 8GB, prior to this and it seems to have helped overall system performance)

              The "Zabbix busy poller process, in %" is never more than 20% busy in spikes, and usually averages about 5%. I don't think it's an issue with the amount of StartPollers that I have running.

              Our network is relatively low latency, most of our equipment is less than ~20ms away from our primary data center (where zabbix server is located); the vast majority of it is about 5ms away. Which is really good. (With bursts up to maybe 10-15ms during peak times). All of my hosts are monitored with icmp (as well as advanced snmp based triggers). Nothing outside of the province/state that we operate in [thus far].

              I have always been interested in using proxies, but have not yet had a chance. It was removed from one release some time ago and I have not yet had a chance to deploy proxies. This would certainly help reduce network inconsistencies, but at this point the number of datacenters that we have is small (4) with no more than about 50 dedicated servers (plus maybe around 200 vm's; which zabbix trappers are set as active, so it has little load on the dedicated zabbix server.) The main components of our infrastructure are locked-up appliances that support SNMP only. Bulk Requests aren't supported on most of our appliances and cause lots of gappy issues in graphs. Our IP infrastructure is large and all encompassing with a vast mixed vendor of deployed networking equipment (servers, telecom radios, backhaul microwave links, LTE equipment, etc)

              One major improvement we made was to move Zabbix onto it's own dedicated StupidMicro server and gave it 16GB of ram. We found with the VM's that the disk queuing was causing serious issues with the database.

              I appreciate your incite Pada, if I have any more improvements I will be sure to post them on here.

              Regards,
              James

              Comment

              • Pada
                Senior Member
                • Apr 2012
                • 236

                #8
                Perhaps changing the "Refresh unsupported items (in sec)" setting from the default 600s to like 3600s (or more, or perhaps even setting to 0) may improve your situation - see https://www.zabbix.com/documentation...ration/general

                If the majority of your items that are in "unsupported" state are because of those SNMP items not existing (and will never exist) on that particular host because say you applied a template to monitor a 32-port switch to an 8-port switch, then I would reckon that you'll benefit a lot from having LLD to detect and only try and monitor the correct amount of items for each host.

                Out network is vast, so if for some reason a portion of our network goes down, I don't want the items to all become disabled on all of the affected hosts and have to re-enable them manually.
                If it goes down due to the link between your Zabbix server and that particular data center, then having a Zabbix proxy at that data center to cache/buffer the results would also help. When you have a local Zabbix proxy in that data center, then the hosts should all still remain in a reachable state, which implies that you'll not have the need for such a high StartPollersUnreachable any more. Also, the proxy has its own unreachable poller processes.

                Like our Zabbix proxy would cache results for 1h when the link between our on-premesis Zabbix server and Amazon EC2 goes down, and when it comes back online, Zabbix server will get all of the data back, except for calculated items (or at least this is the case with Zabbix 1.8).

                Just 1 big caveat with the proxy, and that is that if you are sending alerts when hosts become unreachable, you'll now have to let the triggers of those hosts behind the proxy depend on a trigger that indicates whether the proxy server is reachable. If you already have trigger dependencies like that built into multiple templates, then I suppose this wouldn't be too difficult to apply either.

                Comment

                Working...