Ad Widget

Collapse

Problems to reduce queue in Zabbix agent (active) and SNMP

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hugo.jose
    Junior Member
    • Jul 2018
    • 10

    #1

    Problems to reduce queue in Zabbix agent (active) and SNMP

    Hi

    We have a zabbix 3.0 with 28 proxies and we have read the documentation, the forums, done several adjusts too config file but we cant reduce the values in the queue..
    Can someone give us some hint what should be wrong? Some value too high or too low?

    Our environment is

    Number of hosts (enabled/disabled/templates) 4305 2820 / 1283 / 202
    Number of items (enabled/disabled/not supported) 382524 321137 / 47813 / 13574
    Number of triggers (enabled/disabled [problem/ok]) 91399 47517 / 43882 [230 / 47287]
    Number of users (online) 222 33
    Required server performance, new values per second 1493.58

    Detail in attachment (here in plain post is giving error)

    Best Regards,
    HJ
    Attached Files
  • LenR
    Senior Member
    • Sep 2009
    • 1005

    #2
    How do the queues by proxy look? Are the problems spread across proxies or isolated to just a few?

    Have you looked at your mysql write performance?

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      Lots of people will help you figure out how to make it process that much data faster, but I'd offer the observation you should question if you have too much data.

      I'm a fan of looking closely at the data you are collecting to see that you actually need it all, and need it on the frequency it is collected. Often out-of-the-box templates (not necessarily zabbix box, but those shared by others as well) take the conservative approach of collecting everything, fast. Some specifics:

      - LLD of interfaces often yields monitoring of interfaces which have no actionable purpose (by that I mean that you will never need to take action based on their state, etc.). Microsoft has a slew of interfaces it creates no one ever uses (directly), cisco often does as well. You might not want loopback for example. Each server with one NIC may have half a dozen 'interfaces' and you really only want the NIC. So build a filter on the LLD to exclude certain types you know you don't want, and as you do you also eliminate a slew of items associated with each one.

      - Look at what items populate the most in a given time period, and question hard whether their interval of polling is needed. i often see people polling things every 30 seconds or even faster. Yes, some have legitimate need, but most of the time the real question is "what will you do with data that fast?". If you got an alert in 30 seconds after a system went down, how long would it take for someone to notice the alert, to react. People time is often much, much slower. Reconsider very fast polls, can you slow them down. Especially consider slowing down the more static entries, e.g. admin state on interfaces, but also slow down things like interface error counts -- not say they are not important, but mostly they are important as a trend, so capturing interface counts every 5 minutes may be just as good as every 1 minute. Or whatever.

      - Consider whether some of the data you collect even matters. Interface counters are a great example - some interfaces give a ton of data, multicast vs. unicast, even size breakdowns. Ask yourself whether you will use it; do you consider it actionable. At a typical site with 100 switches, eliminating just something like multi-cast and uni-cast counts (in favor of also present total octects), if the switches average 24 ports, will eliminate 100 * 24 * 4 items entirely. If you were collecting them every 60 seconds there went 160 ips from your total.

      The above kind of culling is best done aimed at the most frequent items. Query the database to find what's taking the majority of the history database. Follow back to the templated items causing it. For everything near the top of the list, ask "is this actionable", "Do I need it updated this rapidly even if so". When I did this type of culling I dropped my items per second by a factor of about 4, if memory serves. It dropped even further over time as I fine tuned templates.

      If it's useful to you (and if you use postgresql, sorry if not), here's a sql query that attempts to estimate the original templated item names of items and their impact on your items per second. It does a bunch of estimations, notably trying to handle traps (which it really cannot), but it may give you a starting point if you take a data driven approach to culling:

      Code:
       select count(*) as items, sum(x.qps) as qps, x.delay, x.TemplateHost, x.Key, x.ItemName
              from 
              (
                      SELECT lef_delay_to_seconds(i.delay) as delay, case when lef_delay_to_seconds(i.delay) > 0 then 1 /  cast(lef_delay_to_seconds(i.delay) as decimal(20,10))
                                                              else null end as qps, 
                             coalesce(th.name, h.name) as TemplateHost, coalesce(ih.key_, i.key_) as Key, coalesce(ih.name, i.name) as ItemName
                      FROM items i
                      inner join hosts h on i.hostid=h.hostid and h.status=0 -- require host be enabled
                      inner join item_discovery id on id.itemid=i.itemid  -- this gives you the item protype
                      inner join items ih on ih.itemid=id.parent_itemid -- this is the item prototype itself (in case we need the name) 
                      inner join item_discovery id2 on id2.itemid=id.parent_itemid -- this gives you the discovery role on the host 
                      inner join items drh on drh.itemid=id2.parent_itemid -- this is the prototye on the host
                      left join items ti on ti.itemid=drh.templateid -- this is the template item 
                      left join hosts th on th.hostid=ti.hostid  
                      WHERE i.status=0 -- Require item be active 
                        AND i.flags=4 -- This should be an LLD discovered item
              union all 
                      SELECT lef_delay_to_seconds(i.delay) as delay, case when lef_delay_to_seconds(i.delay) > 0 then 1 /  cast(lef_delay_to_seconds(i.delay) as decimal(20,10))
                                                              else null end as qps, 
                             coalesce(th.name, h.name) as TemplateHost, coalesce(ti.key_, i.key_) as Key, coalesce(ti.name, i.name) as ItemName
                      FROM items i
                      inner join hosts h on i.hostid=h.hostid and h.status=0 -- require host be enabled
                      left join items ti on ti.itemid=i.templateID
                      left join hosts th on th.hostid=ti.hostid
                      WHERE i.status=0 -- Require item be active 
                        AND i.flags in (0,1)  -- and LLD Discovery itself
              ) x
              where x.delay<>0  -- exclude traps as there's no time
              group by x.delay, x.TemplateHost, x.Key, x.ItemName
              order by 2 desc ;
      It requires a function to handle the new delay format and normalize to seconds:

      Code:
      -- FUNCTION: public.lef_delay_to_seconds(text)
      
      -- DROP FUNCTION public.lef_delay_to_seconds(text);
      
      CREATE OR REPLACE FUNCTION public.lef_delay_to_seconds(
          delay text)
          RETURNS integer
          LANGUAGE 'plpgsql'
          COST 100
          VOLATILE 
      AS $BODY$
      
      begin
        return 
           case when delay is null then null
                when delay = '' then null 
                else 
                   case right(delay,1)
                        when 's' then cast(left(delay,length(delay)-1) as int)
                        when 'm' then cast(left(delay,length(delay)-1) as int) * 60
                        when 'h' then cast(left(delay,length(delay)-1) as int) * 60 * 60 
                        when 'd' then cast(left(delay,length(delay)-1) as int) * 24 * 60 * 60 
                        when 'w' then cast(left(delay,length(delay)-1) as int) * 24 * 60 * 60 * 7
                        else cast(left(delay,length(delay)) as int)
                   end
            end;
      end;
      
      $BODY$;
      
      ALTER FUNCTION public.lef_delay_to_seconds(text)
          OWNER TO zabbix;
      Again, you may want to adjust lots of things in the core function to your own preferences and how it estimates. On mine for example, it finds that pings generate about 4 times the polls as the next highest entry, which is operational status on interfaces (which is something I need to consider as I have triggers turned off on many of those). I also just noticed that a new template is way up near the top because when I was testing it, I had it poll alerts on a microwave device every 30 seconds, whereas every few minutes is more than adequate. Mistakes like that, from speeding things up to test and forgetting, can make a real mess on a system where a lot of template development occurs.

      Anyway.... I'm a big fan of reducing the source of massive volume as a first step. Sure, tune the database, but reduce the flow first.

      Always ask "is this data actionable".

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Quite high is pollers utilisation which means that you are still using passive agent and passive items type and or/you are using passive proxies.
        Pollers are used as well on calculate values of calculated items so question is as well how much this kind of items you have?
        Busy icmp pinger and few other means that you are monitoring some number of hosts over server (which is bad from point of view of scaleability and HA).
        With ~1.5k NVPS and +10min working HK probably you have at least history* tables already partitioned. but delete some tables so long is more or less waste of IOs.

        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        Working...