Ad Widget

**LenR** · 07-07-2018, 05:35

How do the queues by proxy look? Are the problems spread across proxies or isolated to just a few?

Have you looked at your mysql write performance?

**Linwood** · 07-07-2018, 15:17

Lots of people will help you figure out how to make it process that much data faster, but I'd offer the observation you should question if you have too much data.

I'm a fan of looking closely at the data you are collecting to see that you actually need it all, and need it on the frequency it is collected. Often out-of-the-box templates (not necessarily zabbix box, but those shared by others as well) take the conservative approach of collecting everything, fast. Some specifics:

- LLD of interfaces often yields monitoring of interfaces which have no actionable purpose (by that I mean that you will never need to take action based on their state, etc.). Microsoft has a slew of interfaces it creates no one ever uses (directly), cisco often does as well. You might not want loopback for example. Each server with one NIC may have half a dozen 'interfaces' and you really only want the NIC. So build a filter on the LLD to exclude certain types you know you don't want, and as you do you also eliminate a slew of items associated with each one.

- Look at what items populate the most in a given time period, and question hard whether their interval of polling is needed. i often see people polling things every 30 seconds or even faster. Yes, some have legitimate need, but most of the time the real question is "what will you do with data that fast?". If you got an alert in 30 seconds after a system went down, how long would it take for someone to notice the alert, to react. People time is often much, much slower. Reconsider very fast polls, can you slow them down. Especially consider slowing down the more static entries, e.g. admin state on interfaces, but also slow down things like interface error counts -- not say they are not important, but mostly they are important as a trend, so capturing interface counts every 5 minutes may be just as good as every 1 minute. Or whatever.

- Consider whether some of the data you collect even matters. Interface counters are a great example - some interfaces give a ton of data, multicast vs. unicast, even size breakdowns. Ask yourself whether you will use it; do you consider it actionable. At a typical site with 100 switches, eliminating just something like multi-cast and uni-cast counts (in favor of also present total octects), if the switches average 24 ports, will eliminate 100 * 24 * 4 items entirely. If you were collecting them every 60 seconds there went 160 ips from your total.

The above kind of culling is best done aimed at the most frequent items. Query the database to find what's taking the majority of the history database. Follow back to the templated items causing it. For everything near the top of the list, ask "is this actionable", "Do I need it updated this rapidly even if so". When I did this type of culling I dropped my items per second by a factor of about 4, if memory serves. It dropped even further over time as I fine tuned templates.

If it's useful to you (and if you use postgresql, sorry if not), here's a sql query that attempts to estimate the original templated item names of items and their impact on your items per second. It does a bunch of estimations, notably trying to handle traps (which it really cannot), but it may give you a starting point if you take a data driven approach to culling:

Code:

 select count(*) as items, sum(x.qps) as qps, x.delay, x.TemplateHost, x.Key, x.ItemName
        from 
        (
                SELECT lef_delay_to_seconds(i.delay) as delay, case when lef_delay_to_seconds(i.delay) > 0 then 1 /  cast(lef_delay_to_seconds(i.delay) as decimal(20,10))
                                                        else null end as qps, 
                       coalesce(th.name, h.name) as TemplateHost, coalesce(ih.key_, i.key_) as Key, coalesce(ih.name, i.name) as ItemName
                FROM items i
                inner join hosts h on i.hostid=h.hostid and h.status=0 -- require host be enabled
                inner join item_discovery id on id.itemid=i.itemid  -- this gives you the item protype
                inner join items ih on ih.itemid=id.parent_itemid -- this is the item prototype itself (in case we need the name) 
                inner join item_discovery id2 on id2.itemid=id.parent_itemid -- this gives you the discovery role on the host 
                inner join items drh on drh.itemid=id2.parent_itemid -- this is the prototye on the host
                left join items ti on ti.itemid=drh.templateid -- this is the template item 
                left join hosts th on th.hostid=ti.hostid  
                WHERE i.status=0 -- Require item be active 
                  AND i.flags=4 -- This should be an LLD discovered item
        union all 
                SELECT lef_delay_to_seconds(i.delay) as delay, case when lef_delay_to_seconds(i.delay) > 0 then 1 /  cast(lef_delay_to_seconds(i.delay) as decimal(20,10))
                                                        else null end as qps, 
                       coalesce(th.name, h.name) as TemplateHost, coalesce(ti.key_, i.key_) as Key, coalesce(ti.name, i.name) as ItemName
                FROM items i
                inner join hosts h on i.hostid=h.hostid and h.status=0 -- require host be enabled
                left join items ti on ti.itemid=i.templateID
                left join hosts th on th.hostid=ti.hostid
                WHERE i.status=0 -- Require item be active 
                  AND i.flags in (0,1)  -- and LLD Discovery itself
        ) x
        where x.delay<>0  -- exclude traps as there's no time
        group by x.delay, x.TemplateHost, x.Key, x.ItemName
        order by 2 desc ;

It requires a function to handle the new delay format and normalize to seconds:

Code:

-- FUNCTION: public.lef_delay_to_seconds(text)

-- DROP FUNCTION public.lef_delay_to_seconds(text);

CREATE OR REPLACE FUNCTION public.lef_delay_to_seconds(
    delay text)
    RETURNS integer
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE 
AS $BODY$

begin
  return 
     case when delay is null then null
          when delay = '' then null 
          else 
             case right(delay,1)
                  when 's' then cast(left(delay,length(delay)-1) as int)
                  when 'm' then cast(left(delay,length(delay)-1) as int) * 60
                  when 'h' then cast(left(delay,length(delay)-1) as int) * 60 * 60 
                  when 'd' then cast(left(delay,length(delay)-1) as int) * 24 * 60 * 60 
                  when 'w' then cast(left(delay,length(delay)-1) as int) * 24 * 60 * 60 * 7
                  else cast(left(delay,length(delay)) as int)
             end
      end;
end;

$BODY$;

ALTER FUNCTION public.lef_delay_to_seconds(text)
    OWNER TO zabbix;

Again, you may want to adjust lots of things in the core function to your own preferences and how it estimates. On mine for example, it finds that pings generate about 4 times the polls as the next highest entry, which is operational status on interfaces (which is something I need to consider as I have triggers turned off on many of those). I also just noticed that a new template is way up near the top because when I was testing it, I had it poll alerts on a microwave device every 30 seconds, whereas every few minutes is more than adequate. Mistakes like that, from speeding things up to test and forgetting, can make a real mess on a system where a lot of template development occurs.

Anyway.... I'm a big fan of reducing the source of massive volume as a first step. Sure, tune the database, but reduce the flow first.

Always ask "is this data actionable".

**kloczek** · 07-07-2018, 17:45

Quite high is pollers utilisation which means that you are still using passive agent and passive items type and or/you are using passive proxies.
Pollers are used as well on calculate values of calculated items so question is as well how much this kind of items you have?
Busy icmp pinger and few other means that you are monitoring some number of hosts over server (which is bad from point of view of scaleability and HA).
With ~1.5k NVPS and +10min working HK probably you have at least history* tables already partitioned. but delete some tables so long is more or less waste of IOs.

Ad Widget

Problems to reduce queue in Zabbix agent (active) and SNMP

Problems to reduce queue in Zabbix agent (active) and SNMP

Comment

Comment

Comment