Ad Widget

Collapse

Calculated items putting heavy load on Zabbix; need scheduling ideas

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • LemurTech
    Junior Member
    • Mar 2018
    • 8

    #1

    Calculated items putting heavy load on Zabbix; need scheduling ideas

    We have a half-dozen calculated checks for each of our 150 or so hosts that average performance values (CPU load, free RAM, web server connections, etc.) over long intervals (8 days at present, but I would like to extend this to 30 days). We would like the checks to run once/day. I quickly found out that calculated items have the ability to cause an enormous load on Zabbix (iowait) as the system reads values from the DB and performs the calculation for each host. This sometimes causes the web GUI to become unavailable for several minutes. I've put quite a bit of effort into optimizing our system for performance. We use DB partitioning, I've optimized mariaDB and cache values, etc. Unfortunately, the one thing I cannot do is change out our RAID 10 SATA array for SSDs, as they are prohibitively expensive at our current data center host. And other than these I/O-expensive checks, everything is working fantastic.

    Current I'm using macros to assign flexible scheduling for each item, so that checks are broken up over the course of the day. I've created 4 periods of 4 hours each, scheduling so that any 4 hour period has only 1 or 2 items to check. The macros look like this:

    Click image for larger version

Name:	2020-04-24_11-08-45.jpg
Views:	670
Size:	100.7 KB
ID:	400218

    The resulting CPU graph looks like this. Each peak represents one set of checks:

    Click image for larger version

Name:	2020-04-27_9-37-32.jpg
Views:	577
Size:	87.9 KB
ID:	400219

    This is working, but it is not a happy situation, and I'm afraid I can't go beyond the 8-day averaging. How can I get away from the clustering of these checks so that they put less of a load on the server? I could go into each and every host and customize the schedule intervals so that there are more unique schedules, but I want to retain the management ease of assigning the check interval via template.

    My question for you Zabbix experts is: Am I missing something? Is there a scheduling strategy I can adopt so that checks are better spread over a particular time period of each day?

    Thanks beforehand for any tips!
  • gofree
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2017
    • 400

    #2
    In my experience every time when I was trying to invent the wheel again - ex. same as you tried to echedule check in different time to make them less aggressive to DB I failed and failed again and again ( basically there is not enough time to schedule them all reasonably and there are to many devices that will cause those i/o locks eventually ) - simply this is not the way it should be handled. The question what I had to ask ( maybe you should to ) whats the benefit of those 30d ( 8d) calculated check ?

    Cant rememeber how grafana handles the data exactlly but I had better results with it ( quering raw data via api ) - I guess you can also use avg and other functions in it.

    Edit: Just thinkin about mone more thing against long time average - 8d average ( 1, 1, 1, 1, 10, 1, 1 > avg 2.28 ). Maybe a percentile trigger for shorter period ( 1d perhaps ) will be more suitable.
    Last edited by gofree; 28-04-2020, 12:07.

    Comment


    • LemurTech
      LemurTech commented
      Editing a comment
      Thanks, gofree. The checks are used to populate inventory fields, which then get exported into our CRM system to run reports that compare utilization of servers and so forth, and provide justifications for pushing CPU/RAM upgrades to clients. If we bottom out at 10 days or so for the averages, we may just have to live with that. I hear what you're saying about using Grafana to query the raw data, but that's another layer of complication. I was hoping I was just missing out on some easier way of distributing the checks across time.
  • LemurTech
    Junior Member
    • Mar 2018
    • 8

    #3
    Based on splitek's suggestion, I changed my methodology to calculate daily averages at midnight each night for all servers, then an hour later I run the calculation that gets the 30-day averages based off those 1-day averages. After letting this run for a few days, none of this puts a dent in my Zabbix Server's performance, so I'm very happy with the results! Thanks, splitek!

    Comment

    • gajala
      Junior Member
      • Aug 2020
      • 1

      #4
      As I would see it, ascertaining anything at one with period 30d will require more assets and time. Much better is to part information into little parts that are determined all the more frequently however not eat all assets. Also, a few activities should be possible recursively... like avg from other avg's, no information is lost and calculation for this "enormous" avg is quicker in light of the fact that information is now arranged before.
      Last edited by gajala; 08-09-2020, 04:44.

      Comment

      Working...