Ad Widget

Collapse

How is the average of proc.cpu.util calculated (avg1, avg5, avg15)?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • badbob001
    Junior Member
    • May 2020
    • 2

    #1

    How is the average of proc.cpu.util calculated (avg1, avg5, avg15)?

    A customer wants to know how the cpu util averages (avg1/5/15) are calculated when specified in the template.
    Note that I'm asking about cpu util (cpu idle vs cpu busy) and NOT cpu load (number of cpu processes in queue).
    So like for avg1, which is suppose to be the average from the last 1 minute, they want to know how many samples are taken to calculate the average, but I'm not sure if the averages are calculated that way.

    I know that for cpu load, which is very different from cpu util, the linux os actually provides the cpu load average for avg1/avg5/avg15 based on some complicate form of weighted average using exponential math.

    But for cpu util, whch is probably calculated from /proc/stat, I don't think the os provides those avg1/5/15 averages. I'm guessing zabbix collects the /proc/stat data and calculates the average internally. But again, how are those cpu util averages calculated?
    Does cpu util averages use the same exponential math like cpu load averages?
    Does zabbix calculate cpu util at fixed intervals and the averages are calculated from that data? If yes, then how many samples? I almost convinced myself that the sampling is based on the linux quantum time, which is about 10 ms.
    I've seen articles that say only two samples are needed calculate cpu util between two time periods:
    Code:
    (total_cputime_now - total_cputime_prev) - (idletime_now - idletime_prev)
    ------------------------------------------------------------------------- x 100
                 (total_cputime_now - total_cputime_prev)
    Is this it? Then that means:
    avg1: two samples from /proc/stat 1 minute apart
    avg5: two samples from /proc/stat 5 minutes apart
    avg15: two samples from /proc/stat 15 minutes apart
    This all started when the customer saw that azure graphs are able to provide max, average, and min values and we're trying to figure out if zabbix can show similar values.

    Thanks!
    Last edited by badbob001; 22-05-2020, 21:39.
  • badbob001
    Junior Member
    • May 2020
    • 2

    #2
    Tried looking through the source code and I have to admit that it has been a very long time since I've looked at C code.

    In cpustat.c, I see avg1, 5, and 15 used here:
    Code:
    int get_cpustat(AGENT_RESULT *result, int cpu_num, int state, int mode)
    {
    ...
    
    switch (mode)
    {
       case ZBX_AVG1:
    [B]time[/B] = SEC_PER_MIN;
       break;
       case ZBX_AVG5:
    [B]time[/B] = 5 * SEC_PER_MIN;
       break;
       case ZBX_AVG15:
    [B]time[/B] = 15 * SEC_PER_MIN;
       break;
       default:
       return SYSINFO_RET_FAIL;
    }
    
    ..
    
    if (1 == cpu->h_count)
    {
       for (i = 0; i < ZBX_CPU_STATE_COUNT; i++)
          total += cpu->h_counter[i][idx_curr];
       counter = cpu->h_counter[state][idx_curr];
    }
    else
    {
       if (0 > (idx_base = idx_curr - MIN(cpu->h_count - 1, time)))
             idx_base += MAX_COLLECTOR_HISTORY;
    
       while (SYSINFO_RET_OK != cpu->h_status[idx_base])
          if (MAX_COLLECTOR_HISTORY == ++idx_base)
             idx_base -= MAX_COLLECTOR_HISTORY;
    
       for (i = 0; i < ZBX_CPU_STATE_COUNT; i++)
    [B]total[/B] += cpu->h_counter[i][idx_curr] - cpu->h_counter[i][idx_base];
       counter = cpu->h_counter[state][idx_curr] - cpu->h_counter[state][idx_base];
    }
    
    ...
    
    SET_DBL_RESULT(result, 0 == [B]total[/B] ? 0 : 100. * (double)counter / (double)total);
    https://git.zabbix.com/projects/ZBX/...ture/ZBX-15210

    As best I can make out, avgX affects the size of integer time, which then affects the number for idx_base (starting point in metrics to look at?). And then variable total is cumulative sum of the current metric value minus base metric value... I'm guessing the "current minus base" aspect is related to how /proc/stat stores cpu time cumulatively from the start of the system. Unclear what counter is but I'm guessing the count of number of items between base and current.

    And the last line I think says return 0 if total is 0 otherwise return "100 x counter / total", which I don't totally don't understand. Some sort of average formula that results in a percentage? I would expect an average formula to be like: (total / counter) x 100.

    Still unclear if this means that sampling is one-per-second or that is just the minimum possible sampling. Or maybe that depends on the update interval set for the item to be monitored? I have my template item for cpu percent set at 1m update interval. Would that mean that avg1 of a 1-minute sample would just be the same value without any averaging?
    Last edited by badbob001; 26-05-2020, 23:10.

    Comment

    Working...