Ad Widget

Collapse

Finding CPU consumers

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jerrylenk
    Member
    Zabbix Certified Specialist
    • May 2010
    • 62

    #1

    Finding CPU consumers

    I wrote a little script to grab the output of the linux top-command into a Zabbix item. This was inspired by terataz, asking how to find out, e. g. when zabbix reports high CPU load, what process may be causing it.

    My solution is not very sophisticated, but maybe somebody finds it useful, or feels like improving it.
    By now, it reports the names of top cpu-time-consuming processes, if their CPU% exceeds a given value. With small modifications and an adjusted toprc configuration, one could use it for RAM consumers or anything else top is able to report.

    Code:
    #!/bin/bash
    #####################################################
    # topcpu.sh
    # returns names of most CPU time consuming processes
    # as reported by 'top'
    #####################################################
    # 05-07-2010 by Jerry Lenk
    # Use at your own risk!
    #####################################################
    
    # set limit to 1st argument, or 2% if not specified
    lim=$1
    test -z $lim && lim=2
    
    # run 2 iterations of top in batch mode with 1 s delay
    top -b -d1 -n2 |\
    gawk --assign lim=$lim  'BEGIN { reply=""}
            END { print reply, "." }
            # if reply is empty, at least a period is returned
    
            # in 2nd iteration, first 3 lines
            # add columns 9 (%cpu) and 12 (process name)
            # to reply string, if cpu at least lim%
            itr == 2 && NR <= 3 && $9 >= lim { reply=reply " " $9 "%" $12 }
    
            # count iterations by header lines beginning with "PID"
            # reset linenumber
            $1 == "PID" { NR=0 ; itr +=1 }
           '
    # Only 2nd iteration of top is of interest because
    # load values are calculated since previous iteration
    I save it as "topcpu.sh" to my scripts directory on the monitored machine (/etc/zabbix/userscript in my case)
    and add it as a UserParameter to zabbix-agentd.conf:
    Code:
    UserParameter=system.topcpu[*],[U]/etc/zabbix/userscript/[/U]topcpu.sh $1
    If you prefer to place the script somewhere else, change the underlined path accordingly.

    Now all I need to do is restart zabbix_agentd and create an item for it:
    Description: CPU top consumer
    Type: Zabbix agent
    key: system.topcpu[5] (5 being the minimum %CPU load I want reported)
    Type of information: text
    Interval: 30 (I tried 60 first, but wasn't very satisfied.)

    I have yet to try, if it makes sense to put the content of this item into an alert mail triggered by high CPU load. Probably the relevant process name shows up half a minute after the trigger turns "on".

    Have fun,
    Jerry
    Last edited by jerrylenk; 05-07-2010, 15:40.
  • jerrylenk
    Member
    Zabbix Certified Specialist
    • May 2010
    • 62

    #2
    So what do I do with it?
    For example, that is what I see:


    A sudden peak of cpu load on my testing machine at 12:15
    What was going on then? (I know of course, but let's assume I doughnut ;-)

    I look at the item "CPU top consumer" history for server Jysles:


    100% CPU used by vmware, so obviously the real cause was inside the VM. Fortunately, my VM is also monitored und also runs Linux, or my script would not have worked*.
    So here is the vm's "top consumer" history:


    Aha: Some java program created a load peak at 12:16. That would be my tomcat server being restarted.

    Agreed, all this is somewhat intricate, and it will only catch peaks that really stand out and are not too short either. But it may be useful for curious admins anyway.

    * Btw, perhaps someone feels inspired to create one for windows? I don't know if there is a top for windows. There is a gawk, at any rate. One could perhaps try something with typeperf.exe or tasklist.exe. But I am not the dos shell pro.
    Attached Files

    Comment

    • untergeek
      Senior Member
      Zabbix Certified Specialist
      • Jun 2009
      • 512

      #3
      Pretty slick! I wish you could put those values more easily in a screen or run concurrently with a graph. As things are you have to correlate it on your own.

      Comment

      • Zophren
        Junior Member
        • Jul 2010
        • 13

        #4
        Good Job

        Hi !

        It's a good ideas ! And your script works perfectly !

        However, missing the user info who use the top process. This information is very interisting with apache .

        I try to implemented this in your script:

        Change this :

        Code:
        itr == 2 && NR <= 3 && $9 >= lim { reply=reply " " $9 "%" $12}
        By This :

        Code:
        itr == 2 && NR <= 3 && $9 >= lim { reply=reply " " $9 "%:" $12 ":"$2 }
        Last edited by Zophren; 12-07-2010, 09:03. Reason: Add modif

        Comment

        • frater
          Senior Member
          • Oct 2010
          • 340

          #5
          Good good job

          I implemented this too on 1 server, but I had a similar problem with its output.
          My server runs several processes with the same name and I need the full name (ps -ef) to distinguish between them.

          I still have to learn advanced gawking, so I did it more bashy.
          I'm not 100% sure it does exactly the same, but I think it does.

          If a certain process consumes a lot of CPU power it will also launch 'lsof p <pid>' to find out which file it uses. Of course there are many files in use, but I filtered them and there's a very good chance this file will give you extra info.

          I tested it with my apache-server and it will show me which site is 'troubling' it

          lsof needs to be added to the /etc/sudoers file
          Code:
          echo zabbix ALL = NOPASSWD: `which lsof` >> /etc/sudoers
          Code:
          #!/bin/bash
          #####################################################
          # topcpu
          # returns names of most CPU time consuming processes
          # as reported by 'top'
          #####################################################
          # 05-07-2010 by Jerry Lenk
          # 02-11-2010 by Frater (rewrite in bash)
          #
          # Use at your own risk!
          #####################################################
          
          # Add lsof to /etc/sudoers (as root) with the following command
          ##########################
          #     echo zabbix ALL = NOPASSWD: `which lsof` >> /etc/sudoers
          
          # Add to zabbix_agentd.conf
          ###########################
          #     echo 'UserParameter=system.topcpu[*],/usr/local/sbin/topcpu $1 $2' >>/etc/zabbix/zabbix_agentd.conf
          
          # Restart Zabbix
          ################
          #     /etc/init.d/zabbix-agent restart
          
          # Constants
          nodata='.'
          # The delay between the 2 samples 0.5 .... 2
          defdelay=1.1
          deflimit=2
          use_lsof=1
          GREP='grep --color=never -a'
          DEBUG=0
          
          # set limit to 1st argument (given from zabbix), or deflimit if not specified
          lim=`echo "$1" | tr -cd '0-9.'`
          [ -z "${lim}" ] && lim=${deflimit}
          
          # set limit to 2nd argument (given from zabbix), or defdelay if not specified
          delay=`echo "$2" | tr -cd '0-9.'`
          [ -z "${delay}" ] && delay=${defdelay}
          expr $delay \> 2 >/dev/null && delay=2
          
          topboth="`top -b -d${delay} -n2 | ${GREP} -A1 '^ *PID '`"
          toptail="`echo "${topboth}" | tail -n1`"
          
          # get the 2 top scoring PID's
          pid1=`echo "${topboth}" | awk '{print $1}' | head -n2 | tail -n1`
          pid2=`echo "${toptail}" | awk '{print $1}'`
          
          [ ${DEBUG} -ne 0 ] && echo "Debug: \$1=$1  \$2=$2  limit=$lim  delay=$delay  pid1=$pid1  pid2=$pid2"
          
          # if both PID's are the same continue the check
          if [ "${pid1}" = "${pid2}" ] ; then
          
            cpu=`echo "${toptail}"  | awk '{print $9}'`
            if expr ${cpu} \<= ${lim} >/dev/null ; then
              echo "${nodata}"
            else
              # get FULL process name (it may contain more info)
              procname="`ps --pid ${pid2} -o args --no-headers 2>/dev/null`"
              if [ -z "${procname}" ] ; then
                # process is not running anymore... I might as well return nothing and quit
                echo "${nodata}"
              else
                user=`echo "${toptail}" | awk '{print $2}'`
          
                # return CPU usage, process owner and process name
                echo "${cpu}%   ${user}:${procname}"
          
                if [ ${use_lsof} -ne 0 ] ; then
                  # calculate the limit when it should execute lsof
                  lim=$(( 2 * ${lim} + 5 ))
                  [ ${lim} -gt 50 ] && lim=50
                  expr ${cpu} \> ${lim} >/dev/null && sudo lsof -p ${pid2} -S -b -w -n -Fftn0 | ${GREP} -v '^fDEL' | ${GREP} 'tREG'  | ${GREP} -o '/.*' | tr -d '\0' | ${GREP} -vE '(log$|^/var/lib|^/lib|^/var/run|^/tmp|^/usr/|^/var/log/)' | sort -u | head -n5
                fi
              fi
            fi
          else
            echo "${nodata}"
          fi

          changes and additions:

          - implemented everything in bash
          - sanity checking of parameters (limit & delay)
          - delay can be set (between the 2 samples)
          - if the cpu usage is more than double the limit it will execute an lsof to check the files that are open by that process
          - you need to add lsof in the /etc/sudoers for zabbix
          - /etc/zabbix/zabbix_agentd.conf now needs a 2nd parameter


          item can be:

          topcpu[]
          topcpu[4]
          topcpu[8]
          topcpu[8 1.4]

          I recommend:

          topcpu[4]
          Last edited by frater; 05-11-2010, 17:51. Reason: added lsof
          Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

          Comment

          • frater
            Senior Member
            • Oct 2010
            • 340

            #6
            Although I'm getting good results with the previous script I decided to write a more simple one which does about the same.

            In my previous script I only give output if the PID in the first iteration is the same as in the 2nd iteration. This will at least ignore processes that just peak, but it may also ignore 2 processes that are fighting for the first place.

            It may be all hypothetical and I'm taking it maybe too serious.

            The output should still be human interpreted and the log is probably only looked at when other graphs are giving a process that's consuming your computer.
            Code:
            #!/bin/bash
            #####################################################
            # topcpu
            # returns names of most CPU time consuming processes
            # as reported by 'top'
            #####################################################
            # 05-07-2010 by Jerry Lenk
            # 02-11-2010 by Frater (rewrite in bash)
            #
            # Use at your own risk!
            #####################################################
            
            # Add lsof to /etc/sudoers (as root) with the following command
            ##########################
            #     echo zabbix ALL = NOPASSWD: `which lsof` >> /etc/sudoers
            
            # Add to zabbix_agentd.conf
            ###########################
            #     echo 'UserParameter=system.topcpu[*],/usr/local/sbin/topcpu $1' >>/etc/zabbix/zabbix_agentd.conf
            
            # Restart Zabbix
            ################
            #     /etc/init.d/zabbix-agent restart
            
            # Constants
            nodata='.'
            deflimit=4
            use_lsof=1
            GREP='grep --color=never -a'
            DEBUG=0
            
            # set limit to 1st argument (given from zabbix), or deflimit if not specified
            lim=`echo "$1" | tr -cd '0-9.'`
            [ -z "${lim}" ] && lim=${deflimit}
            
            toptail="`top -b -d1 -n2 | ${GREP} -A1 '^ *PID ' | tail -n1`"
            cpu=`echo "${toptail}"  | awk '{print $9}'`
            
            [ ${DEBUG} -ne 0 ] && echo "Debug: \$1=$1  limit=$lim  cpu=$cpu"
            
            if expr ${cpu} \<= ${lim} >/dev/null ; then
              echo "${nodata}"
            else
            
              # get PID & FULL process name (it may contain more info)
              pid2=`echo "${toptail}" | awk '{print $1}'`
              procname="`ps --pid ${pid2} -o args --no-headers 2>/dev/null`"
            
              if [ -z "${procname}" ] ; then
                # process is not running anymore... I might as well return nothing and quit
                echo "${nodata}"
              else
            
                user=`echo "${toptail}" | awk '{print $2}'`
                # return CPU usage, process owner and process name
                echo "${cpu}%   ${user}:${procname}"
            
                if [ ${use_lsof} -ne 0 ] ; then
                  # calculate the limit when it should execute lsof
                  lim=$(( 2 * ${lim} + 5 ))
                  [ ${lim} -gt 50 ] && lim=50
                  # Run an lsof, but exclude log files, apache modules and several runtime/library directories
                  expr ${cpu} \> ${lim} >/dev/null && sudo lsof -p ${pid2} -S -b -w -n -Fftn0 | ${GREP} -v '^fDEL' | ${GREP} 'tREG'  | ${GREP} -o '/.*' | tr -d '\0' | ${GREP} -vE '(log$|\.mo$|^/var/lib|^/lib|^/var/run|^/tmp|^/usr/|^/var/log/)' | sort -u | head -n7
                fi
              fi
            fi
            Last edited by frater; 10-11-2010, 14:29.
            Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

            Comment

            • Murz
              Junior Member
              • Sep 2008
              • 17

              #7
              Thanks for the script, works perfectly!
              Can I display not only one process, but 3 or more which uses much cpu? Because when cpu uses 100%, in my server I see many process that gets 10-25% of cpu, but in sum it eats all 100%, and I want to see 3 or 5 top of them, not only one.

              And please describe more info about parameters:
              topcpu[8 1.4] - what is mean first parameter (8) and second (1.4)?

              Can I get the average cpu usage value per process for time interval at 10 secs, for example?
              For example, first process uses 70% of cpu only one second (in check time), but second process - eats 30%-50% of cpu already long time (all time in 10 seconds interval), and in this case, the cause of the problem is the second process, not first.

              Comment

              • frater
                Senior Member
                • Oct 2010
                • 340

                #8
                Thanks for the script, works perfectly!
                I'm glad you like it, but I will pass some of the thanks to the OP.

                The last script of mine takes only 1 parameter, which is the percentage threshold. The first script takes 2 samples and will only give an answer when both samples have the same process as #1. The 2nd parameter is the delay between 2 samples.
                It makes the script less sensitive for peaks, which may be something you want.

                You're asking to have a top 3 instead of a top 1...
                This means the 1st script of mine is not suited for this....

                I adapted the 2nd script to show more than 1 process.
                You need to give the amount of processes as a 2nd parameter in zabbix

                Only the 1st process will show its open files.
                The script is run twice a minute and you don't want the probe itself to be the process that consumes too much.

                Code:
                #!/bin/bash
                #####################################################
                # topcpu
                # returns names of most CPU time consuming processes
                # as reported by 'top'
                #####################################################
                # 05-07-2010 by Jerry Lenk
                # 02-11-2010 by Frater (rewrite in bash)
                #
                # Use at your own risk!
                #####################################################
                
                # Add lsof to /etc/sudoers (as root) with the following command
                ##########################
                #     echo zabbix ALL = NOPASSWD: `which lsof` >> /etc/sudoers
                
                # Comment out the tty requirement for sudo
                ##########################
                #     sed -i -e 's/^Defaults.*requiretty/# &/' /etc/sudoers
                
                # Add to zabbix_agentd.conf
                ###########################
                #     echo 'UserParameter=system.topcpu[*],/usr/local/sbin/topcpu $1 $2' >>/etc/zabbix/zabbix_agentd.conf
                
                # Restart Zabbix
                ################
                #     /etc/init.d/zabbix-agent restart
                
                # Constants
                nodata='.'
                deflimit=4
                defanswers=1
                use_lsof=1
                GREP='grep --color=never -a'
                DEBUG=0
                
                # set limit to 1st argument (given from zabbix), or deflimit if not specified
                lim=`echo "$1" | tr -cd '0-9.'`
                [ -z "${lim}" ] && lim=${deflimit}
                
                answers=`echo "$2" | tr -cd '0-9'`
                [ -z "${answers}" ] && answers=${defanswers}
                [ $answers -gt 5  ] && answers=5
                [ $answers -lt 1  ] && answers=1
                
                toptail="`top -b -d1 -n2 | ${GREP} -A${answers} '^ *PID ' | tail -n${answers}`"
                cpu=`echo "${toptail}"  | head -n1 | awk '{print $9}'`
                
                [ ${DEBUG} -ne 0 ] && echo "Debug: \$1=$1  limit=$lim  cpu=$cpu"
                
                if expr ${cpu} \<= ${lim} >/dev/null ; then
                  echo "${nodata}"
                else
                  # get PID & FULL process name (it may contain more info)
                  pid=`echo "${toptail}" | head -n1 | awk '{print $1}'`
                  procname="`ps --pid ${pid} -o args --no-headers 2>/dev/null`"
                
                  if [ -z "${procname}" ] ; then
                    # process is not running anymore... I might as well return nothing and quit
                    echo "${nodata}"
                  else
                
                    user=`echo "${toptail}" | head -n1 | awk '{print $2}'`
                    # return CPU usage, process owner and process name
                    echo "${cpu}%   ${user}:${procname}"
                
                    if [ ${use_lsof} -ne 0 ] ; then
                      # calculate the limit when it should execute lsof
                      lim=$(( 2 * ${lim} + 5 ))
                      [ ${lim} -gt 50 ] && lim=50
                      expr ${cpu} \> ${lim} >/dev/null && sudo lsof -p ${pid} -S -b -w -n -Fftn0 | ${GREP} -v '^fDEL' | ${GREP} 'tREG'  | ${GREP} -o '/.*' | tr -d '\0' | ${GREP} -vE '(log$|^/var/lib|^/lib|^/var/run|^/tmp|^/usr/|^/var/log/)' | sort -u | head -n5
                    fi
                
                    n=2
                    while [ $n -le ${answers} ] ; do
                      topline="`echo "${toptail}" | tail -n+${n} | head -n1`"
                       pid=`echo "${topline}" | awk '{print $1}'`
                      user=`echo "${topline}" | awk '{print $2}'`
                       cpu=`echo "${topline}" | awk '{print $9}'`
                      procname="`ps --pid ${pid} -o args --no-headers 2>/dev/null`"
                      echo "${cpu}%   ${user}:${procname}"
                
                      n=$(($n + 1))
                    done
                
                  fi
                fi
                Last edited by frater; 02-12-2010, 11:54. Reason: requiretty out /etc/sudoers
                Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

                Comment

                • frater
                  Senior Member
                  • Oct 2010
                  • 340

                  #9
                  It didn't work on all my servers, but luckily I found the culprit.
                  Your /etc/sudoers may contain the line 'Defaults requiretty'
                  This means any sudo command will fail....

                  This was in my /var/log/zabbix/zabbix_agentd.log
                  Code:
                  sudo: sorry, you must have a tty to run sudo
                  You can comment it out with visudo or just this command:
                  Code:
                  sed -i -e 's/^Defaults.*requiretty/# &/' /etc/sudoers
                  Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

                  Comment

                  • tof233
                    Member
                    • Nov 2010
                    • 94

                    #10
                    Thank you for these scripts
                    I modified a little bit the first one so it displays the command.

                    Code:
                    #!/bin/bash
                    #####################################################
                    # topcpu.sh
                    # returns names of most CPU time consuming processes
                    # as reported by 'top'
                    #####################################################
                    # 05-07-2010 by Jerry Lenk
                    # Use at your own risk!
                    #####################################################
                    
                    # set limit to 1st argument, or 2% if not specified
                    lim=$1
                    test -z $lim && lim=2
                    
                    # run 2 iterations of top in batch mode with 1 s delay
                    top -b -d1 -n2 |\
                    gawk --assign lim=$lim  'BEGIN { reply=""}
                            END { print reply, "." }
                            # if reply is empty, at least a period is returned
                    
                            # in 2nd iteration, first 3 lines
                            # add columns 9 (%cpu) and 12 (process name)
                            # to reply string, if cpu at least lim%
                            itr == 2 && NR <= 3 && $9 >= lim { reply=reply; printf(" %s% ",$9);  system("ps  --no-headers -o args -p " $1)   }
                    
                            # count iterations by header lines beginning with "PID"
                            # reset linenumber
                            $1 == "PID" { NR=0 ; itr +=1 }
                           '
                    # Only 2nd iteration of top is of interest because
                    # load values are calculated since previous iteration
                    However, I would like to remove the "." when there is no process over trigger (so empty result).
                    Do you know what is the output for empty on text item (just returning nothing turn the item into unsupported)?

                    Comment

                    • frater
                      Senior Member
                      • Oct 2010
                      • 340

                      #11
                      Originally posted by tof233
                      However, I would like to remove the "." when there is no process over trigger (so empty result).
                      Do you know what is the output for empty on text item (just returning nothing turn the item into unsupported)?
                      I do not understand why you want to do that. It means zabbix will stop monitoring the machine when it becomes idle, which is hopefully most of the time.

                      But the answer is easy.
                      Just don't give any command that outputs something and zabbix will turn it into unsupported.
                      Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

                      Comment

                      • tof233
                        Member
                        • Nov 2010
                        • 94

                        #12
                        Originally posted by frater
                        I do not understand why you want to do that. It means zabbix will stop monitoring the machine when it becomes idle, which is hopefully most of the time.
                        Of course, I want the opposite (sorry I misspoke...)

                        I just would like Zabbix not to store "." when there is no process over triggers :
                        2011.Jan.26 13:07:32] .
                        [2011.Jan.26 13:07:00] .
                        [2011.Jan.26 13:06:28] .
                        [2011.Jan.26 13:05:56] .
                        [2011.Jan.26 13:05:25] .
                        [2011.Jan.26 13:04:54] .
                        [2011.Jan.26 13:04:23] .
                        [2011.Jan.26 13:03:50] .
                        [2011.Jan.26 13:03:18] .
                        [2011.Jan.26 13:02:47] .
                        [2011.Jan.26 13:02:14] 71% /usr/lib/firefox-3.6.13/firefox-bin -ProfileManager -no-remote .

                        Comment

                        • frater
                          Senior Member
                          • Oct 2010
                          • 340

                          #13
                          This is in /etc/zabbix/zabbix-agentd.conf
                          Code:
                          ####### USER-DEFINED MONITORED PARAMETERS #######
                          # Format: UserParameter=<key>,<shell command>
                          # Note that shell command must not return empty string or EOL only
                          You can return an EOL ( echo -en '\n' )
                          If you don't return anything, the item will get disabled.

                          If you don't return a key with zabbix_sender, it's as if nothing is sent.

                          You have modified the first example (awk), but have you tried the one I wrote in bash?
                          Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

                          Comment

                          • tof233
                            Member
                            • Nov 2010
                            • 94

                            #14
                            Thank you Frater.
                            I tried with a printf("\n") (as echo doesn't work with awk) and it still become not supported when there is no top process over trigger.

                            I tested your last script. It's quite useful. But I still got the problem (having "." in zabbix history when there is no top process).

                            Comment

                            • frater
                              Senior Member
                              • Oct 2010
                              • 340

                              #15
                              Can't you first switch to my version of the script and we can work from there?
                              Zabbix agents on Linux, FreeBSD, Windows, AVM-Fritz!box, DD-WRT and QNAP

                              Comment

                              Working...