2022 Zabbix中国峰会
2022 Zabbix中国峰会

9 在proc.mem和proc.num项目中选择进程的注意事项

Processes modifying their commandline

一些程序使用修改它们的命令行作为显示当前活动的方法。 用户可以通过运行 pstop 命令来查看活动。这些程序的例子包括 PostgreSQL, Sendmail, Zabbix.

让我们来看一个Linux的例子,假设我们想要监视许多Zabbix代理进程。

ps 命令显示的进程如下

$ ps -fu zabbix
       UID        PID  PPID  C STIME TTY          TIME CMD
       ...
       zabbix    6318     1  0 12:01 ?        00:00:00 sbin/zabbix_agentd -c /home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf
       zabbix    6319  6318  0 12:01 ?        00:00:01 sbin/zabbix_agentd: collector [idle 1 sec]                          
       zabbix    6320  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #1 [waiting for connection]            
       zabbix    6321  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #2 [waiting for connection]            
       zabbix    6322  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #3 [waiting for connection]            
       zabbix    6323  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: active checks #1 [idle 1 sec]                   
       ...

通过名称和用户选择进程来完成任务:

$ zabbix_get -s localhost -k 'proc.num[zabbix_agentd,zabbix]'
       6

现在让我们将 zabbix_agentd 重命名为 zabbix_agentd_30 并重新启动它。

ps 现在显示为

$ ps -fu zabbix
       UID        PID  PPID  C STIME TTY          TIME CMD
       ...
       zabbix    6715     1  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30 -c /home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf
       zabbix    6716  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: collector [idle 1 sec]                          
       zabbix    6717  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #1 [waiting for connection]            
       zabbix    6718  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #2 [waiting for connection]            
       zabbix    6719  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #3 [waiting for connection]            
       zabbix    6720  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: active checks #1 [idle 1 sec]                   
       ...

现在根据名称和用户选择进程会产生不正确的结果:

$ zabbix_get -s localhost -k 'proc.num[zabbix_agentd_30,zabbix]'
       1

为什么将可执行文件重命名为更长的名称会导致完全不同的结果?

Zabbix agent 启动时检查进程名字, /proc/<pid>/status 文件是打开的并且检查 Name 行。 我们的例子中 Name 行如下:

$ grep Name /proc/{6715,6716,6717,6718,6719,6720}/status
       /proc/6715/status:Name:   zabbix_agentd_3
       /proc/6716/status:Name:   zabbix_agentd_3
       /proc/6717/status:Name:   zabbix_agentd_3
       /proc/6718/status:Name:   zabbix_agentd_3
       /proc/6719/status:Name:   zabbix_agentd_3
       /proc/6720/status:Name:   zabbix_agentd_3

status 文件中的进程名会被截断为15个字符。

ps 命令会产生相似的结果:

$ ps -u zabbix
         PID TTY          TIME CMD
       ...
        6715 ?        00:00:00 zabbix_agentd_3
        6716 ?        00:00:01 zabbix_agentd_3
        6717 ?        00:00:00 zabbix_agentd_3
        6718 ?        00:00:00 zabbix_agentd_3
        6719 ?        00:00:00 zabbix_agentd_3
        6720 ?        00:00:00 zabbix_agentd_3
        ...

显然, 跟我们的 proc.num[] name 参数值 zabbix_agentd_30并不一样。 Zabbix agent从status 文件中匹配进程名失败后,会转到 /proc/<pid>/cmdline文件。

agent如何看待“cmdline”文件,可以通过运行一个命令来说明

$ for i in 6715 6716 6717 6718 6719 6720; do cat /proc/$i/cmdline | awk '{gsub(/\x0/,"<NUL>"); print};'; done
       sbin/zabbix_agentd_30<NUL>-c<NUL>/home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf<NUL>
       sbin/zabbix_agentd_30: collector [idle 1 sec]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #1 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #2 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #3 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: active checks #1 [idle 1 sec]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...

/proc/<pid>/cmdline 文件包含在C语言中用于终止字符的隐藏的、 不可显示的空字符 。 这个例子中空字符以 "<NUL>" 形式出现。

Zabbix agent 检查 "cmdline" ,得到 zabbix_agentd_30值, 该值匹配我们的name 参数值 zabbix_agentd_30。 因此, 主进程会被监控项 proc.num[zabbix_agentd_30,zabbix]计数。

当检查下一进程时, agent 从cmdline文件中得到 zabbix_agentd_30: collector [idle 1 sec] ,但不匹配 name 参数值 zabbix_agentd_30。 所以,只有不改变命令行的主进程被计数, 其他的 agent 进程改变了命令行而被忽略。

这个例子展示了 name 参数不能用在 proc.mem[]proc.num[] 监控项目中来选择进程。

cmdline 参数使用恰当的正则表达式会达到一个正确的结果:

$ zabbix_get -s localhost -k 'proc.num[,zabbix,,zabbix_agentd_30[ :]]'
       6

使用 proc.mem[] and proc.num[] 监控项监控可以修改命令行的程序时要小心。

在给 proc.mem[]proc.num[] 监控项使用name and cmdline 参数前, 你应该使用 proc.num[] 监控项和 ps 命令测试该参数。

Linux 内核线程

proc.mem[]proc.num[] 监控项中的 cmdline 参数不可以使用线程

让我们以内核线程为例:

$ ps -ef| grep kthreadd
       root         2     0  0 09:33 ?        00:00:00 [kthreadd]

可以用进程“名称”参数选择:

$ zabbix_get -s localhost -k 'proc.num[kthreadd,root]'
       1

但使用进程cmdline 参数就不起作用:

$ zabbix_get -s localhost -k 'proc.num[,root,,kthreadd]'
       0

原因是Zabbix agent采用“cmdline”参数中指定的正则表达式,并将其应用于进程的内容 /proc/<pid>/cmdline. 对于内核线程的 /proc/<pid>/cmdline 文件是空的, 所以, cmdline 参数不会匹配到。

proc.mem[]proc.num[] 监控项中的线程计数

Linux 内核线程通过proc.num[] 监控项计数,但是 proc.mem[] 监控项并不报告内存。 例如:

$ ps -ef | grep kthreadd
       root         2     0  0 09:51 ?        00:00:00 [kthreadd]
       
       $ zabbix_get -s localhost -k 'proc.num[kthreadd]'
       1
       
       $ zabbix_get -s localhost -k 'proc.mem[kthreadd]'
       ZBX_NOTSUPPORTED: Cannot get amount of "VmSize" memory.

但是如果用户线程和内核线程名字相同会发生什么呢 ? 可能会是这样:

$ ps -ef | grep kthreadd
       root         2     0  0 09:51 ?        00:00:00 [kthreadd]
       zabbix    9611  6133  0 17:58 pts/1    00:00:00 ./kthreadd
       
       $ zabbix_get -s localhost -k 'proc.num[kthreadd]'
       2
       
       $ zabbix_get -s localhost -k 'proc.mem[kthreadd]'
       4157440

proc.num[] 计算内核线程和用户进程。 proc.mem[] 只计算用户进程内存,如果为0计算内核线程内存。这和上面报告 ZBX_NOTSUPPORTED 的例子不同。

如果程序名恰好匹配其中一个线程,请小心使用proc.mem[]proc.num[] 监控项 。

在给 proc.mem[]proc.num[] 监控项配置参数时, 你应该使用 proc.num[] 监控项 和 ps 命令测试该参数。

9 Notes on selecting processes in proc.mem and proc.num items

Processes modifying their commandline

Some programs use modifying their commandline as a method for displaying their current activity. A user can see the activity by running ps and top commands. Examples of such programs include PostgreSQL, Sendmail, Zabbix.

Let's see an example from Linux. Let's assume we want to monitor a number of Zabbix agent processes.

ps command shows processes of interest as

$ ps -fu zabbix
       UID        PID  PPID  C STIME TTY          TIME CMD
       ...
       zabbix    6318     1  0 12:01 ?        00:00:00 sbin/zabbix_agentd -c /home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf
       zabbix    6319  6318  0 12:01 ?        00:00:01 sbin/zabbix_agentd: collector [idle 1 sec]                          
       zabbix    6320  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #1 [waiting for connection]            
       zabbix    6321  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #2 [waiting for connection]            
       zabbix    6322  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: listener #3 [waiting for connection]            
       zabbix    6323  6318  0 12:01 ?        00:00:00 sbin/zabbix_agentd: active checks #1 [idle 1 sec]                   
       ...

Selecting processes by name and user does the job:

$ zabbix_get -s localhost -k 'proc.num[zabbix_agentd,zabbix]'
       6

Now let's rename zabbix_agentd executable to zabbix_agentd_30 and restart it.

ps now shows

$ ps -fu zabbix
       UID        PID  PPID  C STIME TTY          TIME CMD
       ...
       zabbix    6715     1  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30 -c /home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf
       zabbix    6716  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: collector [idle 1 sec]                          
       zabbix    6717  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #1 [waiting for connection]            
       zabbix    6718  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #2 [waiting for connection]            
       zabbix    6719  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: listener #3 [waiting for connection]            
       zabbix    6720  6715  0 12:53 ?        00:00:00 sbin/zabbix_agentd_30: active checks #1 [idle 1 sec]                   
       ...

Now selecting processes by name and user produces an incorrect result:

$ zabbix_get -s localhost -k 'proc.num[zabbix_agentd_30,zabbix]'
       1

Why a simple renaming of executable to a longer name lead to quite different result ?

Zabbix agent starts with checking the process name. /proc/<pid>/status file is opened and the line Name is checked. In our case the Name lines are:

$ grep Name /proc/{6715,6716,6717,6718,6719,6720}/status
       /proc/6715/status:Name:   zabbix_agentd_3
       /proc/6716/status:Name:   zabbix_agentd_3
       /proc/6717/status:Name:   zabbix_agentd_3
       /proc/6718/status:Name:   zabbix_agentd_3
       /proc/6719/status:Name:   zabbix_agentd_3
       /proc/6720/status:Name:   zabbix_agentd_3

The process name in status file is truncated to 15 characters.

A similar result can be seen with ps command:

$ ps -u zabbix
         PID TTY          TIME CMD
       ...
        6715 ?        00:00:00 zabbix_agentd_3
        6716 ?        00:00:01 zabbix_agentd_3
        6717 ?        00:00:00 zabbix_agentd_3
        6718 ?        00:00:00 zabbix_agentd_3
        6719 ?        00:00:00 zabbix_agentd_3
        6720 ?        00:00:00 zabbix_agentd_3
        ...

Obviously, that is not equal to our proc.num[] name parameter value zabbix_agentd_30. Having failed to match the process name from status file the Zabbix agent turns to /proc/<pid>/cmdline file.

How the agent sees the "cmdline" file can be illustrated with running a command

$ for i in 6715 6716 6717 6718 6719 6720; do cat /proc/$i/cmdline | awk '{gsub(/\x0/,"<NUL>"); print};'; done
       sbin/zabbix_agentd_30<NUL>-c<NUL>/home/zabbix/ZBXNEXT-1078/zabbix_agentd.conf<NUL>
       sbin/zabbix_agentd_30: collector [idle 1 sec]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #1 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #2 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: listener #3 [waiting for connection]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...
       sbin/zabbix_agentd_30: active checks #1 [idle 1 sec]<NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL><NUL>...

/proc/<pid>/cmdline files in our case contain invisible, non-printable null bytes, used to terminate strings in C language. The null bytes are shown as "<NUL>" in this example.

Zabbix agent checks "cmdline" for the main process and takes a zabbix_agentd_30, which matches our name parameter value zabbix_agentd_30. So, the main process is counted by item proc.num[zabbix_agentd_30,zabbix].

When checking the next process, the agent takes zabbix_agentd_30: collector [idle 1 sec] from the cmdline file and it does not meet our name parameter zabbix_agentd_30. So, only the main process which does not modify its commandline, gets counted. Other agent processes modify their command line and are ignored.

This example shows that the name parameter cannot be used in proc.mem[] and proc.num[] for selecting processes in this case.

Using cmdline parameter with a proper regular expression produces a correct result:

$ zabbix_get -s localhost -k 'proc.num[,zabbix,,zabbix_agentd_30[ :]]'
       6

Be careful when using proc.mem[] and proc.num[] items for monitoring programs which modify their commandlines.

Before putting name and cmdline parameters into proc.mem[] and proc.num[] items, you may want to test the parameters using proc.num[] item and ps command.

Linux kernel threads

Threads cannot be selected with cmdline parameter in proc.mem[] and proc.num[] items

Let's take as an example one of kernel threads:

$ ps -ef| grep kthreadd
       root         2     0  0 09:33 ?        00:00:00 [kthreadd]

It can be selected with process name parameter:

$ zabbix_get -s localhost -k 'proc.num[kthreadd,root]'
       1

But selection by process cmdline parameter does not work:

$ zabbix_get -s localhost -k 'proc.num[,root,,kthreadd]'
       0

The reason is that Zabbix agent takes the regular expression specified in cmdline parameter and applies it to contents of process /proc/<pid>/cmdline. For kernel threads their /proc/<pid>/cmdline files are empty. So, cmdline parameter never matches.

Counting of threads in proc.mem[] and proc.num[] items

Linux kernel threads are counted by proc.num[] item but do not report memory in proc.mem[] item. For example:

$ ps -ef | grep kthreadd
       root         2     0  0 09:51 ?        00:00:00 [kthreadd]
       
       $ zabbix_get -s localhost -k 'proc.num[kthreadd]'
       1
       
       $ zabbix_get -s localhost -k 'proc.mem[kthreadd]'
       ZBX_NOTSUPPORTED: Cannot get amount of "VmSize" memory.

But what happens if there is a user process with the same name as a kernel thread ? Then it could look like this:

$ ps -ef | grep kthreadd
       root         2     0  0 09:51 ?        00:00:00 [kthreadd]
       zabbix    9611  6133  0 17:58 pts/1    00:00:00 ./kthreadd
       
       $ zabbix_get -s localhost -k 'proc.num[kthreadd]'
       2
       
       $ zabbix_get -s localhost -k 'proc.mem[kthreadd]'
       4157440

proc.num[] counted both the kernel thread and the user process. proc.mem[] reports memory for the user process only and counts the kernel thread memory as if it was 0. This is different from the case above when ZBX_NOTSUPPORTED was reported.

Be careful when using proc.mem[] and proc.num[] items if the program name happens to match one of the thread.

Before putting parameters into proc.mem[] and proc.num[] items, you may want to test the parameters using proc.num[] item and ps command.