1、概述

在日常工作中，经常会需要对内核进行Debug、或者进行优化工作。一些简单的问题，可以通过dmesg/printk查看，优化借助一些工具进行。但是当问题逻辑复杂，优化面宽泛的时候，往往无从下手。需要从上到下、模块到模块之间分析，这时候就不得不借助于Linux提供的静态(Trace Event)动态(各种Tracer)进行分析。同时还不得不借助工具、或者编写脚本进行分析，以缩小问题范围、发现问题。直接拿printk和Trace作对比可能并不太合适，因为他们针对不同的使用场景，但是不管什么方法，能解决问题才是王道～。

ftrace 是 Function Trace 的简写，由 Steven Rostedt 开发的，从 2008 年发布的内核 2.6.27 中开始就内置了。这是为记录数据提供的一个调试 Ring 缓冲区的框架。这些数据由集成到内核中的跟踪程序来采集，现在大概分为两大类：tracer、event。

tracer。发展出了function tracer、function_graph tracer、irqsoff tracer、preemptoff tracer、wakeup tracer等一系列tracer。
event。也发展出trace event、kprobe event、uprobe event、syscall event等一系列的event。

trace采集数据的手段归根到底就两种：插桩、采样。ftrace是插桩法的集大成者，各种trace为了插桩使出了浑身解数给出了花样百变的插桩方法。

2、ftrace 文件接口

ftrace挂载在debugfs上，当前android系统中默认已经挂载了，不过为了方便，我这里首先基于ubuntu使用，后面在介绍android。基本一致。让我们首先看看这里面都是有什么

sudo tree -L 1 /sys/kernel/debug/tracing
├── available_events 
├── available_filter_functions
├── available_tracers
├── buffer_percent
├── buffer_size_kb
├── buffer_total_size_kb
├── current_tracer
├── dynamic_events
├── dyn_ftrace_total_info
├── enabled_functions
├── error_log
├── events
├── free_buffer
├── function_profile_enabled
├── hwlat_detector
├── instances
├── kprobe_events
├── kprobe_profile
├── max_graph_depth
├── options
├── per_cpu
├── printk_formats
├── README
├── saved_cmdlines
├── saved_cmdlines_size
├── saved_tgids
├── set_event
├── set_event_notrace_pid
├── set_event_pid
├── set_ftrace_filter
├── set_ftrace_notrace
├── set_ftrace_notrace_pid
├── set_ftrace_pid
├── set_graph_function
├── set_graph_notrace
├── snapshot
├── stack_max_size
├── stack_trace
├── stack_trace_filter
├── synthetic_events
├── timestamp_mode
├── trace
├── trace_clock
├── trace_marker
├── trace_marker_raw
├── trace_options
├── trace_pipe
├── trace_stat
├── tracing_cpumask
├── tracing_max_latency
├── tracing_on
├── tracing_thresh
├── uprobe_events
└── uprobe_profile

参数说明
接口	说明
tracing_on	0: disable ring buffer 1: enable ring buffer 这个仅仅只是禁止数据写入ring buffer，但是各种桩函数还是被调用，trace动作仍然发生，开销依然存在。
available_tracers	可用的跟踪程序
current_tracer	正在运行的跟踪程序
trace	该文件把ringbuffer数据输出成用户可读的格式。注意：在这个文件open时，ring buffer是临时关闭的。读操作并不会清除掉ring buffer中的数据，可以重复读；
trace_pipe	文件内容和“trace”文件是一样的，区别在于： 1、并行读，读操作不会disable写操作； 2、只支持读一次，该读操作会清除掉ring buffer中的数据，再次去读没有内容了；

3、Tracers

目前支持的Tracer有以下这些：

Tracer
Tracer	Description	说明
function	Function call tracer to trace all kernel functions.	追踪所有的内核函数
function_graph	Similar to the function tracer except that the function tracer probes the functions on their entry whereas the function graph tracer traces on both entry and exit of the functions. It then provides the ability to draw a graph of function calls similar to C code source.	和“function tracer”比较类似，但它除了探测函数的入口还探测函数的出口。它可以画出一个图形化的函数调用，类似于c源代码风格。
blk	The block tracer. The tracer used by the blktrace user application.	块设备tracer
hwlat	The Hardware Latency tracer is used to detect if the hardware produces any latency. See “Hardware Latency Detector” section below.	用来侦测硬件产生的延迟。
irqsoff	Traces the areas that disable interrupts and saves the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this trace with the latency-format option enabled, which happens automatically when the tracer is selected.	追踪最大关闭中断时间。
preemptoff	Similar to irqsoff but traces and records the amount of time for which preemption is disabled.	追踪最大关闭抢占时间。
preemptirqsoff	Similar to irqsoff and preemptoff, but traces and records the largest time for which irqs and/or preemption is disabled.	追踪关闭中断 and/or 关闭抢占的最大时间。
wakeup	Traces and records the max latency that it takes for the highest priority task to get scheduled after it has been woken up. Traces all tasks as an average developer would expect.	追踪最高优先级普通任务从获得调度到被唤醒的最大延迟时间。
wakeup_rt	Traces and records the max latency that it takes for just RT tasks (as the current “wakeup” does). This is useful for those interested in wake up timings of RT tasks.	追踪RT类型的任务从获得调度到被唤醒的最大延迟时间。
wakeup_dl	Traces and records the max latency that it takes for a SCHED_DEADLINE task to be woken (as the “wakeup” and “wakeup_rt” does).	追踪Deadline类型的任务从获得调度到被唤醒的最大延迟时间。
mmiotrace	A special tracer that is used to trace binary module. It will trace all the calls that a module makes to the hardware. Everything it writes and reads from the I/O as well.	追踪硬件IO
branch	This tracer can be configured when tracing likely/unlikely calls within the kernel. It will trace when a likely and unlikely branch is hit and if it was correct in its prediction of being correct.	追踪likely/unlikely的分支预测情况
nop	This is the “trace nothing” tracer. To remove all tracers from tracing simply echo “nop” into current_tracer.	空的tracer

直接使用

在开始介绍函数跟踪程序 ftrace 之前，我们先看一个测试脚本：

#!/bin/sh

dir=/sys/kernel/debug/tracing

sysctl kernel.ftrace_enabled=1
echo function > ${dir}/current_tracer
echo 1 > ${dir}/tracing_on
sleep 1
echo 0 > ${dir}/tracing_on
less ${dir}/trace

这个脚本是非常简单的，但是还有几个需要注意的地方。命令 sysctl ftrace.enabled=1 启用了函数跟踪程序。然后我们通过写它的名字到 current_tracer 文件来启用 current tracer。

接下来，我们写入一个 1 到 tracing_on，它启用了 Ring 缓冲区。这些语法都要求在 1 和 > 符号前后有一个空格；写成像 echo 1> tracing_on 这样将不能工作。一行之后我们禁用它（如果 0 写入到 tracing_on，缓冲区不会被清除并且 ftrace 并不会被禁用）。

我们为什么这样做呢？在两个 echo 命令之间，我们看到了命令 sleep 1。我们启用了缓冲区，运行了这个命令，然后禁用它。这将使跟踪程序采集了这个命令运行期间发生的所有系统调用的信息。

在脚本的最后一行，我们写了一个在控制台上显示跟踪数据的命令。

一旦脚本运行完成后，我们将看到下列的输出（这里只列出了一个小片断）：

# tracer: function
#
# entries-in-buffer/entries-written: 410125/1236625   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
          <idle>-0       [007] d...  1700.415152: _raw_spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0       [007] d...  1700.415152: _raw_spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0       [007] d...  1700.415153: timekeeping_max_deferment <-tick_nohz_next_event
          <idle>-0       [007] d...  1700.415153: hrtimer_next_event_without <-tick_nohz_get_sleep_length
          <idle>-0       [007] d...  1700.415153: _raw_spin_lock_irqsave <-hrtimer_next_event_without
          <idle>-0       [007] d...  1700.415154: __hrtimer_next_event_base <-hrtimer_next_event_without
          <idle>-0       [007] d...  1700.415154: __hrtimer_next_event_base <-hrtimer_next_event_without
          <idle>-0       [007] d...  1700.415155: _raw_spin_unlock_irqrestore <-hrtimer_next_event_without
          <idle>-0       [007] d...  1700.415155: nr_iowait_cpu <-menu_select

名词	解释
tracer	trace名称
TASK	进程名称
CPU#	进程所在的cpu
irqs-off	中断是否disable： `‘d’：中断disable； ‘.’：中断没有disable； ‘X’：架构不支持读取该信息`
need-resched	需要调度标志： `‘N’ both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, ‘n’ only TIF_NEED_RESCHED is set, ‘p’ only PREEMPT_NEED_RESCHED is set, ‘.’ otherwise.`
hardirq/softirq	当前所处中断环境： `‘Z’ - NMI occurred inside a hardirq ‘z’ - NMI is running ‘H’ - hard irq occurred inside a softirq. ‘h’ - hard irq is running ‘s’ - soft irq is running ‘.’ - normal context.`
preempt-depth	表示抢占关闭的嵌套层级
time	时间戳，从latency开始的相对时间
delay	时间间隔大小所处的等级，可以迅速分类出不同的严重程度。 `‘$’ - greater than 1 second ‘@’ - greater than 100 milisecond ‘*’ - greater than 10 milisecond ‘#’ - greater than 1000 microsecond ‘!’ - greater than 100 microsecond ‘+’ - greater than 10 microsecond ‘ ‘ - less than or equal to 10 microsecond.`

ftrace利器之trace-cmd和kernelshark

trace-cmd是设置读取ftrace的命令行工具，kernelshark既可以记录数据，也可以图形化分析结果。

trace-cmd和kernelshark源码都由kernel.org维护在trace-cmd.git。

kernekshark还有自己的帮助网站kernelshark.org。

1. trace-cmd编译安装

可以通过git下载trace-cmd相关代码并编译安装：

1、安装依赖：
sudo apt-get install build-essential git cmake libjson-c-dev -y
sudo apt-get install freeglut3-dev libxmu-dev libxi-dev -y
sudo apt-get install qtbase5-dev -y
2、下载源码：
git clone https://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git/
3、编译安装：
cd trace-cmd
make gui
sudo make install_gui
或者
sudo apt-get install trace-cmd
sudo apt-get install kernelshark

2. trace-cmd的使用

2.1 trace-cmd -h

trace-cmd version 2.8.3
usage:
trace-cmd [COMMAND] ...
commands:
record - record a trace into a trace.dat file
start - start tracing without recording into a file
extract - extract a trace from the kernel------------------------------------------------将系统当前的trace保存到trace.dat中。
stop - stop the kernel from recording trace data-------------------------------------start和stop配置使用，用于开始停止录制。
restart - restart the kernel trace data recording
show - show the contents of the kernel tracing buffer------------------------------读取/sys/kernel/debug/tracing/trace
reset - disable all kernel tracing and clear the trace buffers----------------------对ftrace的设置和ring buffer复位
report - read out the trace stored in a trace.dat file
stream - Start tracing and read the output directly----------------------------------实时在shell中显式ftrace结果
profile - Start profiling and read the output directly
hist - show a historgram of the trace.dat information-------------------------------对trace.dat显式统计信息
stat - show the status of the running tracing (ftrace) system---------------------显示当前ftrace的events、ring buffer等情况
split - parse a trace.dat file into smaller file(s)
options - list the plugin options available for trace-cmd report
listen - listen on a network socket for trace clients
list - list the available events, plugins or options-------------------------------------显示当前ftrace支持的events、tracers、options
restore - restore a crashed record
snapshot - take snapshot of running trace
stack - output, enable or disable kernel stack tracing------------------------------echo 1 > /proc/sys/kernel/stack_tracer_enabled打开stack_tracer，然后trace-cmd stack查看
check-events - parse trace event formats

trace-cmd record开始记录，ctrl+c停止记录并保存到trace.dat中。

还可以通过trace-cmd reset对各种设置进行复位，然后trace-cmd start进行录制，trace-cmd stop停止录制，trace-cmd extract将数据保存到trace.dat中。

2.2 trace-cmd record

trace-cmd record用于录制ftrace信息，通过如下选项可以指定只跟踪特定traceevents，或者跟踪特定pid、或者跟踪特定funtion/function_graph函数。

还可以设置cpumask、ringbuffer大小等等。

usage:
trace-cmd record [-v][-e event [-f filter]][-p plugin][-F][-d][-D][-o file] \
[-s usecs][-O option ][-l func][-g func][-n func] \
[-P pid][-N host:port][-t][-r prio][-b size][-B buf][command ...]
[-m max][-C clock]
-e run command with event enabled-----------------------------------------------------指定只抓取某一事件或者某一类型事件
-f filter for previous -e event
-R trigger for previous -e event
-p run command with plugin enabled
-F filter only on the given process
-P trace the given pid like -F for the command-----------------------------------------只跟踪某一个pid
-c also trace the childen of -F
-C set the trace clock
-T do a stacktrace on all events
-l filter function name
-g set graph function
-n do not trace function
-m max size per CPU in kilobytes
-M set CPU mask to trace------------------------------------------------------------------这5个选项对应ftrace的设置set_ftrace_filter、set_graph_function、set_ftrace_notrace、buffer_size_kb、tracing_cpumask。
-v will negate all -e after it (disable those events)
-d disable function tracer when running
-D Full disable of function tracing (for all users)
-o data output file [default trace.dat]-----------------------------------------------------指定输出文件名
-O option to enable (or disable)
-r real time priority to run the capture threads
-s sleep interval between recording (in usecs) [default: 1000]--------------------默认1ms保存一次数据，加大有利于将此操作频率。1000000变成1s写一次数据。
-S used with --profile, to enable only events in command line
-N host:port to connect to (see listen)
-t used with -N, forces use of tcp in live trace
-b change kernel buffersize (in kilobytes per CPU)-----------------------------------改变ring buffer大小
-B create sub buffer and folling events will be enabled here
-k do not reset the buffers after tracing.
-i do not fail if an event is not found
--profile enable tracing options needed for report --profile
--func-stack perform a stack trace for function tracer
(use with caution)

如下表示只记录sched_switch和sched_wakeup两个时间，每1s写入一次数据。

trace-cmd record -e sched_switch -e sched_wakeup -s 1000000

一个不大不小的坑！正确发送ctrl+c

在使用trace-cmd record记录事件的时候，通过ctrl+c可以停止记录。

但是如果在adb shell中，ctrl+c可能优先退出了shell，而没有正常停止trace-cmd record。

最终在目录下只有trace.dat.cpuX的文件，这些文件是中间文件，kernelshark是无法解析的

解决方法有两种，一是在串口console中ctrl+c，另一种是通过kill发送SIGINT信号kill -2 pid。

3. kernelshark的使用

3.1 kernelshark记录事件

通过菜单栏执行Capture->Record，会弹出Capture对话框。

常用的设置有Events和Plugin，比如这里设置了sched_switch和sched_wakeup两个事件，选择了function_graph插件。

然后开始Run，停止使用Stop。

Close关闭后，可以在Kenelshark中查看细节。

可以看出这里不光显示了sched_switch和sched_wakeup事件，同时还显示了function_graph函数调用。

3.2 kernelshark分析结果

3.2.1 加载trace.dat

通过File->Load data加载trace-cmd record记录的数据，然后开始分析。

默认界面如下：

3.2.2 过滤功能

通常获取的数据需要进行一些过滤，才能发现问题。

filter提供了丰富的过滤功能，在File中提供了加载Load filter、保存Save filter、到处Export filter等功能。

具体在Filter中提供三类功能，分别针对事件events、进程tasks、CPU；还有两个菜单决定是否同步显式Filter结果。

task-打开某些task、关闭某些task；event-事件过滤以及高级事件过滤；cpu-开关某些cpu跟踪。

以最常用的场景，只想关注某些进程的内容：

3.4 图表扩展

横轴图表提供了两大类：CPU和进程。

cpu-每个cpu一个plot；task-可以选择不同task，一个task一个plot。

比如在Filter->tasks选定fork_rr_affinity，同时在Plots中选定fork_rr_affinity之后。

增加了4个plots，并且CUP0只显示了fork_rr_affinity相关执行情况，CPU1没有调度，说明affinity设置成功。

3.5 其它功能

还有一些其它有用的功能，比如搜索功能，选定要搜索的Column，文本框中输入字符串即可。

勾选graph follows，可以实现Plots和下面的events同步显示。

鼠标向右拖拉可以放大选择区域；向左拖拉则缩小当前选择区域。

向左向右调整显式区域，有助于全局或者细节查看执行情况。

4. 小结

trace-cmd作为ftrace的前端，对ftrace的各种设置进行包装，同时能对结果进行处理，极大地提高了ftrace的使用效率。

kernelshark作为trace-cmd的前端，借助图形化，灵活的filter，缩放功能，能更有效的帮助分析，高效的得到结果。

参考

ftrace利器之trace-cmd和kernelshark - ArnoldLu - 博客园

Linux ftrace 2.1、ftrace的使用

使用 ftrace 跟踪内核

原文链接：https://blog.csdn.net/hn_zhangkun/article/details/118462902