view raw
osgx osgx - 10 months ago 101
Linux Question

How to use linux `perf` tool to generate "Off-CPU" profile

Brendan D. Gregg (author of DTrace book) has interesting variant of profiling: the "Off-CPU" profiling (and Off-CPU Flame Graph; slides 2013, p112-137) to see, where the thread or application were blocked (was not executed by CPU, but waiting for I/O, pagefault handler, or descheduled due short of CPU resources):

This time reveals which code-paths are blocked and waiting while off-CPU, and for how long exactly. This differs from traditional profiling which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU.

He also can combine Off-CPU profile data and On-CPU profile together:

The examples given by Gregg are made using
, which is not usually available in Linux OS. But there are some similar tools (ktap, systemtap, perf) and the
as I think has widest installed base. Usually
generated On-CPU profiles (which functions were executed more often on CPU).

  • How can I translate Gregg's Off-CPU examples to
    profiling tool in Linux?

PS: There is link to systemtap variant of Off-CPU flamegraphs in the slides from LISA13, p124: "Yichun Zhang created these, and has been using them on Linux with SystemTap to collect the profile data. See: •"" (CloudFlare Beer Meeting on 23 August 2013)


Brendan Gregg published instruction about Off-cpu flame graph generating: and

Off-CPU time flame graphs may solve (say) 60% of the issues, with the remainder requiring walking the thread wakeups to find root cause. I explained off-CPU time flame graphs, this wakeup issue, and additional work, in my LISA13 talk on flame graphs (slides, youtube).

Here I'll show one way to do off-CPU time flame graphs using Linux perf_events.

# perf record -e sched:sched_stat_sleep -e sched:sched_switch \
 -e sched:sched_process_exit -a -g -o sleep 1
# perf inject -v -s -i -o
# perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
./ | \
./ --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg and from Gregg are used to draw flamegraph.

There are perf options used from 3.17 kernels and newer...