Perf can do lots of thing, like collect, cache miss, context switch, per-thread, per-cpu etc But it needs kernel supporting, perf is always used for system performace debugging, gperftools for application performance debugging(perf also can take application performance debugging)
The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code diff Read two perf.data files and display the differential profile list List all symbolic event types probe Define new dynamic tracepoints record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics top System profiling tool.
details about each command #perf annotate --help
Summary: perf has two ACTIONS 'stat' and 'record' for profiling 'stat' uses counter while 'record' uses samples( xx samples/per second) which can show call graph(stacks) to collect information about the system, cpu, thread, call graph
As EVENT is the core part of perf that it can monitor, EVENT includes Hardware event like (cach misses, LL2 cache etc) Software event like (page fault, context switch etc) tracepoint predefined in kernel(requires kernel compiled with debugfs) Dynamic event (create by #perf probe --add )
# run a command under perf stat $ perf stat [-e <EVENT> | --event=EVENT] [-a] <command> $ perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>] $ perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]
# OR attach to a running process $ perf stat [-e <EVENT> | --event=EVENT] [-a] -p $pid
important options
-i: by default perf stat without -i option count all processes, forked (and threads cloned) from target process after starting of counting. With perf stat ./program it will profile all child processes too, and with perf stat -p $PID with attaching to already running $PID it will profile target process and all child processes started after attaching. use -i to disable counting child.
# ================= remote cache and remote dram access for multiple numa ======================================= # offcore_response.all_data_rd.llc_miss.remote_dram # [Counts all demand & prefetch data reads that miss the L3 and the data is returned from remote dram] # offcore_response.all_data_rd.llc_miss.remote_hit_forward # [Counts all demand & prefetch data reads that miss the L3 and clean or shared data is transferred from remote cache] # offcore_response.all_data_rd.llc_miss.remote_hitm # [Counts all demand & prefetch data reads that miss the L3 and the modified data is transferred from remote cache] # offcore_response.all_reads.llc_miss.remote_dram # [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the data is returned from remote dram] # offcore_response.all_reads.llc_miss.remote_hit_forward # [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and clean or shared data is transferred from remote cache] # offcore_response.all_reads.llc_miss.remote_hitm # [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the modified data is transferred from remote cache] # offcore_response.demand_rfo.llc_miss.remote_hitm # [Counts all demand data writes (RFOs) that miss the L3 and the modified data is transferred from remote cache] # ================= remote cache and remote dram access for multiple numa =======================================
# splitlock $ perf stat -e cpu/event=0xf4,umask=0x10/ -a -I 1000 === $ perf stat -e sq_misc.split_lock -a -I 1000
# run a application that accesss automic memory cross cache-line, henc generate splitlock $ perf stat -e sq_misc.split_lock -a -I 100 ./sp # time counts unit events 0.108722215 36,456 sq_misc.split_lock 0.214217317 37,401 sq_misc.split_lock 0.318182137 37,700 sq_misc.split_lock
# total last level cache $ lscpu | grep 'L3 cache' L3 cache: 40960K
# intel cqm need hardware and kernel support $ cat /proc/cpuinfo | grep rdt_a $ grep 'RDT_A' /boot/config-$(uname -r)
$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 -p $pid $ perf stat -a -e intel_cqm/total_bytes/ -I 1000 -p $pid $ perf stat -a -e intel_cqm/local_bytes/ -I 1000 -p $pid
# last level cache occupancy $ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 sleep 100 # time counts unit events 1.004945555 35,061,760 Bytes intel_cqm/llc_occupancy/ 2.007805514 46,989,312 Bytes intel_cqm/llc_occupancy/ $ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 ./sp # time counts unit events 1.011122701 31,391,744 Bytes intel_cqm/llc_occupancy/ 2.018970467 42,401,792 Bytes intel_cqm/llc_occupancy/
$ perf stat -a -e intel_cqm/total_bytes/ -I 1000 sleep 30 # time counts unit events 1.007463857 0.00 MB intel_cqm/total_bytes/ 2.011665924 0.00 MB intel_cqm/total_bytes/
$ perf stat -a -e intel_cqm/local_bytes/ -I 1000 sleep 30 # time counts unit events 1.008000842 0.00 MB intel_cqm/local_bytes/ 2.012502066 0.00 MB intel_cqm/local_bytes/
# -----For memory bandwith on new kernel(>=4.14), use this way---------- # this need cpu and kernel supports $ cat /proc/cpuinfo | grep rdt_a $ grep 'RESCTRL' /boot/config-$(uname -r)
# Mount the resctrl filesystem. $ mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl # after mount check $ ls -p /sys/fs/resctrl/ cpus cpus_list info/ mode mon_data/ mon_groups/ schemata size tasks # ======================host level===================== # Print the number of llc occupancy on the first socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy # Print the number of local bytes on the first socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes # Print the number of total bytes on the first socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
# Print the number of llc occupancy on the second socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_01/llc_occupancy # Print the number of local bytes on the second socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes # Print the number of total bytes on the second socket. $ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes # ======================host level=====================
# ====================== group level===================== # group level mean collect mbm data of all task in this group $ mkdir /sys/fs/resctrl/mon_groups/test $ ls /sys/fs/resctrl/mon_groups/test cpus cpus_list mon_data/ tasks $ ls /sys/fs/resctrl/mon_groups/test/mon_data/mon_L3_00 llc_occupancy mbm_local_bytes mbm_total_bytes # mbm data of this group $ ls /sys/fs/resctrl/mon_groups/test/mon_data/ mon_L3_00/ mon_L3_01/
# then add task(process to this group) $ echo 245543 /sys/fs/resctrl/mon_groups/test/tasks # ====================== group level===================== # -----For memory bandwith on new kernel(>=4.14), use this way----------
# command help # List all symbolic event types, event has another format: Raw which is number!!! $ perf list [hw|sw|cache|tracepoint|pmu|event_glob] - 'hw' or 'hardware' to list hardware events such as cache-misses, etc. - 'sw' or 'software' to list software events such as context switches, etc. - 'cache' or 'hwcache' to list hardware cache events such as L1-dcache-loads, etc. - 'tracepoint' to list all tracepoint events, alternatively use subsys_glob:event_glob to filter by tracepoint subsystems such as sched, block, etc - 'pmu' to print the kernel supplied PMU events. - As a last resort, it will do a substring search in all event names
# list the available PMUs $ ls /sys/devices/*/events/ /sys/devices/cpu/events/: branch-instructions bus-cycles cache-references cycles-ct el-abort el-commit el-start mem-loads ref-cycles tx-capacity tx-conflict branch-misses cache-misses cpu-cycles cycles-t el-capacity el-conflict instructions mem-stores tx-abort tx-commit tx-start
# check Raw of one PMU event $ cat /sys/devices/cpu/events/cache-misses event=0x2e,umask=0x41
# L1 cache $ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores sleep 1 # L3 cache $ perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches sleep 1
$ perf stat -e cache-misses sleep 1 equal to $ perf stat -e cpu/event=0x2e,umask=0x41/ sleep 1
perf for application performance
Even perf can check application performance, but it does this from system view for that process, collect event, stats for that process, and for call graph, kernel symbol, kernel function is counted as well, not like gperf which does this with user function only.
Suggestion if you want to debug performance issue of application code, gperf is good choice, if you want to know the whole path user level and kernel level of this application, use perf.
#include<stdio.h> #include<stdlib.h> voidfunc1() { int i = 0; while (i < 100000) { ++i; } } voidfunc2() { int i = 0; while (i < 200000) { ++i; } } voidfunc3() { int i = 0; for (; i < 1000; ++i) { func1(); func2(); } } intmain(){ func3(); return0; }
$ perf record ./test $ perf report # To display the perf.data header info, please use --header/--header-only options. # # Samples: 2K of event 'cpu-clock' # Event count (approx.): 723750000 # # Overhead Command Shared Object Symbol # ........ ....... ................. .......................... # 66.22% testtest [.] func2 33.75% testtest [.] func1 0.03% test [kernel.kallsyms] [k] get_page_from_freelist
# # (For a higher level overview, try: perf report --sort comm,dso) # $ perf annotate -s func2 Percent | Source code & Disassembly of test -------------------------------------------------- : : : : Disassembly of section .text: : : 0000000000400509 <func2>: : int i = 0; : while (i < 100000) { : ++i; : } : } : void func2() { 0.00 : 400509: push %rbp 0.00 : 40050a: mov %rsp,%rbp : int i = 0; 0.00 : 40050d: movl $0x0,-0x4(%rbp) : while (i < 200000) { 0.05 : 400514: jmp 40051a <func2+0x11> : ++i; 17.48 : 400516: addl $0x1,-0x4(%rbp) : ++i; : } : } : void func2() { : int i = 0; : while (i < 200000) { 1.88 : 40051a: cmpl $0x30d3f,-0x4(%rbp) 80.59 : 400521: jle 400516 <func2+0xd> : ++i; : } : } 0.00 : 400523: pop %rbp 0.00 : 400524: retq
# Except this, you can monitor system information (event) when running your processes see how many cache miss when executing test
$ perf stat -e instructions,cycles,branches,branch-misses,cache-misses ./test
#===============attach to a running process=================================== # show top cpu usage for a process from system wide view(include kernel symbol) $ sysctl -w kernel.pax.softmode=1 $ sysctl -w kernel.kptr_restrict=1 $ perf top -p $pid
# show cpu counter until ctrl+C $ perf stat -p $pid
# show open event for $PID with 5s $ perf stat -e "syscalls:sys_*" -p $PIDsleep 10
# show open event for $PID until ctrl+C $ perf stat -e "syscalls:sys_*" -p $PID
# recod cpu-clock event for given pid until control + c $ perf record -p $pid
# record event for context switch of given pid until control + c $ perf record -e "cs" -p $pid
$ perf probe --add tcp_sendmsg Failed to find path of kernel module. Added new event: probe:tcp_sendmsg (on tcp_sendmsg)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -a sleep 10 perf stat -e probe:tcp_sendmsg -a sleep 10
$ perf record -e probe:tcp_sendmsg -aR sleep 10
# show current probed event $ perf probe -l
# delete dynamic trace point $ perf probe -d tcp_sendmsg
# show page faults, cache miss of whole system $ perf stat -B -e context-switches,cpu-migrations,cache-references,cache-misses,cycles,instructions,page-faults sleep 5 Performance counter stats for'sleep 5':
1 context-switches 1 cpu-migrations 18,143 cache-references 957 cache-misses # 5.275 % of all cache refs 1,598,875 cycles 667,345 instructions # 0.42 insn per cycle 179 page-faults
5.002159188 seconds time elapsed
# default output for perf stat $ perf stat -B sleep 5 Performance counter stats for'sleep 5':
# check which function triggered cache-miss or page fault event on system $ perf record -e cache-misses sleep 5 $ perf record -e page-faults sleep 5 $ perf report
# check which function triggered cache-miss or page fault event of given process $ perf record -e cache-misses -p $pid
# perf stat with multiple events $ perf stat -e intel_cqm/local_bytes/,intel_cqm/total_bytes/ -a -I 1000
FAQ
perf kvm stat vs perf stat
You can use the perf command with the kvm option to collect guest operating system statistics from the host,
In order to use perf kvm record in the host, you must have access to the /proc/modules and /proc/kallsyms files from the guest, otherwise, perf report can NOT show symbol correctly. but these two files is not needed for perf kvm stat as it does not use symbol table