linux-performance-perf

Overview

Perf can do lots of thing, like collect, cache miss, context switch, per-thread, per-cpu etc But it needs kernel supporting, perf is always used for system performace debugging, gperftools for application performance debugging(perf also can take application performance debugging)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Usage

#perf
usage: perf [--version] [--help] COMMAND [ARGS]

The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
diff Read two perf.data files and display the differential profile
list List all symbolic event types
probe Define new dynamic tracepoints
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
top System profiling tool.

details about each command
#perf annotate --help

Summary: perf has two ACTIONS 'stat' and 'record' for profiling
'stat' uses counter while 'record' uses samples( xx samples/per second)
which can show call graph(stacks)
to collect information about the system, cpu, thread, call graph

As EVENT is the core part of perf that it can monitor, EVENT includes
Hardware event like (cach misses, LL2 cache etc)
Software event like (page fault, context switch etc)
tracepoint predefined in kernel(requires kernel compiled with debugfs)
Dynamic event (create by #perf probe --add )

event overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# run a command under perf stat
$ perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
$ perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
$ perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]

# OR attach to a running process
$ perf stat [-e <EVENT> | --event=EVENT] [-a] -p $pid

important options

-i: by default perf stat without -i option count all processes, forked (and threads cloned) from target process after starting of counting. With perf stat ./program it will profile all child processes too, and with perf stat -p $PID with attaching to already running $PID it will profile target process and all child processes started after attaching. use -i to disable counting child.

# ================= remote cache and remote dram access for multiple numa =======================================
# offcore_response.all_data_rd.llc_miss.remote_dram
# [Counts all demand & prefetch data reads that miss the L3 and the data is returned from remote dram]
# offcore_response.all_data_rd.llc_miss.remote_hit_forward
# [Counts all demand & prefetch data reads that miss the L3 and clean or shared data is transferred from remote cache]
# offcore_response.all_data_rd.llc_miss.remote_hitm
# [Counts all demand & prefetch data reads that miss the L3 and the modified data is transferred from remote cache]
# offcore_response.all_reads.llc_miss.remote_dram
# [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the data is returned from remote dram]
# offcore_response.all_reads.llc_miss.remote_hit_forward
# [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and clean or shared data is transferred from remote cache]
# offcore_response.all_reads.llc_miss.remote_hitm
# [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the modified data is transferred from remote cache]
# offcore_response.demand_rfo.llc_miss.remote_hitm
# [Counts all demand data writes (RFOs) that miss the L3 and the modified data is transferred from remote cache]
# ================= remote cache and remote dram access for multiple numa =======================================

# splitlock
$ perf stat -e cpu/event=0xf4,umask=0x10/ -a -I 1000
===
$ perf stat -e sq_misc.split_lock -a -I 1000

# run a application that accesss automic memory cross cache-line, henc generate splitlock
$ perf stat -e sq_misc.split_lock -a -I 100 ./sp
# time counts unit events
0.108722215 36,456 sq_misc.split_lock
0.214217317 37,401 sq_misc.split_lock
0.318182137 37,700 sq_misc.split_lock

# total last level cache
$ lscpu | grep 'L3 cache'
L3 cache: 40960K

# intel cqm need hardware and kernel support
$ cat /proc/cpuinfo | grep rdt_a
$ grep 'RDT_A' /boot/config-$(uname -r)

$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 -p $pid
$ perf stat -a -e intel_cqm/total_bytes/ -I 1000 -p $pid
$ perf stat -a -e intel_cqm/local_bytes/ -I 1000 -p $pid

# last level cache occupancy
$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 sleep 100
# time counts unit events
1.004945555 35,061,760 Bytes intel_cqm/llc_occupancy/
2.007805514 46,989,312 Bytes intel_cqm/llc_occupancy/
$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 ./sp
# time counts unit events
1.011122701 31,391,744 Bytes intel_cqm/llc_occupancy/
2.018970467 42,401,792 Bytes intel_cqm/llc_occupancy/

$ perf stat -a -e intel_cqm/total_bytes/ -I 1000 sleep 30
# time counts unit events
1.007463857 0.00 MB intel_cqm/total_bytes/
2.011665924 0.00 MB intel_cqm/total_bytes/

$ perf stat -a -e intel_cqm/local_bytes/ -I 1000 sleep 30
# time counts unit events
1.008000842 0.00 MB intel_cqm/local_bytes/
2.012502066 0.00 MB intel_cqm/local_bytes/

# -----For memory bandwith on new kernel(>=4.14), use this way----------
# this need cpu and kernel supports
$ cat /proc/cpuinfo | grep rdt_a
$ grep 'RESCTRL' /boot/config-$(uname -r)

# Mount the resctrl filesystem.
$ mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# after mount check
$ ls -p /sys/fs/resctrl/
cpus cpus_list info/ mode mon_data/ mon_groups/ schemata size tasks
# ======================host level=====================
# Print the number of llc occupancy on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
# Print the number of local bytes on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
# Print the number of total bytes on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes

# Print the number of llc occupancy on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/llc_occupancy
# Print the number of local bytes on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
# Print the number of total bytes on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
# ======================host level=====================

# ====================== group level=====================
# group level mean collect mbm data of all task in this group
$ mkdir /sys/fs/resctrl/mon_groups/test
$ ls /sys/fs/resctrl/mon_groups/test
cpus cpus_list mon_data/ tasks
$ ls /sys/fs/resctrl/mon_groups/test/mon_data/mon_L3_00
llc_occupancy mbm_local_bytes mbm_total_bytes
# mbm data of this group
$ ls /sys/fs/resctrl/mon_groups/test/mon_data/
mon_L3_00/ mon_L3_01/

# then add task(process to this group)
$ echo 245543 /sys/fs/resctrl/mon_groups/test/tasks
# ====================== group level=====================
# -----For memory bandwith on new kernel(>=4.14), use this way----------

Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# command help
# List all symbolic event types, event has another format: Raw which is number!!!
$ perf list [hw|sw|cache|tracepoint|pmu|event_glob]
- 'hw' or 'hardware' to list hardware events such as cache-misses, etc.
- 'sw' or 'software' to list software events such as context switches, etc.
- 'cache' or 'hwcache' to list hardware cache events such as L1-dcache-loads, etc.
- 'tracepoint' to list all tracepoint events, alternatively use subsys_glob:event_glob to filter by tracepoint subsystems such as sched, block, etc
- 'pmu' to print the kernel supplied PMU events.
- As a last resort, it will do a substring search in all event names

# list the available PMUs
$ ls /sys/devices/*/events/
/sys/devices/cpu/events/:
branch-instructions bus-cycles cache-references cycles-ct el-abort el-commit el-start mem-loads ref-cycles tx-capacity tx-conflict
branch-misses cache-misses cpu-cycles cycles-t el-capacity el-conflict instructions mem-stores tx-abort tx-commit tx-start

/sys/devices/intel_cqm/events/:
llc_occupancy llc_occupancy.scale llc_occupancy.unit local_bytes.per-pkg local_bytes.unit total_bytes.per-pkg total_bytes.unit
llc_occupancy.per-pkg llc_occupancy.snapshot local_bytes local_bytes.scale total_bytes total_bytes.scale

/sys/devices/power/events/:
energy-cores energy-cores.scale energy-cores.unit energy-gpu energy-gpu.scale energy-gpu.unit energy-pkg energy-pkg.scale energy-pkg.unit energy-ram energy-ram.scale energy-ram.unit

/sys/devices/uncore_imc_0/events/:
cas_count_read cas_count_read.scale cas_count_read.unit cas_count_write cas_count_write.scale cas_count_write.unit clockticks

/sys/devices/uncore_imc_1/events/:
cas_count_read cas_count_read.scale cas_count_read.unit cas_count_write cas_count_write.scale cas_count_write.unit clockticks

/sys/devices/uncore_imc_4/events/:
cas_count_read cas_count_read.scale cas_count_read.unit cas_count_write cas_count_write.scale cas_count_write.unit clockticks

/sys/devices/uncore_imc_5/events/:
cas_count_read cas_count_read.scale cas_count_read.unit cas_count_write cas_count_write.scale cas_count_write.unit clockticks

# check Raw of one PMU event
$ cat /sys/devices/cpu/events/cache-misses
event=0x2e,umask=0x41

# L1 cache
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores sleep 1
# L3 cache
$ perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches sleep 1

$ perf stat -e cache-misses sleep 1
equal to
$ perf stat -e cpu/event=0x2e,umask=0x41/ sleep 1

perf for application performance

Even perf can check application performance, but it does this from system view for that process, collect event, stats for that process, and for call graph, kernel symbol, kernel function is counted as well, not like gperf which does this with user function only.

Suggestion
if you want to debug performance issue of application code, gperf is good choice, if you want to know the whole path user level and kernel level of this application, use perf.

User-level application performance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <stdio.h>
#include <stdlib.h>
void func1() {
int i = 0;
while (i < 100000) {
++i;
}
}
void func2() {
int i = 0;
while (i < 200000) {
++i;
}
}
void func3() {
int i = 0;
for (; i < 1000; ++i) {
func1();
func2();
}
}
int main(){
func3();
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
$ gcc -o test test.c -g -O0

$ perf record ./test
$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 2K of event 'cpu-clock'
# Event count (approx.): 723750000
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..........................
#
66.22% test test [.] func2
33.75% test test [.] func1
0.03% test [kernel.kallsyms] [k] get_page_from_freelist


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
$ perf annotate -s func2
Percent | Source code & Disassembly of test
--------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 0000000000400509 <func2>:
: int i = 0;
: while (i < 100000) {
: ++i;
: }
: }
: void func2() {
0.00 : 400509: push %rbp
0.00 : 40050a: mov %rsp,%rbp
: int i = 0;
0.00 : 40050d: movl $0x0,-0x4(%rbp)
: while (i < 200000) {
0.05 : 400514: jmp 40051a <func2+0x11>
: ++i;
17.48 : 400516: addl $0x1,-0x4(%rbp)
: ++i;
: }
: }
: void func2() {
: int i = 0;
: while (i < 200000) {
1.88 : 40051a: cmpl $0x30d3f,-0x4(%rbp)
80.59 : 400521: jle 400516 <func2+0xd>
: ++i;
: }
: }
0.00 : 400523: pop %rbp
0.00 : 400524: retq

# Except this, you can monitor system information (event) when running your processes see how many cache miss when executing test

$ perf stat -e instructions,cycles,branches,branch-misses,cache-misses ./test

#===============attach to a running process===================================
# show top cpu usage for a process from system wide view(include kernel symbol)
$ sysctl -w kernel.pax.softmode=1
$ sysctl -w kernel.kptr_restrict=1
$ perf top -p $pid

# show cpu counter until ctrl+C
$ perf stat -p $pid

# show open event for $PID with 5s
$ perf stat -e "syscalls:sys_*" -p $PID sleep 10

# show open event for $PID until ctrl+C
$ perf stat -e "syscalls:sys_*" -p $PID

# recod cpu-clock event for given pid until control + c
$ perf record -p $pid

# record event for context switch of given pid until control + c
$ perf record -e "cs" -p $pid

System-wide performace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# show funtion(from different apps or kernel function) which takes much CPU
$ perf top

# show all EVNETs supported that can be used by 'stat' or 'record'
$ perf list

# show counters in detail of all cpus within 10s
$ perf stat -a -d sleep 10

# show cpu counter when executing 'ls' command
$ perf stat ls

# show open event for all cpu with 5s
$ perf stat -e "syscalls:sys_*" -a sleep 5

#record all cpus within 10s
# (-g graph, call graph)
$ perf record -a -g -- sleep 10
$ perf record -a -- sleep 10

$ perf record -e "cpu-clock" -a sleep 5
$ perf record -e "sched:sched_switch" -a sleep 5

# record until ctrl+c
$ perf record -e block:block_rq_insert -a

$ perf report -n --stdio -i perf.data
# (need kernel debug info)
$ perf annotate --stdio

# add tracepoint at door of tcp_sendmsg

$ perf probe --add tcp_sendmsg
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg)

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -a sleep 10
perf stat -e probe:tcp_sendmsg -a sleep 10

$ perf record -e probe:tcp_sendmsg -aR sleep 10

# show current probed event
$ perf probe -l

# delete dynamic trace point
$ perf probe -d tcp_sendmsg

# show page faults, cache miss of whole system
$ perf stat -B -e context-switches,cpu-migrations,cache-references,cache-misses,cycles,instructions,page-faults sleep 5
Performance counter stats for 'sleep 5':

1 context-switches
1 cpu-migrations
18,143 cache-references
957 cache-misses # 5.275 % of all cache refs
1,598,875 cycles
667,345 instructions # 0.42 insn per cycle
179 page-faults

5.002159188 seconds time elapsed


# default output for perf stat
$ perf stat -B sleep 5
Performance counter stats for 'sleep 5':

1.136437 task-clock (msec) # 0.000 CPUs utilized
1 context-switches # 0.880 K/sec
0 cpu-migrations # 0.000 K/sec
179 page-faults # 0.158 M/sec
1,612,448 cycles # 1.419 GHz
671,118 instructions # 0.42 insn per cycle
129,284 branches # 113.763 M/sec
8,996 branch-misses # 6.96% of all branches

5.002229617 seconds time elapsed

# check which function triggered cache-miss or page fault event on system
$ perf record -e cache-misses sleep 5
$ perf record -e page-faults sleep 5
$ perf report

# check which function triggered cache-miss or page fault event of given process
$ perf record -e cache-misses -p $pid

# perf stat with multiple events
$ perf stat -e intel_cqm/local_bytes/,intel_cqm/total_bytes/ -a -I 1000

FAQ

perf kvm stat vs perf stat

You can use the perf command with the kvm option to collect guest operating system statistics from the host,

In order to use perf kvm record in the host, you must have access to the /proc/modules and /proc/kallsyms files from the guest, otherwise, perf report can NOT show symbol correctly. but these two files is not needed for perf kvm stat as it does not use symbol table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# On host
################## perf kvm stat ##########################
$ ps -ef | grep 84857
root 84857 1 60 10:07 ? 01:02:09 /usr/libexec/qemu-kvm -daemonize -D /tmp/qemu_vm.log -machine pc,accel=kvm,usb=off,dump-guest-core=off -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/huge_1GB,share=no,size=34359738368,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-15,memdev=ram-node0 -m 32G -cpu host,host-cache-info=on,l3-cache=off -smp 16,maxcpus=16,sockets=1,cores=16,threads=1 -realtime mlock=off -no-user-config -drive file=./centos7.qcow2 -qmp unix:/var/run/qmp.sock,server,nowait -nodefaults -no-hpet -vga std -vnc 0.0.0.0:0 -serial pty -netdev tap,id=net1,ifname=vnet10,script=./qemu-ifup,downscript=./qemu-ifdown -device virtio-net,netdev=net1,mac=52:54:00:12:34:56 -device qemu-xhci -device usb-tablet
# check splitlock from qemu-kvm
$ perf kvm stat -e sq_misc.split_lock -p 84857 -a -I 1000
# time counts unit events
1.000627884 373,366 sq_misc.split_lock (100.00%)

################## perf kvm record ##########################
$ scp root@GuestMachine:/tmp/kallsyms guest-kallsyms
$ scp root@GuestMachine:/tmp/modules guest-modules
$ perf kvm --host --guest --guestkallsyms=guest-kallsyms \
--guestmodules=guest-modules record -a -o perf.data
$ perf kvm --host --guest --guestmodules=guest-modules report -i perf.data.kvm \
--force > analyze

Ref