linux-performance-perf

Posted on 2023-02-22 Edited on 2023-08-17 In performance , system

Overview

Perf can do lots of thing, like collect, cache miss, context switch, per-thread, per-cpu etc But it needs kernel supporting, perf is always used for system performace debugging, gperftools for application performance debugging(perf also can take application performance debugging)


Usage

    #perf
    usage: perf [--version] [--help] COMMAND [ARGS]

    The most commonly used perf commands are:
     annotate        Read perf.data (created by perf record) and display annotated code
     diff            Read two perf.data files and display the differential profile
     list            List all symbolic event types
     probe           Define new dynamic tracepoints
     record          Run a command and record its profile into perf.data
     report          Read perf.data (created by perf record) and display the profile
     script          Read perf.data (created by perf record) and display trace output
     stat            Run a command and gather performance counter statistics
     top             System profiling tool.

    details about each command
    #perf annotate --help

    Summary: perf has two ACTIONS 'stat' and 'record' for profiling
    'stat' uses counter while 'record' uses samples( xx samples/per second)
    which can show call graph(stacks)
    to collect information about the system, cpu, thread, call graph

    As EVENT is the core part of perf that it can monitor, EVENT includes
        Hardware event like (cach misses, LL2 cache etc)
        Software event like (page fault, context switch etc)
        tracepoint predefined in kernel(requires kernel compiled with debugfs)
        Dynamic event (create by #perf probe --add )

event overview

# run a command under perf stat
$ perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
$ perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
$ perf stat [-e <EVENT> | --event=EVENT] [-a] record [-o file] — <command> [<options>]

# OR attach to a running process
$ perf stat [-e <EVENT> | --event=EVENT] [-a] -p $pid

important options

-i: by default perf stat without -i option count all processes, forked (and threads cloned) from target process after starting of counting. With perf stat ./program it will profile all child processes too, and with perf stat -p $PID with attaching to already running $PID it will profile target process and all child processes started after attaching. use -i to disable counting child.

# ================= remote cache and remote dram access for multiple numa =======================================
#  offcore_response.all_data_rd.llc_miss.remote_dram 
#       [Counts all demand & prefetch data reads that miss the L3 and the data is returned from remote dram]
#  offcore_response.all_data_rd.llc_miss.remote_hit_forward
#       [Counts all demand & prefetch data reads that miss the L3 and clean or shared data is transferred from remote cache]
#  offcore_response.all_data_rd.llc_miss.remote_hitm 
#       [Counts all demand & prefetch data reads that miss the L3 and the modified data is transferred from remote cache]
#  offcore_response.all_reads.llc_miss.remote_dram   
#       [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the data is returned from remote dram]
#  offcore_response.all_reads.llc_miss.remote_hit_forward
#       [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and clean or shared data is transferred from remote cache]
#  offcore_response.all_reads.llc_miss.remote_hitm   
#       [Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the modified data is transferred from remote cache]
#  offcore_response.demand_rfo.llc_miss.remote_hitm  
#       [Counts all demand data writes (RFOs) that miss the L3 and the modified data is transferred from remote cache]
# ================= remote cache and remote dram access for multiple numa =======================================

# splitlock
$ perf stat -e cpu/event=0xf4,umask=0x10/ -a -I 1000 
===
$ perf stat -e sq_misc.split_lock -a -I 1000

# run a application that accesss automic memory cross cache-line, henc generate splitlock
$ perf stat -e sq_misc.split_lock -a -I 100 ./sp
#           time             counts unit events
     0.108722215             36,456      sq_misc.split_lock                                          
     0.214217317             37,401      sq_misc.split_lock                                          
     0.318182137             37,700      sq_misc.split_lock           

# total last level cache
$ lscpu | grep 'L3 cache'
L3 cache:              40960K

# intel cqm need hardware and kernel support
$ cat /proc/cpuinfo | grep rdt_a
$ grep  'RDT_A' /boot/config-$(uname -r)

$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 -p $pid
$ perf stat -a -e intel_cqm/total_bytes/ -I 1000 -p $pid
$ perf stat -a -e intel_cqm/local_bytes/ -I 1000 -p $pid

# last level cache occupancy
$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 sleep 100
#           time             counts  unit events
     1.004945555         35,061,760 Bytes intel_cqm/llc_occupancy/                                    
     2.007805514         46,989,312 Bytes intel_cqm/llc_occupancy/                                    
$ perf stat -a -e intel_cqm/llc_occupancy/ -I 1000 ./sp
#           time             counts  unit events
     1.011122701         31,391,744 Bytes intel_cqm/llc_occupancy/                                    
     2.018970467         42,401,792 Bytes intel_cqm/llc_occupancy/                                    

$ perf stat -a -e intel_cqm/total_bytes/ -I 1000 sleep 30
#           time             counts unit events
     1.007463857               0.00 MB   intel_cqm/total_bytes/                                      
     2.011665924               0.00 MB   intel_cqm/total_bytes/                       

$ perf stat -a -e intel_cqm/local_bytes/ -I 1000 sleep 30
#           time             counts unit events
     1.008000842               0.00 MB   intel_cqm/local_bytes/                                      
     2.012502066               0.00 MB   intel_cqm/local_bytes/  

# -----For memory bandwith on new kernel(>=4.14), use this way----------
# this need cpu and kernel supports
$ cat /proc/cpuinfo | grep rdt_a
$ grep  'RESCTRL' /boot/config-$(uname -r)

# Mount the resctrl filesystem.
$ mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# after mount check
$ ls -p /sys/fs/resctrl/
cpus  cpus_list  info/  mode  mon_data/  mon_groups/  schemata  size  tasks
# ======================host level=====================
# Print the number of llc occupancy on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
# Print the number of local bytes on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
# Print the number of total bytes on the first socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes

# Print the number of llc occupancy on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/llc_occupancy
# Print the number of local bytes on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
# Print the number of total bytes on the second socket.
$ cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
# ======================host level=====================

# ====================== group level=====================
# group level mean collect mbm data of all task in this group
$ mkdir /sys/fs/resctrl/mon_groups/test
$ ls /sys/fs/resctrl/mon_groups/test
cpus  cpus_list  mon_data/  tasks
$ ls /sys/fs/resctrl/mon_groups/test/mon_data/mon_L3_00
llc_occupancy  mbm_local_bytes  mbm_total_bytes
# mbm data of this group
$ ls /sys/fs/resctrl/mon_groups/test/mon_data/
mon_L3_00/  mon_L3_01/

# then add task(process to this group)
$ echo 245543 /sys/fs/resctrl/mon_groups/test/tasks
# ====================== group level=====================
# -----For memory bandwith on new kernel(>=4.14), use this way----------

Usage

# command help
# List all symbolic event types, event has another format: Raw which is number!!!
$ perf list [hw|sw|cache|tracepoint|pmu|event_glob]
- 'hw' or 'hardware' to list hardware events such as cache-misses, etc.
- 'sw' or 'software' to list software events such as context switches, etc.
- 'cache' or 'hwcache' to list hardware cache events such as L1-dcache-loads, etc.
- 'tracepoint' to list all tracepoint events, alternatively use subsys_glob:event_glob to filter by tracepoint subsystems such as sched, block, etc
- 'pmu' to print the kernel supplied PMU events.
- As a last resort, it will do a substring search in all event names

# list the available PMUs
$ ls /sys/devices/*/events/
/sys/devices/cpu/events/:
branch-instructions  bus-cycles    cache-references  cycles-ct  el-abort     el-commit    el-start      mem-loads   ref-cycles  tx-capacity  tx-conflict
branch-misses        cache-misses  cpu-cycles        cycles-t   el-capacity  el-conflict  instructions  mem-stores  tx-abort    tx-commit    tx-start

/sys/devices/intel_cqm/events/:
llc_occupancy          llc_occupancy.scale     llc_occupancy.unit  local_bytes.per-pkg  local_bytes.unit  total_bytes.per-pkg  total_bytes.unit
llc_occupancy.per-pkg  llc_occupancy.snapshot  local_bytes         local_bytes.scale    total_bytes       total_bytes.scale

/sys/devices/power/events/:
energy-cores  energy-cores.scale  energy-cores.unit  energy-gpu  energy-gpu.scale  energy-gpu.unit  energy-pkg  energy-pkg.scale  energy-pkg.unit  energy-ram  energy-ram.scale  energy-ram.unit

/sys/devices/uncore_imc_0/events/:
cas_count_read  cas_count_read.scale  cas_count_read.unit  cas_count_write  cas_count_write.scale  cas_count_write.unit  clockticks

/sys/devices/uncore_imc_1/events/:
cas_count_read  cas_count_read.scale  cas_count_read.unit  cas_count_write  cas_count_write.scale  cas_count_write.unit  clockticks

/sys/devices/uncore_imc_4/events/:
cas_count_read  cas_count_read.scale  cas_count_read.unit  cas_count_write  cas_count_write.scale  cas_count_write.unit  clockticks

/sys/devices/uncore_imc_5/events/:
cas_count_read  cas_count_read.scale  cas_count_read.unit  cas_count_write  cas_count_write.scale  cas_count_write.unit  clockticks

# check Raw of one PMU event
$ cat /sys/devices/cpu/events/cache-misses 
event=0x2e,umask=0x41

# L1 cache
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores sleep 1
# L3 cache
$ perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches sleep 1

$ perf stat -e cache-misses sleep 1
equal to
$ perf stat -e cpu/event=0x2e,umask=0x41/ sleep 1

perf for application performance

Even perf can check application performance, but it does this from system view for that process, collect event, stats for that process, and for call graph, kernel symbol, kernel function is counted as well, not like gperf which does this with user function only.

Suggestion
if you want to debug performance issue of application code, gperf is good choice, if you want to know the whole path user level and kernel level of this application, use perf.

User-level application performance

#include <stdio.h>
#include <stdlib.h>
void func1() {
    int i = 0;
    while (i < 100000) {
        ++i;
    }
}
void func2() {
    int i = 0;
    while (i < 200000) {
        ++i;
    }
}
void func3() {
    int i = 0;
    for (; i < 1000; ++i) {
        func1();
        func2();
    }
}
int main(){
    func3();
    return 0;
}

$ gcc -o test test.c -g -O0

$ perf record ./test
$ perf report
    # To display the perf.data header info, please use --header/--header-only options.
    #
    # Samples: 2K of event 'cpu-clock'
    # Event count (approx.): 723750000
    #
    # Overhead  Command      Shared Object                      Symbol
    # ........  .......  .................  ..........................
    #
        66.22%     test  test               [.] func2
        33.75%     test  test               [.] func1
         0.03%     test  [kernel.kallsyms]  [k] get_page_from_freelist


    #
    # (For a higher level overview, try: perf report --sort comm,dso)
    #
$ perf annotate -s func2
     Percent |      Source code & Disassembly of test
    --------------------------------------------------
             :
             :
             :
             :      Disassembly of section .text:
             :
             :      0000000000400509 <func2>:
             :          int i = 0;
             :          while (i < 100000) {
             :              ++i;
             :          }
             :      }
             :      void func2() {
        0.00 :        400509:       push   %rbp
        0.00 :        40050a:       mov    %rsp,%rbp
             :          int i = 0;
        0.00 :        40050d:       movl   $0x0,-0x4(%rbp)
             :          while (i < 200000) {
        0.05 :        400514:       jmp    40051a <func2+0x11>
             :              ++i;
       17.48 :        400516:       addl   $0x1,-0x4(%rbp)
             :              ++i;
             :          }
             :      }
             :      void func2() {
             :          int i = 0;
             :          while (i < 200000) {
        1.88 :        40051a:       cmpl   $0x30d3f,-0x4(%rbp)
       80.59 :        400521:       jle    400516 <func2+0xd>
             :              ++i;
             :          }
             :      }
        0.00 :        400523:       pop    %rbp
        0.00 :        400524:       retq

# Except this, you can monitor system information (event) when running your processes see how many cache miss when executing test

$ perf stat -e instructions,cycles,branches,branch-misses,cache-misses ./test

#===============attach to a running process===================================
# show top cpu usage for a process from system wide view(include kernel symbol)
$ sysctl -w kernel.pax.softmode=1
$ sysctl -w kernel.kptr_restrict=1
$ perf top -p $pid

# show cpu counter until ctrl+C
$ perf stat -p $pid

# show open event for $PID with 5s
$ perf stat -e "syscalls:sys_*" -p $PID sleep 10

# show open event for $PID until ctrl+C
$ perf stat -e "syscalls:sys_*"  -p $PID

# recod cpu-clock event for given pid until control + c
$ perf record -p $pid

# record event for context switch of given pid until control + c
$ perf record -e "cs" -p $pid

System-wide performace

# show funtion(from different apps or kernel function) which takes much CPU
$ perf top

# show all EVNETs supported that can be used by 'stat' or 'record'
$ perf list

# show counters in detail of all cpus within 10s
$ perf stat -a -d sleep 10

# show cpu counter when executing 'ls' command
$ perf stat ls

# show open event for all cpu with 5s
$ perf stat -e "syscalls:sys_*" -a sleep 5

#record all cpus within 10s
# (-g graph, call graph)
$ perf record -a -g -- sleep 10
$ perf record -a  -- sleep 10

$ perf record -e "cpu-clock" -a sleep 5
$ perf record -e "sched:sched_switch" -a sleep 5

# record until ctrl+c
$ perf record -e block:block_rq_insert -a

$ perf report -n --stdio -i perf.data
# (need kernel debug info)
$ perf annotate --stdio

# add tracepoint at door of tcp_sendmsg

$ perf probe --add tcp_sendmsg
    Failed to find path of kernel module.
    Added new event:
        probe:tcp_sendmsg    (on tcp_sendmsg)

    You can now use it in all perf tools, such as:

        perf record -e probe:tcp_sendmsg -a sleep 10
        perf stat -e probe:tcp_sendmsg -a sleep 10

$ perf record -e probe:tcp_sendmsg -aR sleep 10

# show current probed event
$ perf probe -l

# delete dynamic trace point
$ perf probe -d tcp_sendmsg

# show page faults, cache miss of whole system
$ perf stat -B -e context-switches,cpu-migrations,cache-references,cache-misses,cycles,instructions,page-faults sleep 5
Performance counter stats for 'sleep 5':

             1      context-switches                                            
             1      cpu-migrations                                              
        18,143      cache-references                                            
           957      cache-misses              #    5.275 % of all cache refs    
     1,598,875      cycles                                                      
       667,345      instructions              #    0.42  insn per cycle         
           179      page-faults                                                 

   5.002159188 seconds time elapsed


# default output for perf stat
$ perf stat -B sleep 5
Performance counter stats for 'sleep 5':

      1.136437      task-clock (msec)         #    0.000 CPUs utilized          
             1      context-switches          #    0.880 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           179      page-faults               #    0.158 M/sec                  
     1,612,448      cycles                    #    1.419 GHz                    
       671,118      instructions              #    0.42  insn per cycle         
       129,284      branches                  #  113.763 M/sec                  
         8,996      branch-misses             #    6.96% of all branches        

   5.002229617 seconds time elapsed

# check which function triggered cache-miss or page fault event on system
$ perf record -e cache-misses sleep 5
$ perf record -e page-faults sleep 5
$ perf report

# check which function triggered cache-miss or page fault event of given process
$ perf record -e cache-misses -p  $pid

# perf stat with multiple events
$ perf stat -e intel_cqm/local_bytes/,intel_cqm/total_bytes/ -a -I 1000

FAQ

perf kvm stat vs perf stat

You can use the perf command with the kvm option to collect guest operating system statistics from the host,

In order to use perf kvm record in the host, you must have access to the /proc/modules and /proc/kallsyms files from the guest, otherwise, perf report can NOT show symbol correctly. but these two files is not needed for perf kvm stat as it does not use symbol table

# On host
################## perf kvm stat ##########################
$ ps -ef | grep 84857
root      84857      1 60 10:07 ?        01:02:09 /usr/libexec/qemu-kvm -daemonize -D /tmp/qemu_vm.log -machine pc,accel=kvm,usb=off,dump-guest-core=off -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/mnt/huge_1GB,share=no,size=34359738368,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-15,memdev=ram-node0 -m 32G -cpu host,host-cache-info=on,l3-cache=off -smp 16,maxcpus=16,sockets=1,cores=16,threads=1 -realtime mlock=off -no-user-config -drive file=./centos7.qcow2 -qmp unix:/var/run/qmp.sock,server,nowait -nodefaults -no-hpet -vga std -vnc 0.0.0.0:0 -serial pty -netdev tap,id=net1,ifname=vnet10,script=./qemu-ifup,downscript=./qemu-ifdown -device virtio-net,netdev=net1,mac=52:54:00:12:34:56 -device qemu-xhci -device usb-tablet
# check splitlock from qemu-kvm
$ perf kvm stat -e sq_misc.split_lock -p 84857 -a -I 1000 
#           time             counts unit events
     1.000627884            373,366      sq_misc.split_lock                                            (100.00%)

################## perf kvm record ##########################
$ scp root@GuestMachine:/tmp/kallsyms guest-kallsyms
$ scp root@GuestMachine:/tmp/modules guest-modules
$ perf kvm --host --guest --guestkallsyms=guest-kallsyms \
--guestmodules=guest-modules record -a -o perf.data
$ perf kvm --host --guest --guestmodules=guest-modules report -i perf.data.kvm \
--force > analyze

Ref

detect split lock