linux-performance-systemtap

Introduction

SystemTap provides the infrastructure to monitor the running Linux kernel and application for detailed analysis. This can assist administrators and developers in identifying the underlying cause of a bug or performance problem. SystemTap is designed to eliminate this and allows users to gather the kernel information by running user-written SystemTap scripts., you do NOT need to write kernel module, compile it and load it by yourself, you just write systemtap script, then systemtap framework does all other things for you(which actually use kprobe)

For short, add hooks at point event(function enter, function return etc) for running application or kernel, in hooks print or check something.

How it works

  1. First, SystemTap checks the script against the existing tapset library (normally in /usr/share/systemtap/tapset/ for any tapsets used. SystemTap will then substitute any located tapsets with their corresponding definitions in the tapset library.
  2. SystemTap then translates the script to C, running the system C compiler to create a kernel module from it. The tools(stap) that perform this step are contained in the systemtap package
  3. SystemTap loads the module, then enables all the probes (events and handlers) in the script. The staprun in the systemtap-runtime package provides this functionality.
  4. As the events occur, their corresponding handlers are executed.
  5. Once the SystemTap session is terminated, the probes are disabled, and the kernel module is unloaded.

Setup to use systemtap

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ yum install systemtap systemtap-runtime
# install kernel debug info, refer to https://sourceware.org/systemtap/SystemTap_Beginners_Guide/using-systemtap.html
$ stap-prep
# check if systemtap works or not(script from command line)
$ stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
Pass 1: parsed user script and 487 library scripts using 299532virt/96108res/3516shr/93332data kb, in 940usr/150sys/1278real ms.
Pass 2: analyzed script: 1 probe, 1 function, 7 embeds, 0 globals using 468708virt/260648res/4900shr/262508data kb, in 2750usr/1330sys/4835real ms.
Pass 3: translated to C into "/tmp/stapYzHqgZ/stap_cde75c591f0c1e1bb6070ad11276b42b_2771_src.c" using 468708virt/260908res/5160shr/262508data kb, in 10usr/50sys/69real ms. ----> convert to kernel module(c)
Pass 4: compiled C into "stap_cde75c591f0c1e1bb6070ad11276b42b_2771.ko" in 9250usr/2530sys/11991real ms.--------------------compile this custom mode
Pass 5: starting run.---------------------->load this module

# command to run systemtap

# monitor events for any process(script from a file)
$ stap -vv -F xx.stp -o output.log

# OR

# monitor event for this process only
# Sets the SystemTap handler function target() to the specified process ID
$ stap -vv -F xx.stp -o output.log -x process_id

Probe

Probe = event + its handler

The essential idea behind a SystemTap script is to name events, and to give them handlers. When SystemTap runs the script, SystemTap monitors for the event; once the event occurs, the Linux kernel then runs the handler as a quick sub-routine, then resumes.

There are several kind of events: entering/exiting a function, timer expiration, session termination, etc. A handler is a series of script language statements that specify the work to be done whenever the event occurs. This work normally includes extracting data from the event context, storing them into internal variables, and printing results.

FORMAT
probe PROBEPOINT [, PROBEPOINT] { [STMT ...] }

PROBEPOINT supports wildcard match

alias

New probe points may be defined using aliases. A probe point alias looks similar to probe definitions, but instead of activating a probe at the given point, it defines a new probe point name as an alias to an existing one. New probe aliases may refer to one or more existing probe aliases. Multiple aliases may share the same underlying probe points

FORMAT
probe <alias> = <probepoint> { <prologue_stmts> }

1
2
3
probe socket.sendmsg = kernel.function ("sock_sendmsg") { ... }
probe socket.sendmsg {...}
probe socket.sendmsg.return {...}

===

1
2
probe kernel.function("sock_sendmsg") {...}
probe kernel.function("sock_sendmsg").return {...}

function

SystemTap scripts may define subroutines to factor out common work. Functions may take any number of scalar arguments, and must return a single scalar value. Scalars in this context are integers or strings. For more information on scalars.

FORMAT

1
2
3
4
5
6
7
8
9
function <name>[:<type>] ( <arg1>[:<type>], ... ) { <stmts> }

function thisfn (arg1, arg2) {
return arg1 + arg2
}

function thatfn:string(arg1:long, arg2) {
return sprintf("%d%s", arg1, arg2)
}

Kernel Probe

ProbePoint

1
2
3
4
5
6
7
8
9
10
11
# these two are actually alias!!
syscall.system_call
vfs.file_operation

kernel.function("foo")
kernel.statement("func@file:linenumber")
kernel.function("foo").return
module{"ext3"}.function("ext3_*")
module("modname").statement("func@file:linenumber")
kernel.function("*@net/socket.c")
timer.ms(5000)

Probe examples

1
2
probe kernel.function("sys_mkdir").call { log ("enter") }
probe kernel.function("sys_mkdir").return { log ("exit") }

NOTE
Kernel has prebuilt probe markers This family of probe points connects to static probe markers inserted into the kernel or a module. These markers are special macro calls in the kernel that make probing faster and more reliable than with DWARF-based probes. DWARF debugging information is not required to use probe markers.

User Probe

ProbePoint

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# PATH can be binary or library!!!
process.begin
process("PATH").begin
process.thread.begin
process("PATH").thread.begin
process.end
process("PATH").end
process.thread.end
process("PATH").thread.end

process.syscall
process("PATH").syscall
process.syscall.return
process("PATH").syscall.return

process("PATH").mark("LABEL")

process("PATH").function("NAME")
process("PATH").statement("*@FILE.c:123")
process("PATH").function("*").return

Probe examples

1
2
probe process("/usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so").function("qemuMonitorSend") {}
probe process("/usr/sbin/libvirtd").function("main") {}

static probing
You can probe symbolic static instrumentation compiled into programs and shared libraries with the following syntax:
process("PATH").mark("LABEL")

Use STAP_PROBE1 for probes with one argument. Use STAP_PROBE2 for probes with 2 arguments, and so on. The arguments of the probe are available in the context variables $arg1, $arg2, and so on.

As an alternative to the STAP_PROBE macros, you can use the dtrace script to create custom macros. The sdt.h file also provides dtrace compatible markers through DTRACE_PROBE and an associated python dtrace script.

NOTE: static probing can probe at any point as it's programed by user, hence you can pass any args to it, like local var etc, but dynamic probling can only probe and enter and exit of a function

tapset

Tapset scripts are libraries of probe aliases and auxiliary functions, refer to tapset function

location: /usr/share/systemtap/tapset/

Frequently used

1
2
3
4
5
6
7
8
9
10
11
12
tid()	The id of the current thread.
pid() The process (task group) id of the current thread.
uid() The id of the current user.
execname() The name of the current process.
cpu() The current cpu number.
gettimeofday_s() Number of seconds since epoch.
get_cycles() Snapshot of hardware cycle counter.
pp() A string describing the probe point being currently handled.
ppfunc() If known, the the function name in which this probe was placed.
$$vars If available, a pretty-printed listing of all local variables in scope.
print_backtrace() If possible, print a kernel backtrace.
print_ubacktrace() If possible, print a user-space backtrace.

Example of User application with systemtap

systemtap script has similarity with C language, refer to script syntax, flow control, array

if you want to use systemtap for user application, make sure application is built with symbol(-g) or install symbol separately when dtrace is not compiled in

dtrace disable

You must have the symbol table to find the function used in stp.

test.c

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <stdlib.h>

void hello(char *name) {
printf("hello %s\n", name);
}

int main() {
hello("Tom");

return 0;
}
// gcc -o test test.c -g

test.stp

1
2
3
4
5
6
7
8
9
10
11
probe process("./test").begin {
printf("start to probe\n ")
}

probe process("./test").function("hello") {
printf("%s\n", user_string($name))
}

probe process("./test").end {
printf("probe end\n")
}
1
2
3
$ stap test.stp
# on another terminal
$ ./test

dtrace enabled

No need to build with debug info if only use marker as dtrace stores the address of marker when compiling the code, but if you want to use function name, symbol table is required as well.

test.c with markder(dtrace probe)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <stdlib.h>
#include <sys/sdt.h>

void hello(char *name) {
printf("hello %s\n", name);
}

int main() {
int age = 12;
char *name = "Tom";

DTRACE_PROBE2(test, hello_marker, age, name);
hello(name);

return 0;
}
// gcc -o test test.c

test.stp

1
2
3
4
5
6
7
8
9
10
11
12
probe process("./test").begin {
printf("start to probe\n")
}

probe process("./test").mark("hello_marker") {
// arg1 is int, not pointer, can access it directly
printf("age: %d name: %s\n", $arg1, user_string($arg2))
}

probe process("./test").end {
printf("probe end\n")
}
1
2
$ stap test.stp
$ ./test

REF