debugging application

Posted on 2019-09-27 Edited on 2025-05-12 In linux , command

debug application

Here are list of frequent commands that are used in daily life and some tips that may be useful to debug application

tested only on centos7 platform

show dynamic library used by an application

Way 1
Way 2

$ ldd app_binary
output:
linux-vdso.so.1 => (0x00007fff76bcd000)
libopenvswitch.so.1 => /usr/lib/libopenvswitch.so.1 (0x00007f19c83c2000)
libboost_system.so.1.54.0 => /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0 (0x00007f19c81be000)
…

$ readelf -d app_binary | grep NEEDED
output:
0x0000000000000001 (NEEDED) Shared library: [libopenvswitch.so.1]
0x0000000000000001 (NEEDED) Shared library: [libboost_system.so.1.54.0]
…

then find the library path
$ locate libopenvswitch.so.1
/usr/lib/debug/usr/lib/libopenvswitch.so.1.0.0
/usr/lib/libopenvswitch.so.1
/usr/lib/libopenvswitch.so.1.0.0

how application searches library

directories listed in the LD_LIBRARY_PATH environment variable (DYLD_LIBRARY_PATH on OSX)
directories listed in the executable’s rpath($chrpath -l app)
directories on the system search path, which (on Linux at least) consists of the entries in /etc/ld.so.conf plus /lib and /usr/lib.

create and use a static library

create static lib from object files
$ ar cru libopenvswitch.a stream-ssl.o stream.o
later on you may want to add a new object in the lib
$ ar crs libopenvswitch.a foo.o

use it in your application
$ gcc -o app main.c /xx/libopenvswitch.a
OR
$ gcc -o app main.c -lopenvswitch -L/path/to/openvswitch

when adding library in gcc command line the order is important, let's say if A.a depends on B.a, the command should be $ gcc -o app main.c A.a B.a

show all objects in static library
$ ar -t libopenvswitch.a
extract object file from static library
$ ar -xv libopenvswitch.a stream-ssl.o

create and use a dynamic library

$ gcc -fPIC -g foo.c
$ gcc -shared -o libfoo.so foo.o

Use it later on by this way
$ gcc -o app main.c libfoo.so
Or
$ gcc -app main.c -lfoo -L/path/to/foo

show symbols in dynamics and static library

$ readelf -s /usr/lib/x86_64-linux-gnu/libssl.so
$ readelf -Ws /usr/lib/x86_64-linux-gnu/libssl.a
$ readelf -Ws stream-ssl.o
-W==–wide means show full name of the long symbol

check process thread

show all processes(not thread) of a given program
$ pidof program_name

-s Single shot - this instructs the program to return only one pid
$ pidof -s program_name

-x Scripts too - this causes the program to also return process id’s of shells running the named scripts
$ pidof -x shell_scripts

-o omit pid Tells pidof to omit processes with that process id.
The special pid %PPID can be used to name the parent process of the pidof program, in other words the calling shell or shell script.
example:

if pidof -o %PPID -x “abc.sh”>/dev/null; then
echo “Process already running”
fi

show threads in a process
$ ps -Lf $pid
Or see from /proc
$ ls /proc/$pid/task

do everything in memory

As in some case, you have large memory, you can run all commands in memory, it's fast, but also note these files
will be lost after reboot.

For example compile kernel in memory
$ tar xjvf linux.tar.gz -C /dev/shm
$ cd /dev/shm/linux
$ cp /boot/config-4.3.3-4.3.y.20151215.ol7.x86_64 .config
$ make oldconfig && make -j32
$ make -j32 modules_install
$ make install

debug dynamic library info when runs a program

$ LD_DEBUG=help ls
Valid options for the LD_DEBUG environment variable are:
libs        display library search paths
reloc       display relocation processing
files       display progress for input file
symbols     display symbol table processing
bindings    display information about symbol binding
versions    display version dependencies
scopes      display scope information
all         all previous options combined
statistics  display relocation statistics
unused      determined unused DSOs
help        display this help message and exit
$ LD_DEBUG=libs ls
$ LD_DEBUG=all ls

useful commands in binutils

addr2line, ar, gprof, nm, objdump, readelf, gcov, strip, ranlib, size, strings

ranlib: generates an index to the contents of an archive
(below will generate symbol index for the archive but actually, it's not need as 'ar' already takes care of it, you can use 'nm -s xxx.a' to check the symbol before running ranlib)
$ ranlib xxx.a

size: list the section size of an object or archive file

strings: list printable strings from files
(note strings only return printable characters, strings displays all strings that are at least four characters in length in the files but can change with -n)
strings is always used to get string from Binary file!!!

$ strings /bin/ls | grep Copyright

show c++ symbol with namespace
$ nm –demangle xxx.a
(–demangle shows symbol like namespace::Builder::hello(int); that is readable for human not ABXxxhello33x which is really stored by compiler)
output format for nm

A: Archive symbol (from an archive file)
B: BSS segment (uninitialized data)
C: Common symbol (a type of BSS)
D: Data segment (initialized data)
G: Global symbol (for dynamic symbols, typically in shared objects)
I: Indirect symbol (used in dynamic linking)
N: Debugging section (symbol with no defined value)
R: Read-only data segment
S: Stack segment
T: Text segment (code)
U: Undefined symbol (a symbol that is referenced but not defined in this object file)
W: Weak symbol (a symbol that can be overridden by a definition with the same name)
t: Local (static) text (code) symbol
d: Local (static) data symbol

process affinity

task state: ZOMBIE(if child exits, it enters this STATE, waiting parent to read it state, after parent read it by wait() etc, child resource(task_struct) is freed)

Make sure parent call wait() for its child process when they exit, otherwise resource is leak in kernel as said above!

priority:
        nice is given when create a process to calculate the static priority
        while scheduling is based on dynamic priority (effective_prio) which considers
        sleep time and static priority

command
    show nice value
    #ps -axl

    default priority(20)<------->nice(0)
    run a program with adjustment nice value
    (nice value for ls ==default_nice_value(0)+(-10)
    #nice -n -10 ls

    change nice value of processes
    #renice -n 10 -p 1203

    retrieve task cpu affinity
    #taskset -p 1203

     cpu7    cpu6  cpu5   cpu4 cpu3  cpu2  cpu1  cpu0
       +      +     +      +    +     +     +     +
       |      |     |      |    |     |     |     |
       +------+-----+-----------+-----+-----+-----+
       |                   |                      |
       |   1111(f)         |       0000(0)        |
       +   cpu mask        +       cpu mask       +


    Let's say we have 8 cpus, core id
    begins from 0-7, cpu mask is bits
    each bit represent a cpu, 1/0

    #taskset -p 1223
    0x03
    affinity mask 0x03 means it runs only on cpu0, cpu1

    c means use 1, 2, 3 format for cpu not 0x format!!
    #taskset -cp 1223
    pid 1223's current affinity list: 0,

    retrieve task cpu affinity, Only runs on process #0, set it cpu bit with 1
    #taskset -p 0x01 1223
    OR
    #taskset -cp 0 1223

CPU bandwidth
    /proc/sys/kernel/sched_rt_period_us:
    The scheduling period that is equivalent to 100% CPU
    bandwidth
    /proc/sys/kernel/sched_rt_runtime_us:
    A global limit on how much time realtime scheduling may
    use.

IRQ affinity setting

see cpu affinity for IRQ
$ cat /proc/irq/145/smp_affinity
f
(0x1111)
$ cat /proc/irq/145/smp_affinity_list
0-3
(cpu0, cpu1, cpu2, cpu3)

trace process by strace

strace/ltrace/ptrace/truss
show how much time process spent on system call or library, (the library is C library like, memset, fgets etc)

ptrace is a system call
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

strace and ltrace are based on ptrace to implement their feature
strace/truss is used to trace system call and signal when it occurs
ltrace is used to trace a process's library call initially

truss is similar to strace, but it's for UNIX, not linux

strace is tool we selected for linux
#mount -t procfs proc /proc
#strace -f -tt -o vim.strace vim

/*attach a process to see what system (lib) is calling */
#strace -p pid

#strace -f -p pid
(monitor parent and all its children, this is what -f option does)

#strace -o filename
(save result to a file)
#strace -T
(show time spent on system call)
#strace -t
(show time of day)
#strace -s 1024
(max sie of string to print)
#strace -e trace=nanosleep
-e trace=network
-e trace=file
-e trace=desc
-e trace=signal

(trace a particular event)
7) strace -f
(trace child as well)
8) strace -c
(Count time, calls, and errors for each system call and report a summary on program exit)

ltrace also has options above!!!!!

so the usually use case is
$ strace -f -c -o result vim

show time, spending time for system call
$ strace -f -t -T -o result vim
root@manager:~# strace -f -c vim

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 21.81    0.000935           3       309           read
 15.35    0.000658           2       383       127 stat
  9.96    0.000427           3       142         5 open
  9.68    0.000415           6        65           munmap
  6.02    0.000258           2       113           mmap
  5.25    0.000225           2       145           close
  4.15    0.000178           2        86        80 openat
  3.48    0.000149           6        25           write
  3.15    0.000135           4        36           mprotect
  2.61    0.000112           2        59           select
  2.59    0.000111           1       160           fcntl
  2.15    0.000092           1       106           fchdir
  2.05    0.000088           2        37           brk
  1.94    0.000083           1        56           chdir
  1.70    0.000073           1        83           fstat
  1.24    0.000053           1        56           getcwd
  1.19    0.000051           1        56           lseek
  1.10    0.000047           4        12           getdents
  1.03    0.000044           2        21        20 access
  0.84    0.000036          36         1           unlink
  0.63    0.000027           1        25           ioctl
  0.42    0.000018           1        24           rt_sigaction
  0.30    0.000013           7         2         2 connect
  0.23    0.000010          10         1           rename
  0.19    0.000008           4         2         2 statfs
  0.16    0.000007           1         7           rt_sigprocmask
  0.16    0.000007           4         2           socket
  0.16    0.000007           7         1           fchown
  0.12    0.000005           1         6           getuid
  0.09    0.000004           4         1           sysinfo
  0.05    0.000002           1         2           getrlimit
  0.05    0.000002           2         1         1 futex
  0.02    0.000001           1         1           execve
  0.02    0.000001           1         1           uname
  0.02    0.000001           1         2           umask
  0.02    0.000001           1         1           sigaltstack
  0.02    0.000001           1         1           arch_prctl
  0.02    0.000001           1         1           set_tid_address
  0.02    0.000001           1         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.004287                  2033       237 total

show shared memory on the system

show ipc shared memory
$ ipcs
show posix shared memory
$ ls /dev/shm

only show mem/cpu usage of a particular pid

$ ps -p 13687 -o %mem=
$ ps -p 13687 -o %cpu=

show processes who opened a file or dir

who open the dir only(top level)
$ lsof $dir
search for all open instances of directory s and the files and directories it contains at its top level
$ lsof +d $s
search for all open instances of directory D and all the files and directories it contains to its complete depth
$ lsof +D $s

only this file, who opens the file
$ lsof $file

force a running process to generate a core dump

gcore(from gdb) does NOT kill the process
$ gcore $pid
(usually to see the snapshot of the process memory)

kill the process as send abort to the process
$ kill -SIGABRT $pid

signal

A few days ago, i landed upon unix signals that lead to process termination. I guess i was trying to remember the signals generated in linux when one presses Ctrl+Z and Ctrl+C. Memory did not serve me at that moment and i decided to look these up, one more time. I realized that having a consolidated book which explains these terms clearly is better than searching loads of webpages. I did the later since i had kept my unix os book away from my reach.To my disappointment, there was no single link that listed out all differences in an orderly fashion.

Hence, in this post, i wish to delineate these terms by consolidating my findings from stackoverflow, wikipedia and other unix internals websites. Here it goes:

SIGKILL: Terminates a process immediately. This signal cannot be handled (caught), ignored or blocked. (The “kill -9” command in linux generates the same signal).

SIGTERM: Terminates a process immediately. However, this signal can be handled, ignored or caught in code. If the signal is not caught by a process, the process is killed. Also, this is used for graceful termination of a process. (The “kill” command in linux if specified without any signal number like -9, will send SIGTERM)

SIGINT: Interrupts a process. (The default action is to terminate gracefully). This too, like, SIGTERM can be handled, ignored or caught. The difference between SIGINT and SIGTERM is that the former can be sent from a terminal as input characters. This is the signal generated when a user presses Ctrl+C. (Sidenote: Ctrl+C denotes EOT(End of Transmission) for (say) a network stream)

SIGQUIT: Terminates a process. This is different from both SIGKILL and SIGTERM in the sense that it generates a core dump of the process and also cleans up resources held up by a process. Like SIGINT, this can also be sent from the terminal as input characters. It can be handled, ignored or caught in code. This is the signal generated when a user presses Ctrl+.

SIGSTP: Suspends a process. This too, can be handled, ignored or blocked. Since it does not terminate the process, the process can be resumed by sending a SIGCONT signal. This signal can be generated by pressing Ctrl+Z. (Sidenote: Ctrl+Z stands for substitute character which indicates End-of-File in DOS)

SIGHUP: (From Wikipedia): Hangs up a process when the controlling terminal is disconnected. This especially relates to modem/dial in connections. A process has to explicitly handle this signal for it to work. A good use is to “poke” a process and letting the process (as defined by the programmer) decide what to do with the signal is described here. Hence, SIGHUP can be handled, ignored or caught. This is the signal generated when a user presses Ctrl+D.

Some time, you may want to hang a process for low response

$ kill -STOP $pid
$ kill -SIGCONT $pid

show environment variables of a running process.

use gdb, environ is a global variable

environment variables of this process, it’s the final place, absolute right!
attach to that process
gdb> p environ
gdb>environ[0]
gdb>environ[1]
gdb>environ[2]
…
gdb>environ[x]

check /proc/PIDxx/environ

it only shows the environment variables when process launches, if you added
new environment variable by setenv()[C function], it doesn’t have one that
added after process launched!!!

check process name when knows pid

$ cat /proc/$pid/status

memory leak/ overflow tool

Use tools:
valgrind and AddressSanitizer

shutdown() vs close()

 shutdown() is useful for deliniating when you are done providing a request to a server using TCP.  A typical use is to send a request to a server followed by a shutdown().  The server will read your request followed by an EOF (read of 0 on most unix implementations).  This tells the server that it has your full request.  You then go read blocked on the socket.  The server will process your request and send the necessary data back to you followed by a close.  When you have finished reading all of the response to your request you will read an EOF thus signifying that you have the whole response.

The shutdown(s, how) call causes all or part of a full-duplex connection on the socket associated with sockfd to be shut down.
- If how is SHUT_RD, further receptions will be disallowed.
    - No FIN sent, only kernel set the state of socket NOT reading(the peer does not know), should not read any more
    - if read on SHUT_RD, -1(ECONNRESET) reset by peer, or end of file if peer close(fd).

- If how is SHUT_WR, further transmissions will be disallowed.
    - cause FIN sent to peer and the peer recv EOF()(peer should not send any more, as its peer close reading)`the peer can send data even after receiving EOF`
    - if write again on SHUT_WR socket, exception `SIGPIPE, Broken pipe`.

- If how is SHUT_RDWR,further receptions and transmissions will be disallowed.
    - Both action above

- close() can do SHUT_RDWR and tell kernel to free socket resource

Usage:
- shutdown + close()(only for resource free)
- close() only

RESET TCP from application

if application sent RST(RST flag set) on a tcp socket, the socket will go into Closed state imediately(no FIN sent), the peer that receives RST packet will go into peer Closed state as well, no need ACK for RST packet, this is the quick way to close tcp conenction and free port.

struct linger linger;

linger.l_onoff = 1;
linger.l_linger = 0;
if (setsockopt(fd, SOL_SOCKET, SO_LINGER,
               (const void *) &linger, sizeof(struct linger)) == -1)
{
    /* log error */
}

//RST packet is sent when call close()
close(fd);

install debuginfo of kernel or application on centos

# debuginfo of kernel(kernel is not built by yourself)
$debuginfo-install -y kernel-$(uname -r)

# debuginfo of application or library installed by yum
$cp CentOS-Debuginfo.repo /etc/yum.repos.d/
$yum update
$debuginfo-install libgcc-4.8.5-44.el7.x86_64

show stack for give process(thread)

# show task of the group
$pstack 11289

# only print this process itself
$/proc/11289/stack

# or generate a core without kill this process

$gcore $pid

pkg-config

pkg-config is a tool to check dependencies for a library, it outputs version, header path, libs of that library, so that someone who uses this library passes these to compiler for building.

pkg-config gets all these information by checking xxx.pc from several paths, so that if a library wants to be managed by pkg-config, it must proivde a xxx.pc file at some path.

/usr/lib64/pkgconfig/glib.pc

prefix=/usr
exec_prefix=/usr
libdir=/usr/lib64
includedir=/usr/include

Name: GLib
Description: C Utility Library
Version: 1.2.10
Libs: -L${libdir} -lglib
Cflags: -I${includedir}/glib-1.2 -I${libdir}/glib/include

# get default search paths for pkg-config
$pkg-config --variable pc_path pkg-config

# change search paths for pkg-config
$ export PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/share/pkgconfig:/new/path

# list all known packages
$pkg-config --list-all


##################### example ###########################
$pkg-config --modversion gnutls
3.3.29
$pkg-config --libs  gnutls
-L/usr/lib64 -lgnutls
$pkg-config --cflags gnutls
-I/usr/include/p11-kit-1


$cat /usr/lib64/pkgconfig/gnutls.pc 
# Process this file with autoconf to produce a pkg-config metadata file.

# Copyright (C) 2004-2012 Free Software Foundation, Inc.

# Copying and distribution of this file, with or without modification,
# are permitted in any medium without royalty provided the copyright
# notice and this notice are preserved.  This file is offered as-is,
# without any warranty.

# Author: Simon Josefsson

prefix=/usr
exec_prefix=/usr
libdir=/usr/lib64
includedir=/usr/include

Name: GnuTLS
Description: Transport Security Layer implementation for the GNU system
URL: http://www.gnutls.org/
Version: 3.3.29
Libs: -L${libdir} -lgnutls
Libs.private: /usr/lib64/libz.so     -lp11-kit    -ltspi -lgmp
Requires.private: nettle, hogweed, libtasn1, p11-kit-1, zlib
Cflags: -I${includedir}
##################### example ###########################

ld finding library

1 2	# when for linking, ld is used, show paths checked for a library $ ld -lopenvswitch --verbose

get hardware meta

$dmidecode
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
11 structures occupying 539 bytes.
Table at 0x000F70E0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: SeaBIOS
	Version: 1.10.2-1.el7
	Release Date: 04/01/2014
	Address: 0xE8000
	Runtime Size: 96 kB
	ROM Size: 64 kB
	Characteristics:
		BIOS characteristics not supported
		Targeted content distribution is supported
	BIOS Revision: 0.0

Handle 0x0100, DMI type 1, 27 bytes
System Information
	Manufacturer: Red Hat
	Product Name: KVM
	Version: RHEL 7.4.0 PC (i440FX + PIIX, 1996)
	Serial Number: Not Specified
	UUID: Not Settable
	Wake-up Type: Power Switch
	SKU Number: Not Specified
	Family: Red Hat Enterprise Linux
...

####################### PCIE Slot information ============================
# get free pci-e slot, lanes of each slot
# then you can install devices into these slots
$dmidecode --type slot
...
Handle 0x0902, DMI type 9, 17 bytes
System Slot Information
	Designation: PCIe Slot 4 ----> like a description
	Type: x16 PCI Express 3  ----> 16 lanes PCIE 3.0
	Current Usage: Available ----> this slot is free, can be inserted with pcie device
	Length: Long
	ID: 4
	Characteristics:
		3.3 V is provided
		PME signal is supported

Handle 0x0903, DMI type 9, 17 bytes
System Slot Information
	Designation: PCIe Slot 5 ----> like a description
	Type: x8 PCI Express 3 x16---> electrical 8 lanes but x16 (physical slot), [think it as fake x16]
	Current Usage: In Use     ---> this slot is in use
	Length: Long
	ID: 5
	Characteristics:
		3.3 V is provided
		PME signal is supported
	Bus Address: 0000:86:00.0

# In some linux kernel, the type may not contain pcie generation info
# then check the pcie generation from link rate
$lspci -s 0000:86:00.0 -vv | egrep  "LnkCap|LnkSta"
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us, L1 <4us
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

# LinkCap(Link Capabilities Register): Speed 8GT/s(Data transfer rate)， Width x8(lanes), is the capability supported by slot on motherboard
# LinkSta(Link Status Register): Speed 8GT/s, Width x8, is the result(current speed) after negotiation with pcie device inserted on this slot
...

####################### PCIE Slot information ============================

lspci output

lspci only shows pci(e) devices which means in-use pci(e) slot, more detail about pci device by lspci and setpci to query and set pci registers

# list all pci devices
$ls -al /sys/bus/pci/devices
lrwxrwxrwx 1 root root 0 Aug 29 17:59 0000:3b:12.0 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:12.0
...

# We can break the device string "0000:3b:12.0" down as follows:
# 0000 : PCI domain (each domain can contain up to 256 PCI buses)
# 3b   : the bus number the device is attached to 
# 12   : the device number 
# .0   : PCI device function

# To get additional information about the device, we can change into the 0000:04:00.0 directory and execute our favorite pager to display one or more pseudo-device entries:


# show number instead of name
$lspci -s 3b:12.0 -n
3b:12.0 0100: 1af4:1042

# Field 1 : 3b:12.0 : bus number (3b), device number (12) and function (0) 
# Field 2 : 0100    : device class 
# Field 3 : 1af4    : vendor ID 
# Field 4 : 1042    : device ID

# Or more easy way and more details
$lspci -s 3b:12.0 -nmv
Device:	3b:12.0
Class:	0100
Vendor:	1af4
Device:	1042
SVendor:	1172
SDevice:	0001
NUMANode:	0

# To convert the identifiers to human-readable strings, we can look up the identifiers in the PCI ID repository: http://pci-ids.ucw.cz/
# Field 2 : 0100   : class 0200 is listed as a "SCSI storage controller"
# Field 3 : 1af4   : vendor ID 1af4 is listed as the "Red Hat, Inc." 
# Field 4 : 1000   : device ID 1042 is listed as a "Virtio 1.0 block device"

$ lspci
3b:12.0 SCSI storage controller: Red Hat, Inc Virtio block device
...

coredump without symbol table

As in production env, symbol table is stripped to reduce the binary size, but external symbol table is generated and stored somewhere usually, so that when core happens, it’s esay for us to know why.

but what about no external symbol what we can do?
Stack analysis: Analyze the stack trace in the core dump to understand the sequence of function calls leading up to the crash(use virtual address and objdump -d binary to find the assembly code, with the assembly code, try to map the source). This can provide insights into the program’s execution flow and potentially identify the source of the problem.

Register analysis: Examine the register values in the core dump to understand the state of the program at the time of the crash. This can help you identify any abnormal values or conditions that may have caused the crash.

Keep in mind that without symbols, the information obtained from the core dump analysis will be limited. It will be difficult to pinpoint the exact cause of the crash or understand the specific code paths leading to it.

enable forwarding in kernel

open /etc/sysctl.conf
enable net.ipv4.ip_forward=1
enable ipv6 forwarding set net.ipv6.conf.all.forwarding=1
then restart sysctl service

Actually, all variables in /etc/sysctl.conf will be applied to corresponding kernel variables through /proc or sysctl API
Note: proc for sysctl is at /proc/sys
$ cat /proc/sys/net/ipv4/ip_forward

IP address and Route

IP Address
Displaying existing addresses
ip [-6] addr show [dev ]

Add an IP address
ip [-6] addr add / dev
ip -6 addr add 2001:0db8:0:f101::1/64 dev eth0

Removing an IP address
ip [-6] addr del / dev
ip -6 addr del 2001:0db8:0:f101::1/64 dev eth0

IP Route
show ip routes on all interface
ip [-6] route show
show ip routes on particular interface
ip [-6] route show dev eth0

Add an IP route through a gateway
ip [-6] route add / via [dev ]
ip -6 route add 2000::/64 via 2001:0db8:0:f101::1

Removing an IP route through a gateway
ip [-6] route del / via [dev ]
ip -6 route del 2000::/3 via 2001:0db8:0:f101::1

Add an IP route through an interface
ip [-6] route add / dev metric 1
ip -6 route add 2000::/3 dev eth0 metric 1

Removing an IP route through an interface
ip [-6] route del / dev
ip -6 route del 2000::/3 dev eth0

Deep into route table

Actually, kernel supports 255 route tables with priority, the two main
used ones are local table and main table, local table can’t be modified, controlled by kernel
while main table is the default table when you use tools to add/delete/create route, they are in main table

which table should we use when packet comes in?
First check local table, if not match, check main table, DO NOT check other table if no
policy(ip rule) is configure, otherwise, check other table based on rule, skip local and main

For example, tag skb with mark 1, if skb with mark 1, lookup table 100

iptables -t mangle -A PREROUTING -j MARK –set-mark 1
ip rule add fwmark 1 lookup 100
ip route add local 0.0.0.0/0 dev lo table 100

Show routes in different tables(default is main)
ip -6 route show
ip -6 route show table main
ip -6 route show table all
ip -6 route show table local
ip -6 route show table 100

lets explain the output fields for each route with ipv4 as example

$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:32:26:c0:ce
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          inet6 addr: fe80::42:32ff:fe26:c0ce/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:27408 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30149 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1717078 (1.7 MB)  TX bytes:67951999 (67.9 MB)

eth0      Link encap:Ethernet  HWaddr 00:50:56:91:d7:c1
          inet addr:10.107.60.8  Bcast:10.117.7.255  Mask:255.255.252.0
          inet6 addr: fe80::250:56ff:fe91:d7c1/64 Scope:Link
          inet6 addr: fc00:10:117:7:250:56ff:fe91:d7c1/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:20424652 errors:1753 dropped:2707017 overruns:0 frame:0
          TX packets:2390188 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:9346260108 (9.3 GB)  TX bytes:32523341970 (32.5 GB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:754204 errors:0 dropped:0 overruns:0 frame:0
          TX packets:754204 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:74578327 (74.5 MB)  TX bytes:74578327 (74.5 MB)

(base) root@dev:~/$ ip route show
default via 10.117.7.253 dev eth0 onlink
# default gateway 10.117.7.253 output eth0
10.117.4.0/22 dev eth0  proto kernel  scope link  src 10.117.6.8
# scope link(local network 10.117.4.0/22), src 10.117.6.8 eth0's ip
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.2
# scope link(local network 172.17.0.0/16), src 172.17.0.2 docker0's ip
#local network, match subnet.


(base) root@dev:~/# ip route show table local
broadcast 10.117.4.0 dev eth0  proto kernel  scope link  src 10.117.6.8
local 10.117.6.8 dev eth0  proto kernel  scope host  src 10.117.6.8
# local mapped to RT_LOCAL(kernel) eth0, scope host->0 hop, myself
broadcast 10.117.7.255 dev eth0  proto kernel  scope link  src 10.117.6.8
broadcast 127.0.0.0 dev lo  proto kernel  scope link  src 127.0.0.1
local 127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1
local 127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1
# local mapped to RT_LOCAL(kernel) lo, scope host->0 hop, myself
broadcast 127.255.255.255 dev lo  proto kernel  scope link  src 127.0.0.1
broadcast 172.17.0.0 dev docker0  proto kernel  scope link  src 172.17.0.2
local 172.17.0.2 dev docker0  proto kernel  scope host  src 172.17.0.2
# local mapped to RT_LOCAL(kernel) docker0, scope host->0 hop, myself
broadcast 172.17.255.255 dev docker0  proto kernel  scope link  src 172.17.0.2
# local address, match 255.255.255.255, exact match

prevent route missing after reboot

There are two ways to do this

add static route

# Centos
# /etc/sysconfig/network-scripts/route-eth0 created if not there

default via 10.10.10.1 dev eth0
10.117.0.0/16 via 10.117.1.1 dev eth0
# from terminal restart network service
$sudo service network restart

save route table and restore it after reboot

$sudo ip route save >dump

# after reboot
$sudo ip route restore <dump

show arp entry

# incomplete means request is sent and no reply
$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
172.17.0.6                       (incomplete)                              docker0
172.17.0.2                       12:64:0b:38:ea:49                         docker0
10.226.134.65            ether   1c:ab:34:33:81:84   C                     eth0
127.0.0.3                ether   02:00:00:12:34:05   CM                    loopback-279

# if you delete an entry, it will be marked incomplete, you still see such entry
$ arp -d 172.17.0.2
$ arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
172.17.0.6                       (incomplete)                              docker0
172.17.0.2                       (incomplete)                              docker0
10.226.134.65            ether   1c:ab:34:33:81:84   C                     eth0
127.0.0.3                ether   02:00:00:12:34:05   CM                    loopback-279

Flags:
C----->entry is completed
M------>Permanent entries, user added.

save tcpdump captured packet to file

$ tcpdump -i eth0 -s 0 -w /home/lzq/pt.pcap
-s limits the size of packet
-o indicates no limition!

packet shows as tcp segment of a reassembled pdu in wireshark

This is due to application message(PDU) is larger than MSS, tcp splits it into several tcp segments, but tcp segment has no Fragment flag, how does wireshark know several tcp segments belong to same PDU?

wireshark thinks if several tcp segments have the same ACK number but different sequence numbers, they belong to same PDU, refer to

get two sides of established STREAM socket

STREAM socket can be over TCP or unix, for TCP, it’s easy to get it, here we foucus on UNIX STREAM socket with ss command.

State      Recv-Q Send-Q Local Address:Port              Peer Address:Port                local process who uses local address          
# -t for tcp
# -u for udp
# -w for Raw
# -x for unix
# -p show process name and id
$ss -tp | grep libvirtd
ESTAB      0      0      172.17.0.2:55758                172.17.0.3:16508                 users:(("libvirtd",pid=5056,fd=22))

# for unix established socket
# stream client side: Peer Address is *, Port is unix socket inode, same thing for local address and port as well
# stream server side: Peer Address is *, Port is unix socket inode, while local address is unix path, port is unix socket inode

# as you can see virsh use unix socket(254243) to connect peer who uses 254245 unix socket
$ss -xp | grep virsh
u_str  ESTAB      0      0       * 254243                * 254245                users:(("virsh",pid=11049,fd=6))

# it's clear
# virsh connects with libivrt on path /var/run/libvirt/libvirt-sock 
$ss -xp | grep 254245
u_str  ESTAB      0      0      /var/run/libvirt/libvirt-sock 254245                * 254243                users:(("libvirtd",pid=5056,fd=21))
u_str  ESTAB      0      0       * 254243                * 254245                users:(("virsh",pid=11049,fd=6))

show all sockets

$ netstat -a -t #all tcp sockets
$ netstat -a -u #all udp sockets
$ netstat -a -w #all raw sockets
$ netstat -a -x #all unix sockets
$ netstat -tulpn #l means listening p with program name
Note: if -a is not present, it will only display socket in established state

Show all files opened by a process and -P(show number not name for host, port)
$ lsof -p pid -P
show which process opens a socket on particular port
# lsof -i:80

check/set mtu or MAC

$ ifconfig eth1
$ ifcofnig eth1 mtu 1500

temporary
$ ifconfig eth0 down
$ ifconfig eth0 hw ether 00:80:48:BA:d1:30
$ ifconfig eth0 up

permanently
Centos7
edit /etc/sysconfig/network-scripts/ifcfg-eth0
MACADDR=02:01:02:03:04:0

socket bind to no-local address

if you bind() normally, the ip address must be one of the host’s address

bind can be usedin two cases:

server call socket/bind/listen/accpet, that means only the dst is for bind address, it will accept it, otherwise no.
client call bind, socket/bind/send, that means client uses the bind address as the source address when sending packet.

but for some case, you want to bind an address which is no-local(none of the host address) which is always need for loadbalancer server!

global setting, so that every application can bind non local address
$ echo 1 > /proc/sys/net/ipv4/ip_nonlocal_bind

enable it on a socket of the process
In programe code, call setsocketopt(IP_TRANSPARENT, 1), then
$ setcap CAP_NET_ADMIN program(before program runs)

IP_TRANSPARENT (since Linux 2.6.24)
Setting this boolean option enables transparent proxying on
this socket. This socket option allows the calling
application to bind to a nonlocal IP address and operate both
as a client and a server with the foreign address as the local
endpoint.

NOTE: for receiving non-local address that requires routing should be set up in a way that packets going to the foreign address are routed through the TProxy box (i.e., the system hosting the application that employs the IP_TRANSPARENT socket option), make sure packet is not dropped before reaching tproxy module in kernel

Enabling this socket option requires superuser privilege(the CAP_NET_ADMIN capability).

source port selection

if not set source port explicitly by bind(), source port selected from below range
$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 60999

change the source port range
$ sysctl -w net.ipv4.ip_local_port_range=1024 4095

create vlan interface and check its real dev

As vlan interface is a virtual device, so it must attach to a ‘real’ device, ‘real’ does not mean it must be a physical, it could be another virtual device as well, when send traffic on the vlan interface, it sends out to ‘real’ device after adding vlan id, call ‘real’ device’s dev_queue_xmit().

# must provide 'real' device when creating vlan interface, two vlan interfaces can point to same 'real' deivce.
$ ip link add link ens192 name ens192.100 type vlan id 100
$ ip link set dev ens192.100 up
$ ip -d link show ens192.100
446: ens192.100@ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:50:56:b2:01:75 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 100 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

# ens192.100@ens192  'real device' ens192
# vlan protocol 802.1Q id 100

$ ip link add link ens192 name ens192.200 type vlan id 200
$ ip link set dev ens192.200 up
$ ip -d link show ens192.200
447: ens192.200@ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:50:56:b2:01:75 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 200 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535


# ens192.200@ens192  'real device' ens192
# vlan protocol 802.1Q id 200

When tcp fastopen is enabled both on client and server side, client sends cookie request as TCP option, server sends cookie as TCP option as well, client saves it at kernel with key(dstip: port), when next TCP SYN will send the cookie if for same dst:port

# check cookie saved by kernel
$ sudo ip tcp_metrics
10.10.10.10 age 4.764sec cwnd 10 rtt 110us rttvar 188us fo_mss 65495 fo_cookie a427a77724e9229f source 10.117.6.2
# fo_cookie is sent by server and save at client in kernel

# check cookie only for dst 10.10.10.10
$ sudo ip tcp_metrics 10.10.10.10

# flush cookie
$ sudo ip tcp_metrics flush
$ sudo ip tcp_metrics flush 10.10.10.10

NOTE when TCP fastopen is enabled(setsockopt), use send() not connect() to setup TCP connection

How to know if a network interface is tap, tun, bridge or physical

ethtool -i tunOrTapDeviceName

In case of a TAP device we will get: “bus-info: tap”.
In case of a TUN device we will get: “bus-info: tun”.

$ ethtool -i vnet0
driver: tun
version: 1.6
firmware-version: 
expansion-rom-version: 
bus-info: tap
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

bind irq(s) of ethx to specific cpu

irq binding

do it by yourself

get irq of the ethx
set affinity of each irq

use vendor script

Mellanox: /usr/sbin/set_irq_affinity_bynode.sh socket ethN
Chelsio: /sbin/t4_perftune.sh

# get irq of eth0, for ethx that support multiple queues, for each queue there is a irq number
$grep eth0-TxRx /proc/interrupts | awk '{printf "  %s\n", $1}'
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
115:
116:
117:
118:
...

# cpu 0 for irq 103
$echo 1 > /proc/irq/103/smp_affinity

check rss of ethx

#+++++++++++++++++++++++++++++ One Way++++++++++++++++++++++++++++++++++++++++
# if rss is enabled, ethx should have multiple queues

# with hardware rss 
$ls /sys/class/net/eth0/queues/
rx-0   rx-12  rx-16  rx-2   rx-23  rx-27  rx-30  rx-34  rx-38  rx-41  rx-45  rx-49  rx-52  rx-56  rx-6   rx-7  tx-1   tx-13  tx-17  tx-20  tx-24  tx-28  tx-31  tx-35  tx-39  tx-42  tx-46  tx-5   tx-53  tx-57  tx-60  tx-8
rx-1   rx-13  rx-17  rx-20  rx-24  rx-28  rx-31  rx-35  rx-39  rx-42  rx-46  rx-5   rx-53  rx-57  rx-60  rx-8  tx-10  tx-14  tx-18  tx-21  tx-25  tx-29  tx-32  tx-36  tx-4   tx-43  tx-47  tx-50  tx-54  tx-58  tx-61  tx-9
rx-10  rx-14  rx-18  rx-21  rx-25  rx-29  rx-32  rx-36  rx-4   rx-43  rx-47  rx-50  rx-54  rx-58  rx-61  rx-9  tx-11  tx-15  tx-19  tx-22  tx-26  tx-3   tx-33  tx-37  tx-40  tx-44  tx-48  tx-51  tx-55  tx-59  tx-62
rx-11  rx-15  rx-19  rx-22  rx-26  rx-3   rx-33  rx-37  rx-40  rx-44  rx-48  rx-51  rx-55  rx-59  rx-62  tx-0  tx-12  tx-16  tx-2   tx-23  tx-27  tx-30  tx-34  tx-38  tx-41  tx-45  tx-49  tx-52  tx-56  tx-6   tx-7

# no rss
$ls /sys/class/net/eth0/queues/
rx-0 tx-0

# with RSS, there are multiple irqs, for each irq there are two queues rx-x and tx-x

#+++++++++++++++++++++++++++++ Another Way++++++++++++++++++++++++++++++++++++++++
$ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:		0
TX:		0
Other:		0
Combined:	1 --------------------------> one queue, no RSS
Current hardware settings:
RX:		0
TX:		0
Other:		0
Combined:	1

$ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:		0
TX:		0
Other:		1
Combined:	63------------------------->63 queueus(63 rx and 63 tx), RSS supported by hardware!!!
Current hardware settings:
RX:		0
TX:		0
Other:		1
Combined:	63

get bus info for a network device like eth0

################# use ethtool=====================

# you can also get bus info, driver info etc
$ethtool -i eth0
driver: mlx5_core      --------------------> driver info
version: 5.6-1.0.3
firmware-version: 16.33.1048 (MT_0000000241)
expansion-rom-version: 
bus-info: 0000:4b:00.0 --------------------> bus info here
supports-statistics: yes
...

################ check /sys filesystem ===========
$grep PCI_SLOT_NAME /sys/class/net/*/device/uevent | grep eth0
/sys/class/net/eth0/device/uevent:PCI_SLOT_NAME=0000:4b:00.0

# check device for this (it's physical function)
$lspci -D | grep 0000:4b:00.0
0000:4b:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

# show more details about this device

$lspci -vv -s 0000:4b:00.0
4b:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] --------------------->Mellanox Connect-5
	Subsystem: Mellanox Technologies Device 0052
	Physical Slot: 19                                                          ---------------------->physical Slot
    ...
	Interrupt: pin A routed to IRQ 18                                          ----------------------> IRQ
	NUMA node: 0                                                               ----------------------> Numa node it belongs to
	Region 0: Memory at 9c000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at 99000000 [disabled] [size=1M]
    ....
	Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)             ----------------------->Caps like SR-IOV
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy+
		IOVSta:	Migration-
		Initial VFs: 127, Total VFs: 127, Number of VFs: 0, Function Dependency Link: 00
		VF offset: 2, stride: 1, Device ID: 1018
		Supported Page Size: 000007ff, System Page Size: 00000001
		Region 0: Memory at 00000000a5f00000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
    ...
	Kernel driver in use: mlx5_core                                           -------------------------> driver in use
	Kernel modules: mlx5_core

####################### another example##################################
$ethtool -i enp75s17f2np0
driver: mlx5_core
version: 5.6-1.0.3
firmware-version: 16.33.1048 (MT_0000000241)
expansion-rom-version: 
bus-info: 0000:4b:11.2
....

$lspci -D | grep 0000:4b:11.2
0000:4b:11.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]----> VF

# show only given pci device
$lspci -vv -s 0000:4b:11.2
4b:11.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]  -----> Mellanox VF
	Subsystem: Mellanox Technologies Device 0052
	NUMA node: 0                                                                                 ------> no physical slot,no IRQ, numa node it blongs to
	Region 0: [virtual] Memory at 9e900000 (64-bit, prefetchable) [size=1M]

    ...
    Capabilities: [9c] MSI-X: Enable+ Count=6 Masked-                                            -------> MSI-X with six msx-irqs
        Vector table: BAR=0 offset=00002000
        PBA: BAR=0 offset=00003000

	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

create binding interface through sysfs

#first check all existing bindings
$modprobe --first-time bonding
$cat /sys/class/net/bonding_masters

# create binding interface
$echo +bond0 >/sys/class/net/bonding_masters
(after this #ifconfig -a to check or /sys/class/net/bond1 created)

# add slaves to binding interface
$ifconfig eth0 down
$ifconfig eth1 down
$echo +eth0 >/sys/class/net/bond1/bonding/slaves
$echo +eth1 >/sys/class/net/bond1/bonding/slaves

# configure binding interface like mode and its address

$echo active-backup >/sys/class/net/bond1/bonding/mode
$ifconfig bond1 192.168.100.1 netmask 255.255.255.0

# enable bonding interface
$ifconfig bond1 up
# (these will bring up all slaves in it)

# NOTE: remove in revert order with "-" prefix

icmp ping is ok but tcp connect fails

Tcp connection fails with ‘NO route to host’ while ping is ok, it’s probably packet is dropped by firewall.

# --- Way1: use tcpdump to see where packet dropped
# 
04:10:28.906424 IP dev-162 > 22.7.73.161: ICMP host dev-162 unreachable - admin prohibited filter, length 68 -----> as you can seee "admin prohibited" means firewall drops it.

# --- Way2: check iptables rules
sudo iptables -L -v

# --- Way3: check firewalld(firwalld can use iptables or nftable as backend)
# you can not see firewall ruels if it uses nftable as backend.
$ sudo systemctl status firewalld

# To see the current configuration, including allowed services and ports, use:
$ sudo firewall-cmd --list-all

# For a more detailed view of all active zones and their settings, use:
$ sudo firewall-cmd --get-active-zones
$ sudo firewall-cmd --get-default-zone

# list rich rules
$ sudo firewall-cmd --list-rich-rules