performance tools

Performance knowledge

Memory Usage Metric

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Show process memory usage by top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

1 root 20 0 38116 6136 3984 S 0.0 0.0 0:05.71 systemd


VIRT(VSS): The total amount of virtual memory used by the task.
It includes all code, data and shared libraries plus pages that have been swapped out(it not real physical memory current used)

RES(RSS): The non-swapped physical memory a task has used, CODE+DATA

SHR: The amount of shared memory used by a task.
It simply reflects memory that could be potentially shared with other processes.

%MEM: Memory usage (RES)

Show process memory usage by smem (metric used by smem and ps)

Swap: Swap size used by each process

VSS(virtual set size)
VSS (reported as VSZ from ps)is the total accessible address space of a process
(all allocated virtual addresses like malloc, stack, map(shared library))
This size also includes memory that may not be resident in RAM like mallocs that have been allocated but not written to.
VSS is of very little use for determining real memory usage of a process.

RSS(resident set size)
RSS is the total memory actually held in RAM for a process. RSS can be misleading,
because it reports the total all of the shared libraries that the process uses,
even though a shared library is only loaded into memory once regardless of how many processes use it.
RSS is not an accurate representation of the memory usage for a single process.

PSS(Proportional set size)
PSS differs from RSS in that it reports the proportional size of its shared libraries,
i.e.if three processes all use a shared library that has 30 pages,
that library will only contribute 10 pages to the PSS that is reported for each of the three processes.
PSS is a very useful number because when the PSS for all processes in the system are summed together,
that is a good representation for the total memory usage in the system.
When a process is killed, the shared libraries that contributed to its PSS will be proportionally distributed to
the PSS totals for the remaining processes still using that library.
In this way PSS can be slightly misleading, because when a process is killed, PSS does not accurately represent the memory returned to the overall system.

USS(unique set size)
USS is the total private memory for a process, i.e. that memory that is completely unique to that process.
USS is an extremely useful number because it indicates the true incremental cost of running a particular process.
When a process is killed, the USS is the total memory that is actually returned to the system.
USS is the best number to watch when initially suspicious of memory leaks in a process.

For example, there are two processes share a library which takes 2M physical memory

VSS RSS PSS USS
process A 20M 18M 17M 16M
process B 20M 19M 18M 17M
(RSS=USS+shared_library_memory, PSS=USS+shared_library_memory/shared_process_count)

load average

The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes sum of the running and the waiting threads. on linux, it not only tracks running tasks, but also tasks in uninterruptible sleep (usually waiting for IO)

Measuring the load average is critical to understanding how your servers are performing; if overloaded, you need to kill or optimize the processes consuming high amounts of resources, or provide more resources to balance the workload.

For simple, let's assume a server with a single processor, if the load is less than 1, that means on average, every process that needed the CPU could use it immediately without being blocked. Conversely, if the load is greater than 1, that means on average, there were processes ready to run, but could not due to CPUs being unavailable.

For a single processor, ideal load average is 1.00, and anything above that is an action call to troubleshoot? Well, although it’s a safe bet, a more proactive approach is leaving some extra headroom to manage unexpected loads, many people tend to aim for a load number of about 0.7 to cater for the spikes

overloaded or not depends on how may cpus(not core) you have

You probably have a system with multiple CPUs. The load average numbers work a bit differently on such a system. For example, if you have a load average of 2 on a single-CPU system, this means your system was overloaded by 100 percent — the entire period of time, one process was using the CPU while one other process was waiting. On a system with two CPUs, this would be complete usage — two different processes were using two different CPUs the entire time. On a system with four CPUs, this would be half usage — two processes were using two CPUs, while two CPUs were sitting idle.

check load average

1
2
3
(py3.9) [root@dev ~]# uptime
14:41:58 up 11 days, 23:10, 3 users, load average: 1.68, 0.55, 5.91
# These numbers are the averages of the system load over a period of one, five, and 15 minutes
  • The first value is 1.68. This is the value of CPU load during the last minute. this is a measure of how many programs(process in ready state) were using CPU time during the last minute. So, during the last minute on this machine, there were an average of 1.68 programs either using CPU processing time or waiting for CPU processing time. If this is a single-threaded CPU, that means the computer is overloaded. Users are waiting for their programs to run on the CPU, and experiencing degraded performance. If, instead, this is a dual-core computer or a quad-core, users are able to get CPU time just as quickly as they needed it, during the last minute.

  • The second value is 0.55. This is the measurement over the last 5 minutes. As we previously discussed, a measurement below 1 means that the CPU spent some of the time in that window completely idle. In this case, the CPU was idle for almost half the time. If we’re optimizing our CPU to be constantly doing something, that’s not a good sign.

  • The final number, 5.91, is a measurement of the last 15 minutes. If you’re using an eight-core CPU, then this number isn’t particularly shocking. If you’re using a dual-core CPU, then a number like 5.91 means your CPU is very overloaded. Users are regularly waiting for CPU time, and are probably experiencing significantly degraded performance.

troubleshooting high load average

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# show current load average
$ uptime


########################################################## cpu usage per cpu====================================
# show per-cpu usage peridically
$ mpstat -P ALL 1
########################################################## cpu usage per cpu====================================

########################################################## cpu usage per process====================================
# show all active process cpu usage(like top)
$pidstat 1
# show cpu usage of given process peridically(2 second)
# $pidstat -p 823471 2
$pidstat -p 823471 1
Linux 3.10.0-693.21.7.el7.x86_64 (A06-R08-I132-181-815KSRH.JCLOUD.COM) 07/20/2022 _x86_64_ (32 CPU)
04:42:46 PM UID PID %usr %system %guest %CPU CPU Command
04:42:47 PM 0 823471 33.00 5.00 0.00 38.00 10 node_monitor
04:42:48 PM 0 823471 54.00 7.00 0.00 61.00 10 node_monitor
04:42:49 PM 0 823471 46.00 4.00 0.00 50.00 10 node_monitor
04:42:50 PM 0 823471 31.00 7.00 0.00 38.00 10 node_monitor
# show cpu usage of given process(thread displayed)peridically
$pidstat -p 823471 -t 1
########################################################## cpu usage per process====================================


########################################################## cpu schedule latency per process=========================

# CPU run queue latency, schedule latency for each process
# monitor 10 seconds
$ perf sched record -- sleep 10
$ perf sched latency
-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
:632308:632308 | 0.478 ms | 2 | avg: 47.545 ms | max: 95.088 ms | max at: 1397496.686810 s
:632316:632316 | 13.820 ms | 8 | avg: 23.729 ms | max: 94.070 ms | max at: 1397496.685531 s
ovs-vsctl:(3) | 107.932 ms | 31 | avg: 14.581 ms | max: 95.035 ms | max at: 1397496.485498 s
sh:(2) | 6.058 ms | 11 | avg: 8.146 ms | max: 87.049 ms | max at: 1397495.885529 s
sleep:(26) | 17.129 ms | 75 | avg: 6.211 ms | max: 94.541 ms | max at: 1397496.386032 s
:632320:632320 | 3.305 ms | 18 | avg: 5.288 ms | max: 94.205 ms | max at: 1397495.785685 s
node_monitor:(76) | 342.577 ms | 2342 | avg: 4.063 ms | max: 192.182 ms | max at: 1397496.385540 s
perf:(183) | 569.297 ms | 1597 | avg: 3.975 ms | max: 196.652 ms | max at: 1397496.386045 s
kworker/25:2:461486 | 0.009 ms | 1 | avg: 0.677 ms | max: 0.677 ms | max at: 1397496.187200 s
kworker/9:1:30855 | 0.012 ms | 1 | avg: 0.379 ms | max: 0.379 ms | max at: 1397496.186873 s
kworker/1:6:615135 | 0.022 ms | 2 | avg: 0.358 ms | max: 0.711 ms | max at: 1397496.189183 s
kworker/17:1:715010 | 0.009 ms | 1 | avg: 0.357 ms | max: 0.357 ms | max at: 1397496.186876 s
kworker/31:1:303204 | 0.011 ms | 1 | avg: 0.314 ms | max: 0.314 ms | max at: 1397496.186854 s
:632321:632321 | 3.882 ms | 7 | avg: 0.313 ms | max: 1.140 ms | max at: 1397495.887512 s
:632319:632319 | 0.815 ms | 6 | avg: 0.258 ms | max: 1.277 ms | max at: 1397495.792539 s
kworker/3:1:86183 | 0.010 ms | 1 | avg: 0.226 ms | max: 0.226 ms | max at: 1397496.186715 s
:632317:632317 | 0.866 ms | 7 | avg: 0.216 ms | max: 1.099 ms | max at: 1397495.794319 s
kworker/14:1:197420 | 0.010 ms | 1 | avg: 0.215 ms | max: 0.215 ms | max at: 1397496.186723 s
:632307:632307 | 0.872 ms | 8 | avg: 0.205 ms | max: 1.420 ms | max at: 1397496.188524 s
kworker/26:0:308384 | 0.011 ms | 1 | avg: 0.193 ms | max: 0.193 ms | max at: 1397496.186720 s
:632301:632301 | 0.275 ms | 4 | avg: 0.165 ms | max: 0.658 ms | max at: 1397496.591534 s

########################################################## cpu schedule latency per process=========================

CPU usage

CPU usage is a measurement, in a percentage, of how much time the CPU spends actively computing something. For instance, if you had a program that required uninterrupted processing power for 54 out of the last 60 seconds, your CPU usage on one core would be 90%. Instead, if the program only required six seconds processing time on one core, the usage would be 10%.

Most companies seek to keep the CPU usage of their servers as close to 100% as possible. Most servers are sold by overall computing power, and if your server is only sitting at 30% CPU usage, you’re paying for too much processor power. You could downgrade your processor to a lower tier, save money, and see no reduction in the quality of your server’s performance.

cpu usage vs load average

CPU usage: There ratio (usually expressed as a percentage)of time that the CPU is busy doing stuff. This measure only makes sense if you know over which period the percentage is being calculated.

Load: Average queue length for the CPU - including the process currently executing. For this to make sense, you need to know the period over which this is being measured.

They are related, but one does not necessarily correlate to the other.

Imagine this scenario - with slightly contrived numbers: An ideal world with a single CPU. No scheduling overhead, no I/O overhead. Just keeping things simple.

  • You have 100 processes waiting for something.
  • When that “something” happens, each process will need 0.05 seconds of CPU time to do stuff in response.
  • When “something” does not happen, you have 0% CPU utilisation, and a queue length of 0. Basically stuff is just waiting. Life is good, and you’re merely wasting electrons and heating up the planet.
  • “something” happens. All 100 processes wake up. Your queue length jumps to 100, and your CPU is busy.
  • 0.05 seconds later, your queue length is 99 as the first process has finished doing “stuff”. CPU is still busy.
  • After 0.1 seconds, your queue length is 98 as the 2nd process has finished doing “stuff”. CPU is still busy.
  • Every 0.05 your queue length drops by 1 as a process finishes. CPU remains busy.
  • After 5 seconds, all the processes have finished; CPU becomes idle again and your queue length is back to zero.
  • Your CPU utilisation over the last 60 seconds is now: 5/60 = 8.33%. But your average queue length (=load average) over the last 60 seconds will be about 4.2.
    1
    2
    3
    4
    len:  100  99     98    97    ... 1      0     ...  0
    time: 0 0.05 0.1 0.15 4.95 5 59

    average = (0+0+...+1+2+3...+100)/(59/0.05+1) = 4.27

Looking at the 1-minute CPU utilisation alone (8.33%), you look good. But the 1-minute load average (4.2) shows that you have a performance bottleneck during that minute. Whether this is “bad” or not depends on whether you want it to be faster - do you need to respond to “something” happening more frequently than every 5 seconds?

NOTE

  • Load average is always high for periburst load(many processes are ready to run at a time) which runs shortly while cpu uage is not so high.
  • Too many D process as a process in state D is in uninterruptible, it’ counted by load average as well.

useful commands and performance tools

To debug performance issue, there are lots of tools that we can use to help us identify the issue, but some of them are old tools, some of new, so here we only introduce new tools that’s are used today.

Old tool

  • grpof
  • Oprofile

new tool

  • gperftools
    gperftools is newer since 2007 developed by Google, it’s simpler, only from process view, stack of process
  • perf
    perf is already in kernel source tree(upstream) since since 2009, it’s complex, can show
    more information from system-wide view it uses hardware counters to profile the application.
    The result of this profiler are really precise and because it is not doing instrumentation of the code, it is really fast.

perf can check a process(stack from kernel to process stack) or check system(without given process id)

useful commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464

show cpu load
Each running process either using or waiting for CPU resources adds 1 to the load average. So, if your system has a load of 5, five processes are either using or waiting for the CPU, the load number doesn’t mean too much. A computer might have a load of 0 one split-second, and a load of 5 the next split-second as several processes use the CPU. Even if you could see the load at any given time, that number would be basically meaningless. That’s why Unix-like systems don’t display the current load. They display the load average — an average of the computer’s load over several periods of time. This allows you to see how much work your computer has been performing.

# uptime
10:11:01 up 18:57, 4 users, load average: 0.50, 2.13, 1.85
From left to right, these numbers show you the average load over the last one minute, the last five minutes, and the last fifteen minutes

show how much time process runs in sys, user

Real time is wall clock time. (what we could measure with a stopwatch)
User time is the amount of time spend in user-mode within the process
Sys is the CPU time spend in the kernel within the process.

NOTE: real can be less than user if, it's app is multi-thread or multi-process!!!

The rule of thumb is:
real < user: The process is CPU bound and takes advantage of parallel execution on multiple cores/CPUs.
real ≈ user: The process is CPU bound and takes no advantage of parallel exeuction.
real > user: The process is I/O bound. Execution on multiple cores would be of little to no advantage.

#time ls
share windows
real 0m0.002s
user 0m0.001s
sys 0m0.001s

show latency of RT linux kernel

#cyclictest
(git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git)

show slab info

#cat /proc/slabinfo
#slabtop
Active / Total Objects (% used) : 133629 / 147300 (90.7%)
Active / Total Slabs (% used) : 11492 / 11493 (100.0%)
Active / Total Caches (% used) : 77 / 121 (63.6%)
Active / Total Size (% used) : 41739.83K / 44081.89K (94.7%)
Minimum / Average / Maximum Object : 0.01K / 0.30K / 128.00K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
44814 43159 96% 0.62K 7469 6 29876K ext3_inode_cache
36900 34614 93% 0.05K 492 75 1968K buffer_head
35213 33124 94% 0.16K 1531 23 6124K dentry_cache
7364 6463 87% 0.27K 526 14 2104K radix_tree_node
1280 1015 79% 0.25K 40 32 320K kmalloc-256 ---> two pages for one slab
(Note, the management memory is not calculated!!!, but it's small)

Each cache may have many slabs(empty, partial, full), each slab is one or multiple PAGE SIZE
(usually 4K for a PAGE SIZE)!

USE = (ACTIVE/OBJS)*100/100
OBJS = SLABS*(OBJ/SLAB)
OBJ/SLAB = (4K*n)/OBJ_SIZE
CACHE SIZE = SLABS * (4K*n)


show swap size used by each process
#smem
(RSS 656 mean 656K?)
PID User Command Swap USS PSS RSS
2516 rabbitmq sh -c /usr/lib/rabbitmq/bin 0 96 116 656
1451 lightdm /bin/sh /usr/lib/lightdm/li 0 100 121 700
1130 root /bin/sh -e /proc/self/fd/9 0 100 122 680
1157 root /sbin/getty -8 38400 tty3 0 156 174 964

Show basic process information smem
Show library-oriented view smem -m
Show user-oriented view smem -u
Show system view smem -R 4G -K /path/to/vmlinux -w
Show totals and percentages smem -t -p
Show different columns smem -c "name user pss"
Sort by reverse RSS smem -s rss -r
Show processes filtered by mapping smem -M libxml
Show mappings filtered by process smem -m -P [e]volution
Read data from capture tarball smem --source capture.tar.gz
Show a bar chart labeled by pid smem --bar pid -c "pss uss"
Show a pie chart of RSS labeled by name smem --pie name -s rss

Show memory usage by 'free' command

$ free
total used free shared buff/cache available
Mem: 24687560 11825536 8579812 258488 4282212 12299492
Swap: 16774140 0 16774140

total== 11825536 + 8579812 + 4282212 == 24687560
available = 8579812 + part of(buff/cache which is not used by OS)

total: Your total (physical) RAM (excluding a small bit that the kernel permanently reserves for itself at startup);
used: memory in use by the OS(calculate apps, buffers, caches)
free: memory not in use.

total = used + free + buff/cache

shared /buff/cache: This shows memory usage for specific purposes
(write data to disk, buffer is used, which cache is used for storing data read from disk in memory)

The last line (Swap:) gives information about swap space usage (i.e. memory contents that have been temporarily moved to disk).

To actually understand what the numbers mean, you need a bit of background about the virtual memory (VM) subsystem in Linux.
Just a short version: Linux (like most modern OS) will always try to use free RAM for caching stuff, so Mem: free will almost always be very low.
caches will be freed automatically if memory gets scarce, so they do not really matter.

Inside exec()

In computing, exec is a functionality of an operating system that runs an executable file in the context of an already existing process,
replacing the previous executable. This act is also referred to as an overlay. It is especially important in Unix-like systems, although exists elsewhere.
As a new process is not created, the process identifier (PID) does not change,
but the machine code, data, heap, and stack of the process are replaced by those of the new program.


====================================================SAR===================================================================================
sar(System Activity Report): Show system activity information, its gives more details about cpu, memory, interrupt, io, power, network etc
But you can also check other commands for specific resource from below section

# -B is more general including(swap process memory + disk io)
# sar -B 5
Linux 3.10.0-1160.el7.x86_64 (dev) 10/12/2022 _x86_64_ (16 CPU)

05:14:19 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
05:14:24 PM 0.00 31948.80 10.00 0.00 136.20 0.00 0.00 0.00 0.00
05:14:29 PM 0.00 236544.00 10.80 0.00 57.40 0.00 0.00 0.00 0.00

# -W is about swap of process memory(swap process page to disk when there is not engouh memory)
#sar -W 5
Linux 3.10.0-1160.el7.x86_64 (dev) 10/12/2022 _x86_64_ (16 CPU)

05:14:43 PM pswpin/s pswpout/s
05:14:48 PM 0.00 0.00
05:14:53 PM 0.00 0.00


Report I/O and transfer rate statistics
# sar -b 5
Linux 3.10.0-1160.el7.x86_64 (dev) 10/22/2021 _x86_64_ (8 CPU)

05:34:02 PM tps rtps wtps bread/s bwrtn/s
05:34:07 PM 0.00 0.00 0.00 0.00 0.00
05:34:12 PM 0.00 0.00 0.00 0.00 0.00

Report activity for each block device
# sar -d 5
Linux 3.10.0-1160.el7.x86_64 (dev) 10/22/2021 _x86_64_ (8 CPU)

05:34:20 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
05:34:25 PM dev8-0 0.20 0.00 6.40 32.00 0.00 1.00 1.00 0.02
05:34:25 PM dev253-0 0.20 0.00 6.40 32.00 0.00 1.00 1.00 0.02
05:34:25 PM dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:34:25 PM dev253-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

show interrupt per 5s
# sar -I ALL 5
Linux 3.10.0-1160.el7.x86_64 (dev) 10/22/2021 _x86_64_ (8 CPU)

05:37:36 PM INTR intr/s
05:37:41 PM 0 0.00
05:37:41 PM 1 0.00
05:37:41 PM 2 0.00
05:37:41 PM 3 0.00
05:37:41 PM 4 0.00
05:37:41 PM 5 0.00
05:37:41 PM 6 0.00
05:37:41 PM 7 0.00
05:37:41 PM 8 0.00
05:37:41 PM 9 0.00
05:37:41 PM 10 0.00
05:37:41 PM 11 0.00
05:37:41 PM 12 0.00
05:37:41 PM 13 0.00
05:37:41 PM 14 0.80
05:37:41 PM 15 0.00

show power management
$ ar -m ALL 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:40:01 PM CPU MHz
05:40:11 PM all 1258.82

05:40:01 PM TEMP degC %temp DEVICE
05:40:11 PM 1 43.00 55.84 coretemp-isa-0000
05:40:11 PM 2 38.00 49.35 coretemp-isa-0000
05:40:11 PM 3 37.00 48.05 coretemp-isa-0000
05:40:11 PM 4 34.00 44.16 coretemp-isa-0000
05:40:11 PM 5 38.00 49.35 coretemp-isa-0000
05:40:11 PM 6 33.00 42.86 coretemp-isa-0000
05:40:11 PM 7 34.00 44.16 coretemp-isa-0000
05:40:11 PM 8 37.00 48.05 coretemp-isa-0000
05:40:11 PM 9 35.00 45.45 coretemp-isa-0000
05:40:11 PM 10 44.00 57.14 coretemp-isa-0001
05:40:11 PM 11 36.00 46.75 coretemp-isa-0001
05:40:11 PM 12 36.00 46.75 coretemp-isa-0001
05:40:11 PM 13 37.00 48.05 coretemp-isa-0001
05:40:11 PM 14 37.00 48.05 coretemp-isa-0001
05:40:11 PM 15 34.00 44.16 coretemp-isa-0001
05:40:11 PM 16 36.00 46.75 coretemp-isa-0001
05:40:11 PM 17 34.00 44.16 coretemp-isa-0001
05:40:11 PM 18 34.00 44.16 coretemp-isa-0001

05:40:01 PM BUS idvendor idprod maxpower manufact product
05:40:11 PM 1 8087 800a 0
05:40:11 PM 2 8087 8002 0
05:40:11 PM 1 413c a001 200 no manufacturer Gadget USB HUB

show network stats, lots of fields, only list some
# sar -n ALL 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:43:15 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
05:43:25 PM tap_metadata 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM vxlan_sys_4789 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM br0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM tap_proxy_ns 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM ovs-system 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM tap_proxy 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM eth0 2.10 2.00 0.23 0.19 0.00 0.00 0.00
05:43:25 PM lo 2.00 2.00 0.79 0.79 0.00 0.00 0.00
05:43:25 PM em2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM em4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM em3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

05:43:15 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s
05:43:25 PM tap_metadata 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM vxlan_sys_4789 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM br0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM tap_proxy_ns 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM ovs-system 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM tap_proxy 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM em2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM em4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM em3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:43:25 PM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Report cpu queue length and load averages
# sar -P ALL -q 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:48:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
05:48:42 PM 0 1069 0.43 0.46 0.49 0

Report memory utilization statistics
# sar -r 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:46:25 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
05:46:35 PM 119478308 12269620 9.31 1868 9411160 5926572 3.99 4968572 5430432 1560

Report CPU utilization
# sar -P ALL -u 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:47:32 PM CPU %user %nice %system %iowait %steal %idle
05:47:42 PM all 1.39 0.00 0.40 0.00 0.00 98.20
05:47:42 PM 0 40.49 0.00 0.00 0.00 0.00 59.51
05:47:42 PM 1 0.00 0.00 0.30 0.00 0.00 99.70
05:47:42 PM 2 0.41 0.00 1.32 0.00 0.00 98.28
05:47:42 PM 3 0.10 0.00 0.20 0.00 0.00 99.70
05:47:42 PM 4 0.30 0.00 0.80 0.00 0.00 98.89
05:47:42 PM 5 0.20 0.00 0.40 0.00 0.00 99.40
05:47:42 PM 6 0.20 0.00 0.40 0.00 0.00 99.40
05:47:42 PM 7 0.00 0.00 0.20 0.00 0.00 99.80
05:47:42 PM 8 0.70 0.00 0.90 0.00 0.00 98.39

Report task creation and system switching activity
# sar -w 10
Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM) 10/22/2021 _x86_64_ (32 CPU)

05:48:56 PM proc/s cswch/s
05:49:06 PM 2.30 12558.70
====================================================SAR===================================================================================

Show CPUS stats
CPU utilization stats runs on user, sys, virtual processor(vm)
# mpstat -P ALL -u
04:38:08 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
04:38:08 PM all 7.03 0.06 2.89 0.01 0.00 0.00 0.00 8.28 0.00 81.73
04:38:08 PM 0 0.88 0.08 3.26 0.01 0.00 0.19 0.00 10.02 0.00 85.56
04:38:08 PM 1 0.89 0.04 3.30 0.01 0.00 0.02 0.00 9.47 0.00 86.28
04:38:08 PM 2 0.83 0.08 3.15 0.01 0.00 0.01 0.00 10.15 0.00 85.78
04:38:08 PM 3 0.82 0.04 3.15 0.01 0.00 0.00 0.00 9.61 0.00 86.39
04:38:08 PM 4 1.03 0.07 4.59 0.01 0.00 0.01 0.00 12.51 0.00 81.78
04:38:08 PM 5 0.92 0.04 3.22 0.01 0.00 0.00 0.00 9.58 0.00 86.23
04:38:08 PM 6 1.10 0.07 4.63 0.01 0.00 0.00 0.00 12.52 0.00 81.66
04:38:08 PM 7 0.83 0.05 3.42 0.01 0.00 0.00 0.00 10.44 0.00 85.25
04:38:08 PM 8 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00

CPU soft irq
# mpstat -I SCPU
Linux 3.10.0-693.21.4.el7.x86_64 (A01-R15-I124-40-CCK4HP2.JCLOUD.COM) 10/22/2021 _x86_64_ (64 CPU)

04:39:57 PM CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s BLOCK_IOPOLL/s TASKLET/s SCHED/s HRTIMER/s RCU/s
04:39:57 PM 0 0.00 54.90 0.20 2.42 0.00 0.00 0.05 12.35 0.00 10.65
04:39:57 PM 1 0.00 41.11 0.00 0.48 0.04 0.00 7.10 43.85 0.00 7.03
04:39:57 PM 2 0.00 60.01 0.01 14.90 0.00 0.00 0.57 59.44 0.00 10.80
04:39:57 PM 3 0.00 33.81 0.00 0.50 0.04 0.00 0.00 52.79 0.00 3.72
04:39:57 PM 4 0.00 40.35 0.01 17.83 0.00 0.00 0.75 6.86 0.00 23.19
04:39:57 PM 5 0.00 44.60 0.00 0.51 0.04 0.00 0.00 53.62 0.00 7.76
04:39:57 PM 6 0.00 44.92 0.01 12.48 0.00 0.00 0.51 7.00 0.00 24.59
04:39:57 PM 7 0.00 58.52 0.00 0.46 0.04 0.00 0.00 57.85 0.00 12.73
04:39:57 PM 8 0.00 33.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58.50

Show CPU live stats
# top
top - 16:45:28 up 771 days, 3:16, 1 user, load average: 6.66, 7.24, 6.54
Tasks: 670 total, 9 running, 661 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.7 us, 1.6 sy, 0.1 ni, 88.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 26379142+total, 3811248 free, 23686681+used, 23113348 buff/cache
KiB Swap: 16777212 total, 16691312 free, 85900 used. 25565136 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
127796 root 10 -10 0.105t 573032 20444 S 403.6 0.2 42365,18 vswitchd
131315 root 20 0 9188560 87436 5940 S 108.6 0.0 668887:05 qemu-kvm
113257 root 20 0 9220320 77628 5916 S 78.8 0.0 300804:39 qemu-kvm
69578 root 20 0 9043128 66088 3640 S 18.2 0.0 2199:03 qemu-system-x86
123753 root 20 0 9039996 63892 3604 S 15.9 0.0 764:21.77 qemu-system-x86
113084 root 20 0 9074688 68036 1916 S 12.3 0.0 170215:57 qemu-system-x86
99040 root 20 0 16.647g 65140 1900 S 9.6 0.0 158111:53 qemu-system-x86
133933 root 20 0 4836392 64328 3648 S 8.6 0.0 381:15.29 qemu-system-x86
92403 root 20 0 4825240 62916 3308 S 7.3 0.0 21040:53 qemu-system-x86
100018 root 20 0 3471696 5216 2616 S 7.0 0.0 8384:57 logd

# htop

show live virtual memory usage
show stats per 2s, actuall, it also shows io, system, cpu as well
$ vmstat -n 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 4 85900 4185664 4396 22772168 0 0 1 91 0 0 15 3 82 0 0
6 0 85900 4181612 4396 22772580 0 0 64 479 55060 82145 8 1 91 0 0
8 0 85900 4184968 4396 22772636 0 0 96 98 56364 87759 8 1 91 0 0
8 0 85900 4183828 4396 22772936 0 0 96 152 58835 88482 9 1 90 0 0
6 0 85900 4180524 4396 22772920 0 0 0 320 58749 94072 9 1 90 0 0
5 0 85900 4184580 4396 22773588 0 0 0 234 67631 111630 9 2 89 0 0

show io statistics, most used for which disk has high io await.
the io wait of the whole system(96.0%wa)
# top
top - 14:31:20 up 35 min, 4 users, load average: 2.25, 1.74, 1.68
Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.3%us, 1.7%sy, 0.0%ni, 0.0%id, 96.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 245440k total, 241004k used, 4436k free, 496k buffers
Swap: 409596k total, 5436k used, 404160k free, 182812k cached

show iostat per 10s of each block device(check which block device has high io wait)
# sar -d 5
# iostat -txz 10
Linux 3.10.0-693.21.4.el7.x86_64 (A01-R15-I124-40-CCK4HP2.JCLOUD.COM) 10/22/2021 _x86_64_ (64 CPU)

10/22/2021 05:14:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
15.31 0.06 2.89 0.01 0.00 81.73

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.06 1.06 47.46 43.69 4897.74 203.70 0.01 0.25 0.26 0.25 0.06 0.29
sda 0.00 0.03 0.00 0.73 0.07 7.51 20.56 0.01 19.20 6.80 19.27 3.83 0.28
nb100 0.00 0.01 0.29 7.33 36.40 545.36 152.62 0.02 3.25 9.77 2.99 1.07 0.82
nb101 0.00 0.00 0.00 1.76 0.03 14.69 16.69 0.00 0.94 1.29 0.94 0.29 0.05
nb102 0.00 0.00 0.00 6.72 0.01 132.59 39.46 0.01 1.18 0.70 1.18 0.28 0.19
nb103 0.00 0.00 0.00 0.55 0.02 7.14 25.97 0.00 0.86 0.54 0.86 0.46 0.03
nb104 0.00 0.00 0.00 0.31 0.01 10.68 68.74 0.00 1.45 0.50 1.45 0.43 0.01
nb105 0.00 0.00 0.00 1.29 0.02 78.13 121.00 0.00 3.56 0.62 3.56 0.50 0.06
nb106 0.00 0.00 0.01 0.71 0.83 40.81 116.51 0.00 1.19 0.57 1.19 0.69 0.05
nb107 0.00 0.00 0.00 0.17 0.01 1.37 16.55 0.00 1.21 8.04 1.20 0.39 0.01
nb108 0.00 0.00 0.00 0.17 0.00 1.29 15.10 0.00 0.90 0.53 0.90 0.44 0.01
nb109 0.00 0.00 0.00 0.00 0.00 0.04 60.20 0.00 42.15 0.42 43.57 0.46 0.00

show io per process, which process is writing high io
#iotop
Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
17391 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % [kworker/6:0]
16896 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-0fe83e4c-ccd4-49f4-ae7e-4b07fabb2dc3.json
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % systemd --switched-root --system --deserialize 22
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H]
6 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_bh]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched]

show interface statistics
show stats of ifaces
# ifstat
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
lo 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
enp0s3 8 0 6 0 560 0 1424 0
0 0 0 0 0 0 0 0
docker0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
vethbedf2bf 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

# live stats on each interface
#iftop

# live stats on each process which has network io
# nethogs
NetHogs version 0.8.5

PID USER PROGRAM DEV SENT RECEIVED
13337 root sshd: root@pts/2 enp0s3 0.218 0.186 KB/sec
? root unknown TCP 0.000 0.000 KB/sec
TOTAL 0.218 0.186 KB/sec

show details about interface like config, stats etc
# ethtool -h
ethtool -g|--show-ring DEVNAME Query RX/TX ring parameters
ethtool -k|--show-features|--show-offload DEVNAME Get state of protocol offload and other features
ethtool -i|--driver DEVNAME Show driver information
ethtool -S|--statistics DEVNAME Show adapter statistics
ethtool -n|-u|--show-nfc|--show-ntuple DEVNAME Show Rx network flow classification options or rules
ethtool -x|--show-rxfh-indir|--show-rxfh DEVNAME Show Rx flow hash indirection and/or hash key

show power management
show power used by each process live
# powertop
PowerTOP v2.9 Overview Idle stats Frequency stats Device stats Tunables
Summary: 72.3 wakeups/second, 0.0 GPU ops/seconds, 0.0 VFS ops/sec and 0.3% CPU use

Usage Events/s Category Description
122.3 µs/s 20.0 Process [PID 460] [xfsaild/dm-0]
72.8 µs/s 9.5 Timer tick_sched_timer
116.2 µs/s 6.7 Timer hrtimer_wakeup
93.0 µs/s 5.7 Process [PID 1084] /usr/bin/containerd
63.7 µs/s 5.7 Process [PID 9] [rcu_sched]
632.7 µs/s 4.8 Process [PID 1049] /home/data/Anaconda3/bin/python /home/data/Anaconda3/bin/jupyter-notebook -y --no-browser --allow-root --ip=10.0.2.1
61.6 µs/s 4.8 Process [PID 1082] /usr/bin/containerd
34.9 µs/s 2.9 Interrupt [3] net_rx(softirq)
183.8 µs/s 1.9 Interrupt [7] sched(softirq)
239.2 µs/s 1.0 kWork e1000_watchdog

Benchmark tools

for operation function
#apt-get install lmbench

Layer 4 Throughput using NetPerf and iPerf, two open source network performance benchmark tools that support both UDP and TCP protocols. Each tool provides in addition other information:
NetPerf for example provides tests for end-to-end latency (round-trip times or RTT) and is a good replacement for Ping

iPerf provides packet loss and delay jitter, useful to troubleshoot network performance.

for network, test the network between client(netperf) and server(netserver)

server side
#netserver

client side with testing 300s, or never stop(-l 0)
#netperf -H $server -l 300 -t TCP_STREAM

server side
#iperf3 --server --interval 30
client side
#iperf3 --client $server --time 300 --interval 30

check bottleneck, call graph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Don't use gprof which is old since 198x

Oprofile is old still release one version each year since 2002 and it uses the same backend
as 'perf' does, so can give almost the same output with 'perf' but the Community recommend
'perf' and intend to replace it


gperftools is newer since 2007 developed by Google, it's simpler, only from process view

perf is already in kernel source tree(upstream) since since 2009, it's complex, can show
more information from system-wide view!!!!!!!!!!!!!!! it uses hardware counters to profile the application.
The result of this profiler are really precise and
because it is not doing instrumentation of the code, it is really fast.

gperftools and perf are two good choices nowadays!!!!!!!!!!!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

gperftools(great perf tools, originally from google performance tool, is package name)

gperftools is a collection of a high-performance multi-threaded malloc() implementation, plus some pretty nifty performance analysis tools.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

Ubuntu18
#apt-get install google-perftools graphviz libgoogle-perftools-dev
Centos7
$ yum install -y pprof gperftools-devel

Usage
CPUPROFILE:(check which function or line take much time) NO track forked child!!!

As gperftools provides tcmalloc, heap checker, heap profiler and cpu profiler
(heap checker, heap profiler are in '-ltcmalloc'
cpu profiler '-lprofiler'
google-pprof is used to analysis the profile file)

There are two ways to use gperftools, one is to compile it within your program
the other is to used PRELOAD and set env(use the first way always!)

Generate profile file
a. Compile it within your program

#include <gperftools/profiler.h>
#include <stdio.h>
#include <stdlib.h>
void func1() {
int i = 0;
while (i < 100000) {
++i;
}
}
void func2() {
int i = 0;
while (i < 200000) {
++i;
}
}
void func3() {
int i = 0;
for (i = 0; i < 1000; ++i) {
func1();
func2();
}
}
int main(){
ProfilerStart("my.prof");
func3();
ProfilerStop();
return 0;
}

# gcc -o test test.c -g -Wall -lprofiler

This way(set CPUPROFILE) will do CPU profile definitely!!!, no switch
#CPUPROFILE_FREQUENCY=100 ./test
(100 samples per second, default value)


b. Use PRELOAD (not recommended!!!!) without recompiling!!!!!
#export LD_PRELOAD=/usr/lib64/libprofiler.so

/*turn on cpu profile during whole life*/
#env CPUPROFILE=my.prof ./test

-------------------------------------------------------------------------
| For a daemon process, run it forground!!! not a daemon for profiling. |
-------------------------------------------------------------------------

---------------------------------------------------------------------------
Analyze the profile file( take care of the first three columns)
to see which function or lines consume more CPU time!!!!!!!!!!
---------------------------------------------------------------------------
See which function takes much time
root@ubuntu:~# google-pprof --text ./test my.prof
OR
root@centos:~# pprof --text ./test my.prof

Using local file ./test. (test is the program)
Using local file my.prof.(my.prof is the data collected before)
Removing killpg from all stack traces.
Total: 71 samples
53 74.6% 74.6% 53 74.6% func2
18 25.4% 100.0% 18 25.4% func1
0 0.0% 100.0% 71 100.0% __libc_start_main
0 0.0% 100.0% 71 100.0% _start
0 0.0% 100.0% 71 100.0% func3
0 0.0% 100.0% 71 100.0% main

column meanings
1. Number of profiling samples in this function
2. Percentage of profiling samples in this function
3. Percentage of profiling samples in the functions printed so far
4. Number of profiling samples in this function and its callees
5. Percentage of profiling samples in this function and its callees
6. Function name


if you perf to run, from system view, you will get

$perf record ./test
$perf report
65.73% test test [.] func2
33.68% test test [.] func1
0.16% test [kernel.vmlinux] [k] native_write_msr_safe
0.06% test [kernel.vmlinux] [k] x86_pmu_enable
0.05% test [kernel.vmlinux] [k] __intel_pmu_disable_all
0.05% test libc-2.17.so [.] __GI___dl_iterate_phdr
0.00% test [kernel.vmlinux] [k] __do_page_fault
0.00% test libc-2.17.so [.] __memset_sse2
0.00% test [kernel.vmlinux] [k] lapic_next_deadline


See which line takes much time, you have to build test with -g
root@ubuntu:~#google-pprof --lines --text ./test my.prof
OR
root@centos:~#pprof --lines --text ./test my.prof

Using local file ./test.
Using local file my.prof.
Removing killpg from all stack traces.
Total: 71 samples
37 52.1% 52.1% 37 52.1% func2 /root/test.c:12 (discriminator 1)
22 31.0% 83.1% 22 31.0% func1 /root/test.c:6 (discriminator 1)
11 15.5% 98.6% 13 18.3% func2 /root/test.c:13
1 1.4% 100.0% 1 1.4% func1 /root/test.c:7
0 0.0% 100.0% 71 100.0% __libc_start_main /build/eglibc-3GlaMS/eglibc-2.19/csu/libc-start.c:287
0 0.0% 100.0% 71 100.0% _start ??:?
0 0.0% 100.0% 1 1.4% func1 /root/test.c:6
0 0.0% 100.0% 11 15.5% func2 /root/test.c:12
0 0.0% 100.0% 23 32.4% func3 /root/test.c:19 (discriminator 2)
0 0.0% 100.0% 48 67.6% func3 /root/test.c:20 (discriminator 2)
0 0.0% 100.0% 71 100.0% main /root/test.c:25
(--text, --pdf, --web, --dot, --gif, --gv etc )
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Below seems not working for nginx don't know why

TCMALLOC(thread cache malloc), you don't need to memory management for your own
tcmalloc does it for you, so that you don't need to care memory managment!!!!

tcmalloc actually implements a cache or pool, so that you can get memory from cache or pool
fast, but as the presure grows, tcmalloc takes more memory from system, while when presure decrease
tcmalloc should return the memory back to system TCMALLOC_RELEASE_RATE
Usage
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#gcc -o test test.c -ltcmalloc_minimal
(in your program, use malloc, free as you did before)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

HEAP CHECKER(check memory leak, not working well, don't know why)
#gcc -o test test.c -ltcmalloc
#HEAPCHECK=normal ./test


HEAP Profile(check where/who alloc memory)

#gcc -o test test.c -ltcmalloc

dump heap profile when allocate 1M memory (only through malloc method)
#HEAPPROFILE=heap.prof HEAP_PROFILE_ALLOCATION_INTERVAL=1024*1024 ./test

also dump sbrk, mmap method as well
#HEAPPROFILE=heap.prof HEAP_PROFILE_MMAP=true HEAP_PROFILE_ALLOCATION_INTERVAL=1024*1024 ./test

root@ubuntu:~#google-pprof --gv test test.0004.heap
root@centos:~#pprof --gv test test.0004.heap

Ref