performance tools

Posted on 2019-09-26 Edited on 2023-08-16 In performance , application

Performance knowledge

Memory Usage Metric

Show process memory usage by top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

    1 root      20   0   38116   6136   3984 S   0.0  0.0   0:05.71 systemd


    VIRT(VSS): The total amount of virtual memory used by the task.
               It includes all code, data and shared libraries plus pages that have been swapped out(it not real physical memory current used)

    RES(RSS):  The non-swapped physical memory a task has used, CODE+DATA

    SHR:       The amount of shared memory used by a task.
               It simply reflects memory that could be potentially shared with other processes.

    %MEM:      Memory usage (RES)

Show process memory usage by smem (metric used by smem and ps)

    Swap: Swap size used by each process

    VSS(virtual set size)
         VSS (reported as VSZ from ps)is the total accessible address space of a process
         (all allocated virtual addresses like malloc, stack, map(shared library))
         This size also includes memory that may not be resident in RAM like mallocs that have been allocated but not written to.
         VSS is of very little use for determining real memory usage of a process.

    RSS(resident set size)
        RSS is the total memory actually held in RAM for a process. RSS can be misleading,
        because it reports the total all of the shared libraries that the process uses,
        even though a shared library is only loaded into memory once regardless of how many processes use it.
        RSS is not an accurate representation of the memory usage for a single process.

    PSS(Proportional set size)
        PSS differs from RSS in that it reports the proportional size of its shared libraries,
        i.e.if three processes all use a shared library that has 30 pages,
        that library will only contribute 10 pages to the PSS that is reported for each of the three processes.
        PSS is a very useful number because when the PSS for all processes in the system are summed together,
        that is a good representation for the total memory usage in the system.
        When a process is killed, the shared libraries that contributed to its PSS will be proportionally distributed to
        the PSS totals for the remaining processes still using that library.
        In this way PSS can be slightly misleading, because when a process is killed, PSS does not accurately represent the memory returned to the overall system.

    USS(unique set size)
        USS is the total private memory for a process, i.e. that memory that is completely unique to that process.
        USS is an extremely useful number because it indicates the true incremental cost of running a particular process.
        When a process is killed, the USS is the total memory that is actually returned to the system.
        USS is the best number to watch when initially suspicious of memory leaks in a process.

    For example, there are two processes share a library which takes 2M physical memory

                VSS     RSS     PSS     USS
    process A   20M     18M     17M     16M
    process B   20M     19M     18M     17M
    (RSS=USS+shared_library_memory, PSS=USS+shared_library_memory/shared_process_count)

load average

The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes sum of the running and the waiting threads. on linux, it not only tracks running tasks, but also tasks in uninterruptible sleep (usually waiting for IO)

Measuring the load average is critical to understanding how your servers are performing; if overloaded, you need to kill or optimize the processes consuming high amounts of resources, or provide more resources to balance the workload.

For simple, let's assume a server with a single processor, if the load is less than 1, that means on average, every process that needed the CPU could use it immediately without being blocked. Conversely, if the load is greater than 1, that means on average, there were processes ready to run, but could not due to CPUs being unavailable.

For a single processor, ideal load average is 1.00, and anything above that is an action call to troubleshoot? Well, although it’s a safe bet, a more proactive approach is leaving some extra headroom to manage unexpected loads, many people tend to aim for a load number of about 0.7 to cater for the spikes

overloaded or not depends on how may cpus(not core) you have

You probably have a system with multiple CPUs. The load average numbers work a bit differently on such a system. For example, if you have a load average of 2 on a single-CPU system, this means your system was overloaded by 100 percent — the entire period of time, one process was using the CPU while one other process was waiting. On a system with two CPUs, this would be complete usage — two different processes were using two different CPUs the entire time. On a system with four CPUs, this would be half usage — two processes were using two CPUs, while two CPUs were sitting idle.

check load average

1
2
3

(py3.9) [root@dev ~]# uptime
 14:41:58 up 11 days, 23:10,  3 users,  load average: 1.68, 0.55, 5.91
 # These numbers are the averages of the system load over a period of one, five, and 15 minutes

The first value is 1.68. This is the value of CPU load during the last minute. this is a measure of how many programs(process in ready state) were using CPU time during the last minute. So, during the last minute on this machine, there were an average of 1.68 programs either using CPU processing time or waiting for CPU processing time. If this is a single-threaded CPU, that means the computer is overloaded. Users are waiting for their programs to run on the CPU, and experiencing degraded performance. If, instead, this is a dual-core computer or a quad-core, users are able to get CPU time just as quickly as they needed it, during the last minute.
The second value is 0.55. This is the measurement over the last 5 minutes. As we previously discussed, a measurement below 1 means that the CPU spent some of the time in that window completely idle. In this case, the CPU was idle for almost half the time. If we’re optimizing our CPU to be constantly doing something, that’s not a good sign.
The final number, 5.91, is a measurement of the last 15 minutes. If you’re using an eight-core CPU, then this number isn’t particularly shocking. If you’re using a dual-core CPU, then a number like 5.91 means your CPU is very overloaded. Users are regularly waiting for CPU time, and are probably experiencing significantly degraded performance.

troubleshooting high load average

# show current load average
$ uptime


########################################################## cpu usage per cpu====================================
# show per-cpu usage peridically
$ mpstat -P ALL 1
########################################################## cpu usage per cpu====================================

########################################################## cpu usage per process====================================
# show all active process cpu usage(like top)
$pidstat 1
# show cpu usage of given process peridically(2 second)
# $pidstat -p 823471  2
$pidstat -p 823471  1
Linux 3.10.0-693.21.7.el7.x86_64 (A06-R08-I132-181-815KSRH.JCLOUD.COM)  07/20/2022      _x86_64_        (32 CPU)
04:42:46 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
04:42:47 PM     0    823471   33.00    5.00    0.00   38.00    10  node_monitor
04:42:48 PM     0    823471   54.00    7.00    0.00   61.00    10  node_monitor
04:42:49 PM     0    823471   46.00    4.00    0.00   50.00    10  node_monitor
04:42:50 PM     0    823471   31.00    7.00    0.00   38.00    10  node_monitor
# show cpu usage of given process(thread displayed)peridically
$pidstat -p 823471 -t 1
########################################################## cpu usage per process====================================


########################################################## cpu schedule latency per process=========================

# CPU run queue latency, schedule latency for each process
# monitor 10 seconds
$ perf sched record -- sleep 10
$ perf sched latency
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  :632308:632308        |      0.478 ms |        2 | avg:   47.545 ms | max:   95.088 ms | max at: 1397496.686810 s
  :632316:632316        |     13.820 ms |        8 | avg:   23.729 ms | max:   94.070 ms | max at: 1397496.685531 s
  ovs-vsctl:(3)         |    107.932 ms |       31 | avg:   14.581 ms | max:   95.035 ms | max at: 1397496.485498 s
  sh:(2)                |      6.058 ms |       11 | avg:    8.146 ms | max:   87.049 ms | max at: 1397495.885529 s
  sleep:(26)            |     17.129 ms |       75 | avg:    6.211 ms | max:   94.541 ms | max at: 1397496.386032 s
  :632320:632320        |      3.305 ms |       18 | avg:    5.288 ms | max:   94.205 ms | max at: 1397495.785685 s
  node_monitor:(76)     |    342.577 ms |     2342 | avg:    4.063 ms | max:  192.182 ms | max at: 1397496.385540 s
  perf:(183)            |    569.297 ms |     1597 | avg:    3.975 ms | max:  196.652 ms | max at: 1397496.386045 s
  kworker/25:2:461486   |      0.009 ms |        1 | avg:    0.677 ms | max:    0.677 ms | max at: 1397496.187200 s
  kworker/9:1:30855     |      0.012 ms |        1 | avg:    0.379 ms | max:    0.379 ms | max at: 1397496.186873 s
  kworker/1:6:615135    |      0.022 ms |        2 | avg:    0.358 ms | max:    0.711 ms | max at: 1397496.189183 s
  kworker/17:1:715010   |      0.009 ms |        1 | avg:    0.357 ms | max:    0.357 ms | max at: 1397496.186876 s
  kworker/31:1:303204   |      0.011 ms |        1 | avg:    0.314 ms | max:    0.314 ms | max at: 1397496.186854 s
  :632321:632321        |      3.882 ms |        7 | avg:    0.313 ms | max:    1.140 ms | max at: 1397495.887512 s
  :632319:632319        |      0.815 ms |        6 | avg:    0.258 ms | max:    1.277 ms | max at: 1397495.792539 s
  kworker/3:1:86183     |      0.010 ms |        1 | avg:    0.226 ms | max:    0.226 ms | max at: 1397496.186715 s
  :632317:632317        |      0.866 ms |        7 | avg:    0.216 ms | max:    1.099 ms | max at: 1397495.794319 s
  kworker/14:1:197420   |      0.010 ms |        1 | avg:    0.215 ms | max:    0.215 ms | max at: 1397496.186723 s
  :632307:632307        |      0.872 ms |        8 | avg:    0.205 ms | max:    1.420 ms | max at: 1397496.188524 s
  kworker/26:0:308384   |      0.011 ms |        1 | avg:    0.193 ms | max:    0.193 ms | max at: 1397496.186720 s
  :632301:632301        |      0.275 ms |        4 | avg:    0.165 ms | max:    0.658 ms | max at: 1397496.591534 s

########################################################## cpu schedule latency per process=========================

CPU usage

CPU usage is a measurement, in a percentage, of how much time the CPU spends actively computing something. For instance, if you had a program that required uninterrupted processing power for 54 out of the last 60 seconds, your CPU usage on one core would be 90%. Instead, if the program only required six seconds processing time on one core, the usage would be 10%.

Most companies seek to keep the CPU usage of their servers as close to 100% as possible. Most servers are sold by overall computing power, and if your server is only sitting at 30% CPU usage, you’re paying for too much processor power. You could downgrade your processor to a lower tier, save money, and see no reduction in the quality of your server’s performance.

cpu usage vs load average

CPU usage: There ratio (usually expressed as a percentage)of time that the CPU is busy doing stuff. This measure only makes sense if you know over which period the percentage is being calculated.

Load: Average queue length for the CPU - including the process currently executing. For this to make sense, you need to know the period over which this is being measured.

They are related, but one does not necessarily correlate to the other.

Imagine this scenario - with slightly contrived numbers: An ideal world with a single CPU. No scheduling overhead, no I/O overhead. Just keeping things simple.

You have 100 processes waiting for something.
When that “something” happens, each process will need 0.05 seconds of CPU time to do stuff in response.
When “something” does not happen, you have 0% CPU utilisation, and a queue length of 0. Basically stuff is just waiting. Life is good, and you’re merely wasting electrons and heating up the planet.
“something” happens. All 100 processes wake up. Your queue length jumps to 100, and your CPU is busy.
0.05 seconds later, your queue length is 99 as the first process has finished doing “stuff”. CPU is still busy.
After 0.1 seconds, your queue length is 98 as the 2nd process has finished doing “stuff”. CPU is still busy.
Every 0.05 your queue length drops by 1 as a process finishes. CPU remains busy.
After 5 seconds, all the processes have finished; CPU becomes idle again and your queue length is back to zero.

Your CPU utilisation over the last 60 seconds is now: 5/60 = 8.33%. But your average queue length (=load average) over the last 60 seconds will be about 4.2.

len:  100  99     98    97    ... 1      0     ...  0
time: 0    0.05   0.1   0.15      4.95   5          59

average = (0+0+...+1+2+3...+100)/(59/0.05+1) = 4.27

Looking at the 1-minute CPU utilisation alone (8.33%), you look good. But the 1-minute load average (4.2) shows that you have a performance bottleneck during that minute. Whether this is “bad” or not depends on whether you want it to be faster - do you need to respond to “something” happening more frequently than every 5 seconds?

NOTE

Load average is always high for periburst load(many processes are ready to run at a time) which runs shortly while cpu uage is not so high.
Too many D process as a process in state D is in uninterruptible, it’ counted by load average as well.

useful commands and performance tools

To debug performance issue, there are lots of tools that we can use to help us identify the issue, but some of them are old tools, some of new, so here we only introduce new tools that’s are used today.

Old tool

grpof
Oprofile

new tool

gperftools
gperftools is newer since 2007 developed by Google, it’s simpler, only from process view, stack of process
perf
perf is already in kernel source tree(upstream) since since 2009, it’s complex, can show
more information from system-wide view it uses hardware counters to profile the application.
The result of this profiler are really precise and because it is not doing instrumentation of the code, it is really fast.

perf can check a process(stack from kernel to process stack) or check system(without given process id)

useful commands


show cpu load
    Each running process either using or waiting for CPU resources adds 1 to the load average. So, if your system has a load of 5, five processes are either using or waiting for the CPU, the load number doesn’t mean too much. A computer might have a load of 0 one split-second, and a load of 5 the next split-second as several processes use the CPU. Even if you could see the load at any given time, that number would be basically meaningless. That’s why Unix-like systems don’t display the current load. They display the load average — an average of the computer’s load over several periods of time. This allows you to see how much work your computer has been performing.

    # uptime
    10:11:01 up 18:57,  4 users,  load average: 0.50, 2.13, 1.85
    From left to right, these numbers show you the average load over the last one minute, the last five minutes, and the last fifteen minutes

show how much time process runs in sys, user

    Real time is wall clock time. (what we could measure with a stopwatch)
    User time is the amount of time spend in user-mode within the process
    Sys is the CPU time spend in the kernel within the process.

    NOTE: real can be less than user if, it's app is multi-thread or multi-process!!!

    The rule of thumb is:
    real < user: The process is CPU bound and takes advantage of parallel execution on multiple cores/CPUs.
    real ≈ user: The process is CPU bound and takes no advantage of parallel exeuction.
    real > user: The process is I/O bound. Execution on multiple cores would be of little to no advantage.

    #time ls
         share  windows
         real    0m0.002s
         user    0m0.001s
         sys 0m0.001s

show latency of RT linux kernel

    #cyclictest
    (git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git)

show slab info

    #cat /proc/slabinfo
    #slabtop
        Active / Total Objects (% used)    : 133629 / 147300 (90.7%)
        Active / Total Slabs (% used)      : 11492 / 11493 (100.0%)
        Active / Total Caches (% used)     : 77 / 121 (63.6%)
        Active / Total Size (% used)       : 41739.83K / 44081.89K (94.7%)
        Minimum / Average / Maximum Object : 0.01K / 0.30K / 128.00K

         OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
        44814  43159  96%    0.62K   7469        6     29876K ext3_inode_cache
        36900  34614  93%    0.05K    492       75      1968K buffer_head
        35213  33124  94%    0.16K   1531       23      6124K dentry_cache
         7364   6463  87%    0.27K    526       14      2104K radix_tree_node
         1280   1015  79%    0.25K     40       32       320K kmalloc-256  ---> two pages for one slab
    (Note, the management memory is not calculated!!!, but it's small)

    Each cache may have many slabs(empty, partial, full), each slab is one or multiple PAGE SIZE
    (usually 4K for a PAGE SIZE)!

    USE        = (ACTIVE/OBJS)*100/100
    OBJS       = SLABS*(OBJ/SLAB)
    OBJ/SLAB   = (4K*n)/OBJ_SIZE
    CACHE SIZE = SLABS * (4K*n)


show swap size used by each process
    #smem
      (RSS 656 mean 656K?)
      PID User     Command                         Swap      USS      PSS      RSS
     2516 rabbitmq sh -c /usr/lib/rabbitmq/bin        0       96      116      656
     1451 lightdm  /bin/sh /usr/lib/lightdm/li        0      100      121      700
     1130 root     /bin/sh -e /proc/self/fd/9         0      100      122      680
     1157 root     /sbin/getty -8 38400 tty3          0      156      174      964

    Show basic process information  smem
    Show library-oriented view  smem -m
    Show user-oriented view     smem -u
    Show system view    smem -R 4G -K /path/to/vmlinux -w
    Show totals and percentages     smem -t -p
    Show different columns  smem -c "name user pss"
    Sort by reverse RSS     smem -s rss -r
    Show processes filtered by mapping  smem -M libxml
    Show mappings filtered by process   smem -m -P [e]volution
    Read data from capture tarball  smem --source capture.tar.gz
    Show a bar chart labeled by pid     smem --bar pid -c "pss uss"
    Show a pie chart of RSS labeled by name     smem --pie name -s rss

Show memory usage by 'free' command

    $ free 
                  total        used        free      shared  buff/cache   available
    Mem:       24687560    11825536     8579812      258488     4282212    12299492
    Swap:      16774140           0    16774140

    total== 11825536 + 8579812 + 4282212 == 24687560
    available = 8579812 + part of(buff/cache which is not used by OS)

    total: Your total (physical) RAM (excluding a small bit that the kernel permanently reserves for itself at startup);
    used: memory in use by the OS(calculate apps, buffers, caches)
    free: memory not in use.

    total = used + free + buff/cache

    shared /buff/cache: This shows memory usage for specific purposes
    (write data to disk, buffer is used, which cache is used for storing data read from disk in memory)

    The last line (Swap:) gives information about swap space usage (i.e. memory contents that have been temporarily moved to disk).

    To actually understand what the numbers mean, you need a bit of background about the virtual memory (VM) subsystem in Linux.
    Just a short version: Linux (like most modern OS) will always try to use free RAM for caching stuff, so Mem: free will almost always be very low.
    caches will be freed automatically if memory gets scarce, so they do not really matter.

Inside exec()

    In computing, exec is a functionality of an operating system that runs an executable file in the context of an already existing process,
    replacing the previous executable. This act is also referred to as an overlay. It is especially important in Unix-like systems, although exists elsewhere.
   As a new process is not created, the process identifier (PID) does not change,
   but the machine code, data, heap, and stack of the process are replaced by those of the new program.


====================================================SAR===================================================================================
sar(System Activity Report)： Show system activity information, its gives more details about cpu, memory, interrupt, io, power, network etc
But you can also check other commands for specific resource from below section

    # -B is more general including(swap process memory + disk io)
    # sar -B 5 
    Linux 3.10.0-1160.el7.x86_64 (dev) 	10/12/2022 	_x86_64_	(16 CPU)

    05:14:19 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
    05:14:24 PM      0.00  31948.80     10.00      0.00    136.20      0.00      0.00      0.00      0.00
    05:14:29 PM      0.00 236544.00     10.80      0.00     57.40      0.00      0.00      0.00      0.00

    # -W is about swap of process memory(swap process page to disk when there is not engouh memory)
    #sar -W 5
    Linux 3.10.0-1160.el7.x86_64 (dev) 	10/12/2022 	_x86_64_	(16 CPU)
    
    05:14:43 PM  pswpin/s pswpout/s
    05:14:48 PM      0.00      0.00
    05:14:53 PM      0.00      0.00


    Report I/O and transfer rate statistics
    # sar -b 5 
    Linux 3.10.0-1160.el7.x86_64 (dev)      10/22/2021      _x86_64_        (8 CPU)

    05:34:02 PM       tps      rtps      wtps   bread/s   bwrtn/s
    05:34:07 PM      0.00      0.00      0.00      0.00      0.00
    05:34:12 PM      0.00      0.00      0.00      0.00      0.00

    Report activity for each block device
    # sar -d 5 
    Linux 3.10.0-1160.el7.x86_64 (dev)      10/22/2021      _x86_64_        (8 CPU)

    05:34:20 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
    05:34:25 PM    dev8-0      0.20      0.00      6.40     32.00      0.00      1.00      1.00      0.02
    05:34:25 PM  dev253-0      0.20      0.00      6.40     32.00      0.00      1.00      1.00      0.02
    05:34:25 PM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:34:25 PM  dev253-2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

    show interrupt per 5s
    # sar -I ALL 5
    Linux 3.10.0-1160.el7.x86_64 (dev)      10/22/2021      _x86_64_        (8 CPU)

    05:37:36 PM      INTR    intr/s
    05:37:41 PM         0      0.00
    05:37:41 PM         1      0.00
    05:37:41 PM         2      0.00
    05:37:41 PM         3      0.00
    05:37:41 PM         4      0.00
    05:37:41 PM         5      0.00
    05:37:41 PM         6      0.00
    05:37:41 PM         7      0.00
    05:37:41 PM         8      0.00
    05:37:41 PM         9      0.00
    05:37:41 PM        10      0.00
    05:37:41 PM        11      0.00
    05:37:41 PM        12      0.00
    05:37:41 PM        13      0.00
    05:37:41 PM        14      0.80
    05:37:41 PM        15      0.00

    show power management
    $ ar -m ALL 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:40:01 PM     CPU       MHz
    05:40:11 PM     all   1258.82

    05:40:01 PM    TEMP      degC     %temp               DEVICE
    05:40:11 PM       1     43.00     55.84    coretemp-isa-0000
    05:40:11 PM       2     38.00     49.35    coretemp-isa-0000
    05:40:11 PM       3     37.00     48.05    coretemp-isa-0000
    05:40:11 PM       4     34.00     44.16    coretemp-isa-0000
    05:40:11 PM       5     38.00     49.35    coretemp-isa-0000
    05:40:11 PM       6     33.00     42.86    coretemp-isa-0000
    05:40:11 PM       7     34.00     44.16    coretemp-isa-0000
    05:40:11 PM       8     37.00     48.05    coretemp-isa-0000
    05:40:11 PM       9     35.00     45.45    coretemp-isa-0000
    05:40:11 PM      10     44.00     57.14    coretemp-isa-0001
    05:40:11 PM      11     36.00     46.75    coretemp-isa-0001
    05:40:11 PM      12     36.00     46.75    coretemp-isa-0001
    05:40:11 PM      13     37.00     48.05    coretemp-isa-0001
    05:40:11 PM      14     37.00     48.05    coretemp-isa-0001
    05:40:11 PM      15     34.00     44.16    coretemp-isa-0001
    05:40:11 PM      16     36.00     46.75    coretemp-isa-0001
    05:40:11 PM      17     34.00     44.16    coretemp-isa-0001
    05:40:11 PM      18     34.00     44.16    coretemp-isa-0001

    05:40:01 PM     BUS  idvendor    idprod  maxpower                manufact                                         product
    05:40:11 PM       1      8087      800a         0                                                                        
    05:40:11 PM       2      8087      8002         0                                                                        
    05:40:11 PM       1      413c      a001       200         no manufacturer                                  Gadget USB HUB

    show network stats, lots of fields, only list some
    # sar -n ALL 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:43:15 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
    05:43:25 PM tap_metadata      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM vxlan_sys_4789      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       br0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM tap_proxy_ns      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM ovs-system      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM tap_proxy      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM      eth0      2.10      2.00      0.23      0.19      0.00      0.00      0.00
    05:43:25 PM        lo      2.00      2.00      0.79      0.79      0.00      0.00      0.00
    05:43:25 PM       em2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       em4      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       em3      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

    05:43:15 PM     IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
    05:43:25 PM tap_metadata      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM vxlan_sys_4789      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       br0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM tap_proxy_ns      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM ovs-system      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM tap_proxy      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       em2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       em4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM       em3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
    05:43:25 PM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

    Report cpu queue length and load averages
    # sar -P ALL -q 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:48:32 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
    05:48:42 PM         0      1069      0.43      0.46      0.49         0
    
    Report memory utilization statistics
    # sar -r 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:46:25 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
    05:46:35 PM 119478308  12269620      9.31      1868   9411160   5926572      3.99   4968572   5430432      1560

    Report CPU utilization
    # sar -P ALL -u 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:47:32 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
    05:47:42 PM     all      1.39      0.00      0.40      0.00      0.00     98.20
    05:47:42 PM       0     40.49      0.00      0.00      0.00      0.00     59.51
    05:47:42 PM       1      0.00      0.00      0.30      0.00      0.00     99.70
    05:47:42 PM       2      0.41      0.00      1.32      0.00      0.00     98.28
    05:47:42 PM       3      0.10      0.00      0.20      0.00      0.00     99.70
    05:47:42 PM       4      0.30      0.00      0.80      0.00      0.00     98.89
    05:47:42 PM       5      0.20      0.00      0.40      0.00      0.00     99.40
    05:47:42 PM       6      0.20      0.00      0.40      0.00      0.00     99.40
    05:47:42 PM       7      0.00      0.00      0.20      0.00      0.00     99.80
    05:47:42 PM       8      0.70      0.00      0.90      0.00      0.00     98.39

    Report task creation and system switching activity
    # sar -w 10
    Linux 3.10.0-327.36.4.el7.x86_64 (A04-R08-I138-47-91TYB72.JCLOUD.COM)   10/22/2021      _x86_64_        (32 CPU)

    05:48:56 PM    proc/s   cswch/s
    05:49:06 PM      2.30  12558.70
====================================================SAR===================================================================================

Show CPUS stats
    CPU utilization stats runs on user, sys, virtual processor(vm)
    # mpstat -P ALL -u
    04:38:08 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    04:38:08 PM  all    7.03    0.06    2.89    0.01    0.00    0.00    0.00    8.28    0.00   81.73
    04:38:08 PM    0    0.88    0.08    3.26    0.01    0.00    0.19    0.00   10.02    0.00   85.56
    04:38:08 PM    1    0.89    0.04    3.30    0.01    0.00    0.02    0.00    9.47    0.00   86.28
    04:38:08 PM    2    0.83    0.08    3.15    0.01    0.00    0.01    0.00   10.15    0.00   85.78
    04:38:08 PM    3    0.82    0.04    3.15    0.01    0.00    0.00    0.00    9.61    0.00   86.39
    04:38:08 PM    4    1.03    0.07    4.59    0.01    0.00    0.01    0.00   12.51    0.00   81.78
    04:38:08 PM    5    0.92    0.04    3.22    0.01    0.00    0.00    0.00    9.58    0.00   86.23
    04:38:08 PM    6    1.10    0.07    4.63    0.01    0.00    0.00    0.00   12.52    0.00   81.66
    04:38:08 PM    7    0.83    0.05    3.42    0.01    0.00    0.00    0.00   10.44    0.00   85.25
    04:38:08 PM    8   99.75    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00    0.00

    CPU soft irq
    # mpstat -I SCPU
    Linux 3.10.0-693.21.4.el7.x86_64 (A01-R15-I124-40-CCK4HP2.JCLOUD.COM)   10/22/2021      _x86_64_        (64 CPU)

    04:39:57 PM  CPU       HI/s    TIMER/s   NET_TX/s   NET_RX/s    BLOCK/s BLOCK_IOPOLL/s  TASKLET/s    SCHED/s  HRTIMER/s      RCU/s
    04:39:57 PM    0       0.00      54.90       0.20       2.42       0.00       0.00       0.05      12.35       0.00      10.65
    04:39:57 PM    1       0.00      41.11       0.00       0.48       0.04       0.00       7.10      43.85       0.00       7.03
    04:39:57 PM    2       0.00      60.01       0.01      14.90       0.00       0.00       0.57      59.44       0.00      10.80
    04:39:57 PM    3       0.00      33.81       0.00       0.50       0.04       0.00       0.00      52.79       0.00       3.72
    04:39:57 PM    4       0.00      40.35       0.01      17.83       0.00       0.00       0.75       6.86       0.00      23.19
    04:39:57 PM    5       0.00      44.60       0.00       0.51       0.04       0.00       0.00      53.62       0.00       7.76
    04:39:57 PM    6       0.00      44.92       0.01      12.48       0.00       0.00       0.51       7.00       0.00      24.59
    04:39:57 PM    7       0.00      58.52       0.00       0.46       0.04       0.00       0.00      57.85       0.00      12.73
    04:39:57 PM    8       0.00      33.03       0.00       0.00       0.00       0.00       0.00       0.00       0.00      58.50

Show CPU live stats
    # top
    top - 16:45:28 up 771 days,  3:16,  1 user,  load average: 6.66, 7.24, 6.54
    Tasks: 670 total,   9 running, 661 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  9.7 us,  1.6 sy,  0.1 ni, 88.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem : 26379142+total,  3811248 free, 23686681+used, 23113348 buff/cache
    KiB Swap: 16777212 total, 16691312 free,    85900 used. 25565136 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    127796 root      10 -10  0.105t 573032  20444 S 403.6  0.2  42365,18 vswitchd
    131315 root      20   0 9188560  87436   5940 S 108.6  0.0 668887:05 qemu-kvm
    113257 root      20   0 9220320  77628   5916 S  78.8  0.0 300804:39 qemu-kvm
    69578 root      20   0 9043128  66088   3640 S  18.2  0.0   2199:03 qemu-system-x86
    123753 root      20   0 9039996  63892   3604 S  15.9  0.0 764:21.77 qemu-system-x86
    113084 root      20   0 9074688  68036   1916 S  12.3  0.0 170215:57 qemu-system-x86
    99040 root      20   0 16.647g  65140   1900 S   9.6  0.0 158111:53 qemu-system-x86
    133933 root      20   0 4836392  64328   3648 S   8.6  0.0 381:15.29 qemu-system-x86
    92403 root      20   0 4825240  62916   3308 S   7.3  0.0  21040:53 qemu-system-x86
    100018 root      20   0 3471696   5216   2616 S   7.0  0.0   8384:57 logd

    # htop

show live virtual memory usage
    show stats per 2s, actuall, it also shows io, system, cpu as well
    $ vmstat  -n 2
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
    5  4  85900 4185664   4396 22772168    0    0     1    91    0    0 15  3 82  0  0
    6  0  85900 4181612   4396 22772580    0    0    64   479 55060 82145  8  1 91  0  0
    8  0  85900 4184968   4396 22772636    0    0    96    98 56364 87759  8  1 91  0  0
    8  0  85900 4183828   4396 22772936    0    0    96   152 58835 88482  9  1 90  0  0
    6  0  85900 4180524   4396 22772920    0    0     0   320 58749 94072  9  1 90  0  0
    5  0  85900 4184580   4396 22773588    0    0     0   234 67631 111630  9  2 89  0  0

show io statistics, most used for which disk has high io await.
    the io wait of the whole system(96.0%wa)
    # top
    top - 14:31:20 up 35 min, 4 users, load average: 2.25, 1.74, 1.68
    Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
    Cpu(s): 2.3%us, 1.7%sy, 0.0%ni, 0.0%id, 96.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem: 245440k total, 241004k used, 4436k free, 496k buffers
    Swap: 409596k total, 5436k used, 404160k free, 182812k cached

    show iostat per 10s of each block device(check which block device has high io wait)
    # sar -d 5
    # iostat -txz 10
    Linux 3.10.0-693.21.4.el7.x86_64 (A01-R15-I124-40-CCK4HP2.JCLOUD.COM)   10/22/2021      _x86_64_        (64 CPU)

    10/22/2021 05:14:10 PM
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            15.31    0.06    2.89    0.01    0.00   81.73

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    nvme0n1           0.00     0.06    1.06   47.46    43.69  4897.74   203.70     0.01    0.25    0.26    0.25   0.06   0.29
    sda               0.00     0.03    0.00    0.73     0.07     7.51    20.56     0.01   19.20    6.80   19.27   3.83   0.28
    nb100             0.00     0.01    0.29    7.33    36.40   545.36   152.62     0.02    3.25    9.77    2.99   1.07   0.82
    nb101             0.00     0.00    0.00    1.76     0.03    14.69    16.69     0.00    0.94    1.29    0.94   0.29   0.05
    nb102             0.00     0.00    0.00    6.72     0.01   132.59    39.46     0.01    1.18    0.70    1.18   0.28   0.19
    nb103             0.00     0.00    0.00    0.55     0.02     7.14    25.97     0.00    0.86    0.54    0.86   0.46   0.03
    nb104             0.00     0.00    0.00    0.31     0.01    10.68    68.74     0.00    1.45    0.50    1.45   0.43   0.01
    nb105             0.00     0.00    0.00    1.29     0.02    78.13   121.00     0.00    3.56    0.62    3.56   0.50   0.06
    nb106             0.00     0.00    0.01    0.71     0.83    40.81   116.51     0.00    1.19    0.57    1.19   0.69   0.05
    nb107             0.00     0.00    0.00    0.17     0.01     1.37    16.55     0.00    1.21    8.04    1.20   0.39   0.01
    nb108             0.00     0.00    0.00    0.17     0.00     1.29    15.10     0.00    0.90    0.53    0.90   0.44   0.01
    nb109             0.00     0.00    0.00    0.00     0.00     0.04    60.20     0.00   42.15    0.42   43.57   0.46   0.00

    show io per process, which process is writing high io
    #iotop
    Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
    Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                          
    17391 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % [kworker/6:0]
    16896 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-0fe83e4c-ccd4-49f4-ae7e-4b07fabb2dc3.json
        1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --switched-root --system --deserialize 22
        2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
        4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
        6 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
        7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
        8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_bh]
        9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]

show interface statistics
    show stats of ifaces
    # ifstat
    #kernel
    Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                    RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
    lo                     0 0             0 0             0 0             0 0      
                        0 0             0 0             0 0             0 0      
    enp0s3                 8 0             6 0           560 0          1424 0      
                        0 0             0 0             0 0             0 0      
    docker0                0 0             0 0             0 0             0 0      
                        0 0             0 0             0 0             0 0      
    vethbedf2bf            0 0             0 0             0 0             0 0      
                        0 0             0 0             0 0             0 0    

    # live stats on each interface                    
    #iftop

    # live stats on each process which has network io
    # nethogs
    NetHogs version 0.8.5

        PID USER     PROGRAM             DEV        SENT      RECEIVED       
    13337 root     sshd: root@pts/2    enp0s3      0.218       0.186 KB/sec      
    ?     root     unknown TCP                     0.000       0.000 KB/sec
    TOTAL                                          0.218       0.186 KB/sec

    show details about interface like config, stats etc
    # ethtool -h
    ethtool -g|--show-ring DEVNAME	Query RX/TX ring parameters
    ethtool -k|--show-features|--show-offload DEVNAME	Get state of protocol offload and other features
    ethtool -i|--driver DEVNAME	Show driver information
    ethtool -S|--statistics DEVNAME	Show adapter statistics
    ethtool -n|-u|--show-nfc|--show-ntuple DEVNAME	Show Rx network flow classification options or rules
    ethtool -x|--show-rxfh-indir|--show-rxfh DEVNAME	Show Rx flow hash indirection and/or hash key

show power management
    show power used by each process live
    # powertop
    PowerTOP v2.9     Overview   Idle stats   Frequency stats   Device stats   Tunables                                     
    Summary: 72.3 wakeups/second,  0.0 GPU ops/seconds, 0.0 VFS ops/sec and 0.3% CPU use

                    Usage       Events/s    Category       Description
                122.3 µs/s      20.0        Process        [PID 460] [xfsaild/dm-0]
                72.8 µs/s       9.5        Timer          tick_sched_timer
                116.2 µs/s       6.7        Timer          hrtimer_wakeup
                93.0 µs/s       5.7        Process        [PID 1084] /usr/bin/containerd
                63.7 µs/s       5.7        Process        [PID 9] [rcu_sched]
                632.7 µs/s       4.8        Process        [PID 1049] /home/data/Anaconda3/bin/python /home/data/Anaconda3/bin/jupyter-notebook -y --no-browser --allow-root --ip=10.0.2.1
                61.6 µs/s       4.8        Process        [PID 1082] /usr/bin/containerd
                34.9 µs/s       2.9        Interrupt      [3] net_rx(softirq)
                183.8 µs/s       1.9        Interrupt      [7] sched(softirq)
                239.2 µs/s       1.0        kWork          e1000_watchdog

Benchmark tools

    for operation function
    #apt-get install lmbench

    Layer 4 Throughput using NetPerf and iPerf, two open source network performance benchmark tools that support both UDP and TCP protocols. Each tool provides in addition other information:
    NetPerf for example provides tests for end-to-end latency (round-trip times or RTT) and is a good replacement for Ping

    iPerf provides packet loss and delay jitter, useful to troubleshoot network performance.

    for network, test the network between client(netperf) and server(netserver)

    server side
    #netserver

    client side with testing 300s, or never stop(-l 0)
    #netperf -H $server -l 300 -t TCP_STREAM

    server side
    #iperf3 --server --interval 30
    client side
    #iperf3 --client $server --time 300 --interval 30

check bottleneck, call graph

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Don't use gprof which is old since 198x

Oprofile is old still release one version each year since 2002 and it uses the same backend
as 'perf' does, so can give almost the same output with 'perf' but the Community recommend
'perf' and intend to replace it


gperftools is newer since 2007 developed by Google, it's simpler, only from process view

perf is already in kernel source tree(upstream) since since 2009, it's complex, can show
more information from system-wide view!!!!!!!!!!!!!!! it uses hardware counters to profile the application.
The result of this profiler are really precise and
because it is not doing instrumentation of the code, it is really fast.

gperftools and perf are two good choices nowadays!!!!!!!!!!!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

gperftools(great perf tools, originally from google performance tool, is package name)

gperftools is a collection of a high-performance multi-threaded malloc() implementation, plus some pretty nifty performance analysis tools.


Ubuntu18
#apt-get install google-perftools graphviz libgoogle-perftools-dev
Centos7
$ yum install -y pprof gperftools-devel

Usage
    CPUPROFILE:(check which function or line take much time) NO track forked child!!!

        As gperftools provides tcmalloc, heap checker, heap profiler and cpu profiler
        (heap checker, heap profiler are in '-ltcmalloc'
         cpu profiler '-lprofiler'
         google-pprof is used to analysis the profile file)

        There are two ways to use gperftools, one is to compile it within your program
        the other is to used PRELOAD and set env(use the first way always!)

        Generate profile file
            a. Compile it within your program

                #include <gperftools/profiler.h>
                #include <stdio.h>
                #include <stdlib.h>
                void func1() {
                    int i = 0;
                    while (i < 100000) {
                        ++i;
                    }
                }
                void func2() {
                    int i = 0;
                    while (i < 200000) {
                        ++i;
                    }
                }
                void func3() {
                    int i = 0;
                    for (i = 0; i < 1000; ++i) {
                        func1();
                        func2();
                    }
                }
                int main(){
                    ProfilerStart("my.prof");
                    func3();
                    ProfilerStop();
                    return 0;
                }

               # gcc -o test  test.c -g -Wall -lprofiler

               This way(set CPUPROFILE) will do CPU profile definitely!!!, no switch
               #CPUPROFILE_FREQUENCY=100 ./test
               (100 samples per second, default value)


            b. Use PRELOAD (not recommended!!!!) without recompiling!!!!!
                #export LD_PRELOAD=/usr/lib64/libprofiler.so

                /*turn on cpu profile during whole life*/
                #env CPUPROFILE=my.prof ./test
    
            -------------------------------------------------------------------------
            | For a daemon process, run it forground!!! not a daemon for profiling. |
            -------------------------------------------------------------------------

        ---------------------------------------------------------------------------
        Analyze the profile file( take care of the first three columns)
        to see which function or lines consume more CPU time!!!!!!!!!!
        ---------------------------------------------------------------------------
            See which function takes much time
            root@ubuntu:~# google-pprof --text ./test my.prof
            OR
            root@centos:~# pprof --text ./test my.prof

            Using local file ./test. (test is the program)
            Using local file my.prof.(my.prof is the data collected before)
            Removing killpg from all stack traces.
            Total: 71 samples
                  53                74.6%            74.6%          53  74.6% func2
                  18                25.4%           100.0%          18  25.4% func1
                   0                 0.0%           100.0%          71 100.0% __libc_start_main
                   0                 0.0%           100.0%          71 100.0% _start
                   0                 0.0%           100.0%          71 100.0% func3
                   0                 0.0%           100.0%          71 100.0% main

            column meanings
            1. Number of profiling samples in this function
            2. Percentage of profiling samples in this function
            3. Percentage of profiling samples in the functions printed so far
            4. Number of profiling samples in this function and its callees
            5. Percentage of profiling samples in this function and its callees
            6. Function name 


            if you perf to run, from system view, you will get

            $perf record ./test
            $perf report
            65.73%  test     test              [.] func2
            33.68%  test     test              [.] func1
            0.16%  test     [kernel.vmlinux]  [k] native_write_msr_safe
            0.06%  test     [kernel.vmlinux]  [k] x86_pmu_enable
            0.05%  test     [kernel.vmlinux]  [k] __intel_pmu_disable_all
            0.05%  test     libc-2.17.so      [.] __GI___dl_iterate_phdr
            0.00%  test     [kernel.vmlinux]  [k] __do_page_fault
            0.00%  test     libc-2.17.so      [.] __memset_sse2
            0.00%  test     [kernel.vmlinux]  [k] lapic_next_deadline


            See which line takes much time, you have to build test with -g
            root@ubuntu:~#google-pprof --lines --text ./test my.prof
            OR
            root@centos:~#pprof --lines --text ./test my.prof

            Using local file ./test.
            Using local file my.prof.
            Removing killpg from all stack traces.
            Total: 71 samples
                  37  52.1%  52.1%       37  52.1% func2 /root/test.c:12 (discriminator 1)
                  22  31.0%  83.1%       22  31.0% func1 /root/test.c:6 (discriminator 1)
                  11  15.5%  98.6%       13  18.3% func2 /root/test.c:13
                   1   1.4% 100.0%        1   1.4% func1 /root/test.c:7
                   0   0.0% 100.0%       71 100.0% __libc_start_main /build/eglibc-3GlaMS/eglibc-2.19/csu/libc-start.c:287
                   0   0.0% 100.0%       71 100.0% _start ??:?
                   0   0.0% 100.0%        1   1.4% func1 /root/test.c:6
                   0   0.0% 100.0%       11  15.5% func2 /root/test.c:12
                   0   0.0% 100.0%       23  32.4% func3 /root/test.c:19 (discriminator 2)
                   0   0.0% 100.0%       48  67.6% func3 /root/test.c:20 (discriminator 2)
                   0   0.0% 100.0%       71 100.0% main /root/test.c:25
            (--text, --pdf, --web, --dot, --gif, --gv etc )
            ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    Below seems not working for nginx don't know why

        TCMALLOC(thread cache malloc), you don't need to memory management for your own
        tcmalloc does it for you, so that you don't need to care memory managment!!!!

            tcmalloc actually implements a cache or pool, so that you can get memory from cache or pool
            fast, but as the presure grows, tcmalloc takes more memory from system, while when presure decrease
            tcmalloc should return the memory back to system TCMALLOC_RELEASE_RATE
            Usage
            +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                #gcc -o test test.c -ltcmalloc_minimal
                (in your program, use malloc, free as you did before)
            +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

        HEAP CHECKER(check memory leak, not working well, don't know why)
            #gcc -o test test.c -ltcmalloc
            #HEAPCHECK=normal  ./test


        HEAP Profile(check where/who alloc memory)

            #gcc -o test test.c -ltcmalloc

            dump heap profile when allocate 1M memory (only through malloc method)
            #HEAPPROFILE=heap.prof HEAP_PROFILE_ALLOCATION_INTERVAL=1024*1024 ./test

            also dump sbrk, mmap method as well
            #HEAPPROFILE=heap.prof  HEAP_PROFILE_MMAP=true HEAP_PROFILE_ALLOCATION_INTERVAL=1024*1024 ./test

            root@ubuntu:~#google-pprof --gv test test.0004.heap
            root@centos:~#pprof --gv test test.0004.heap