linux-numa

Overview

The NUMA-aware architecture is a hardware design which separates its cores into multiple clusters where each cluster has its own local memory region and still allows cores from one cluster to access all memory in the system. However, if a processor needs to use memory that is not its own memory region, it will take longer to access that (remote) memory. For applications where performance is crucial, preventing the need to access memory from other clusters is critical.

  • A socket refers to the physical location where a processor package plugs into a motherboard. The processor that plugs into the motherboard is also known as a socket.
  • A core is an individual execution unit within a processor that can independently execute a software execution thread and maintains its execution state separate from the execution state of any other cores within a processor.
  • A thread refers to a hardware-based thread execution capability. For example, the Intel Xeon 7560 has eight cores, each of which has hardware that can effectively execute two software execution threads simultaneously, yielding 16 threads.
1
2
3
4
5
6
7
8
9
10
11
12
$ lscpu 
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 128
Socket(s): 1
NUMA node(s): 2
Vendor ID: ARM
...

Numa topolgy

From the hardware perspective, a NUMA system is a computer platform that comprises multiple components or assemblies each of which may contain 0 or more CPUs, local memory, and/or IO buses. we’ll call the components/assemblies ‘cells’ as well.

The cells of the NUMA system are connected together with some sort of system interconnect–e.g., a crossbar or point-to-point link are common types of NUMA system interconnects. Both of these types of interconnects can be aggregated to create NUMA platforms with cells at multiple distances from other cells.

Memory access time and effective memory bandwidth varies depending on how far away the cell containing the CPU or IO bus making the memory access is from the cell containing the target memory.

Linux divides the system’s hardware resources into multiple software abstractions called “nodes”. Linux maps the nodes onto the physical cells of the hardware platform, abstracting away some of the details for some architectures. As with physical cells, software nodes may contain 0 or more CPUs, memory and/or IO buses. And, again, memory accesses to memory on “closer” nodes–nodes that map to closer cells–will generally experience faster access times and higher effective bandwidth than accesses to more remote cells.

For each node with memory, Linux constructs an independent memory management subsystem, complete with its own free page lists, in-use page lists, usage statistics and locks to mediate access. In addition, Linux constructs for each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], an ordered “zonelist”. A zonelist specifies the zones/nodes to visit when a selected zone/node cannot satisfy the allocation request. This situation, when a zone has no available memory to satisfy a request, is called “overflow” or “fallback”.

By default, Linux will attempt to satisfy memory allocation requests from the node to which the CPU that executes the request is assigned. Specifically, Linux will attempt to allocate from the first node in the appropriate zonelist for the node where the request originates. This is called “local allocation.” If the “local” node cannot satisfy the request, the kernel will examine other nodes’ zones in the selected zonelist looking for the first zone in the list that can satisfy the request.

NUMA

CPU

each CPU is assigned its own local memory and can access memory from other CPUs in the system.
each processor contains many cores with a shared on-chip cache and an off-chip memory and has variable memory access costs across different parts of the memory within a server

Memory

In Non-Uniform Memory Access (NUMA), system memory is divided into zones (called nodes), which are allocated to particular CPUs or sockets. Access to memory that is local to a CPU is faster than memory connected to remote CPUs on that system.

Memory allocation policies defines for Numa system

  • Default(local allocation): This mode specifies that any nondefault thread memory policy be removed, so that the memory policy “falls back” to the system default policy. The system default policy is “local allocation”—that is, allocate memory on the node of the CPU that triggered the allocation. nodemask must be specified as NULL. If the "local node" contains no free memory, the system will attempt to allocate memory from a "near by" node.

  • Bind: This mode defines a strict policy that restricts memory allocation to the nodes specified in nodemask. If nodemask specifies more than one node, page allocations will come from the node with the lowest numeric node ID first, until that node contains no free memory. Allocations will then come from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contain free memory. Pages will not be allocated from any node not specified in the nodemask.

  • Interleave: This mode interleaves page allocations across the nodes specified in nodemask in numeric node ID order. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. However, accesses to a single page will still be limited to the memory bandwidth of a single node.

  • Preferred: This mode sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to “near by” nodes if the preferred node is low on free memory. If nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node. If the nodemask and maxnode arguments specify the empty set, then the policy specifies “local allocation” (like the system default policy discussed above).

Debug

Taskset

The taskset command is considered the most portable Linux way of setting or retrieving the CPU affinity (binding) of a running process (thread). it only sets cpu affinity of running process, not touch memory allocation.

1
2
3
4
5
6
7
8
9
10
11
12
# start process on given cpu
$ taskset -c 0 ./app

# change process to run on given cpu
$ taskset -p -c 0 $pid
$ taskset -c 0 -p $pid # error!!!

$ taskset -p -c 0,2 $pid

# get process affinity
$ taskset -cp $pid
$ taskset -p $pid

numactl

numactl can be used to control the NUMA policy for processes, shared memory, or both. One key thing about numactl is that, unlike taskset, you can’t use it to change the policy of a running application.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ yum install -y numactl

# show nodes topology and free/total numa node memory
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 65442 MB
node 0 free: 1903 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 65536 MB
node 1 free: 17423 MB
node distances:
node 0 1
0: 10 21
1: 21 10

# show policy
$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1
nodebind: 0 1
membind: 0 1
  • --interleave=<nodes> policy has the application allocate memory in a round-robin fashion on “nodes.” With only two NUMA nodes, this means memory will be allocated first on node 0, followed by node 1, node 0, node 1, and so on. If the memory allocation cannot work on the current interleave target node (node x), it falls back to other nodes, but in the same round-robin fashion. You can control which nodes are used for memory interleaving or use them all

    --interleave=0,1 or --interleave=all

  • --membind=<nodes> policy forces memory to be allocated from the list of provided nodes

    --membind=0 or --menbind=all

  • --preferred=<node> policy causes memory allocation on the node you specify, but if it can’t, it will fall back to using memory from other nodes.

    --preferred=1

  • --localalloc policy forces allocation of memory on the current node

    --localalloc

  • --cpunodebind=<nodes> option causes processes to run only on the CPUs of the specified node(s)

    --cpunodebind=0

  • --physcpubind=<CPUs> policy executes the process(es) on the list of CPUs provided

    --physcpubind=+0-4,8-12

numastat

The numastat tool is provided by the numactl package, and displays memory statistics (such as allocation hits and misses) for processes and the operating system on a per-NUMA-node basis. The default tracking categories for the numastat command are outlined as follows:

  • numa_hit

    The number of pages that were successfully allocated to this node.

  • numa_miss

    The number of pages that were allocated on this node because of low memory on the intended node. Each numa_miss event has a corresponding numa_foreign event on another node.

  • numa_foreign

    The number of pages initially intended for this node that were allocated to another node instead. Each numa_foreign event has a corresponding numa_miss event on another node.

  • interleave_hit

    The number of interleave policy pages successfully allocated to this node.

  • local_node

    The number of pages successfully allocated on this node, by a process on this node.

  • other_node

    The number of pages allocated on this node, by a process on another node.

Options

  • -m Displays system-wide memory usage information on a per-node basis, similar to the information found in /proc/meminfo
  • -p pattern Displays per-node memory information for the specified pattern. If the value for pattern is comprised of digits, numastat assumes that it is a numerical process identifier
  • -s Sorts the displayed data in descending order so that the biggest memory consumers (according to the total column) are listed first
  • -v Displays more verbose information. Namely, process information for multiple processes will display detailed information for each process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# system wide view of each numa node
$ numastat -m
# more info about system wide view
$ numastat -mv

Per-node system memory usage (in MBs):
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 65442.34 65536.00 130978.34
MemFree 2073.16 17484.08 19557.24
MemUsed 63369.18 48051.92 111421.10
Active 28696.10 29089.88 57785.98
Inactive 31393.39 15842.16 47235.55
Active(anon) 1541.27 1925.21 3466.48
Inactive(anon) 9547.17 1200.94 10748.11
Active(file) 27154.82 27164.67 54319.50
Inactive(file) 21846.21 14641.22 36487.43
Unevictable 0.00 0.00 0.00
Mlocked 0.00 0.00 0.00
Dirty 0.13 0.02 0.15
Writeback 0.00 0.00 0.00
FilePages 58718.86 43096.30 101815.16
Mapped 101.45 222.64 324.10
AnonPages 1370.99 1835.88 3206.87
Shmem 9712.72 1288.01 11000.73
KernelStack 9.66 8.83 18.48
PageTables 18.27 18.41 36.68
NFS_Unstable 0.00 0.00 0.00
Bounce 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00
Slab 1441.85 1498.11 2939.95
SReclaimable 1312.22 1387.39 2699.61
SUnreclaim 129.62 110.72 240.34
AnonHugePages 140.00 1238.00 1378.00
HugePages_Total 0.00 0.00 0.00
HugePages_Free 0.00 0.00 0.00
HugePages_Surp 0.00 0.00 0.00

# systemwid view in MB unit
$ numastat -c

Per-node numastat info (in MBs):
Node 0 Node 1 Total
---------- ---------- -----------
Numa_Hit 6889434917 6552564429 13441999346
Numa_Miss 19506824 18047982 37554806
Numa_Foreign 18047982 19506824 37554806
Interleave_Hit 232 230 462
Local_Node 6889391241 6552564755 13441955995
Other_Node 19550500 18047656 37598156

# show numa memory used by process who has command qemu-kvm
$ numastat -p qemu-kvm
$ numastat -p $pid

# sort by total
$ numastat -s -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
116345 (qemu-kvm) 73.78 47.09 120.87
37545 (qemu-kvm) 3.90 106.01 109.91
117212 (qemu-kvm) 74.95 16.20 91.15
114870 (qemu-kvm) 74.74 8.19 82.93
20080 (qemu-kvm) 5.45 76.87 82.32
134180 (qemu-kvm) 59.51 22.23 81.73
131889 (qemu-kvm) 76.20 4.18 80.38
50070 (qemu-kvm) 50.39 26.17 76.57
60596 (qemu-kvm) 22.01 50.14 72.14
16097 (qemu-kvm) 68.00 4.11 72.12
131511 (qemu-kvm) 44.75 26.11 70.87
----------------- --------------- --------------- ---------------
Total 553.68 387.30 940.98

# more detail about of each process
# numastat -v -p qemu-kvm
Per-node process memory usage (in MBs) for PID 16097 (qemu-kvm)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 4.00 0.00 4.00
Heap 45.83 0.00 45.83
Stack 0.04 0.00 0.04
Private 18.14 4.11 22.25
---------------- --------------- --------------- ---------------
Total 68.00 4.11 72.12

Per-node process memory usage (in MBs) for PID 20080 (qemu-kvm)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 4.00 4.00
Heap 2.00 54.64 56.64
Stack 0.00 0.04 0.04
Private 3.44 18.20 21.64
---------------- --------------- --------------- ---------------
Total 5.45 76.88 82.32
...

numad

numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management., it scans all processes of the system within period of time.

Note that when numad is enabled, its behavior overrides the default behavior of automatic NUMA balancing(scheduler)

1
2
3
4
5
6
7
8
9
10
11
12
13
$ yum install numad
$ service nuamd start

# log file
$ ls /var/log/numad.log
/var/log/numad.log

$ cat /etc/numad.conf
# Config file for numad
#
# Default INTERVAL is 15
# modify below to change it
INTERVAL=15

lstopo

lstopo and lstopo-no-graphics are capable of displaying a topological map of the system in a variety of different output formats. The only difference between lstopo and lstopo-no-graphics is that graphical outputs are only supported by lstopo, to reduce dependencies on external libraries. hwloc-ls is identical to lstopo-no-graphics.

1
2
3
$ lstopo-no-graphics
$ lstopo-no-graphics -.ascii
$ lstopo topo.png

Ref