docker-core-tech-cgroups

Posted on 2019-10-14 Edited on 2023-10-20 In docker , core-tech

cgroups

provides a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behavior(under resource controller) like

Resource Limiting (i.e. not to exceed a memory limit) mostly used
Prioritization (i.e. groups may have larger share of CPU)
Isolation (i.e. isolate GPUs for particular processes)
Accounting (i.e. monitor resource usage for processes)
Control (i.e. suspending and resuming processes)

Definitions:

A cgroup associates a set of tasks with a set of parameters(control resource) for one
or more subsystems.
A subsystem is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a “resource controller” that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.
A hierarchy is a set of cgroups arranged in a tree, such that
every task in the system is in exactly one of the cgroups in the
hierarchy, and a set of subsystems(CPU, Memory); each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.

cgroup pseudo filesystem

cgroup uv

Here are most cgroups(subsys) supported by linux

cpuset – assigns tasks to individual CPUs and memory nodes in a cgroup
cpu – schedules CPU access to cgroups(how much cpu used and how long)
cpuacct – reports CPU resource usage of tasks of a cgroup
memory – set limits on memory use and reports memory usage for a cgroup, normal page memory not huge page
hugetlb - allows to use virtual memory pages of large sizes and to enforce resource limits on these pages
devices – allows or denies access to devices (i.e. gpus) for tasks of a cgroup
freezer – suspends and resumes tasks in a cgroup
net_cls – tags network packets in a cgroup to allow network traffic priorities
blkio – tracks I/O ownership, allowing control of access to block I/O resources

Rules for cgroup

Each subsystem (memory, CPU…) has a hierarchy (tree)
Hierarchies are independent
- (the trees for e.g. memory and CPU can be different)
Each process belongs to exactly 1 node in each hierarchy
- (think of each hierarchy as a different dimension or axis)
Each hierarchy(subsystem) starts with 1 node (the root)

# like for cpu subsystem  root node is at
# tasks # attach a task(thread) and show list of threads
# cgroup.procs # show list of processes

$ls /sys/fs/cgroup/cpu/
cgroup.clone_children  cpuacct.stat       cpuacct.usage_percpu       cpuacct.usage_sys   cpu.cfs_quota_us  docker/            release_agent  user.slice/
cgroup.procs           cpuacct.usage      cpuacct.usage_percpu_sys   cpuacct.usage_user  cpu.shares        init.scope/        system.slice/
cgroup.sane_behavior   cpuacct.usage_all  cpuacct.usage_percpu_user  cpu.cfs_period_us   cpu.stat          notify_on_release  tasks

All processes initially belonged to the root of each hierarchy
Each node = group of processes(sharing the same resources)

NOTE

All subdirs shared the setting of its parent, total amount should not exceed its parent, even tasks in subdir not present in parent’s tasks, see example below.

# NOTE: remove a cgroup dir using: rmdir not rm -rf
$ mkdir /sys/fs/cgroup/cpu/agent
# total two CPU
$ echo 100000 >/sys/fs/cgroup/cpu/agent/cpu.cfs_period_us
$ echo 200000 >/sys/fs/cgroup/cpu/agent/cpu.cfs_quota_us

$ cat /sys/fs/cgroup/cpu/agent/tasks
# no tasks here even each subdirs has its own task!!!

$ mkdir /sys/fs/cgroup/cpu/agent/sub1
$ cat /sys/fs/cgroup/cpu/agent/sub1/cpu.cfs_period_us
100000
$ cat /sys/fs/cgroup/cpu/agent/sub1/cpu.cfs_quota_us
-1
$ cat /sys/fs/cgroup/cpu/agent/sub1/tasks
9877

$ mkdir /sys/fs/cgroup/cpu/agent/sub2
$ cat /sys/fs/cgroup/cpu/agent/sub2/cpu.cfs_period_us
100000
$ cat /sys/fs/cgroup/cpu/agent/sub2/cpu.cfs_quota_us
$ cat /sys/fs/cgroup/cpu/agent/sub2/tasks
9853

cpu subsystem

sys path: /sys/fs/cgroup/cpu, it’s used for cpu state(usage), weight and limitation for tasks attached to this group, here are several key parameters of this subsystem.

cpu.shares(relative weight to other cpu cgroups)

it’s relative weight to other cgroups under cpu subsystem, when access to host CPU cycles, by default it’s 1024, it works only when other cgroups have tasks attached and competing for host CPU cycles, say cgroupA has task A with default cpu.shares, cgroupB has task B with cpu.shares 2048, if both taskA and taskB are running, taskB get host CPU cycles twice than taskA, while if taskB is slept or quits, taskA will get all host CPU cycles, as no others are competing with it.
As it’s relative weight, It does not guarantee or reserve any specific CPU access.!!!
cpu.cfs_period_us

specifies a period of time in microseconds not millisecond(ms) (represented here as “us”) for how regularly a cgroup’s access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. The upper limit of the cpu.cfs_quota_us parameter is 1 second(1000 milliseconds) and the lower limit is 1000 microseconds(1 milliseconds)
cpu.cfs_quota_us and cpu.cfs_period_us(hard limit for tasks in the group)

CPU quota control how much CPU time tasks in cgroup can use during a period, cpu_quota is the number of microseconds of CPU time, cpu period is counting unit for cpu quota, they should work together to limit the upper bound(max cpu) for that cgroup(tasks in it). by default period is 100000 us, while quota is -1(no limit). if quota is 100000(==cpu period), like tasks in this group can use one host CPU, but if quota is 200000 larger than period, like tasks in this groups can use two Host CPUs.

NOTE

cgroup controls resource to all tasks in the same group, say for cpu quota and cpu period, counting all tasks for the same cpu use, not per task
child process forked from task in the cgroup, obey the same limitation with its parent(child task is added parent cgroup automatically)
even quota is two CPU, but if only one task in the cgroup without multi-thread on, at most one CPU is used!!!!

cpu.shares and cpu.cfs_quota_us, cpu.cfs.period_us can work together!!!

host has two cpus, both taskA and taskB are always running, no other task is running for ideal.

taskA in groupA with share 1024, 1 cpu(cpu.cfs_period_us 100000 cfs_quota_us 100000)
taskB in groupB with share 2048, 0.5cpu(cpu.cfs_period_us 100000 cfs_quota_us 50000)
for each period, host provides 2*100000, taskB can get 0.67 due to share setting, but taskB has hard limit 0.5cpu, hence after task B get 0.5cpu, it’s paused, so that taskA continues to run, after one period, taskA gets 1 cpu, taskB gets 0.5cpu(another 0.5cpu is idea), but taskB gets 0.5cpu before taskA gets its 0.5cpu

warn

taskA in groupA with share 1024, 1 cpu(cpu.cfs_period_us 100000 cfs_quota_us 100000)
taskB in groupB with share 1024, 0.5cpu(cpu.cfs_period_us 100000 cfs_quota_us 50000)
for each period, host provides 2*100000, taskA and taskB gets its 0.5cpu at the same, then taskB paused, taskA continues to run, after one period, taskA gets 1 cpu, taskB gets 0.5cpu(another 0.5cpu is idea).

docker parameters related

# let's say there are four CPUS on host, if all containers are running CPU intensive workload
# first container takes one CPU
# second container takes one CPU
# third container takes two CPU

# but if third container quits or sleep, the other two both take two CPUS !!!
$ docker run -it --rm  --cpu-shares 1024 ubuntu /bin/bash
$ docker run -it --rm  --cpu-shares 1024 ubuntu /bin/bash
$ docker run -it --rm  --cpu-shares 2048 ubuntu /bin/bash

# --cpu-period & --cpu-quota
# CPU quota (cpu_quota) is a feature of Linux Control Groups (cgroup). CPU quota control `how much CPU time a container can use`, `cpu_quota is the number of microseconds of CPU time` a container can use `per cpu_period`.

# cpu_quota allows setting an `upper bound on the amount of CPU time a container gets`. Linux enforces the limit even if CPU time is available. Quotas can hinder utilization while `providing a predictable upper bounds on CPU time.`

# one Host CPU for this docker
$ docker run -it --rm  --cpu-period 100000 --cpu-quota 100000 ubuntu /bin/bash

# --cpus=<value>
# Specify how much of the available CPU resources a container can use. For instance, if the host machine has two CPUs and you set --cpus="1.5", the container is guaranteed at most one and a half of the CPUs. This is the equivalent of setting --cpu-period="100000" and --cpu-quota="150000". short way for cpu-period and cpu-quota

# In order to set cpu_quota correctly, you need to know how many cpu can be used by container, then set cpu_quota and cpu_period correctly.
$ docker run -it --rm  --cpus=1.5 ubuntu /bin/bash
# Same as below
$ docker run -it --rm  --cpu-period=100000 --cpu-quota=150000 ubuntu /bin/bash

memory subsystem

The memory subsystem generates automatic reports on memory resources used by the tasks in a cgroup, and sets limits on memory use of those tasks.

$ls /sys/fs/cgroup/memory/
cgroup.clone_children  memory.force_empty              memory.kmem.tcp.limit_in_bytes      memory.memsw.failcnt             memory.oom_control          memory.use_hierarchy
cgroup.event_control   memory.kmem.failcnt             memory.kmem.tcp.max_usage_in_bytes  memory.memsw.limit_in_bytes      memory.pressure_level       notify_on_release
cgroup.procs           memory.kmem.limit_in_bytes      memory.kmem.tcp.usage_in_bytes      memory.memsw.max_usage_in_bytes  memory.soft_limit_in_bytes  release_agent
cgroup.sane_behavior   memory.kmem.max_usage_in_bytes  memory.kmem.usage_in_bytes          memory.memsw.usage_in_bytes      memory.stat                 system.slice
machine.slice          memory.kmem.slabinfo            memory.limit_in_bytes               memory.move_charge_at_immigrate  memory.swappiness           tasks
memory.failcnt         memory.kmem.tcp.failcnt         memory.max_usage_in_bytes           memory.numa_stat                 memory.usage_in_bytes       user.slice

let’s focus on memory limitation parameters memory.limit_in_bytes, memory.memsw.limit_in_bytes, memory.oom_control, memory.soft_limit_in_bytes, memory.swappiness

memory.limit_in_bytes
sets the maximum amount of user memory (including file cache). If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units(k/K, m/M, g/G). echo 1G > /cgroup/memory/lab1/memory.limit_in_bytes
memory.memsw.limit_in_bytes
sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes, However, it is possible to use suffixes to represent larger units(k/K, m/M, g/G).

It is important to set the memory.limit_in_bytes parameter before setting the memory.memsw.limit_in_bytes parameter:, This is because memory.memsw.limit_in_bytes becomes available only after all memory limitations (previously set in memory.limit_in_bytes) are exhausted. memory.limit_in_bytes = 2G and memory.memsw.limit_in_bytes = 4G for a certain cgroup will allow processes in that cgroup to allocate 2 GB of memory and, once exhausted, allocate another 2 GB of swap only.
memory.soft_limit_in_bytes
enables flexible sharing of memory. Under normal circumstances, control groups are allowed to use as much of the memory as needed, constrained only by their hard limits set with the memory.limit_in_bytes parameter. However, when the system detects memory contention or low memory, control groups are forced to restrict their consumption to their soft limits(reclaim memory).If lowering the memory usage to the soft limit does not solve the contention, cgroups are pushed back as much as possible to make sure that one control group does not starve the others of memory. Note that soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups.
memory.oom_control
contains a flag (0 or 1) that enables or disables the Out of Memory killer for a cgroup. If enabled (0), tasks that attempt to consume more memory than they are allowed are immediately killed by the OOM killer. The OOM killer is enabled by default in every cgroup using the memory subsystem; to disable it, write 1 to the memory.oom_control file.

When the OOM killer is disabled, tasks that attempt to use more memory than they are allowed are paused until additional memory is freed.
memory.swappiness
sets the tendency of the kernel to swap out process memory used by tasks in this cgroup instead of reclaiming pages from the page cache. This is the same tendency, calculated the same way, as set in /proc/sys/vm/swappiness for the system as a whole. The default value is 60. Values lower than 60 decrease the kernel's tendency to swap out process memory, values greater than 60 increase the kernel's tendency to swap out process memory, and values greater than 100 permit the kernel to swap out pages that are part of the address space of the processes in this cgroup.

Note that a value of 0 does not prevent process memory being swapped out; swap out might still happen when there is a shortage of system memory because the global virtual memory management logic does not read the cgroup value. To lock pages completely, use mlock() instead of cgroups.

NOTE:
Increasing this value will make the system more inclined to utilize swap space, leaving more memory free for caches.
Decreasing this value will make the system less inclined to swap, and may improve application responsiveness.

Tuning vm.swappiness incorrectly may hurt performance or may have a different impact between light and heavy workloads. Changes to this parameter should be made in small increments and should be tested under the same conditions that the system normally operates.

NOTE

if soft_limit_in_bytes is unlimited while limit_in_bytes is set, when processes in this group reach limit_in_bytes, kernel will try to swap some memory of the processes to disk, so that it may be below the limit, but if there is no swap left of the system or it reaches memsw.limit_in_bytes, in these case, one of the process(high score) should be killed by OOM killer.

cgroup subsystem kernel

Here is the source code for each subsystem

cpuset_subsys - defined in kernel/cpuset.c.
freezer_subsys - defined in kernel/cgroup_freezer.c.
mem_cgroup_subsys - defined in mm/memcontrol.c; Aka memcg - memory control groups.
blkio_subsys - defined in block/blk-cgroup.c.
net_cls_subsys - defined in net/sched/cls_cgroup.c ( can be built as a kernel module)
net_prio_subsys - defined in net/core/netprio_cgroup.c ( can be built as a kernel module)
devices_subsys - defined in security/device_cgroup.c.
perf_subsys (perf_event) - defined in kernel/events/core.c
hugetlb_subsys - defined in mm/hugetlb_cgroup.c.
cpu_cgroup_subsys - defined in kernel/sched/core.c
cpuacct_subsys - defined in kernel/sched/core.c

cgroup kernel part1

cgroup kernel part2

css_set:(set of groups from different subsystems)
When a process forks, the child will be in all the same cgroups that the parent is in. While either process could be moved around, they very often are not. This means that it is quite common for a collection of processes (and the threads within them) to all be in the same set of cgroups. To make use of this commonality, the struct css_set exists. It identifies a set of cgroups (css stands for “cgroup subsystem state”) and each thread is attached to precisely one css_set. All css_sets are linked together in a hash table so that when a process or thread is moved to a new cgroup, a pre-existing css_set can be reused, if one exists with the required set of cgroups

cgroupv2

Why cgroupv2(kernel>=4.5)?
There was a lot of criticism and issues about the implementation of cgroups, which seems to present a number of inconsistencies and a lot of chaos. For example, when creating subgroups (cgroups within cgroups), several cgroup controllers propagate parameters to their immediate subgroups, while other controllers do not. Or, for a different example, some controllers use interface files (such as the cpuset controller’s clone_children) that appear in all controllers even though they only affect one. also due to different subsystem(cpu, memory, blkio), limitation only affect within that subsystem, hence when buffer io is enabled(enabled by default), write/read only happens at page cache, blkio who works at block layer, does not know it, hence can not limit disk io correctly! but with cgroupv2, as memory, blkio can be enabled at same group, so blkio with v2 can works as expected!!!

The biggest change to cgroups in v2 is a focus on simplicity to the hierarchy. Where v1 used independent trees for each controller (such as /sys/fs/cgroup/cpu/GROUPNAME and /sys/fs/cgroup/memory/GROUPNAME), v2 will unify those in /sys/fs/cgroup/GROUPNAME. In the same vein, if Process X joins /sys/fs/cgroup/test, every controller enabled for test will control Process X

For example, in cgroups v2, memory protection is configured in four files:

memory.min: this memory will never be reclaimed.
memory.low: memory below this threshold is reclaimed if there’s no other reclaimable memory in other cgroups.
memory.high: the kernel will attempt to keep memory usage below this configuration.
memory.max: if memory reaches this level the OOM killer (a system used to sacrifice one or more processes to free up memory for the system when all else fails) is invoked on the cgroup

cgroupv1 vs cgroupv2

cgroupv2 usage and redhat cgroupv2 guideline

FAQ

check cgroups of a given process

1	# cat /proc/$pid/cgroup

change a cgroup of a given process

change cpu cgroup of given process

# you can create a sub group under a subsystem or under another group
# tasks # attach a task(thread) and show list of threads
# cgroup.procs # show list of processes
$echo $pid > /sys/fs/cgroup/cpu/$sub_group/cgroup.procs

check root node of particular cgroup(subsystem)

1	# ls /sys/fs/cgroup/xx

create cgroup(sub group of a given cgroup subsystem)

create a cpu control group, you can create cgroup alone without process, later on add process to it
# cd /sys/fs/cgroup/cpu/
# mkdir test_cg
after this system will add attributes files(control files) automatically, then you can change
parameters for this group to control cpu etc

the created cgroup will be deleted after reboot!!!, when created, there is no process in this group

when you delete a cgroup, all its processes are moved to their parent group.

run a program in a given cgroup

$cgexec -g controllers:path_to_cgroup command arguments 
# must create cgroup before!!!
# path: -g memory:test/sub
$cgexec -g memory:test stress-ng -m 1 --vm-bytes 128M -t 10s --metrics-brief
$cgexec -g memory:test -g cpu:test stress-ng -m 1 --vm-bytes 128M -t 10s --metrics-brief