docker-core-tech-ns

Posted on 2019-10-14 Edited on 2025-04-07 In docker , ns

Introduction

Namespaces provide processes with their own view of the system, limit what you can see(and therefore use)

Overview

Namespaces

There are multiple namespaces(no cgroup namespace actually):

pid
each PID namespace has it own numbering(start at 1), when PID 1 goes away, the whole namespace is killed. Even you run process in a PID namespace, root pid namespace still sees it, but the pid number is different in both namespaces, but others non-root pid namespace does NOT see it.
net
processes within a given network namespace get their own private network stack, including:
- network interfaces
- routing tables
- iptables rules
- sockets(inet socket, NOT unix socket)
  network interface belongs to exactly one net namespace, same for (inet)socket, the newly create net namespace only contains a loopback, no others, when a network space is deleted, all its movable network devices are moved back to the default network namespace, while unmovable devices(device who have NETIF_F_NETNS_LOCAL in their features) and virtual devices are not moved to the default network namespace
mnt(mount)
processes can have ‘private’ mounts, mounts/unmounts in that mount namespace are invisible to the rest of the system
uts
gethostname/sethostname can be different at uts namespace, you can change hostname in this namespace, but not affect others(namespace)
ipc(rarely used)
Allow a process to have its own
- IPC semaphores
- IPC message queues
- IPC shared memory
user
Allows to map UID/GID(that means you can see mapped uid/gid in user namespace, you can’t know the real pid/gid in container with this mapping)
- UID 0-1999 in container C1 is mapped to UID 10000-11999 on host(security improvement)

mount namespace

Mount namespace provides isolation of the list of mounts seen by the processes in each namespace instance. Thus, the processes in each of the mount namespace instances will see distinct single-directory hierarchies.

A new mount namespace is created using either clone(2) or unshare(2) with the CLONE_NEWNS flag. When a new mount namespace is created, its mount list is initialized as follows:

If the namespace is created using clone(2), the mount list of the child’s namespace is a copy of the mount list in the parent process’s mount namespace.
If the namespace is created using unshare(2), the mount list of the new namespace is a copy of the mount list in the caller’s previous mount namespace.

Subsequent modifications to the mount list (mount(2) and umount(2)) in either mount namespace will not (if MS_PRIVATE is used[default]) affect the mount list seen in the other namespace, this is controlled by propagation types.

propagation types

MS_SHARED
This mount shares events with members of a peer group(parent process). Mount and unmount events immediately under this mount will propagate to the other mounts that are members of the peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mounts in the peer group. Conversely, mount and unmount events that take place under peer mounts will propagate to this mount.
MS_PRIVATE(default)
This mount is private; it does not have a peer group. Mount and unmount events do not propagate into or out of this mount.
MS_SLAVE
Mount and unmount events propagate into this mount from a (master) shared peer group. Mount and unmount events under this mount do not propagate to any peer.

Note that a mount can be the slave of another peer group while at the same time sharing mount and unmount events with a peer group of which it is a member. (More precisely, one peer group can be the slave of another peergroup.)

MS_UNBINDABLE
This is like a private mount, and in addition this mount can’t be bind mounted. Attempts to bind mount this mount (mount(2) with the MS_BIND flag) will fail.

When a recursive bind mount (mount(2) with the MS_BIND and MS_REC flags) is performed on a directory subtree, any bind mounts within the subtree are automatically pruned
(i.e., not replicated) when replicating that subtree to produce the target subtree.

kernel view

each namespace is identified by an inode(unique), if two processes are in the same namespace if they see the same inode for equivalent namespace types.

namespace in kernel

FAQ

how to create different namespaces

namespaces are created with the clone() system call(with extra flags when creating a new process), when the last process of a namespace exits, it’s destroyed automatically by kernel(but can be preserved)
process can ‘join’ a namespace by setns()

clone flags for namespace

CLONE_NEWNS
CLONE_NEWNET
CLONE_NEWPID
CLONE_NEWIPC
CLONE_NEWUTS
CLONE_NEWUSER

All needs CAP_SYS_ADMIN except CLONE_NEWUSER

how to check namespaces of given process

$ ls -al /proc/$pid/ns
dr-x--x--x 2 root root 0 Oct 25 16:24 .
dr-xr-xr-x 9 root root 0 Oct 11 22:15 ..
lrwxrwxrwx 1 root root 0 Oct 25 16:24 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Oct 25 16:24 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Oct 25 16:24 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Oct 25 16:24 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Oct 25 16:24 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Oct 25 16:24 uts -> uts:[4026531838]

how to communicate between network namespaces

use veth pair or unix socket

how to check if device is netns local or not

if device with NETIF_F_NETNS_LOCAL, it’s not allowed to move between network namespace; example of this device

loopback, vxlan, pp, bridge
use ethtool to check if device is set or not.

$ ethtool -k eth0 | grep local
netns-local: off [fixed]

$ ethtool -k lo | grep local
netns-local: on [fixed]

how to check mounted point of the system

# this is the real mounts from kernel
$ cat /proc/mounts

# it just read file from /etc/mtab
$ mount

how to run a program in another namespace from shell

There are two commands for this.

unshare to run command in new namespace or existing namespace
nsenter to run command in another process's namespace[existing namespace]

unshare [options] <program> [<argument>...]

Options:
 -m, --mount[=<file>]      unshare mounts namespace
 -u, --uts[=<file>]        unshare UTS namespace (hostname etc)
 -i, --ipc[=<file>]        unshare System V IPC namespace
 -n, --net[=<file>]        unshare network namespace
 -p, --pid[=<file>]        unshare pid namespace
 -U, --user[=<file>]       unshare user namespace
 -f, --fork                fork before launching <program>
     --mount-proc[=<dir>]  mount proc filesystem first (implies --mount)
 -r, --map-root-user       map current user to root (implies --user)
     --propagation slave|shared|private|unchanged
                           modify mount propagation in mount namespace
 -s, --setgroups allow|deny  control the setgroups syscall in user namespaces

# without namespace option provided, it uses the same as its parent!!!
# PRIVATE mount namespace
# after copied mount list from parent, they will never affect each other
$ unshare -m /bin/bash

# shared mount namespace
# after copied mount list from parent, they will always affect each other
# mount on one, will also mount automatically on peer!!!
$ unshare --propagation shared -m /bin/bash

network namespace command list

# show all named net ns
$ ip netns list

# create netns
$ ip netns add ns1
$ ls /var/run/netns

# move eth0 to ns1 if it's movable
$ ip link set dev eth0 netns ns1

# run command in netns
$ ip netns exec ns1 bash
# then
$ ifconfig -a

# show all processes joined a netns
$ ip netns pids ns1

# check which netns a process belongs to
$ ip netns identify $pid

# delete a namespace
$ ip netns delete ns1

# how two netns communicate(ping with each other or send() packet through socket)
# use veth pair, two virtual net devices(like a pipe), put them at different netns.
$ ip link add name v1 type veth peer name v1_peer

# check veth pair
$ ip -d link show v1
443: v1@v1_peer: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 76:01:28:8f:09:4f brd ff:ff:ff:ff:ff:ff promiscuity 0 
-->  veth addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

$ ip -d link show v1_peer
442: v1_peer@v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:12:df:8a:ce:4c brd ff:ff:ff:ff:ff:ff promiscuity 0 
-->  veth addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

# @v1 means its peer name, veth means veth pair

$ ip link set dev v1 netns ns1
$ ip link set dev v1_peer netns ns2
$ ip netns exec ns1 ifconfig v1 192.168.1.1/24 up
$ ip netns exec ns2 ifconfig v1_peer 192.168.1.2/24 up

# now the two netns can ping each other
$ ip netns exec ns1 bash
$ ping 192.168.1.2

# how process runs in a netns communicates with internet.

# One way:
# 1. move a physical ethx into that namespace
# 
# The other(mostly used)
# create a tunnel(veth pairs)between the netns and default ns, then create a bridge that contains a physical interface and one end of the tunnel that in root netns, the other side is in netns.
# 
# create bridge and add a physical device to it

$ sudo brctl addbr br0
$ ifconfig eth0 0.0.0.0
$ sudo brctl addif br0 eth0
$ ifconfig br0 up
$ dhclient br0

# add veth paris and add one end to the bridge
$ sudo ip link add name veth1 type veth peer name veth1_peer
$ sudo ifconfig veth1 up
$ sudo brctl addif br0 veth1

# add peer to netns and get ip from bridge
$ sudo ip link set dev veth1_peer netns ns1
$ sudo ip netns exec ns1 bash
$ dhclient veth1_peer (inside netns)


# show all network namespaces(named and unnamed)
# -------- show named netns --------
$ ip netns list
ns1

# -------- show unnamed netns --------
$ lsns --type=net
        NS TYPE NPROCS   PID USER     NETNSID NSFS                           COMMAND
4026532008 net     686     1 root  unassigned                                /usr/lib/systemd/systemd --switched-root --system --deserialize 17
4026533573 net       9 26128 jaluo          0 /run/docker/netns/e18a6fbd7cae bash