k8s_service_deep

Service

A Kubernetes Service is a resource you create to make a single, constant point of entry to a group of pods(selected by label selector) providing the same service. service has an IP address and port that never change while the service exists, but Pod address could change during upgrade, or pod is removed or deleted during scale, hence we SHOULD NOT access pod address directly for a service, we need a dedicated ip for the cased mentioned, that’s why service comes in.

More details about service, refer to k8s service

enable source ip persistence for a service
If you want to make sure that connections from a particular client are passed to the same Pod each time, you can select the session affinity based on the client’s IP addresses by setting service.spec.sessionAffinity to "ClientIP" (the default is "None"). You can also set the maximum session sticky time by setting service.spec.sessionAffinityConfig.clientIP.timeoutSeconds appropriately. (the default value is 10800, which works out to be 3 hours).

kube-proxy

kube-proxy is a key component of any Kubernetes deployment. Its role is to load-balance traffic that is destined for services (via cluster IPs and node ports) to the correct backend pods. Kube-proxy can run in one of three modes, each implemented with different data plane technologies: userspace, iptables, or IPVS.

The userspace mode is very old, slow, and definitely not recommended! we DO NOT discuss it here.

iptables vs IPVS

  • IPVS has better performance with larger service and pods
  • IPVS has more algorithms then iptables
  • IPVS supports server health checking and connection retries, etc.

Note

  • cluser ip of service, pod ip and endpoint are assigned by controller manager
  • kube-proxy watches apiserver for service and endpoint object, then update iptables or IPVS rules.
  • kube-proxy runs in each node(kube-system namespace)

Why not use round-robin DNS to replace kube-proxy?
A question that pops up every now and then is why Kubernetes relies on proxying to forward inbound traffic to backends. What about other approaches? For example, would it be possible to configure DNS records that have multiple A values (or AAAA for IPv6), and rely on round-robin name resolution?

There are a few reasons for using proxying for Services:

  • There is a long history of DNS implementations not respecting record TTLs, and caching the results of name lookups after they should have expired.
  • Some apps do DNS lookups only once and cache the results indefinitely.
  • Even if apps and libraries did proper re-resolution, the low or zero TTLs on the DNS records could impose a high load on DNS that then becomes difficult to manage

Iptables

In this mode, kube-proxy watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects. For each Service, it installs iptables rules, which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend sets. For each Endpoint object, it installs iptables rules which select a backend Pod.

iptables mode

By default, kube-proxy in iptables mode chooses a backend at random.
If kube-proxy is running in iptables mode and the first Pod that’s selected does not respond, the connection fails, there is no try next pod!!!

When access service by cluster ip(inside cluster), OUTPUT chain is checked., while when access service by NodePort address, PREROUTING chain is checked, but both will jump to KUBE-SERVICE chain created by kube-proxy, more detail see section below enable iptables for kube-proxy.

Ipvs

n IPVS mode, kube-proxy watches Kubernetes Services and Endpoints, calls netlink interface to create IPVS rules accordingly and synchronizes IPVS rules with Kubernetes Services and Endpoints periodically. This control loop ensures that IPVS status matches the desired state. When accessing a Service, IPVS directs traffic to one of the backend Pods.

The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space. That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.

IPVS provides more options for balancing traffic to backend Pods; these are:

  • rr: round-robin
  • lc: least connection (smallest number of open connections)
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

IPVS mode

When creating a ClusterIP type Service, IPVS proxier will do the following three things:

  • Make sure a dummy interface exists in the node, defaults to kube-IPVS0
  • Bind Service IP addresses to the dummy interface
  • Create IPVS virtual servers for each Service IP address respectively

config

enable iptables mode

When access service by cluster ip(inside cluster), OUTPUT chain is checked., while when access service by NodePort address, PREROUTING chain is checked, but both will jump to KUBE-SERVICE chain created by kube-proxy.

Ipvs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# set from beginning when create cluster Cluster Created by Kubeadm
# conf file for kubeadm init
...
kubeProxy:
config:
mode: ""
...



# change after kube-proxy runs
$ kubectl edit configmaps kube-proxy -n kube-system


$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-1623168286 NodePort 10.1.172.5 <none> 80:31067/TCP 9m11s
# service cluster ip(10.1.172.5:80) and (nodeaddress:31067) for the service

$ kubectl get ep
NAME ENDPOINTS AGE
nginx-1623168286 10.2.2.18:8080 9m2s



# check KUBE-SERVICE chain
$ iptables -nv -L PREROUTING -t nat
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
25960 2231K KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

$ iptables -nv -L OUTPUT -t nat
Chain OUTPUT (policy ACCEPT 39 packets, 2451 bytes)
pkts bytes target prot opt in out source destination
986K 60M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */


# nat table of kube-service chain, rule for cluster-ip named with KUBE-SVC-XXX and nodePort(stay at last rule)
$ iptables -nv -L KUBE-SERVICES -t nat
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SVC-YTBFCGJW6SOUTSSA tcp -- * * 0.0.0.0/0 10.1.172.5 /* default/nginx-1623168286:http cluster IP */ tcp dpt:80
303 18611 KUBE-NODEPORTS all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL


$ iptables -nv -L KUBE-NODEPORTS -t nat

Chain KUBE-NODEPORTS (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SVC-YTBFCGJW6SOUTSSA tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-1623168286:http */ tcp dpt:31067

# SVC backend rule named with KUBE-SEP-XXX
$ iptables -nv -L KUBE-SVC-YTBFCGJW6SOUTSSA -t nat
Chain KUBE-SVC-YTBFCGJW6SOUTSSA (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-NUA5P77FXIMWW66U all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-1623168286:http */

$ iptables -nv -L KUBE-SEP-NUA5P77FXIMWW66U -t nat
Chain KUBE-SEP-NUA5P77FXIMWW66U (1 references)
pkts bytes target prot opt in out source destination
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-1623168286:http */ tcp to:10.2.2.20:8080
# real backend

enable IPVS mode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# load module <module_name>
$ modprobe -- ip_vs
$ modprobe -- ip_vs_rr
$ modprobe -- ip_vs_wrr
$ modprobe -- ip_vs_sh

$ modprobe -- nf_conntrack_ipv4
# OR (use nf_conntrack instead of nf_conntrack_ipv4 for Linux kernel 4.19 and later)
$ modprobe -- nf_conntrack

$ lsmod | grep -e ip_vs -e nf_conntrack_ipv4

#---------------------------------------------------------------------------
# set from beginning when create cluster Cluster Created by Kubeadm
# conf file for kubeadm init
...
kubeProxy:
config:
mode: ipvs
...

# change after kube-proxy runs
$ kubectl edit configmaps kube-proxy -n kube-system

# install ipvs tool to check ipvs rule
# Ubuntu18
$ apt-get install ipvsadm
# Centos7
$ yum install -y ipvsadm

# after create a service check ipvs rule
$ ipvsadm -ln


$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-1623168286 NodePort 10.1.172.5 <none> 80:31067/TCP 9m11s
# service cluster ip(10.1.172.5:80) and (nodeaddress:31067) for the service

$ kubectl get ep
NAME ENDPOINTS AGE
nginx-1623168286 10.2.2.18:8080 9m2s
# pods that provides nginx service

# For cluster-IP, kube-proxy configure it at virtual device: kube-ipvs0
# and create one rule for cluster-ip

# for nodePort 31067, kube-proxy create several rules, each for one address of interfaces on the node
#
$ ip addr
8: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 9a:f2:1d:c0:84:ec brd ff:ff:ff:ff:ff:ff
inet 10.1.172.5/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever

#
$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn

# each address of all device on that node
TCP 192.168.56.11:31067 rr
-> 10.2.2.18:8080 Masq 1 0 0

# cluster ip
TCP 10.1.172.5:80 rr
-> 10.2.2.18:8080 Masq 1 0 0

# each address of all device on that node
TCP 10.2.0.0:31067 rr
-> 10.2.2.18:8080 Masq 1 0 0
TCP 10.2.0.1:31067 rr
-> 10.2.2.18:8080 Masq 1 0 0
TCP 127.0.0.1:31067 rr
-> 10.2.2.18:8080 Masq 1 0 0
TCP 172.17.0.1:31067 rr
-> 10.2.2.18:8080 Masq 1 0 0

NOTE
When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy falls back to running in iptables proxy mode.

debug kube-proxy

1
2
3
4
5
6
7
# check process running
$ ps -ef | grep kube-proxy
$ kubectl get pod -n kube-system | grep kube-proxy

$ kubectl get configmaps -n kube-system
# check conf for kube-proxy, mode used and parameter for each mode
$ kubectl describe configmaps kube-proxy -n kube-system

coredns

Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names.

Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. By default, a client Pod's DNS search list includes the Pod's own namespace and the cluster's default domain.

You can (and almost always should) set up a DNS service for your Kubernetes cluster using an add-on.

A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name.

For example, if you have a Service called my-service in a Kubernetes namespace my-ns, the control plane and the DNS Service acting together create a DNS record for my-service.my-ns. Pods in the my-ns namespace should be able to find the service by doing a name lookup for my-service (my-service.my-ns would also work).

Pods in other namespaces must qualify the name as my-service.my-ns. These names will resolve to the cluster IP assigned for the Service.

Kubernetes also supports DNS SRV (Service) records for named ports. If the my-service.my-ns Service has a port named http with the protocol set to TCP, you can do a DNS SRV query for _http._tcp.my-service.my-ns to discover the port number for http, as well as the IP address.

Ref