• 周日. 11月 27th, 2022

5G编程聚合网

5G时代下一个聚合的编程学习网

热门标签

An “irresponsible” experience sharing of k8s network troubleshooting

[db:作者]

1月 6, 2022

 Insert picture description here

author | Luo Bingli source | Erda official account ​

One night , The customer encountered such a problem :K8s The cluster has been failing to expand , All nodes cannot join the cluster normally . After many twists and turns, there is no solution , Customers feed back problems to us , Hope to get technical support . The whole investigation process of this problem is quite interesting , In this paper, the investigation ideas and methods are summarized and shared with you , Hope to be able to help you in the investigation of such problems and reference . ​

Problem phenomenon

Operation and maintenance students in the customer’s K8s When the cluster is expanding the capacity of nodes , It is found that the newly added node has failed to be added all the time . The preliminary investigation results are as follows : ​

  • On the new node , visit K8s master service vip The Internet is not working .
  • On the new node , Direct access K8s master hostIP + 6443 Normal network .
  • On the new node , Access containers for other nodes IP It’s ok ping through .
  • On the new node , visit coredns service vip Normal network .

The customer uses Kubernetes The version is 1.13.10, The kernel version of the host is 4.18(centos 8.2). ​

Troubleshooting process

Received feedback from the front-line colleagues , We have a preliminary suspicion that ipvs The problem of . According to the past experience of network troubleshooting , We did some routine checks on the scene first : ​

  • Confirm kernel module ip_tables Whether to load ( normal )
  • confirm iptable forward Default accpet ( normal )
  • Confirm whether the host network is normal ( normal )
  • Confirm if the container network is normal ( normal )

After eliminating the usual problems , It can basically narrow the scope , Now let’s continue based on ipvs Relevant levels of investigation . ​

1. adopt ipvsadm Order the investigation

10.96.0.1 It’s a customer cluster K8s master service vip.  Insert picture description here As shown in the figure above , We can see that there is an abnormal connection , be in SYN_RECV The state of , And it can be observed that , Startup time kubelet + kube-proxy There is a normal connection , The explanation is after starting up ,K8s service There’s something wrong with the network . ​

2. tcpdump Caught analysis

Grab the bag at both ends , And pass telnet 10.96.0.1 443 Command to confirm .

Conclusion : Find out SYN The packet is not sent out on this machine . ​

3. A preliminary summary

Through the above investigation , We can narrow it down again : The problem is basically kube-proxy On the body . We used ipvs Pattern , I also rely on iptables Configure some network forwarding 、snat、drop etc. .

According to the above investigation process , We’ve narrowed it down again , Start analyzing the suspect kube-proxy. ​

4. see kube-proxy journal

 Insert picture description here As shown in the figure above : Exception log found ,iptables-restore Command execution exception . adopt Google、 Community view , Identify the problem . ​ relevant issue For links, please refer to :

  • https://github.com/kubernetes/kubernetes/issues/73360
  • https://github.com/kubernetes/kubernetes/pull/84422/files
  • https://github.com/kubernetes/kubernetes/pull/82214/files

5. further

Check… By code (1.13.10 edition pkg/proxy/ipvs/proxier.go:1427), It can be found that this version does not judge KUBE-MARK-DROP Whether there is and creates logic . When the chain does not exist , There will be logical flaws , Lead to iptable Command execution failed .

K8s master service vip no , Related to the actual container ip Yes. , The reason for this situation , And the following iptable The rules are about : ​

iptable -t nat -A KUBE-SERVICES ! -s 9.0.0.0/8 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ

6. Root cause research

We already know that kube-proxy 1.13.10 There is a flaw in the version , Without creating KUBE-MARK-DROP In the case of chains , perform iptables-restore Command configuration rules . But why K8s 1.13.10 The version runs in centos8.2 4.18 An error will be reported on the operating system of the kernel , Run in the centos7.6 3.10 The kernel’s operating system is normal ?

Let’s look at kube-proxy Source code , You can find kube-proxy Actually, it’s execution iptables Command to configure rules . Since the kube-proxy Report errors iptables-restore Command fails , We’re looking for one 4.18 The kernel machine , Get into kube-proxy Let’s see . ​ Go to the container and execute iptables-save command , You can find kube-proxy There’s really no creation in the container KUBE-MARK-DROP chain ( Meet code expectations ). Continue on the host iptables-save command , But I found that there was KUBE-MARK-DROP chain .

There are two questions :

  • Why? 4.18 The kernel host’s iptables Yes KUBE-MARK-DROP chain ?
  • Why? 4.18 The kernel host’s iptables Rules and kube-proxy The rules in the container are inconsistent ?

The first doubt , Doubt by feeling, except kube-proxy, There will be other programs operating iptables, Keep rolling K8s Code . Conclusion : It turns out that except for kube-proxy, also kubelet It will also modify iptables The rules . Specific code can be viewed :pkg/kubelet/kubelet_network_linux.go ​ The second question , Continue to feel ······Google Let’s have a look at why kube-proxy The container mounts the host /run/xtables.lock In the case of documents , Hosts and containers iptables The rules for viewing are inconsistent . ​ Conclusion :CentOS 8 Get rid of… On the Internet iptables, use nftables Framework as the default network packet filter tool . ​

thus , All the mysteries have been solved .

The team has done a lot of customer project delivery , Here are some more questions to answer :​

  • Question 1 : Why do so many customer environments encounter this situation for the first time ?

Because of the need K8s 1.13.10 + centos 8.2 Operating system of , This combination is rare , And problems will arise . upgrade K8s 1.16.0+ There is no such problem .

  • Question two : Why use K8s 1.13.10 + 5.5 The kernel doesn’t have this problem ?

Because with centos 8 Operating system , We manually upgrade 5.5 After version , The default is still used iptables frame . ​

Can pass iptables -v command , To confirm whether to use nftables. ​ Insert picture description here

Digression :nftables Who is holy ? Than iptables Good yao ? This is another point worth further learning , There’s no more depth here .

Summary and perception

In view of the above problems , Let’s summarize the solutions :

  • Adjust the kernel version to 3.10(centos 7.6+), Or manually upgrade the kernel version to 5.0 +;
  • upgrade Kubernetes edition , The current confirmation 1.16.10+ Version does not have this problem .

That’s what we’re doing Kubernetes A little experience in network troubleshooting , Hope to be able to effectively check , It helps to pinpoint the cause .

If for Erda There are other things you want to know about the project , welcome ** Add small assistant wechat (Erda202106)** Join the community ! ​

Welcome to open source

Erda As an open source one-stop cloud native PaaS platform , Have DevOps、 Microservice observation governance 、 Platform level capabilities such as multi cloud management and fast data governance . Click the link below to participate in open source , Discuss with many developers 、 communication , Build an open source community . Welcome to pay attention 、 Contribution code and Star!

  • Erda Github Address :https://github.com/erda-project/erda
  • Erda Cloud Official website :https://www.erda.cloud/

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注