Accidentally drained all nodes in Kubernetes (even

2020-07-27 19:10发布

I accidentally drained all nodes in Kubernetes (even master). How can I bring my Kubernetes back? kubectl is not working anymore:

kubectl get nodes

Result:

The connection to the server 172.16.16.111:6443 was refused - did you specify the right host or port?

Here is the output of systemctl status kubelet on master node (node1):

● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2020-06-23 21:42:39 UTC; 25min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 15541 (kubelet)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/kubelet.service
           └─15541 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.16.16.111 --hostname-override=node1 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --cpu-manager-policy=static --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin

Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330009   15541 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330201   15541 setters.go:73] Using node IP: "172.16.16.111"
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331475   15541 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331494   15541 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331500   15541 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331661   15541 policy_static.go:244] [cpumanager] static policy: RemoveContainer (container id: 6dd59735cabf973b6d8b2a46a14c0711831daca248e918bfcfe2041420931963)
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.332058   15541 pod_workers.go:191] Error syncing pod 93ff1a9840f77f8b2b924a85815e17fe ("kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.427587   15541 kubelet.go:2267] node "node1" not found
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.506152   15541 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Get https://172.16.16.111:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.16.16.111:6443: connect: connection refused
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.527813   15541 kubelet.go:2267] node "node1" not found

I'm using Ubuntu 18.04, and there are 7 compute nodes in my cluster. All drained (accidentally, kind of!)! I've installed my K8s cluster using Kubespray.

Is there any way to uncordon any of these nodes? So that k8s necessary pods can be scheduled.

Any help would be appreciated.

Update:

I asked a seperate question about how to connect to etcd here: Can't connect to the ETCD of Kubernetes

1条回答
一夜七次
2楼-- · 2020-07-27 19:45

If you have production or 'live' workloads, the best safe approach is to provision a new cluster and switch the workloads gradually.

Kubernetes keeps its state in etcd so you could potentially connect to etcd and clear the 'drained' state but you will probably have to look at the source code and see where that happens and where the specific key/values are stored in etcd.

The logs that you shared are basically showing that the kube-apiserver cannot start so it's likely that it's trying to connect to etcd/startup and etcd is telling it: "you cannot start on this node because it has been drained".

The typical startup sequence for the masters is something like this:

  • etcd
  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

You can also follow any guide to connect to etcd and see if you can troubleshoot any further. For example, this one. Then you could examine/delete some of the node keys at your own risk:

/registry/minions/node-x1
/registry/minions/node-x2
/registry/minions/node-x3
查看更多
登录 后发表回答