問題描述:
早上master所在的物理節(jié)點主機故障,導(dǎo)致虛擬機漂移,導(dǎo)致etcd應(yīng)用異常 容器異常如下
[root@region-master2 ~]# kubectl get po -nkube-system -owide |grep 32.45
etcd-region-master1 0/1 CrashLoopBackOff 69 4m35s 10.39.32.45 region-master1 <none> <none>
kube-apiserver-region-master1 0/1 CrashLoopBackOff 56 4m24s 10.39.32.45 region-master1 <none> <none>
查看etcd的報錯日志如圖:

image.png
解決辦法
etcd增加節(jié)點和剔除節(jié)點
剔除節(jié)點(剔除有問題的節(jié)點,讓其重新加入集群同步數(shù)據(jù))(舉例要剔除的對象是https://192.168.1.73:2379)
member list打印出所有節(jié)點的節(jié)點ID
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379,https://192.168.1.73:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list -w table
member remove 對應(yīng)的節(jié)點ID
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379,https://192.168.1.73:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member remove f926bd1d34241ce0
Member f926bd1d34241ce0 removed from cluster 6294eac8c3e80ca
此時運行member list對應(yīng)的節(jié)點會在集群消失,被剔除節(jié)點etcd進程會退出
添加節(jié)點
清空etcd數(shù)據(jù)目錄(etcd異常主機上操作)
$ rm -rf /var/lib/etcd/*
確認/etc/kubernetes/manifests/etcd.yaml中spec.containers.command里的3個參數(shù)
1. --initial-cluster-state=existing
2. --initial-cluster的值是否是全集群
3. --name成 員名
# 注意不要將etcd.yaml 備份到 /etc/kubernetes/manifests/這個目錄,不然會有2個etcd ,kubectl 啟動是會加載這個目錄下所有配置文件
修改好后在正常運行etcd的節(jié)點執(zhí)行以下命令,endpoints只填寫當前集群現(xiàn)有的節(jié)點,member add后面添加的是--name的ming c
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.71:2379,https://192.168.1.72:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member add 192.168.1.73 --peer-urls=https://192.168.1.73:2380
Member 2226f8cff2cbbfa9 added to cluster 6294eac8c3e80ca
ETCD_NAME="192.168.1.73"
ETCD_INITIAL_CLUSTER="192.168.1.71=https://192.168.1.71:2380,192.168.1.73=https://192.168.1.73:2380,192.168.1.72=https://192.168.1.72:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.1.73:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
重啟kubelet,讓其重新拉起etcd
$ systemctl restart kubelet