故障現(xiàn)象:檢查nodes正常,但是檢查cs狀態(tài)時(shí)scheduler與controller-manager組件不正常
$ kubectl get nodes,cs
NAME STATUS ROLES AGE VERSION
node/test-k8s-master00 Ready master 25h v1.18.6
node/test-k8s-master01 Ready master 24h v1.18.6
node/test-k8s-master02 Ready master 24h v1.18.6
node/test-k8s-node00 Ready <none> 24h v1.18.6
node/test-k8s-node01 Ready <none> 24h v1.18.6
NAME STATUS MESSAGE ERROR
componentstatus/controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
componentstatus/scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
componentstatus/etcd-0 Healthy {"health":"true"}
排故過程:
1、手工連接發(fā)現(xiàn)確實(shí)是被拒絕了
$ curl -k http://127.0.0.1:10251/healthz
curl: (7) Failed to connect to 127.0.0.1 port 10251: 拒絕連接
2、檢查pod與容器狀態(tài)
$ kubectl get pod -n kube-system | grep sche
kube-scheduler-apron-k8s-master00 1/1 Running 8 25h
kube-scheduler-apron-k8s-master01 1/1 Running 0 14m
kube-scheduler-apron-k8s-master02 1/1 Running 3 24h
$ docker ps | grep sched
15fbf835497b 0e0972b2b5d1 "kube-scheduler --au…" 23 minutes ago Up 23 minutes k8s_kube-scheduler_kube-scheduler-apron-k8s-master02_kube-system_0643afa2262d08f779c8829c02532d96_3
811c213657d0 k8s.gcr.io/pause:3.2 "/pause" 23 minutes ago Up 23 minutes k8s_POD_kube-scheduler-apron-k8s-master02_kube-system_0643afa2262d08f779c8829c02532d96_2
從上面結(jié)果看,pod是就緒的
3、檢查scheduler的yaml文件配置
$ cat /etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
.......(略)
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15 ......
從上面內(nèi)容可以看到,容器內(nèi)部使用的是10259端口來作健康檢查 ,因此可以直接在容器內(nèi)檢查端口,因?yàn)槿萜鲀?nèi)不包含netstat 、ss等網(wǎng)絡(luò)命令,只有直接讀取 /proc/net/tcp 文件來查看 IPv4 的 TCP 連接狀態(tài)
$ docker exec -it 15fbf835497b /bin/sh
# grep -E ":$(printf "%04X" 10259)\\b" /proc/net/tcp
8: 0100007F:2813 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 40368 1 0000000000000000 100 0 0 10 0
說明:printf "%04X" 10259 是動(dòng)態(tài)將10259轉(zhuǎn)換為十六進(jìn)制樣式
上面有返回說明,容器內(nèi)部是正常打開端口的
只是在外部訪問不到。因此需要檢查它的YAML文件中的相關(guān)安全配置
$ cat /etc/kubernetes/manifests/kube-scheduler.yaml
......(略)
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --feature-gates=TTLAfterFinished=true
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.18.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
以上配置說明:存活探針使用的是 HTTPS 方式訪問 10259 端口,--port=0 表示禁用 HTTP 健康檢查端口(非安全端口),即前面kubectl get cs 命令輸出的 Get http://127.0.0.1:10251/healthz 這種http方式訪問是不支持的。
解決辦法:將 “ - --port=0” 這行注釋后,或者將port設(shè)置為:- --port=10259 ,scheduler檢查命令就正常了,對(duì)于controller-manager組件,也是同樣的處理辦法。不過,在生產(chǎn)環(huán)境這是一種不太安全的做法。