問(wèn)題現(xiàn)象
看到 k8s 集群中有 Evicted 狀態(tài)的 pod,沒(méi)有被清理
# kubectl get pod -o wide -A | grep Evicted
simulation-prod cloud-simulation-dead-letter-worker-d96bdcf98-dxt7h 0/1 Evicted 0 42d <none> cn-shanghai.172.22.0.194 <none> <none>
排查過(guò)程
可以看到 pod 的狀態(tài)是 Status:Failed 和 Reason:Evicted,從 Message 可以知道,Evicted 的原因是 node 磁盤(pán)資源不足
# kubectl -n simulation-prod describe pod cloud-simulation-dead-letter-worker-d96bdcf98-dxt7h
Name: cloud-simulation-dead-letter-worker-d96bdcf98-dxt7h
Namespace: simulation-prod
Priority: 0
Node: cn-shanghai.172.22.0.194/
Start Time: Mon, 29 Nov 2021 15:48:25 +0800
Labels: app.kubernetes.io/instance=cloud-simulation-dead-letter-worker
app.kubernetes.io/name=cloud-simulation-dead-letter-worker
pod-template-hash=d96bdcf98
Annotations: kubernetes.io/psp: ack.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container cloud-simulation-dead-letter-worker was using 291599484Ki, which exceeds its request of 0.
IP:
IPs: <none>
Controlled By: ReplicaSet/cloud-simulation-dead-letter-worker-d96bdcf98
Containers:
cloud-simulation-dead-letter-worker:
Image: registry-vpc.cn-shanghai.aliyuncs.com/xxx/cloud_sim:1.1.2111290718.f0cfa04
Port: <none>
Host Port: <none>
Command:
/root/entry/dead_letter_worker.py
Environment:
DEPLOYMENT: prod
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from cloud-simulation-dead-letter-worker-token-4z2xv (ro)
Volumes:
cloud-simulation-dead-letter-worker-token-4z2xv:
Type: Secret (a volume populated by a Secret)
SecretName: cloud-simulation-dead-letter-worker-token-4z2xv
Optional: false
QoS Class: BestEffort
Node-Selectors: node-type=simulation-prod
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
問(wèn)題原因
節(jié)點(diǎn)壓力驅(qū)逐是 kubelet 主動(dòng)終止 Pod 以回收節(jié)點(diǎn)上資源的過(guò)程。
kubelet 監(jiān)控集群節(jié)點(diǎn)的 CPU、內(nèi)存、磁盤(pán)空間和文件系統(tǒng)的 inode 等資源。 當(dāng)這些資源中的一個(gè)或者多個(gè)達(dá)到特定的消耗水平, kubelet 可以主動(dòng)地使節(jié)點(diǎn)上一個(gè)或者多個(gè) Pod 失效,以回收資源防止饑餓。
在節(jié)點(diǎn)壓力驅(qū)逐期間,kubelet 將所選 Pod 的 PodPhase 設(shè)置為 Failed。這將終止 Pod。
節(jié)點(diǎn)壓力驅(qū)逐不同于 API 發(fā)起的驅(qū)逐。kubelet 并不理會(huì)你配置的 PodDisruptionBudget 或者是 Pod 的 terminationGracePeriodSeconds。
解決辦法
kubectl 不會(huì)刪除 Status:Failed 和 Reason:Evicted 狀態(tài)的 pod ,因此選擇 k8s CronJob 定時(shí)刪除這些 pod
$ vim 01-sa.yaml
apiVersion: v1
kind: Namespace
metadata:
name: delete-evicted-pods
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: delete-evicted-pods
namespace: delete-evicted-pods
$ vim 02-cr.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: delete-evicted-pods
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
$ vim 03-crb.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: delete-evicted-pods
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: delete-evicted-pods
subjects:
- kind: ServiceAccount
name: delete-evicted-pods
namespace: delete-evicted-pods
$ vim 04-cj.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: delete-evicted-pods
namespace: delete-evicted-pods
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: delete-evicted-pods
containers:
- name: kubectl-runner
image: bitnami/kubectl:1.21.8
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- kubectl get pods --all-namespaces -o go-template='{{range .items}} {{if (eq .status.phase "Failed" )}} {{.metadata.name}}{{" "}} {{.metadata.namespace}}{{" "}} {{.metadata.creationTimestamp}}{{" "}} {{.status.reason}} {{"\n"}}{{end}} {{end}}' | while read epod namespace ct reason; do if [ x"$reason" = x"Evicted" -a $((`date +%s`-`date -d "$ct" +%s`)) -gt 259200 ];then echo "`date "+%Y-%m-%d %H:%M:%S"` delete $namespace $reason $epod "; kubectl -n $namespace delete pod $epod; fi; done;
restartPolicy: OnFailure
參考:
- Pod 的生命周期:https://kubernetes.io/zh/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
- 節(jié)點(diǎn)壓力驅(qū)逐:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/node-pressure-eviction/
- kubelet 驅(qū)逐時(shí) Pod 的選擇:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/node-pressure-eviction/#kubelet-%E9%A9%B1%E9%80%90%E6%97%B6-pod-%E7%9A%84%E9%80%89%E6%8B%A9
- Kubelet does not delete evicted pods:https://github.com/kubernetes/kubernetes/issues/55051
- 字段選擇器的鏈?zhǔn)竭x擇器:https://kubernetes.io/zh/docs/concepts/overview/working-with-objects/field-selectors/#chained-selectors
- 使用 RBAC 鑒權(quán):https://kubernetes.io/zh/docs/reference/access-authn-authz/rbac/