我是 LEE，老李，一個(gè)在 IT 行業(yè)摸爬滾打 16 年的技術(shù)老兵。

事件背景

大家都知道 k8s 容量不夠的時(shí)候，都是添加節(jié)點(diǎn)來解決問題。這幾天有小伙伴在升級(jí) k8s 容量的時(shí)候碰到一個(gè)問題，他將集群中某一個(gè) node 節(jié)點(diǎn)的 CPU 做了升級(jí)，然后重啟了這個(gè) node 節(jié)點(diǎn)導(dǎo)致 kubelet 無法啟動(dòng)，然后大量 pod 被驅(qū)逐，報(bào)警電話響個(gè)不停。為了緊急恢復(fù)業(yè)務(wù)，果斷參加故障恢復(fù)。

現(xiàn)象獲取

在知道事件背景后，我登上了那個(gè)已經(jīng)重啟完畢的 node 節(jié)點(diǎn)，開始了一系列的網(wǎng)絡(luò)測(cè)試，確認(rèn) node 這個(gè)宿主機(jī)到 Apiserver 和 Loadbalancer 的 ip 和 port 都是通的。隨后趕緊看了下 kubelet 的日志，果不其然，一行日志讓我看到問題點(diǎn)：

E1121 23:43:52.644552   23453 policy_static.go:158] "Static policy invalid state, please drain node and remove policy state file" err="current set of available CPUs \"0-7\" doesn't match with CPUs in state \"0-3\""
E1121 23:43:52.644569   23453 cpu_manager.go:230] "Policy start error" err="current set of available CPUs \"0-7\" doesn't match with CPUs in state \"0-3\""
E1121 23:43:52.644587   23453 kubelet.go:1431] "Failed to start ContainerManager" err="start cpu manager error: current set of available CPUs \"0-7\" doesn't match with CPUs in state \"0-3\""

說到這里，很多小伙伴會(huì)說：“就這？？”。真的就這。是因?yàn)樯赌兀?是因?yàn)?kubelet 啟動(dòng)參數(shù)里面有一個(gè)參數(shù)很重要：--cpu-manager-policy。表示 kubelet 在使用宿主機(jī)的 cpu 是什么邏輯策略。如果你設(shè)定為 static ，那么就會(huì)在參數(shù) --root-dir 指定的目錄下生成一個(gè) cpu_manager_state 這樣一個(gè)綁定文件。

cpu_manager_state 內(nèi)容大致長得如下：

{ "policyName": "static", "defaultCpuSet": "0-7", "checksum": 14413152 }

當(dāng)你升級(jí)這個(gè) k8s node 節(jié)點(diǎn)的 CPU 配置，并且使用了 static cpu 管理模式，那么 kubelet 會(huì)讀取 cpu_manager_state 文件，然后跟現(xiàn)有的宿主運(yùn)行的資源做對(duì)比，如果不一致，kubelet 就不會(huì)啟動(dòng)了。

原理分析

既然我們看到了具體現(xiàn)象和故障位置，不妨借著這個(gè)小問題我們一起開溫馨下 k8s 的 cpu 管理規(guī)范。

官方文檔如下：
https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/cpu-management-policies/

當(dāng)然我還想多少說點(diǎn)別的，關(guān)于 CPU Manager 整個(gè)架構(gòu)，讓小伙伴們有一個(gè)整體理解，能更加深入理解官方的 cpu 管理策略到底是做了些什么動(dòng)作。

cpu-management-policies

CPU Manager 架構(gòu)

CPU Manager 為滿足條件的 Container 分配指定的 CPUs 時(shí)，會(huì)盡數(shù)按 CPU Topology 來分配，也就是參考 CPU Affinity，按如下的優(yōu)先順序進(jìn)行 CPUs 選擇：（Logic CPUs 就是 Hyperthreads）

如果 Container 要求的 Logic CPUs 數(shù)量不少于單塊 CPU Socket 中 Logci CPUs 數(shù)量，那么會(huì)優(yōu)先把整塊 CPU Socket 中的 Logic CPUs 分配給 Container。
如果 Container 減余請(qǐng)求的 Logic CPU 數(shù)量不少于單塊物理 CPU Core 提供的 Logic CPU 數(shù)量，那么會(huì)優(yōu)先把整塊物理 CPU Core 上的 Logic CPU 分配給 Container。

Container 托余請(qǐng)求的 Logic CPUs 則從按以下規(guī)則排列好的 Logic CPUs 列表中選擇：

同一插槽上可用的 CPU 數(shù)量
同一核心上可用的 CPU 數(shù)量

參考代碼： pkg/kubelet/cm/cpumanager/cpu_assignment.go

func takeByTopology(topo *topology.CPUTopology, availableCPUs cpuset.CPUSet, numCPUs int) (cpuset.CPUSet, error) {
    acc := newCPUAccumulator(topo, availableCPUs, numCPUs)
    if acc.isSatisfied() {
        return acc.result, nil
    }
    if acc.isFailed() {
        return cpuset.NewCPUSet(), fmt.Errorf("not enough cpus available to satisfy request")
    }

    // Algorithm: topology-aware best-fit
    // 1. Acquire whole sockets, if available and the container requires at
    //    least a socket's-worth of CPUs.
    for _, s := range acc.freeSockets() {
        if acc.needs(acc.topo.CPUsPerSocket()) {
            glog.V(4).Infof("[cpumanager] takeByTopology: claiming socket [%d]", s)
            acc.take(acc.details.CPUsInSocket(s))
            if acc.isSatisfied() {
                return acc.result, nil
            }
        }
    }

    // 2. Acquire whole cores, if available and the container requires at least
    //    a core's-worth of CPUs.
    for _, c := range acc.freeCores() {
        if acc.needs(acc.topo.CPUsPerCore()) {
            glog.V(4).Infof("[cpumanager] takeByTopology: claiming core [%d]", c)
            acc.take(acc.details.CPUsInCore(c))
            if acc.isSatisfied() {
                return acc.result, nil
            }
        }
    }

    // 3. Acquire single threads, preferring to fill partially-allocated cores
    //    on the same sockets as the whole cores we have already taken in this
    //    allocation.
    for _, c := range acc.freeCPUs() {
        glog.V(4).Infof("[cpumanager] takeByTopology: claiming CPU [%d]", c)
        if acc.needs(1) {
            acc.take(cpuset.NewCPUSet(c))
        }
        if acc.isSatisfied() {
            return acc.result, nil
        }
    }

    return cpuset.NewCPUSet(), fmt.Errorf("failed to allocate cpus")
}

發(fā)現(xiàn) CPU Topology

參考代碼： vendor/github.com/google/cadvisor/info/v1/machine.go

type MachineInfo struct {
    // The number of cores in this machine.
    NumCores int `json:"num_cores"`

    ...

    // Machine Topology
    // Describes cpu/memory layout and hierarchy.
    Topology []Node `json:"topology"`

    ...
}

type Node struct {
    Id int `json:"node_id"`
    // Per-node memory
    Memory uint64  `json:"memory"`
    Cores  []Core  `json:"cores"`
    Caches []Cache `json:"caches"`
}

cAdvisor 通過 GetTopology 完成 cpu 拓普信息生成，主要是讀取宿主機(jī)上 /proc/cpuinfo 中信息來渲染 CPU Topology，通過讀取 /sys/devices/system/cpu/cpu 來獲得 cpu cache 信息。

參考代碼： vendor/github.com/google/cadvisor/info/v1/machine.go

func GetTopology(sysFs sysfs.SysFs, cpuinfo string) ([]info.Node, int, error) {
    nodes := []info.Node{}

    ...
    return nodes, numCores, nil
}

創(chuàng)建 pod 過程

對(duì)于前面提到的 static policy 情況下 Container 如何創(chuàng)建呢？kubelet 會(huì)為其選擇約定的 cpu affinity 來為其選擇最佳的 CPU Set。

Container 的創(chuàng)建時(shí) CPU Manager 工作流程大致下：

Kuberuntime 調(diào)用容器運(yùn)行時(shí)去創(chuàng)建容器。
Kuberuntime 將容器傳遞給 CPU Manager 處理。
CPU Manager 為 Container 按照靜態(tài)策略進(jìn)行處理。
CPU Manager 從當(dāng)前 Shared Pool 中選擇“最佳”Set 拓結(jié)構(gòu)的 CPU，對(duì)于不滿 Static Policy 的 Contianer，則返回 Shared Pool 中所有 CPU 組合的 Set。
CPU Manager 將針對(duì)容器的 CPUs 分配情況記錄到 Checkpoint State 中，并從 Shared Pool 中刪除剛剛分配的 CPUs。
CPU Manager 再從 state 中讀取該 Container 的 CPU 分配信息，然后通過 UpdateContainerResources cRI 接口將其更新到 Cpuset Cgroups 中，包例如對(duì)于非 Static Policy Container。
Kuberuntime 調(diào)用容器運(yùn)行時(shí)啟動(dòng)該容器。

參考代碼： pkg/kubelet/cm/cpumanager/cpu_manager.go

func (m *manager) AddContainer(pod *v1.Pod, container *v1.Container, containerID string) {
    m.Lock()
    defer m.Unlock()
    if cset, exists := m.state.GetCPUSet(string(pod.UID), container.Name); exists {
        m.lastUpdateState.SetCPUSet(string(pod.UID), container.Name, cset)
    }
    m.containerMap.Add(string(pod.UID), container.Name, containerID)
}

參考代碼： pkg/kubelet/cm/cpumanager/policy_static.go

func NewStaticPolicy(topology *topology.CPUTopology, numReservedCPUs int, reservedCPUs cpuset.CPUSet, affinity topologymanager.Store, cpuPolicyOptions map[string]string) (Policy, error) {
    opts, err := NewStaticPolicyOptions(cpuPolicyOptions)
    if err != nil {
        return nil, err
    }

    klog.InfoS("Static policy created with configuration", "options", opts)

    policy := &staticPolicy{
        topology:    topology,
        affinity:    affinity,
        cpusToReuse: make(map[string]cpuset.CPUSet),
        options:     opts,
    }

    allCPUs := topology.CPUDetails.CPUs()
    var reserved cpuset.CPUSet
    if reservedCPUs.Size() > 0 {
        reserved = reservedCPUs
    } else {
        // takeByTopology allocates CPUs associated with low-numbered cores from
        // allCPUs.
        //
        // For example: Given a system with 8 CPUs available and HT enabled,
        // if numReservedCPUs=2, then reserved={0,4}
        reserved, _ = policy.takeByTopology(allCPUs, numReservedCPUs)
    }

    if reserved.Size() != numReservedCPUs {
        err := fmt.Errorf("[cpumanager] unable to reserve the required amount of CPUs (size of %s did not equal %d)", reserved, numReservedCPUs)
        return nil, err
    }

    klog.InfoS("Reserved CPUs not available for exclusive assignment", "reservedSize", reserved.Size(), "reserved", reserved)
    policy.reserved = reserved

    return policy, nil
}

func (p *staticPolicy) Allocate(s state.State, pod *v1.Pod, container *v1.Container) error {
    if numCPUs := p.guaranteedCPUs(pod, container); numCPUs != 0 {
        klog.InfoS("Static policy: Allocate", "pod", klog.KObj(pod), "containerName", container.Name)
        // container belongs in an exclusively allocated pool

        if p.options.FullPhysicalCPUsOnly && ((numCPUs % p.topology.CPUsPerCore()) != 0) {
            // Since CPU Manager has been enabled requesting strict SMT alignment, it means a guaranteed pod can only be admitted
            // if the CPU requested is a multiple of the number of virtual cpus per physical cores.
            // In case CPU request is not a multiple of the number of virtual cpus per physical cores the Pod will be put
            // in Failed state, with SMTAlignmentError as reason. Since the allocation happens in terms of physical cores
            // and the scheduler is responsible for ensuring that the workload goes to a node that has enough CPUs,
            // the pod would be placed on a node where there are enough physical cores available to be allocated.
            // Just like the behaviour in case of static policy, takeByTopology will try to first allocate CPUs from the same socket
            // and only in case the request cannot be sattisfied on a single socket, CPU allocation is done for a workload to occupy all
            // CPUs on a physical core. Allocation of individual threads would never have to occur.
            return SMTAlignmentError{
                RequestedCPUs: numCPUs,
                CpusPerCore:   p.topology.CPUsPerCore(),
            }
        }
        if cpuset, ok := s.GetCPUSet(string(pod.UID), container.Name); ok {
            p.updateCPUsToReuse(pod, container, cpuset)
            klog.InfoS("Static policy: container already present in state, skipping", "pod", klog.KObj(pod), "containerName", container.Name)
            return nil
        }

        // Call Topology Manager to get the aligned socket affinity across all hint providers.
        hint := p.affinity.GetAffinity(string(pod.UID), container.Name)
        klog.InfoS("Topology Affinity", "pod", klog.KObj(pod), "containerName", container.Name, "affinity", hint)

        // Allocate CPUs according to the NUMA affinity contained in the hint.
        cpuset, err := p.allocateCPUs(s, numCPUs, hint.NUMANodeAffinity, p.cpusToReuse[string(pod.UID)])
        if err != nil {
            klog.ErrorS(err, "Unable to allocate CPUs", "pod", klog.KObj(pod), "containerName", container.Name, "numCPUs", numCPUs)
            return err
        }
        s.SetCPUSet(string(pod.UID), container.Name, cpuset)
        p.updateCPUsToReuse(pod, container, cpuset)

    }
    // container belongs in the shared pool (nothing to do; use default cpuset)
    return nil
}

func (p *staticPolicy) allocateCPUs(s state.State, numCPUs int, numaAffinity bitmask.BitMask, reusableCPUs cpuset.CPUSet) (cpuset.CPUSet, error) {
    klog.InfoS("AllocateCPUs", "numCPUs", numCPUs, "socket", numaAffinity)

    allocatableCPUs := p.GetAllocatableCPUs(s).Union(reusableCPUs)

    // If there are aligned CPUs in numaAffinity, attempt to take those first.
    result := cpuset.NewCPUSet()
    if numaAffinity != nil {
        alignedCPUs := cpuset.NewCPUSet()
        for _, numaNodeID := range numaAffinity.GetBits() {
            alignedCPUs = alignedCPUs.Union(allocatableCPUs.Intersection(p.topology.CPUDetails.CPUsInNUMANodes(numaNodeID)))
        }

        numAlignedToAlloc := alignedCPUs.Size()
        if numCPUs < numAlignedToAlloc {
            numAlignedToAlloc = numCPUs
        }

        alignedCPUs, err := p.takeByTopology(alignedCPUs, numAlignedToAlloc)
        if err != nil {
            return cpuset.NewCPUSet(), err
        }

        result = result.Union(alignedCPUs)
    }

    // Get any remaining CPUs from what's leftover after attempting to grab aligned ones.
    remainingCPUs, err := p.takeByTopology(allocatableCPUs.Difference(result), numCPUs-result.Size())
    if err != nil {
        return cpuset.NewCPUSet(), err
    }
    result = result.Union(remainingCPUs)

    // Remove allocated CPUs from the shared CPUSet.
    s.SetDefaultCPUSet(s.GetDefaultCPUSet().Difference(result))

    klog.InfoS("AllocateCPUs", "result", result)
    return result, nil
}

刪除 pod 過程

當(dāng)這些通過 CPU Managers 分配 CPUs 的 Container 要?jiǎng)h除時(shí)，CPU Manager 工作流大致如下：

Kuberuntime 會(huì)調(diào)用 CPU Manager 去按靜態(tài)策略中定義分發(fā)處理。
CPU Manager 將容器分配的 Cpu Set 重新歸還到 Shared Pool 中。
Kuberuntime 調(diào)用容器運(yùn)行時(shí)移除該容器。
CPU Manager 會(huì)異步進(jìn)行協(xié)調(diào)循環(huán)，為使用共享池中的 Cpus 容器更新 CPU 集合。

參考代碼： pkg/kubelet/cm/cpumanager/cpu_manager.go

func (m *manager) RemoveContainer(containerID string) error {
    m.Lock()
    defer m.Unlock()

    err := m.policyRemoveContainerByID(containerID)
    if err != nil {
        klog.ErrorS(err, "RemoveContainer error")
        return err
    }

    return nil
}

參考代碼： pkg/kubelet/cm/cpumanager/policy_static.go

func (p *staticPolicy) RemoveContainer(s state.State, podUID string, containerName string) error {
    klog.InfoS("Static policy: RemoveContainer", "podUID", podUID, "containerName", containerName)
    if toRelease, ok := s.GetCPUSet(podUID, containerName); ok {
        s.Delete(podUID, containerName)
        // Mutate the shared pool, adding released cpus.
        s.SetDefaultCPUSet(s.GetDefaultCPUSet().Union(toRelease))
    }
    return nil
}

處理方法

知道了異常的原因和以及具體原因，解決辦法也非常好弄就兩步：

刪除原有 cpu_manager_state 文件
重啟 kubelet

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

k8s 節(jié)點(diǎn) CPU 升級(jí)，導(dǎo)致 kubelet 無法啟動(dòng)故障一例

k8s 節(jié)點(diǎn) CPU 升級(jí)，導(dǎo)致 kubelet 無法啟動(dòng)故障一例

事件背景

現(xiàn)象獲取

原理分析

CPU Manager 架構(gòu)

發(fā)現(xiàn) CPU Topology

創(chuàng)建 pod 過程

刪除 pod 過程

處理方法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

k8s 節(jié)點(diǎn) CPU 升級(jí)，導(dǎo)致 kubelet 無法啟動(dòng)故障一例

事件背景

現(xiàn)象獲取

原理分析

CPU Manager 架構(gòu)

發(fā)現(xiàn) CPU Topology

創(chuàng)建 pod 過程

刪除 pod 過程

處理方法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

k8s 節(jié)點(diǎn) CPU 升級(jí)，導(dǎo)致 kubelet 無法啟動(dòng)故障一例