1 引言

1.1 K8s 架構：環(huán)形層次視圖

從架構層次和組件依賴角度，可以將一個 K8s 集群和一臺 Linux 主機做如下類比：

Fig 1. Anology: a Linux host and a Kubernetes cluster

對于 K8s 集群，從內(nèi)到外的幾個組件和功能：

etcd：持久化 KV 存儲，集群資源（pods/services/networkpolicies/…）的唯一的權威數(shù)據(jù)（狀態(tài)）源；
apiserver：從 etcd 讀?。↙istWatch）全量數(shù)據(jù)，并緩存在內(nèi)存中；無狀態(tài)服務，可水平擴展；
各種基礎服務（e.g. kubelet、-agent、-operator）：連接 apiserver，獲?。↙ist/ListWatch）各自需要的數(shù)據(jù)；
集群內(nèi)的 workloads：在 1 和 2 正常的情況下由 3 來創(chuàng)建、管理和 reconcile，例如 kubelet 創(chuàng)建 pod、cilium 配置網(wǎng)絡和安全策略。

1.2 apiserver/etcd 角色

以上可以看到，系統(tǒng)路徑中存在兩級 List/ListWatch（但數(shù)據(jù)是同一份）：

apiserver List/ListWatch etcd
基礎服務 List/ListWatch apiserver

因此，從最簡形式上來說，apiserver 就是擋在 etcd 前面的一個代理（proxy），

           +--------+              +-----------------+         +---------------+
           | Client | -----------> | Proxy (cache)   |  -----> | Data store    |
           +--------+              +-----------------+         +---------------+

         infra services               apiserver                         etcd

絕大部分情況下，apiserver 直接從本地緩存提供服務（因為它緩存了集群全量數(shù)據(jù)）；
某些特殊情況，例如，
1. 客戶端明確要求從 etcd 讀數(shù)據(jù)（追求最高的數(shù)據(jù)準確性），
2. apiserver 本地緩存還沒建好

apiserver 就只能將請求轉發(fā)給 etcd ——這里就要特別注意了 —— 客戶端 LIST 參數(shù)設置不當也可能會走到這個邏輯。

1.3 `apiserver/etcd` List 開銷

1.3.1 請求舉例

考慮下面幾個 LIST 操作：

LIST apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0`

這里同時傳了兩個參數(shù)，但 resourceVersion=0 會導致 apiserver 忽略 limit=500，所以客戶端拿到的是全量 ciliumendpoints 數(shù)據(jù)。

一種資源的全量數(shù)據(jù)可能是比較大的，需要考慮清楚是否真的需要全量數(shù)據(jù)。后文定量測量與分析方法。會介紹

LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1

這個請求是獲取 node1 上的所有 pods（%3D 是 = 的轉義）。

根據(jù) nodename 做過濾，給人的感覺可能是數(shù)據(jù)量不太大，但其實背后要比看上去復雜：
- 首先，這里沒有指定 resourceVersion=0，導致apiserver 跳過緩存，直接去 etcd 讀數(shù)據(jù)；
- 其次，etcd 只是 KV 存儲，沒有按 label/field 過濾功能（只處理 limit/continue），
- 所以，apiserver 是從 etcd 拉全量數(shù)據(jù)，然后在內(nèi)存做過濾，開銷也是很大的，后文有代碼分析。
這種行為是要避免的，除非對數(shù)據(jù)準確性有極高要求，特意要繞過 apiserver 緩存。
LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0跟 2 的區(qū)別是加上了resourceVersion=0`，因此 apiserver 會從緩存讀數(shù)據(jù)，性能會有量級的提升。
但要注意，雖然實際上返回給客戶端的可能只有幾百 KB 到上百 MB（取決于 node 上 pod 的數(shù)量、pod 上 label 的多少等因素），但 apiserver 需要處理的數(shù)據(jù)量可能是幾個 GB。后面會有定量分析。

以上可以看到，不同的 LIST 操作產(chǎn)生的影響是不一樣的，而客戶端看到數(shù)據(jù)還有可能只是 apiserver/etcd 處理數(shù)據(jù)的很小一部分。如果基礎服務大規(guī)模啟動或重啟，就極有可能把控制平面打爆。

1.3.2 處理開銷

List 請求可以分為兩種：

List 全量數(shù)據(jù)：開銷主要花在數(shù)據(jù)傳輸；
指定用 label 或字段（field）過濾，只需要匹配的數(shù)據(jù)。

這里需要特別說明的是第二種情況，也就是 list 請求帶了過濾條件。

大部分情況下，apiserver 會用自己的緩存做過濾，這個很快，因此耗時主要花在數(shù)據(jù)傳輸；
需要將請求轉給 etcd 的情況，

前面已經(jīng)提到，etcd 只是 KV 存儲，并不理解 label/field 信息，因此它無法處理過濾請求。實際的過程是：apiserver 從 etcd 拉全量數(shù)據(jù)，然后在內(nèi)存做過濾，再返回給客戶端。
因此除了數(shù)據(jù)傳輸開銷（網(wǎng)絡帶寬），這種情況下還會占用大量 apiserver CPU 和內(nèi)存。

1.4 大規(guī)模部署時潛在的問題

再來看個例子，下面這行代碼用 k8s client-go 根據(jù) nodename 過濾 pod，

 podList, err := Client().CoreV1().Pods("").List(ctx(), ListOptions{FieldSelector: "spec.nodeName=node1"})

看起來非常簡單的操作，我們來實際看一下它背后的數(shù)據(jù)量。以一個 4000 node，10w pod 的集群為例，全量 pod 數(shù)據(jù)量：

etcd 中：緊湊的非結構化 KV 存儲，在 1GB 量級；
*apiserver 緩存中：已經(jīng)是結構化的 golang objects，在2GB 量級 TODO：需進一步確認）；
apiserver 返回：client 一般選擇默認的 json 格式接收，也已經(jīng)是結構化數(shù)據(jù)。全量 pod 的 json 也在 2GB 量級。

可以看到，某些請求看起來很簡單，只是客戶端一行代碼的事情，但背后的數(shù)據(jù)量是驚人的。指定按 nodeName 過濾 pod 可能只返回了 500KB 數(shù)據(jù)，但 apiserver 卻需要過濾 2GB 數(shù)據(jù) —— 最壞的情況，etcd 也要跟著處理 1GB 數(shù)據(jù)（以上參數(shù)配置確實命中了最壞情況，見下文代碼分析）。

集群規(guī)模比較小的時候，這個問題可能看不出來（etcd 在 LIST 響應延遲超過某個閾值后才開始打印 warning 日志）；規(guī)模大了之后，如果這樣的請求比較多，apiserver/etcd 肯定是扛不住的。

1.5 本文目的

通過深入代碼查看 k8s 的 List/ListWatch 實現(xiàn)，加深對性能問題的理解，對大規(guī)模 K8s 集群的穩(wěn)定性優(yōu)化提供一些參考。

2 apiserver `List()` 操作源碼分析

有了以上理論預熱，接下來可以看代碼實現(xiàn)了。

2.1 調用棧和流程圖

store.List
|-store.ListPredicate
   |-if opt == nil
   |   opt = ListOptions{ResourceVersion: ""}
   |-Init SelectionPredicate.Limit/Continue fileld
   |-list := e.NewListFunc()                               // objects will be stored in this list
   |-storageOpts := storage.ListOptions{opt.ResourceVersion, opt.ResourceVersionMatch, Predicate: p}
   |
   |-if MatchesSingle ok                                   // 1\. when "metadata.name" is specified,  get single obj
   |   // Get single obj from cache or etcd
   |
   |-return e.Storage.List(KeyRootFunc(ctx), storageOpts)  // 2\. get all objs and perform filtering
      |-cacher.List()
         | // case 1: list all from etcd and filter in apiserver
         |-if shouldDelegateList(opts)                     // true if resourceVersion == ""
         |    return c.storage.List                        // list from etcd
         |             |- fromRV *int64 = nil
         |             |- if len(storageOpts.ResourceVersion) > 0
         |             |     rv = ParseResourceVersion
         |             |     fromRV = &rv
         |             |
         |             |- for hasMore {
         |             |    objs := etcdclient.KV.Get()
         |             |    filter(objs)                   // filter by labels or filelds
         |             | }
         |
         | // case 2: list & filter from apiserver local cache (memory)
         |-if cache.notready()
         |   return c.storage.List                         // get from etcd
         |
         | // case 3: list & filter from apiserver local cache (memory)
         |-obj := watchCache.WaitUntilFreshAndGet
         |-for elem in obj.(*storeElement)
         |   listVal.Set()                                 // append results to listOjb
         |-return  // results stored in listObj

對應的流程圖：

image.png

Fig 2-1. List operation processing in apiserver

2.2 請求處理入口：`List()`

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L361

// 根據(jù) PredicateFunc 中指定的 LabelSelector 和 FieldSelector 過濾，返回一個對象列表
func (e *Store) List(ctx, options *metainternalversion.ListOptions) (runtime.Object, error) {
    label := labels.Everything()
    if options != nil && options.LabelSelector != nil
        label = options.LabelSelector // Label 過濾器，例如 app=nginx

    field := fields.Everything()
    if options != nil && options.FieldSelector != nil
        field = options.FieldSelector // 字段過濾器，例如 spec.nodeName=node1

    out := e.ListPredicate(ctx, e.PredicateFunc(label, field), options) // 拉?。↙ist）數(shù)據(jù)并過濾（Predicate）
    if e.Decorator != nil
        e.Decorator(out)

    return out, nil
}

2.3 `ListPredicate()`

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L411

func (e *Store) ListPredicate(ctx , p storage.SelectionPredicate, options *metainternalversion.ListOptions) (runtime.Object, error) {
    // Step 1: 初始化
    if options == nil
        options = &metainternalversion.ListOptions{ResourceVersion: ""}

    p.Limit    = options.Limit
    p.Continue = options.Continue
    list      := e.NewListFunc()        // 返回結果將存儲在這里面
    storageOpts := storage.ListOptions{ // 將 API 側的 ListOption 轉成底層存儲側的 ListOption，字段區(qū)別見下文
        ResourceVersion:      options.ResourceVersion,
        ResourceVersionMatch: options.ResourceVersionMatch,
        Predicate:            p,
        Recursive:            true,
    }

    // Step 2：如果請求指定了 metadata.name，則應獲取單個 object，無需對全量數(shù)據(jù)做過濾
    if name, ok := p.MatchesSingle(); ok { // 檢查是否設置了 metadata.name 字段
        if key := e.KeyFunc(ctx, name); err == nil { // 獲取這個 object 在 etcd 中的 key（唯一或不存在）
            storageOpts.Recursive = false
            e.Storage.GetList(ctx, key, storageOpts, list)
            return list
        }
        // else 邏輯：如果執(zhí)行到這里，說明沒有從 context 中拿到過濾用的 key，則 fallback 到下面拿全量數(shù)據(jù)再過濾
    }

    // Step 3: 對全量數(shù)據(jù)做過濾
    e.Storage.GetList(ctx, e.KeyRootFunc(), storageOpts, list) // KeyRootFunc() 用來獲取這種資源在 etcd 里面的 root key（即 prefix，不帶最后的 /）
    return list
}

1.24.0 中 case 1 & 2 都是調用 e.Storage.GetList()，之前的版本有點不同：

Case 1 中的 e.Storage.GetToList

Case 1 中的 e.Storage.List

不過基本流程是一樣的。

如果客戶端沒傳 ListOption，則初始化一個默認值，其中的 ResourceVersion 設置為空字符串，這將使 apiserver從 etcd 拉取數(shù)據(jù)來返回給客戶端，而不使用本地緩存（除非本地緩存還沒有建好）；

舉例，客戶端設置 ListOption{Limit: 5000, ResourceVersion: 0} list ciliumendpoints 時，發(fā)送的請求將為 /apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0。
ResourceVersion 為空字符串的行為，后面會看到對它的解析。

用 listoptions 中的字段分別初始化過濾器（SelectionPredicate）的 limit/continue 字段；
初始化返回結果，list := e.NewListFunc()；
將 API 側的 ListOption 轉成底層存儲的 ListOption，字段區(qū)別見下文

metainternalversion.ListOptions 是 API 側的結構體，包含了

```
 // staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/types.go

 // ListOptions is the query options to a standard REST list call.
 type ListOptions struct {
     metav1.TypeMeta

     LabelSelector labels.Selector // 標簽過濾器，例如 app=nginx
     FieldSelector fields.Selector // 字段過濾器，例如 spec.nodeName=node1

     Watch bool
     AllowWatchBookmarks bool
     ResourceVersion string
     ResourceVersionMatch metav1.ResourceVersionMatch

     TimeoutSeconds *int64         // Timeout for the list/watch call.
     Limit int64
     Continue string               // a token returned by the server. return a 410 error if the token has expired.
 }

```

`storage.ListOptions` 是傳給底層存儲的結構體，字段有一些區(qū)別：

```
 // staging/src/k8s.io/apiserver/pkg/storage/interfaces.go

 // ListOptions provides the options that may be provided for storage list operations.
 type ListOptions struct {
     ResourceVersion string
     ResourceVersionMatch metav1.ResourceVersionMatch
     Predicate SelectionPredicate // Predicate provides the selection rules for the list operation.
     Recursive bool               // true: 根據(jù) key 獲取單個對象；false：根據(jù) key prefix 獲取全量數(shù)據(jù)
     ProgressNotify bool          // storage-originated bookmark, ignored for non-watch requests.
 }

```

2.4 請求指定了資源名（resource name）：獲取單個對象

接下來根據(jù)請求中是否指定了 meta.Name 分為兩種情況：

如果指定了，說明是查詢單個對象，因為 Name 是唯一的，接下來轉入查詢單個 object 的邏輯；
如果未指定，則需要獲取全量數(shù)據(jù)，然后在 apiserver 內(nèi)存中根據(jù) SelectionPredicate 中的過濾條件進行過濾，將最終結果返回給客戶端；

代碼如下：

    // case 1：根據(jù) metadata.name 獲取單個 object，無需對全量數(shù)據(jù)做過濾
    if name, ok := p.MatchesSingle(); ok { // 檢查是否設置了 metadata.name 字段
        if key := e.KeyFunc(ctx, name); err == nil {
            e.Storage.GetList(ctx, key, storageOpts, list)
            return list
        }
        // else 邏輯：如果執(zhí)行到這里，說明沒有從 context 中拿到過濾用的 key，則 fallback 到下面拿全量數(shù)據(jù)再過濾
    }

e.Storage 是一個 Interface，

// staging/src/k8s.io/apiserver/pkg/storage/interfaces.go

// Interface offers a common interface for object marshaling/unmarshaling operations and
// hides all the storage-related operations behind it.
type Interface interface {
    Create(ctx , key string, obj, out runtime.Object, ttl uint64) error
    Delete(ctx , key string, out runtime.Object, preconditions *Preconditions,...)
    Watch(ctx , key string, opts ListOptions) (watch.Interface, error)
    Get(ctx , key string, opts GetOptions, objPtr runtime.Object) error

    // unmarshall objects found at key into a *List api object (an object that satisfies runtime.IsList definition).
    // If 'opts.Recursive' is false, 'key' is used as an exact match; if is true, 'key' is used as a prefix.
    // The returned contents may be delayed, but it is guaranteed that they will
    // match 'opts.ResourceVersion' according 'opts.ResourceVersionMatch'.
    GetList(ctx , key string, opts ListOptions, listObj runtime.Object) error

e.Storage.GetList() 會執(zhí)行到 cacher 代碼。

不管是獲取單個 object，還是獲取全量數(shù)據(jù)，都經(jīng)歷類似的過程：

優(yōu)先從 apiserver 本地緩存獲?。Q定因素包括 ResourceVersion 等），
不得已才到 etcd 去獲?。?/li>

獲取單個對象的邏輯相對比較簡單，這里就不看了。接下來看 List 全量數(shù)據(jù)再做過濾的邏輯。

2.5 請求未指定資源名，獲取全量數(shù)據(jù)做過濾

2.5.1 apiserver 緩存層：`GetList()` 處理邏輯

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

// GetList implements storage.Interface
func (c *Cacher) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    recursive := opts.Recursive
    resourceVersion := opts.ResourceVersion
    pred := opts.Predicate

    // 情況一：ListOption 要求必須從 etcd 讀
    if shouldDelegateList(opts)
        return c.storage.GetList(ctx, key, opts, listObj) // c.storage 指向 etcd

    // If resourceVersion is specified, serve it from cache.
    listRV := c.versioner.ParseResourceVersion(resourceVersion)

    // 情況二：apiserver 緩存未建好，只能從 etcd 讀
    if listRV == 0 && !c.ready.check()
        return c.storage.GetList(ctx, key, opts, listObj)

    // 情況三：apiserver 緩存正常，從緩存讀：保證返回的 objects 版本不低于 `listRV`
    listPtr := meta.GetItemsPtr(listObj)
    listVal := conversion.EnforcePtr(listPtr)
    filter  := filterWithAttrsFunction(key, pred) // 最終的過濾器

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...) // 根據(jù) index 預篩，性能優(yōu)化
    for _, obj := range objs {
        elem := obj.(*storeElement)
        if filter(elem.Key, elem.Labels, elem.Fields)                           // 真正的過濾
            listVal.Set(reflect.Append(listVal, reflect.ValueOf(elem))
    }

    // 更新最后一次讀到的 ResourceVersion
    if c.versioner != nil
        c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
    return nil
}

2.5.2 判斷是否必須從 etcd 讀數(shù)據(jù)：`shouldDelegateList()`

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L591

func shouldDelegateList(opts storage.ListOptions) bool {
    resourceVersion := opts.ResourceVersion
    pred            := opts.Predicate
    pagingEnabled   := DefaultFeatureGate.Enabled(features.APIListChunking)      // 默認是啟用的
    hasContinuation := pagingEnabled && len(pred.Continue) > 0                   // Continue 是個 token
    hasLimit        := pagingEnabled && pred.Limit > 0 && resourceVersion != "0" // 只有在 resourceVersion != "0" 的情況下，hasLimit 才有可能為 true

    // 1\. 如果未指定 resourceVersion，從底層存儲（etcd）拉去數(shù)據(jù)；
    // 2\. 如果有 continuation，也從底層存儲拉數(shù)據(jù)；
    // 3\. 只有 resourceVersion != "0" 時，才會將 limit 傳給底層存儲（etcd），因為 watch cache 不支持 continuation
    return resourceVersion == "" || hasContinuation || hasLimit || opts.ResourceVersionMatch == metav1.ResourceVersionMatchExact
}

這里非常重要：

問：客戶端未設置 ListOption{} 中的 ResourceVersion 字段，是否對應到這里的 resourceVersion == ""？

答：是的，所以第一節(jié)的例子會導致從 etcd 拉全量數(shù)據(jù)。
問：客戶端設置了 limit=500&resourceVersion=0 是否會導致下次 hasContinuation==true？

答：不會，resourceVersion=0 將導致 limit 被忽略hasLimit 那一行代碼），也就是說，雖然指定了 limit=500，但這個請求會返回全量數(shù)據(jù)。
問：ResourceVersionMatch 是什么用途？

答：用來告訴 apiserver，該如何解讀 ResourceVersion。官方有個很復雜的表格，有興趣可以看看。

接下來再返回到 cacher 的 GetList() 邏輯，來看下具體有哪幾種處理情況。

2.5.3 情況一：ListOption 要求從 etcd 讀數(shù)據(jù)

這種情況下，apiserver 會直接從 etcd 讀取所有 objects 并過濾，然后返回給客戶端，適用于數(shù)據(jù)一致性要求極其高的場景。當然，也容易誤入這種場景造成 etcd 壓力過大，例如第一節(jié)的例子。

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L563

// GetList implements storage.Interface.
func (s *store) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    listPtr   := meta.GetItemsPtr(listObj)
    v         := conversion.EnforcePtr(listPtr)
    key        = path.Join(s.pathPrefix, key)
    keyPrefix := key // append '/' if needed

    newItemFunc := getNewItemFunc(listObj, v)

    var fromRV *uint64
    if len(resourceVersion) > 0 { // 如果 RV 非空（客戶端不傳時，默認是空字符串）
        parsedRV := s.versioner.ParseResourceVersion(resourceVersion)
        fromRV = &parsedRV
    }

    // ResourceVersion, ResourceVersionMatch 等處理邏輯
    switch {
    case recursive && s.pagingEnabled && len(pred.Continue) > 0: ...
    case recursive && s.pagingEnabled && pred.Limit > 0        : ...
    default                                                    : ...
    }

    // loop until we have filled the requested limit from etcd or there are no more results
    for {
        getResp = s.client.KV.Get(ctx, key, options...) // 從 etcd 拉數(shù)據(jù)
        numFetched += len(getResp.Kvs)
        hasMore = getResp.More

        for i, kv := range getResp.Kvs {
            if limitOption != nil && int64(v.Len()) >= pred.Limit {
                hasMore = true
                break
            }

            lastKey = kv.Key
            data := s.transformer.TransformFromStorage(ctx, kv.Value, kv.Key)
            appendListItem(v, data, kv.ModRevision, pred, s.codec, s.versioner, newItemFunc) // 這里面會做過濾
            numEvald++
        }

        key = string(lastKey) + "\x00"
    }

    // instruct the client to begin querying from immediately after the last key we returned
    if hasMore {
        // we want to start immediately after the last key
        next := encodeContinue(string(lastKey)+"\x00", keyPrefix, returnedRV)
        return s.versioner.UpdateList(listObj, uint64(returnedRV), next, remainingItemCount)
    }

    // no continuation
    return s.versioner.UpdateList(listObj, uint64(returnedRV), "", nil)
}

client.KV.Get()`就進入 etcd client 庫了，感興趣可以繼續(xù)往下挖。
appendListItem()會對拿到的數(shù)據(jù)進行過濾，這就是我們第一節(jié)提到的 apiserver 內(nèi)存過濾操作。

2.5.4 情況二：本地緩存還沒建好，只能從 etcd 讀數(shù)據(jù)

具體執(zhí)行過程與情況一相同。

2.5.5 情況三：使用本地緩存

// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

// GetList implements storage.Interface
func (c *Cacher) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    // 情況一：ListOption 要求必須從 etcd 讀
    ...
    // 情況二：apiserver 緩存未建好，只能從 etcd 讀
    ...
    // 情況三：apiserver 緩存正常，從緩存讀：保證返回的 objects 版本不低于 `listRV`
    listPtr := meta.GetItemsPtr(listObj) // List elements with at least 'listRV' from cache.
    listVal := conversion.EnforcePtr(listPtr)
    filter  := filterWithAttrsFunction(key, pred) // 最終的過濾器

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...) // 根據(jù) index 預篩，性能優(yōu)化
    for _, obj := range objs {
        elem := obj.(*storeElement)
        if filter(elem.Key, elem.Labels, elem.Fields)                           // 真正的過濾
            listVal.Set(reflect.Append(listVal, reflect.ValueOf(elem))
    }

    if c.versioner != nil
        c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
    return nil
}

3 LIST 測試

為了避免客戶端庫（例如 client-go）自動幫我們設置一些參數(shù)，我們直接用 curl 來測試，指定證書就行了：

$ cat curl-k8s-apiserver.sh
curl -s --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key --cacert /etc/kubernetes/pki/ca.crt $@

使用方式：

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}]
}

3.1 指定 `limit=2`：response 將返回分頁信息（`continue`）

3.1.1 `curl` 測試

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}]
}

可以看到，

確實返回了兩個 pod 信息，在 items[] 字段中；
另外在 metadata 中返回了一個 continue 字段，客戶端下次帶上這個參數(shù)，apiserver 將繼續(xù)返回剩下的內(nèi)容，直到 apiserver 不再返回 continue。

3.1.2 `kubectl` 測試

調大 kubectl 的日志級別，也可以看到它背后用了 continue 來獲取全量 pods：

$ kubectl get pods --all-namespaces --v=10
# 以下都是 log 輸出，做了適當調整
# curl -k -v -XGET  -H "User-Agent: kubectl/v1.xx" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json"
#   'http://localhost:8080/api/v1/pods?limit=500'
# GET http://localhost:8080/api/v1/pods?limit=500 200 OK in 202 milliseconds
# Response Body: {"kind":"Table","metadata":{"continue":"eyJ2Ijoib...","remainingItemCount":54},"columnDefinitions":[...],"rows":[...]}
# 
# curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.xx"
#   'http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500'
# GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500 200 OK in 44 milliseconds
# Response Body: {"kind":"Table","metadata":{"resourceVersion":"2122644698"},"columnDefinitions":[],"rows":[...]}

第一次請求拿到了 500 個 pods，第二次請求把返回的 continue 帶上了：]GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500，continue 是個 token，有點長，為了更好的展示這里把它截斷了。

3.2 指定 `limit=2&resourceVersion=0`：`limit=2` 將被忽略，返回全量數(shù)據(jù)

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2&resourceVersion=0"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}, ...]
}

items[] 里面是全量 pod 信息。

3.3 指定 `spec.nodeName=node1&resourceVersion=0` vs. `spec.nodeName=node1"`

結果相同

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1" | jq '.items[].spec.nodeName'
"node1"
"node1"
"node1"
...

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1&resourceVersion=0" | jq '.items[].spec.nodeName'
"node1"
"node1"
"node1"
...

結果是一樣的，除非是 apiserver 緩存和 etcd 數(shù)據(jù)出現(xiàn)不一致，這個概率極小，我們這里不討論。

速度差異很大

用 time 測量以上兩種情況下的耗時，會發(fā)現(xiàn)對于大一些的集群，這兩種請求的響應時間就會有明顯差異。

$ time ./curl-k8s-apiserver.sh <url> > result

對于 4K nodes, 100K pods 規(guī)模的集群，以下數(shù)據(jù)供參考：

不帶 resourceVersion=0（讀 etcd 并在 apiserver 過濾）: 耗時 `10s
帶 resourceVersion=0（讀 apiserver 緩存）: 耗時 0.05s

差了 200 倍。

全量 pod 的總大小按 2GB 計算，平均每個 20KB。

4 LIST 請求對控制平面壓力：量化分析

本節(jié)以 cilium-agent 為例，介紹定量測量它啟動時對控制平面壓力。

4.1 收集 LIST 請求

首先獲取 agent 啟動時，都 LIST k8s 哪些資源。有幾種收集方式：

在 k8s access log，按 ServiceAccount、verb、request_uri 等過濾；
通過 agent 日志；
通過進一步代碼分析等等。

假設我們收集到如下 LIST 請求：

api/v1/namespaces?resourceVersion=0
api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0
api/v1/nodes?fieldSelector=metadata.name%3Dnode1&resourceVersion=0
api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name
apis/discovery.k8s.io/v1beta1/endpointslices?resourceVersion=0
apis/networking.k8s.io/networkpolicies?resourceVersion=0
apis/cilium.io/v2/ciliumnodes?resourceVersion=0
apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=0
apis/cilium.io/v2/ciliumclusterwidenetworkpolicies?resourceVersion=0

2.2 測試 LIST 請求數(shù)據(jù)量和耗時

有了 LIST 請求列表，接下來就可以手動執(zhí)行這些請求，拿到如下數(shù)據(jù)：

請求耗時
請求處理的數(shù)據(jù)量，這里分為兩種：
1. apiserver 處理的數(shù)據(jù)量（全量數(shù)據(jù)），評估對 apiserver/etcd 的性能影響應該以這個為主
2. agent 最終拿到的數(shù)據(jù)量（按 selector 做了過濾）

用下面這個腳本（放到真實環(huán)境 k8s master 上）來就可以執(zhí)行一遍測試，

$ cat benchmark-list-overheads.sh
apiserver_url="https://localhost:6443"

# List k8s core resources (e.g. pods, services)
# API: GET/LIST /api/v1/<resources>?<fileld/label selector>&resourceVersion=0
function benchmark_list_core_resource() {
    resource=$1
    selectors=$2

    echo "----------------------------------------------------"
    echo "Benchmarking list $2"
    listed_file="listed-$resource"
    url="$apiserver_url/api/v1/$resource?resourceVersion=0"

    # first perform a request without selectors, this is the size apiserver really handles
    echo "curl $url"
    time ./curl-k8s-apiserver.sh "$url" > $listed_file

    # perform another request if selectors are provided, this is the size client receives
    listed_file2="$listed_file-filtered"
    if [ ! -z "$selectors" ]; then
        url="$url&$selectors"
        echo "curl $url"
        time ./curl-k8s-apiserver.sh "$url" > $listed_file2
    fi

    ls -ahl $listed_file $listed_file2 2>/dev/null

    echo "----------------------------------------------------"
    echo ""
}

# List k8s apiextension resources (e.g. pods, services)
# API: GET/LIST /apis/<api group>/<resources>?<fileld/label selector>&resourceVersion=0
function benchmark_list_apiexternsion_resource() {
    api_group=$1
    resource=$2
    selectors=$3

    echo "----------------------------------------------------"
    echo "Benchmarking list $api_group/$resource"
    api_group_flatten_name=$(echo $api_group | sed 's/\//-/g')
    listed_file="listed-$api_group_flatten_name-$resource"
    url="$apiserver_url/apis/$api_group/$resource?resourceVersion=0"
    if [ ! -z "$selectors" ]; then
        url="$url&$selectors"
    fi

    echo "curl $url"
    time ./curl-k8s-apiserver.sh "$url" > $listed_file
    ls -ahl $listed_file
    echo "----------------------------------------------------"
    echo ""
}

benchmark_list_core_resource "namespaces" ""
benchmark_list_core_resource "pods"       "filedSelector=spec.nodeName%3Dnode1"
benchmark_list_core_resource "nodes"      "fieldSelector=metadata.name%3Dnode1"
benchmark_list_core_resource "services"   "labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name"

benchmark_list_apiexternsion_resource "discovery.k8s.io/v1beta1" "endpointslices"                   ""
benchmark_list_apiexternsion_resource "apiextensions.k8s.io/v1"  "customresourcedefinitions"        ""
benchmark_list_apiexternsion_resource "networking.k8s.io"        "networkpolicies"                  ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumnodes"                      ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumendpoints"                  ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumnetworkpolicies"            ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumclusterwidenetworkpolicies" ""

執(zhí)行效果如下：

$ benchmark-list-overheads.sh
----------------------------------------------------
Benchmarking list
curl https://localhost:6443/api/v1/namespaces?resourceVersion=0

real    0m0.090s
user    0m0.038s
sys     0m0.044s
-rw-r--r-- 1 root root 69K listed-namespaces
----------------------------------------------------

Benchmarking list fieldSelector=spec.nodeName%3Dnode1
curl https://localhost:6443/api/v1/pods?resourceVersion=0

real    0m18.332s
user    0m1.355s
sys     0m1.822s
curl https://localhost:6443/api/v1/pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1

real    0m0.242s
user    0m0.044s
sys     0m0.188s
-rw-r--r-- 1 root root 2.0G listed-pods
-rw-r--r-- 1 root root 526K listed-pods-filtered
----------------------------------------------------

...

說明：凡是帶了 selector 的 LIST，例如 LIST pods?spec.nodeName=node1，這個腳本會先執(zhí)行一遍不帶 selector 的請求，目的是測量 apiserver 需要處理的數(shù)據(jù)量，例如上面的 list pods：

agent 真正執(zhí)行的是 pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1，所以請求耗時應該以這個為準
額外執(zhí)行了 pods?resourceVersion=0，這樣是為了測試 1 的請求到底需要 apiserver 處理多少數(shù)據(jù)量

注意： list all pods 這樣的操作會產(chǎn)生 2GB 的文件，因此謹慎使用這個 benchmark 工具，首先理解你寫的腳本在測什么，尤其不要自動化或并發(fā)跑，可能會把 apiserver/etcd 打爆。

4.3 測試結果分析

以上輸出有如下關鍵信息：

LIST 的資源類型，例如 pods/endpoints/services
LIST 操作耗時
LIST 操作涉及的數(shù)據(jù)量
1. apiserver 需要處理的數(shù)據(jù)量（json 格式）：以上面 list pods 為例，對應的是 listed-pods 文件，共 2GB；
2. agent 收到的數(shù)據(jù)量（因為 agent 可能指定了 label/field 過濾器）：以上面 list pods 為例，對應 listed-pods-filtered 文件，共計 526K

按以上方式將所有 LIST 請求都收集起來并排序，就知道了 agent 一次啟動操作，對 apiserver/etcd 的壓力。

$ ls -ahl listed-*
-rw-r--r-- 1 root root  222 listed-apiextensions.k8s.io-v1-customeresourcedefinitions
-rw-r--r-- 1 root root 5.8M listed-apiextensions.k8s.io-v1-customresourcedefinitions
-rw-r--r-- 1 root root 2.0M listed-cilium.io-v2-ciliumclusterwidenetworkpolicies
-rw-r--r-- 1 root root 193M listed-cilium.io-v2-ciliumendpoints
-rw-r--r-- 1 root root  185 listed-cilium.io-v2-ciliumnetworkpolicies
-rw-r--r-- 1 root root 6.6M listed-cilium.io-v2-ciliumnodes
-rw-r--r-- 1 root root  42M listed-discovery.k8s.io-v1beta1-endpointslices
-rw-r--r-- 1 root root  69K listed-namespaces
-rw-r--r-- 1 root root  222 listed-networking.k8s.io-networkpolicies
-rw-r--r-- 1 root root  70M listed-nodes    # 僅用于評估 apiserver 需要處理的數(shù)據(jù)量
-rw-r--r-- 1 root root  25K listed-nodes-filtered
-rw-r--r-- 1 root root 2.0G listed-pods     # 僅用于評估 apiserver 需要處理的數(shù)據(jù)量
-rw-r--r-- 1 root root 526K listed-pods-filtered
-rw-r--r-- 1 root root  23M listed-services # 僅用于評估 apiserver 需要處理的數(shù)據(jù)量
-rw-r--r-- 1 root root  23M listed-services-filtered

還是以 cilium 為例，有大致這樣一個排序（apiserver 處理的數(shù)據(jù)量，json 格式）：

List 資源類型	apiserver 處理的數(shù)據(jù)量（json）	耗時
CiliumEndpoints (全量）	193MB	11s
CiliumNodes (全量）	70MB	0.5s
…	…	…

5 大規(guī)?；A服務：部署和調優(yōu)建議

5.1 List 請求默認設置 `ResourceVersion=0`

前面已經(jīng)介紹，不設置這個參數(shù)將導致 apiserver 從 etcd 拉全量數(shù)據(jù)再過濾，導致

很慢
規(guī)模大了 etcd 扛不住

因此，除非對數(shù)據(jù)準確性要求極高，必須從 etcd 拉數(shù)據(jù)，否則應該在 LIST 請求時設置 ResourceVersion=0 參數(shù)，讓 apiserver 用緩存提供服務。

如果你使用的是client-go 的 ListWatch/informer 接口，那它默認已經(jīng)設置了 ResourceVersion=0。

5.2 優(yōu)先使用 namespaced API

如果要 LIST 的資源在單個或少數(shù)幾個 namespace，考慮使用 namespaced API：

Namespaced API: /api/v1/namespaces/<ns>/pods?query=xxx
Un-namespaced API: /api/v1/pods?query=xxx

5.3 Restart backoff

對于 per-node 部署的基礎服務，例如 kubelet、cilium-agent、daemonsets，需要通過有效的 restart backoff 降低大面積重啟時對控制平面的壓力。

例如，同時掛掉后，每分鐘重啟的 agent 數(shù)量不超過集群規(guī)模的 10%（可配置，或可自動計算）。

5.4 優(yōu)先通過 label/field selector 在服務端做過濾

如果需要緩存某些資源并監(jiān)聽變動，那需要使用 ListWatch 機制，將數(shù)據(jù)拉到本地，業(yè)務邏輯根據(jù)需要自己從 local cache 過濾。這是 client-go 的 ListWatch/informer 機制。

但如果只是一次性的 LIST 操作，并且有篩選條件，例如前面提到的根據(jù) nodename 過濾 pod 的例子，那顯然應該通過設置 label 或字段過濾器，讓 apiserver 幫我們把數(shù)據(jù)過濾出來。 LIST 10w pods 需要幾十秒（大部分時間花在數(shù)據(jù)傳輸上，同時也占用 apiserver 大量 CPU/BW/IO），而如果只需要本機上的 pod，那設置 nodeName=node1 之后，LIST 可能只需要 0.05s 就能返回結果。另外非常重要的一點時，不要忘記在請求中同時帶上 resourceVersion=0。

5.4.1 Label selector

在 apiserver 內(nèi)存過濾。

5.4.2 Field selector

在 apiserver 內(nèi)存過濾。

5.4.3 Namespace selector

etcd 中 namespace 是前綴的一部分，因此能指定 namespace 過濾資源，速度比不是前綴的 selector 快很多。

5.5 配套基礎設施（監(jiān)控、告警等）

以上分析可以看成，client 的單個請求可能只返回幾百 KB 的數(shù)據(jù)，但 apiserver（更糟糕的情況，etcd）需要處理上 GB 的數(shù)據(jù)。因此，應該極力避免基礎服務的大規(guī)模重啟，為此需要在監(jiān)控、告警上做的盡量完善。

5.5.1 使用獨立 ServiceAccount

每個基礎服務（例如 kubelet、cilium-agent 等），以及對 apiserver 有大量 LIST 操作的各種 operator，都使用各自獨立的 SA，這樣便于 apiserver 區(qū)分請求來源，對監(jiān)控、排障和服務端限流都非常有用。

5.5.2 Liveness 監(jiān)控告警

基礎服務必須覆蓋到 liveness 監(jiān)控。

必須有 P1 級別的 liveness 告警，能第一時間發(fā)現(xiàn)大規(guī)模掛掉的場景。然后通過 restart backoff 降低對控制平面的壓力。

5.5.3 監(jiān)控和調優(yōu) etcd

需要針對性能相關的關鍵指標做好監(jiān)控和告警：

內(nèi)存
帶寬

大 LIST 請求數(shù)量及響應耗時

比如下面這個 LIST all pods 日志：

 {
     "level":"warn",
     "msg":"apply request took too long",
     "took":"5357.87304ms",
     "expected-duration":"100ms",
     "prefix":"read-only range ",
     "request":"key:\"/registry/pods/\" range_end:\"/registry/pods0\" ",
     "response":"range_response_count:60077 size:602251227"
 }

部署和配置調優(yōu)：

K8s events 拆到單獨的 etcd 集群
其他。

6 其他

6.1 Get 請求：`GetOptions{}`

基本原理與 ListOption{} 一樣，不設置 ResourceVersion=0 會導致 apiserver 去 etcd 拿數(shù)據(jù)，應該盡量避免。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

LIST 請求源碼分析、性能評估與大規(guī)?；A服務部署調優(yōu)