Knative Serving 平臺(tái)與應(yīng)用監(jiān)控一例

我是 LEE,老李,一個(gè)在 IT 行業(yè)摸爬滾打 16 年的技術(shù)老兵。

事件背景

最近我們的 Knative 的應(yīng)用管理和發(fā)布平臺(tái)上線了,有了工具平臺(tái),那么監(jiān)控報(bào)警就是下一個(gè)非常重要的環(huán)節(jié),后面的應(yīng)用報(bào)警就水到渠成了。

通過 Knative 官方 Serving 模塊中的監(jiān)控報(bào)警文檔實(shí)踐,發(fā)現(xiàn)官方提供的解決方案是一個(gè)極其麻煩的方案。也許是出發(fā)點(diǎn)不一樣,他們傾向建立一個(gè)全新的系統(tǒng),可是現(xiàn)在 k8s 系統(tǒng)普及這么多了,難道還有集群不使用 Promthues/Thanos 的嘛?我想有更簡單的辦法就能解決監(jiān)控的問題,不要用復(fù)雜的方法來接解決問題。

順便多說一嘴,Knative 官方提供的 Grafana 的監(jiān)控大盤也非常不好用,沒有真正貼合到實(shí)際使用需要。

準(zhǔn)備工具

Tips: 我們這邊使用 VictoriaMetrics 替換了 Thanos, 因?yàn)樵诖髷?shù)據(jù)查詢和寫入的量情況下 Thanos 實(shí)在是表現(xiàn)的不太好,所以最后使用了 VictoriaMetrics。

這個(gè)是我們平臺(tái)版本的情況:

  • Kubernetes: 1.23
  • Istio: 1.13
  • Knative: 1.5
  • Grafana: 8.3.3
  • VictoriaMetrics: 1.79

具體實(shí)操

既然打算用自己的方法來監(jiān)控 Knative Serving 的控制層,那么 Knative 官方的文檔就沒有什么參考價(jià)值了。

監(jiān)控控制層

一個(gè)簡單的 knative 會(huì)有如下幾個(gè)簡單的組件構(gòu)成:

NAME                                    READY   STATUS    RESTARTS   AGE
activator-58b96bdb7d-nf6hf              1/1     Running   0          30d
autoscaler-75c4975cd8-bg2nt             1/1     Running   0          30d
controller-66475c8469-d5w2h             1/1     Running   0          30d
domain-mapping-68768c5ddc-999ng         1/1     Running   0          30d
domainmapping-webhook-d4bbcb544-bjtfz   1/1     Running   0          30d
net-istio-controller-689d984c59-4vtdx   1/1     Running   0          27d
net-istio-webhook-74f9465d86-jtj72      1/1     Running   0          27d
webhook-996d56c7-ms6js                  1/1     Running   0          30d

那么就可以針對(duì)這些組件定制合適的 metrics 抓取方案。當(dāng)然抓取前,我們還是稍微瀏覽下 Deployment 里面的配置情況。

這里用 activator 為例:

apiVersion: apps/v1
kind: Deployment
metadata:
    annotations:
        deployment.kubernetes.io/revision: "1"
    labels:
        app.kubernetes.io/component: activator
        app.kubernetes.io/name: knative-serving
        app.kubernetes.io/version: 1.5.0
    name: activator
    namespace: knative-serving
spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
        matchLabels:
            app: activator
            role: activator
    strategy:
        rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
        type: RollingUpdate
    template:
        metadata:
            annotations:
                cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
            creationTimestamp: null
            labels:
                app: activator
                app.kubernetes.io/component: activator
                app.kubernetes.io/name: knative-serving
                app.kubernetes.io/version: 1.5.0
                role: activator
        spec:
            containers:
                - env:
                      - name: GOGC
                        value: "500"
                      - name: POD_NAME
                        valueFrom:
                            fieldRef:
                                apiVersion: v1
                                fieldPath: metadata.name
                      - name: POD_IP
                        valueFrom:
                            fieldRef:
                                apiVersion: v1
                                fieldPath: status.podIP
                      - name: SYSTEM_NAMESPACE
                        valueFrom:
                            fieldRef:
                                apiVersion: v1
                                fieldPath: metadata.namespace
                      - name: CONFIG_LOGGING_NAME
                        value: config-logging
                      - name: CONFIG_OBSERVABILITY_NAME
                        value: config-observability
                      - name: METRICS_DOMAIN
                        value: knative.dev/internal/serving
                  image: knative-serving/activator:1.5.0
                  imagePullPolicy: IfNotPresent
                  livenessProbe:
                      failureThreshold: 12
                      httpGet:
                          httpHeaders:
                              - name: k-kubelet-probe
                                value: activator
                          path: /
                          port: 8012
                          scheme: HTTP
                      initialDelaySeconds: 15
                      periodSeconds: 10
                      successThreshold: 1
                      timeoutSeconds: 1
                  name: activator
                  ports:
                      - containerPort: 9090
                        name: metrics ## 就是這里,提供 9090 端口作為 metrics 數(shù)據(jù)讀取接口
                        protocol: TCP
                      - containerPort: 8008
                        name: profiling
                        protocol: TCP
                      - containerPort: 8012
                        name: http1
                        protocol: TCP
                      - containerPort: 8013
                        name: h2c
                        protocol: TCP
                  readinessProbe:
                      failureThreshold: 5
                      httpGet:
                          httpHeaders:
                              - name: k-kubelet-probe
                                value: activator
                          path: /
                          port: 8012
                          scheme: HTTP
                      periodSeconds: 5
                      successThreshold: 1
                      timeoutSeconds: 1
                  resources:
                      limits:
                          cpu: "1"
                          memory: 600Mi
                      requests:
                          cpu: 300m
                          memory: 60Mi
                  securityContext:
                      allowPrivilegeEscalation: false
                      capabilities:
                          drop:
                              - all
                      readOnlyRootFilesystem: true
                      runAsNonRoot: true
                  terminationMessagePath: /dev/termination-log
                  terminationMessagePolicy: File
            dnsPolicy: ClusterFirst
            restartPolicy: Always
            schedulerName: default-scheduler
            securityContext: {}
            serviceAccount: controller
            serviceAccountName: controller
            terminationGracePeriodSeconds: 600

通過對(duì) activator 的 Deployment 內(nèi)容閱讀得知,9090 端口(命名:metrics,后面用的到)是對(duì)外提供指標(biāo)的位置。隨后我們對(duì) Knative Serving 中其他的組件提供指標(biāo)的接口做了統(tǒng)計(jì),做了如下列表:

組件名 Port 別名 描述
activator 9090 metrics 連接緩沖器,是 Knative 重要流量轉(zhuǎn)發(fā)組件。負(fù)責(zé)應(yīng)用從 0->1/1->0 過程中 http 請(qǐng)求緩存。
autoscaler 9090 metrics 擴(kuò)容控制器,是 Knative 控制應(yīng)用 Pod 副本數(shù)量重要組件。根據(jù) queue-proxy 和 activator 反饋的數(shù)據(jù)決定 pod 啟動(dòng)數(shù)量。
controller 9090 metrics 控制器,是 Knative 控制器服務(wù)協(xié)調(diào)所有公共 Knative 對(duì)象和自動(dòng)伸縮 crd。當(dāng)用戶將 Knative 服務(wù)應(yīng)用到 Kubernetes API 時(shí),這會(huì)創(chuàng)建配置和路由。
webhook 9090 metrics 鉤子,是 Knative 控制層與 Kubernetes 溝通重要組件。攔截所有 Kubernetes API 調(diào)用以及所有 CRD 插入和更新。它設(shè)置默認(rèn)值和拒絕不一致和無效的對(duì)象,并驗(yàn)證和改變 Kubernetes API 調(diào)用。

從上面的列表真正對(duì)業(yè)務(wù)有實(shí)質(zhì)性影響的就是這 4 個(gè)模塊。既然如此,我們就方便編寫抓取監(jiān)控的 Job 了。 這里以 VictoriaMetrics 平臺(tái)為例:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
    name: controller-monitor
    namespace: knative-serving
spec:
    namespaceSelector:
        matchNames:
            - knative-serving
    podMetricsEndpoints:
        - path: /metrics
          scheme: http
          targetPort: metrics # 這里就是面提到的接聽端口 9090 的別名
    selector:
        matchLabels:
            app: controller
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
    name: autoscaler-monitor
    namespace: knative-serving
spec:
    namespaceSelector:
        matchNames:
            - knative-serving
    podMetricsEndpoints:
        - path: /metrics
          scheme: http
          targetPort: metrics # 這里就是面提到的接聽端口 9090 的別名
    selector:
        matchLabels:
            app: autoscaler
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
    name: activator-monitor
    namespace: knative-serving
spec:
    namespaceSelector:
        matchNames:
            - knative-serving
    podMetricsEndpoints:
        - path: /metrics
          scheme: http
          targetPort: metrics # 這里就是面提到的接聽端口 9090 的別名
    selector:
        matchLabels:
            app: activator
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
    name: webhook-monitor
    namespace: knative-serving
spec:
    namespaceSelector:
        matchNames:
            - knative-serving
    podMetricsEndpoints:
        - path: /metrics
          scheme: http
          targetPort: metrics # 這里就是面提到的接聽端口 9090 的別名
    selector:
        matchLabels:
            app: webhook

我編寫了 4 個(gè) PodScrape 任務(wù)來監(jiān)控控制層 Pod 的 metrics,數(shù)據(jù)被自動(dòng)收集到了 VictoriaMetrics,后面方便 Grafana 來做 Dashboard。

監(jiān)控已發(fā)布應(yīng)用

依葫蘆畫瓢,當(dāng)然抓發(fā)布應(yīng)用的 metrics 取前,我們還是稍微瀏覽下 Deployment 里面的配置情況。這里用 test-app-18 為例:

apiVersion: v1
kind: Pod
metadata:
    annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/initial-scale: "1"
        autoscaling.knative.dev/max-scale: "6"
        autoscaling.knative.dev/metric: rps
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/target: "60"
        kubernetes.io/limit-ranger: "LimitRanger plugin set: ephemeral-storage request
            for container app; ephemeral-storage limit for container app; ephemeral-storage
            request for container queue-proxy; ephemeral-storage limit for container queue-proxy"
        serving.knative.dev/creator: system:serviceaccount:default:oms-admin
    creationTimestamp: "2022-08-04T07:05:03Z"
    generateName: test-app-18-ac403-deployment-988b7b66f-
    labels:
        k_type: knative # 這里很重要,通過這個(gè) label 我們區(qū)分這個(gè)pod 是 knative 應(yīng)用的pod,還是普通的 pod
        app: test-app-18
        app_id: test-app-18
        pod-template-hash: 988b7b66f
        service.istio.io/canonical-name: test-app-18
        service.istio.io/canonical-revision: test-app-18-ac403
        serving.knative.dev/configuration: test-app-18
        serving.knative.dev/configurationGeneration: "4"
        serving.knative.dev/configurationUID: d896cd40-ce9c-4027-9229-4af9f2aa5630
        serving.knative.dev/revision: test-app-18-ac403
        serving.knative.dev/revisionUID: 1b3dc38f-5aed-4252-a07b-aefc32f7f9f9
        serving.knative.dev/service: test-app-18
        serving.knative.dev/serviceUID: 0af741d0-a74f-44dd-ab6e-458a5d3743a2
    name: test-app-18-ac403-deployment-988b7b66f-tlw27
    namespace: knative-apps
    ownerReferences:
        - apiVersion: apps/v1
          blockOwnerDeletion: true
          controller: true
          kind: ReplicaSet
          name: test-app-18-ac403-deployment-988b7b66f
          uid: 429f5cc4-20e6-4f85-a985-4da1de578844
    resourceVersion: "755594194"
    uid: c632adba-5c66-4c0b-ac31-67c8c231b591
spec:
    containers:
        - env:
              - name: PORT
                value: "8080"
              - name: K_REVISION
                value: test-app-18-ac403
              - name: K_CONFIGURATION
                value: test-app-18
              - name: K_SERVICE
                value: test-app-18
          image: knative-apps/fn_test-app-18_qa@sha256:e86ed5117e91b4d11f9e169526d734981deb31c99744d65cb6a6debf9262d97f
          imagePullPolicy: IfNotPresent
          lifecycle:
              preStop:
                  httpGet:
                      path: /wait-for-drain
                      port: 8022
                      scheme: HTTP
          livenessProbe:
              failureThreshold: 3
              httpGet:
                  httpHeaders:
                      - name: K-Kubelet-Probe
                        value: queue
                  path: /ping
                  port: 8080
                  scheme: HTTP
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 1
          name: app
          ports:
              - containerPort: 8080
                name: user-port
                protocol: TCP
          resources:
              limits:
                  cpu: "2"
                  ephemeral-storage: 7Gi
                  memory: 4Gi
              requests:
                  cpu: 200m
                  ephemeral-storage: 256Mi
                  memory: 409Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          volumeMounts:
              - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                name: kube-api-access-jp8zk
                readOnly: true
        - env:
              - name: SERVING_NAMESPACE
                value: knative-apps
              - name: SERVING_SERVICE
                value: test-app-18
              - name: SERVING_CONFIGURATION
                value: test-app-18
              - name: SERVING_REVISION
                value: test-app-18-ac403
              - name: QUEUE_SERVING_PORT
                value: "8012"
              - name: QUEUE_SERVING_TLS_PORT
                value: "8112"
              - name: CONTAINER_CONCURRENCY
                value: "0"
              - name: REVISION_TIMEOUT_SECONDS
                value: "10"
              - name: SERVING_POD
                valueFrom:
                    fieldRef:
                        apiVersion: v1
                        fieldPath: metadata.name
              - name: SERVING_POD_IP
                valueFrom:
                    fieldRef:
                        apiVersion: v1
                        fieldPath: status.podIP
              - name: SERVING_LOGGING_CONFIG
              - name: SERVING_LOGGING_LEVEL
              - name: SERVING_REQUEST_LOG_TEMPLATE
                value: '{"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl":
                    "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}",
                    "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent":
                    "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp":
                    "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s",
                    "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}'
              - name: SERVING_ENABLE_REQUEST_LOG
                value: "false"
              - name: SERVING_REQUEST_METRICS_BACKEND
                value: prometheus
              - name: TRACING_CONFIG_BACKEND
                value: none
              - name: TRACING_CONFIG_ZIPKIN_ENDPOINT
              - name: TRACING_CONFIG_DEBUG
                value: "false"
              - name: TRACING_CONFIG_SAMPLE_RATE
                value: "0.1"
              - name: USER_PORT
                value: "8080"
              - name: SYSTEM_NAMESPACE
                value: knative-serving
              - name: METRICS_DOMAIN
                value: knative.dev/internal/serving
              - name: SERVING_READINESS_PROBE
                value: '{"httpGet":{"path":"/ping","port":8080,"host":"127.0.0.1","scheme":"HTTP","httpHeaders":[{"name":"K-Kubelet-Probe","value":"queue"}]},"successThreshold":1}'
              - name: ENABLE_PROFILING
                value: "false"
              - name: SERVING_ENABLE_PROBE_REQUEST_LOG
                value: "false"
              - name: METRICS_COLLECTOR_ADDRESS
              - name: CONCURRENCY_STATE_ENDPOINT
              - name: CONCURRENCY_STATE_TOKEN_PATH
                value: /var/run/secrets/tokens/state-token
              - name: HOST_IP
                valueFrom:
                    fieldRef:
                        apiVersion: v1
                        fieldPath: status.hostIP
              - name: ENABLE_HTTP2_AUTO_DETECTION
                value: "false"
          image: knative-serving/queue:1.5.0
          imagePullPolicy: IfNotPresent
          name: queue-proxy
          ports:
              - containerPort: 8022
                name: http-queueadm
                protocol: TCP
              - containerPort: 9090
                name: http-autometric
                protocol: TCP
              - containerPort: 9091
                name: http-usermetric # 就是這里,提供 9091 端口作為 metrics 數(shù)據(jù)讀取接口。因?yàn)閼?yīng)用的流量都被 Queue 轉(zhuǎn)發(fā),所以在這里統(tǒng)計(jì)最好。
                protocol: TCP
              - containerPort: 8012
                name: queue-port
                protocol: TCP
              - containerPort: 8112
                name: https-port
                protocol: TCP
          readinessProbe:
              failureThreshold: 3
              httpGet:
                  httpHeaders:
                      - name: K-Network-Probe
                        value: queue
                  path: /
                  port: 8012
                  scheme: HTTP
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 1
          resources:
              limits:
                  ephemeral-storage: 7Gi
              requests:
                  cpu: 25m
                  ephemeral-storage: 256Mi
          securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                  drop:
                      - all
              readOnlyRootFilesystem: true
              runAsNonRoot: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
              - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                name: kube-api-access-jp8zk
                readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: false
    imagePullSecrets:
        - name: key.key
    nodeName: 10.11.96.79
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 10
    tolerations:
        - effect: NoExecute
          key: node.kubernetes.io/not-ready
          operator: Exists
          tolerationSeconds: 120
        - effect: NoExecute
          key: node.kubernetes.io/unreachable
          operator: Exists
          tolerationSeconds: 120
    volumes:
        - name: kube-api-access-jp8zk
          projected:
              defaultMode: 420
              sources:
                  - serviceAccountToken:
                        expirationSeconds: 3607
                        path: token
                  - configMap:
                        items:
                            - key: ca.crt
                              path: ca.crt
                        name: kube-root-ca.crt
                  - downwardAPI:
                        items:
                            - fieldRef:
                                  apiVersion: v1
                                  fieldPath: metadata.namespace
                              path: namespace

通過對(duì) test-app-18 的 Deployment 內(nèi)容閱讀得知,9091 端口(命名:http-usermetric,后面用的到)是對(duì)外提供指標(biāo)的位置。 這里我也類似的做了一個(gè)通用的能夠抓取任何 Namespace 中 Knative 應(yīng)用 Pod 流量情況的 Job(這里有一個(gè)挑戰(zhàn):應(yīng)用的 Namespace 不確定,就需要對(duì)所有 Namespace 適配)。

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
    name: custom-apps-monitor
    namespace: knative-serving
spec:
    namespaceSelector:
        any: true # 這個(gè)表示匹配任何 Namespace
    podMetricsEndpoints:
        - path: /metrics
          scheme: http
          targetPort: http-usermetric
    selector:
        matchLabels:
            k_type: knative # 匹配真實(shí)的 Pod 區(qū)分應(yīng)用類型的標(biāo)簽

編寫了通用的 PodScrape 任務(wù)來監(jiān)控應(yīng)用 Pod 的 metrics,數(shù)據(jù)被自動(dòng)收集到了 VictoriaMetrics,后面方便 Grafana 來做 Dashboard。

最終效果

在接入 Grafana 以后,我這邊也沒有用 Knative 社區(qū)的模板,發(fā)現(xiàn)很多不一定有用。最后決定自定義個(gè)比較有意義的監(jiān)控模板。

Knative-monitoring Dashboard
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容