《Site.Reliability.Engineering.2016.3》SRE:Google運(yùn)維解密

[TOC]

induction

what is SRE?

  • SRE的本質(zhì):

    • availability
    • latency
    • performance
    • efficiency
    • change management
    • monitoring
    • emergency response
    • capacity planning
  • 服務(wù)100%可能與99.9%可用的差別:

100%需要多做很多的努力,而對(duì)用戶來說99.9%與100%沒太大差異,因?yàn)榉?wù)與用戶之間還有很多媒介(wifi,網(wǎng)絡(luò)環(huán)境等),即使100%了,也可能因?yàn)橹虚g的媒介導(dǎo)致用戶感受到得只有99.9%

核心方法論

長期關(guān)注研發(fā)工作

  • 工作目標(biāo):運(yùn)維時(shí)間控制在50%內(nèi),超過的比例通過運(yùn)維開發(fā)工程師設(shè)計(jì)自動(dòng)化軟件控制在50%
  • 方法:
    • 轉(zhuǎn)移工作到研發(fā)團(tuán)隊(duì)
    • 指派bug和工單到研發(fā)團(tuán)隊(duì)

基于不破壞SLO下,追求最大改變速度

SLO: service level object 服務(wù)水平目標(biāo)

可用性定義

  • 用戶感受到滿意的可用性等級(jí)?
  • 當(dāng)用戶不滿意時(shí),有哪些可替代的方法?
  • 不同的可用性等級(jí),用戶的使用習(xí)慣是怎么樣的?

監(jiān)控 Monitoring

合理的監(jiān)控輸出:

  • alerts:必須馬上做出響應(yīng)處理
  • tickets:相當(dāng)于警告,不需要馬上處理,延后處理
  • logging: 不需要關(guān)注的信息,記錄方便以后查看

及時(shí)響應(yīng) Emergency Response

指標(biāo):

  • MTTF: mean time to failure, 平均失效時(shí)間

  • MTTR:mean time to restoration: 平均恢復(fù)時(shí)間

方法:故障預(yù)案準(zhǔn)備

變更管理 Change Management

  • Implementing progressive rollouts
  • Quickly and accurately detecting problems
  • Rolling back changes safely when problems arise

需求預(yù)測與容量規(guī)劃 Demand Forecasting and Capacity Planning

容量規(guī)劃需要考慮的事情:

  • 精準(zhǔn)的自然增長需求預(yù)測
  • 非自然增長關(guān)聯(lián)的預(yù)測
  • 周期性調(diào)整測試,將容量與服務(wù)關(guān)聯(lián)

快速服務(wù)部署 Provisioning

Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.

效率與性能 Efficiency and Performance

SRE 需要關(guān)注效率與性能,這與快速部署關(guān)聯(lián)

Google Envirmonts

terminology

  • Machine: A piece of hardware (or perhaps a VM)
  • Server: A piece of software that implements a service
  • Racks: Tens of machines are placed in a rack.
  • Row: Racks stand in a row
  • Cluster: One or more rows form a cluster
  • Datacenter: A datacenter building houses multiple clusters
  • Campus: Multiple datacenter buildings that are located close together form a campus

Embracing Risk

可用性計(jì)算

  • 時(shí)間維度:availability = uptime/ (uptime + downtime)
  • 分布式維度:availability = successful requests / total requests

Risk Tolerance of Consumer Services

  • Target level of availability
  • Types of failures
  • Cost
  • Other service metrics

Risk Tolerance of Infrastructure Services

  • Target level of availability
  • Types of failures
  • Cost

Forming Your Error Budget

  • Product Management defines an SLO, which sets an expectation of how much
    uptime the service should have per quarter
  • The actual uptime is measured by a neutral third party: our monitoring system.
  • The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
  • As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Service Level Objective

service level indicator in practice

  • Collecting Indicators(users care about):
    • User-facing serving systems: availability, latency, and throughput
    • Storage systems: latency, availability, and durability
    • Big data systems: data processing pipelines, throughput, end-to-end latency
    • All systems: correctness
    • Others: error rate
  • Aggregation
    • Using percentiles for indicators
  • Standardize Indicators

service level objective in parctice

  • example:
    • lower bound ≤ SLI ≤ upper bound.
    • SLI ≤ target
  • Defining Objectives:
    • For maximum clarity, SLOs should specify how they’re measured and the conditions
      under which they’re valid.
    • eg:
      • 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
      • 90% of Get RPC calls will complete in less than 1 ms
      • 99% of Get RPC calls will complete in less than 10 ms
  • Choosing Targets:
    • Don’t pick a target based on current performance
    • Keep it simple
    • Avoid absolutes
    • Have as few SLOs as possible
    • Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
      tainable.
  • Control Measures:
    • Monitor and measure the system’s SLIs
    • Compare the SLIs to the SLOs, and decide whether or not action is needed
    • If action is needed, figure out what needs to happen in order to meet the target
    • Take that action
  • SLOs Set Expectations:
    • Keep a safety margin
    • Don’t overachieve

service level agreements in practice

  • an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
  • SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
  • It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.

Eliminating Toil

Toil Define

  • manual
  • repetitive
  • automatable
  • tactical
  • no enduring value
  • O(n) with service growth

calculating toil

  • 個(gè)體值班時(shí)間/運(yùn)維人員一輪輪班時(shí)間。四個(gè)運(yùn)維人員,每個(gè)人值班一周,運(yùn)維時(shí)間占比:1/4=25%

What Qualifies as Engineering

  • Software engineering
  • Systems engineering: 線上環(huán)境配置,線上環(huán)境優(yōu)化。一次性工作,免去重復(fù)勞動(dòng),初始化工作,參數(shù)優(yōu)化。
  • Toil:Work directly tied to running a service that is repetitive, manual, etc.
  • Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.

the bad of toil

  • Career stagnation
  • Low morale
  • Creates confusion
  • Slows progress
  • Sets bad precedent
  • Promotes attrition
  • Causes breach of faith

monitoring

Definitions

  • Monitoring
  • White-box monitoring
  • Black-box monitoring
  • Dashboard
  • Alart
  • Root cause
  • Node and machine
  • Push

Four Golden Signals

  • Latency
  • Traffic
  • Errors
  • Saturation

Worrying About Your Tail

  • use histogram instead mean(avg) metric

Choosing an Appropriate Resolution for Measurements

  • 收集
  • 設(shè)置粒度,取樣
  • 聚合

Principles

  • Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
  • Extra code to detect and expose possible causes
  • Associated dashboards for each of these possible causes
  • The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
  • Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
  • Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
  • Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
  • Every page should be actionable.
  • Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
  • Pages should be about a novel problem or an event that hasn’t been seen before.

臨時(shí)方案

  • 調(diào)整部分閥值
  • 臨時(shí)方案過渡

Effective Troubleshooting

最好的方法:知道系統(tǒng)如何設(shè)計(jì),如何構(gòu)建起來的(可以不用太細(xì),再通過model的過程排錯(cuò))。

model

  1. Problem Report
  2. Triage
  3. Examine
  4. Diagnose
  5. Test/Treat -loop-> 2/3
  6. Cure

Problem Report

就包含如下信息:

  • expected behavior
  • actiual behavior
  • optional: how to reproduce this behavior.

輔助的工具:

  • 告警信息平臺(tái),可查看告警相關(guān)聯(lián)的信息,盡量做到看這些信息就能定位原因,并修復(fù)。

Triage

  • 事故定級(jí):冷靜定級(jí)
  • 止損優(yōu)于排查

Examine

  • 監(jiān)控系統(tǒng):監(jiān)控某些metric
  • logging:
    • 分級(jí)
    • 取樣
    • 日志查詢平臺(tái):支持某種語言去查詢

Diagnose

  • Simplify and reduce
    • 黑盒測試
      • 正向測試
      • negative測試
    • 分而治之
      • 分兩部分:比如分區(qū),分地域
        • 分層
  • Ask "what", "where" and "why": 遞歸反推原因
  • 事件記錄:
    • 配置改變
    • 代碼上線
    • 系統(tǒng)配置改變
    • 節(jié)點(diǎn)變化
    • 其他
  • 特殊系統(tǒng):專門為某些服務(wù)設(shè)計(jì)的排查系統(tǒng)

Test And Treat

  • 列出幾條可能的原因
  • 設(shè)計(jì)測試方案
    • 首先設(shè)計(jì)最容易測試的
    • 各個(gè)測試間應(yīng)該互斥
    • 測試的結(jié)果可能誤導(dǎo)認(rèn)知。
    • 前后測試可能相互影響。比如負(fù)載變高了
    • 有些測試比較難操作,盡量避免做這些測試。
  • 總結(jié):
    • 要明白要測試什么,要做哪些測試,測試的結(jié)果是什么
    • 如果是復(fù)雜的且多的測試,及時(shí)記錄文檔,避免需要重復(fù)這些步驟

Negative Results Are Magic

  • 負(fù)面效果不能被忽略
  • 負(fù)面效果至關(guān)重要
  • 測試中使用的工具和方法,在將來的工作中會(huì)用到
  • 發(fā)布負(fù)面效果對(duì)整個(gè)行業(yè)有幫助

Cure

  • 確認(rèn)原因
  • 編寫事故報(bào)告
  • 修復(fù)

Make Troubleshooting Easier

兩大原則:

  • 服務(wù)可觀察:輸出各種有用指標(biāo),日志,在服務(wù)設(shè)計(jì)時(shí)就需要考慮到
  • 設(shè)計(jì)良好易理解的組件接口
  • 良好的全鏈路追蹤系統(tǒng):方便追蹤上下游
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容