[TOC]
induction
what is SRE?
-
SRE的本質(zhì):
- availability
- latency
- performance
- efficiency
- change management
- monitoring
- emergency response
- capacity planning
服務(wù)100%可能與99.9%可用的差別:
100%需要多做很多的努力,而對(duì)用戶來說99.9%與100%沒太大差異,因?yàn)榉?wù)與用戶之間還有很多媒介(wifi,網(wǎng)絡(luò)環(huán)境等),即使100%了,也可能因?yàn)橹虚g的媒介導(dǎo)致用戶感受到得只有99.9%
核心方法論
長期關(guān)注研發(fā)工作
- 工作目標(biāo):運(yùn)維時(shí)間控制在50%內(nèi),超過的比例通過運(yùn)維開發(fā)工程師設(shè)計(jì)自動(dòng)化軟件控制在50%
- 方法:
- 轉(zhuǎn)移工作到研發(fā)團(tuán)隊(duì)
- 指派bug和工單到研發(fā)團(tuán)隊(duì)
基于不破壞SLO下,追求最大改變速度
SLO: service level object 服務(wù)水平目標(biāo)
可用性定義
- 用戶感受到滿意的可用性等級(jí)?
- 當(dāng)用戶不滿意時(shí),有哪些可替代的方法?
- 不同的可用性等級(jí),用戶的使用習(xí)慣是怎么樣的?
監(jiān)控 Monitoring
合理的監(jiān)控輸出:
- alerts:必須馬上做出響應(yīng)處理
- tickets:相當(dāng)于警告,不需要馬上處理,延后處理
- logging: 不需要關(guān)注的信息,記錄方便以后查看
及時(shí)響應(yīng) Emergency Response
指標(biāo):
MTTF: mean time to failure, 平均失效時(shí)間
MTTR:mean time to restoration: 平均恢復(fù)時(shí)間
方法:故障預(yù)案準(zhǔn)備
變更管理 Change Management
- Implementing progressive rollouts
- Quickly and accurately detecting problems
- Rolling back changes safely when problems arise
需求預(yù)測與容量規(guī)劃 Demand Forecasting and Capacity Planning
容量規(guī)劃需要考慮的事情:
- 精準(zhǔn)的自然增長需求預(yù)測
- 非自然增長關(guān)聯(lián)的預(yù)測
- 周期性調(diào)整測試,將容量與服務(wù)關(guān)聯(lián)
快速服務(wù)部署 Provisioning
Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.
效率與性能 Efficiency and Performance
SRE 需要關(guān)注效率與性能,這與快速部署關(guān)聯(lián)
Google Envirmonts
terminology
- Machine: A piece of hardware (or perhaps a VM)
- Server: A piece of software that implements a service
- Racks: Tens of machines are placed in a rack.
- Row: Racks stand in a row
- Cluster: One or more rows form a cluster
- Datacenter: A datacenter building houses multiple clusters
- Campus: Multiple datacenter buildings that are located close together form a campus
Embracing Risk
可用性計(jì)算
- 時(shí)間維度:availability = uptime/ (uptime + downtime)
- 分布式維度:availability = successful requests / total requests
Risk Tolerance of Consumer Services
- Target level of availability
- Types of failures
- Cost
- Other service metrics
Risk Tolerance of Infrastructure Services
- Target level of availability
- Types of failures
- Cost
Forming Your Error Budget
- Product Management defines an SLO, which sets an expectation of how much
uptime the service should have per quarter - The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
- As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.
Service Level Objective
service level indicator in practice
- Collecting Indicators(users care about):
- User-facing serving systems: availability, latency, and throughput
- Storage systems: latency, availability, and durability
- Big data systems: data processing pipelines, throughput, end-to-end latency
- All systems: correctness
- Others: error rate
- Aggregation
- Using percentiles for indicators
- Standardize Indicators
service level objective in parctice
- example:
- lower bound ≤ SLI ≤ upper bound.
- SLI ≤ target
- Defining Objectives:
- For maximum clarity, SLOs should specify how they’re measured and the conditions
under which they’re valid. - eg:
- 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
- 90% of Get RPC calls will complete in less than 1 ms
- 99% of Get RPC calls will complete in less than 10 ms
- For maximum clarity, SLOs should specify how they’re measured and the conditions
- Choosing Targets:
- Don’t pick a target based on current performance
- Keep it simple
- Avoid absolutes
- Have as few SLOs as possible
-
Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
tainable.
- Control Measures:
- Monitor and measure the system’s SLIs
- Compare the SLIs to the SLOs, and decide whether or not action is needed
- If action is needed, figure out what needs to happen in order to meet the target
- Take that action
- SLOs Set Expectations:
- Keep a safety margin
- Don’t overachieve
service level agreements in practice
- an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
- SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
- It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.
Eliminating Toil
Toil Define
- manual
- repetitive
- automatable
- tactical
- no enduring value
- O(n) with service growth
calculating toil
- 個(gè)體值班時(shí)間/運(yùn)維人員一輪輪班時(shí)間。四個(gè)運(yùn)維人員,每個(gè)人值班一周,運(yùn)維時(shí)間占比:1/4=25%
What Qualifies as Engineering
- Software engineering
- Systems engineering: 線上環(huán)境配置,線上環(huán)境優(yōu)化。一次性工作,免去重復(fù)勞動(dòng),初始化工作,參數(shù)優(yōu)化。
- Toil:Work directly tied to running a service that is repetitive, manual, etc.
- Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.
the bad of toil
- Career stagnation
- Low morale
- Creates confusion
- Slows progress
- Sets bad precedent
- Promotes attrition
- Causes breach of faith
monitoring
Definitions
- Monitoring
- White-box monitoring
- Black-box monitoring
- Dashboard
- Alart
- Root cause
- Node and machine
- Push
Four Golden Signals
- Latency
- Traffic
- Errors
- Saturation
Worrying About Your Tail
- use histogram instead mean(avg) metric
Choosing an Appropriate Resolution for Measurements
- 收集
- 設(shè)置粒度,取樣
- 聚合
Principles
- Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
- Extra code to detect and expose possible causes
- Associated dashboards for each of these possible causes
- The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
- Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
- Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
- Every page should be actionable.
- Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
- Pages should be about a novel problem or an event that hasn’t been seen before.
臨時(shí)方案
- 調(diào)整部分閥值
- 臨時(shí)方案過渡
Effective Troubleshooting
最好的方法:知道系統(tǒng)如何設(shè)計(jì),如何構(gòu)建起來的(可以不用太細(xì),再通過model的過程排錯(cuò))。
model
- Problem Report
- Triage
- Examine
- Diagnose
- Test/Treat -loop-> 2/3
- Cure
Problem Report
就包含如下信息:
- expected behavior
- actiual behavior
- optional: how to reproduce this behavior.
輔助的工具:
- 告警信息平臺(tái),可查看告警相關(guān)聯(lián)的信息,盡量做到看這些信息就能定位原因,并修復(fù)。
Triage
- 事故定級(jí):冷靜定級(jí)
- 止損優(yōu)于排查
Examine
- 監(jiān)控系統(tǒng):監(jiān)控某些metric
- logging:
- 分級(jí)
- 取樣
- 日志查詢平臺(tái):支持某種語言去查詢
Diagnose
- Simplify and reduce
- 黑盒測試
- 正向測試
- negative測試
- 分而治之
- 分兩部分:比如分區(qū),分地域
- 分層
- 分兩部分:比如分區(qū),分地域
- 黑盒測試
- Ask "what", "where" and "why": 遞歸反推原因
- 事件記錄:
- 配置改變
- 代碼上線
- 系統(tǒng)配置改變
- 節(jié)點(diǎn)變化
- 其他
- 特殊系統(tǒng):專門為某些服務(wù)設(shè)計(jì)的排查系統(tǒng)
Test And Treat
- 列出幾條可能的原因
- 設(shè)計(jì)測試方案
- 首先設(shè)計(jì)最容易測試的
- 各個(gè)測試間應(yīng)該互斥
- 測試的結(jié)果可能誤導(dǎo)認(rèn)知。
- 前后測試可能相互影響。比如負(fù)載變高了
- 有些測試比較難操作,盡量避免做這些測試。
- 總結(jié):
- 要明白要測試什么,要做哪些測試,測試的結(jié)果是什么
- 如果是復(fù)雜的且多的測試,及時(shí)記錄文檔,避免需要重復(fù)這些步驟
Negative Results Are Magic
- 負(fù)面效果不能被忽略
- 負(fù)面效果至關(guān)重要
- 測試中使用的工具和方法,在將來的工作中會(huì)用到
- 發(fā)布負(fù)面效果對(duì)整個(gè)行業(yè)有幫助
Cure
- 確認(rèn)原因
- 編寫事故報(bào)告
- 修復(fù)
Make Troubleshooting Easier
兩大原則:
- 服務(wù)可觀察:輸出各種有用指標(biāo),日志,在服務(wù)設(shè)計(jì)時(shí)就需要考慮到
- 設(shè)計(jì)良好易理解的組件接口
- 良好的全鏈路追蹤系統(tǒng):方便追蹤上下游