久久亚洲无字幕电影了,亚洲人妻综合区网,夜夜骑东京热

[TOC]

induction

what is SRE？

SRE的本質(zhì)：
- availability
- latency
- performance
- efficiency
- change management
- monitoring
- emergency response
- capacity planning
服務(wù)100%可能與99.9%可用的差別：

100%需要多做很多的努力，而對(duì)用戶來說99.9%與100%沒太大差異，因?yàn)榉?wù)與用戶之間還有很多媒介（wifi，網(wǎng)絡(luò)環(huán)境等），即使100%了，也可能因?yàn)橹虚g的媒介導(dǎo)致用戶感受到得只有99.9%

核心方法論

長期關(guān)注研發(fā)工作

工作目標(biāo)：運(yùn)維時(shí)間控制在50%內(nèi)，超過的比例通過運(yùn)維開發(fā)工程師設(shè)計(jì)自動(dòng)化軟件控制在50%
方法：
- 轉(zhuǎn)移工作到研發(fā)團(tuán)隊(duì)
- 指派bug和工單到研發(fā)團(tuán)隊(duì)

基于不破壞SLO下，追求最大改變速度

SLO： service level object 服務(wù)水平目標(biāo)

可用性定義

用戶感受到滿意的可用性等級(jí)？
當(dāng)用戶不滿意時(shí)，有哪些可替代的方法？
不同的可用性等級(jí)，用戶的使用習(xí)慣是怎么樣的？

監(jiān)控 Monitoring

合理的監(jiān)控輸出：

alerts：必須馬上做出響應(yīng)處理
tickets：相當(dāng)于警告，不需要馬上處理，延后處理
logging: 不需要關(guān)注的信息，記錄方便以后查看

及時(shí)響應(yīng) Emergency Response

指標(biāo)：

MTTF: mean time to failure, 平均失效時(shí)間
MTTR：mean time to restoration: 平均恢復(fù)時(shí)間

方法：故障預(yù)案準(zhǔn)備

變更管理 Change Management

Implementing progressive rollouts
Quickly and accurately detecting problems
Rolling back changes safely when problems arise

需求預(yù)測與容量規(guī)劃 Demand Forecasting and Capacity Planning

容量規(guī)劃需要考慮的事情：

精準(zhǔn)的自然增長需求預(yù)測
非自然增長關(guān)聯(lián)的預(yù)測
周期性調(diào)整測試，將容量與服務(wù)關(guān)聯(lián)

快速服務(wù)部署 Provisioning

Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.

效率與性能 Efficiency and Performance

SRE 需要關(guān)注效率與性能，這與快速部署關(guān)聯(lián)

Google Envirmonts

terminology

Machine: A piece of hardware (or perhaps a VM)
Server: A piece of software that implements a service
Racks: Tens of machines are placed in a rack.
Row: Racks stand in a row
Cluster: One or more rows form a cluster
Datacenter: A datacenter building houses multiple clusters
Campus: Multiple datacenter buildings that are located close together form a campus

Embracing Risk

可用性計(jì)算

時(shí)間維度：availability = uptime/ (uptime + downtime)
分布式維度：availability = successful requests / total requests

Risk Tolerance of Consumer Services

Target level of availability
Types of failures
Cost
Other service metrics

Risk Tolerance of Infrastructure Services

Target level of availability
Types of failures
Cost

Forming Your Error Budget

Product Management defines an SLO, which sets an expectation of how much
uptime the service should have per quarter
The actual uptime is measured by a neutral third party: our monitoring system.
The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Service Level Objective

service level indicator in practice

Collecting Indicators(users care about):
- User-facing serving systems: availability, latency, and throughput
- Storage systems: latency, availability, and durability
- Big data systems: data processing pipelines, throughput, end-to-end latency
- All systems: correctness
- Others: error rate
Aggregation
- Using percentiles for indicators
Standardize Indicators

service level objective in parctice

example:
- lower bound ≤ SLI ≤ upper bound.
- SLI ≤ target
Defining Objectives:
- For maximum clarity, SLOs should specify how they’re measured and the conditions
  under which they’re valid.
- eg:
  - 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
  - 90% of Get RPC calls will complete in less than 1 ms
  - 99% of Get RPC calls will complete in less than 10 ms
Choosing Targets:
- Don’t pick a target based on current performance
- Keep it simple
- Avoid absolutes
- Have as few SLOs as possible
- Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
  tainable.
Control Measures:
- Monitor and measure the system’s SLIs
- Compare the SLIs to the SLOs, and decide whether or not action is needed
- If action is needed, figure out what needs to happen in order to meet the target
- Take that action
SLOs Set Expectations:
- Keep a safety margin
- Don’t overachieve

service level agreements in practice

an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.

Eliminating Toil

Toil Define

manual
repetitive
automatable
tactical
no enduring value
O(n) with service growth

calculating toil

個(gè)體值班時(shí)間/運(yùn)維人員一輪輪班時(shí)間。四個(gè)運(yùn)維人員，每個(gè)人值班一周，運(yùn)維時(shí)間占比：1/4=25%

What Qualifies as Engineering

Software engineering
Systems engineering: 線上環(huán)境配置,線上環(huán)境優(yōu)化。一次性工作，免去重復(fù)勞動(dòng)，初始化工作，參數(shù)優(yōu)化。
Toil：Work directly tied to running a service that is repetitive, manual, etc.
Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.

the bad of toil

Career stagnation
Low morale
Creates confusion
Slows progress
Sets bad precedent
Promotes attrition
Causes breach of faith

monitoring

Definitions

Monitoring
White-box monitoring
Black-box monitoring
Dashboard
Alart
Root cause
Node and machine
Push

Four Golden Signals

Latency
Traffic
Errors
Saturation

Worrying About Your Tail

use histogram instead mean(avg) metric

Choosing an Appropriate Resolution for Measurements

收集
設(shè)置粒度，取樣
聚合

Principles

Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
Extra code to detect and expose possible causes
Associated dashboards for each of these possible causes
The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
Every page should be actionable.
Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
Pages should be about a novel problem or an event that hasn’t been seen before.

臨時(shí)方案

調(diào)整部分閥值
臨時(shí)方案過渡

Effective Troubleshooting

最好的方法：知道系統(tǒng)如何設(shè)計(jì)，如何構(gòu)建起來的（可以不用太細(xì)，再通過model的過程排錯(cuò)）。

model

Problem Report
Triage
Examine
Diagnose
Test/Treat -loop-> 2/3
Cure

Problem Report

就包含如下信息：

expected behavior
actiual behavior
optional: how to reproduce this behavior.

輔助的工具：

告警信息平臺(tái)，可查看告警相關(guān)聯(lián)的信息，盡量做到看這些信息就能定位原因，并修復(fù)。

Triage

事故定級(jí)：冷靜定級(jí)
止損優(yōu)于排查

Examine

監(jiān)控系統(tǒng)：監(jiān)控某些metric
logging:
- 分級(jí)
- 取樣
- 日志查詢平臺(tái)：支持某種語言去查詢

Diagnose

Simplify and reduce
- 黑盒測試
  - 正向測試
  - negative測試
- 分而治之
  - 分兩部分：比如分區(qū)，分地域
    - 分層
Ask "what", "where" and "why": 遞歸反推原因
事件記錄：
- 配置改變
- 代碼上線
- 系統(tǒng)配置改變
- 節(jié)點(diǎn)變化
- 其他
特殊系統(tǒng)：專門為某些服務(wù)設(shè)計(jì)的排查系統(tǒng)

Test And Treat

列出幾條可能的原因
設(shè)計(jì)測試方案
- 首先設(shè)計(jì)最容易測試的
- 各個(gè)測試間應(yīng)該互斥
- 測試的結(jié)果可能誤導(dǎo)認(rèn)知。
- 前后測試可能相互影響。比如負(fù)載變高了
- 有些測試比較難操作，盡量避免做這些測試。
總結(jié)：
- 要明白要測試什么，要做哪些測試，測試的結(jié)果是什么
- 如果是復(fù)雜的且多的測試，及時(shí)記錄文檔，避免需要重復(fù)這些步驟

Negative Results Are Magic

負(fù)面效果不能被忽略
負(fù)面效果至關(guān)重要
測試中使用的工具和方法，在將來的工作中會(huì)用到
發(fā)布負(fù)面效果對(duì)整個(gè)行業(yè)有幫助

Cure

確認(rèn)原因
編寫事故報(bào)告
修復(fù)

Make Troubleshooting Easier

兩大原則：

服務(wù)可觀察：輸出各種有用指標(biāo)，日志，在服務(wù)設(shè)計(jì)時(shí)就需要考慮到
設(shè)計(jì)良好易理解的組件接口
良好的全鏈路追蹤系統(tǒng)：方便追蹤上下游

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

《Site.Reliability.Engineering.2016.3》SRE：Google運(yùn)維解密