一. 關(guān)于goreman部分
- go get github.com/mattn/goreman將代碼下載到本地的$GOPATH/src下
- 在goreman根目錄下執(zhí)行 go build,生成goreman程序
- 將goreman程序:cp goreman $GOPATH/bin下
二.啟動(dòng)local multiple_member_cluster
切換到etcd根目錄下
-
執(zhí)行g(shù)oreman -f Profile start,輸出如下內(nèi)容
16:00:04 etcd2 | Starting etcd2 on port 5100 16:00:04 etcd1 | Starting etcd1 on port 5000 16:00:04 etcd3 | Starting etcd3 on port 5200 16:00:04 etcd1 | {"level":"info","ts":"2020-01-06T16:00:04.125+0800","caller":"etcdmain/etcd.go:110","msg":"failed to detect default host","error":"default host not supported on darwin_amd64"} 16:00:04 etcd1 | {"level":"warn","ts":"2020-01-06T16:00:04.126+0800","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"infra1.etcd"} 16:00:04 etcd2 | {"level":"info","ts":"2020-01-06T16:00:04.125+0800","caller":"etcdmain/etcd.go:110","msg":"failed to detect default host","error":"default host not supported on darwin_amd64"} 16:00:04 etcd2 | {"level":"warn","ts":"2020-01-06T16:00:04.126+0800","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"infra2.etcd"} 16:00:04 etcd2 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:22380"]} 16:00:04 etcd1 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:12380"]} 16:00:04 etcd3 | {"level":"info","ts":"2020-01-06T16:00:04.126+0800","caller":"etcdmain/etcd.go:110","msg":"failed to detect default host","error":"default host not supported on darwin_amd64"} 16:00:04 etcd1 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2379"]} 16:00:04 etcd1 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:602","msg":"pprof is enabled","path":"/debug/pprof"} 16:00:04 etcd3 | {"level":"warn","ts":"2020-01-06T16:00:04.126+0800","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"infra3.etcd"} 16:00:04 etcd3 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:32380"]} 16:00:04 etcd3 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:32379"]} 16:00:04 etcd3 | {"level":"info","ts":"2020-01-06T16:00:04.127+0800","caller":"embed/etcd.go:602","msg":"pprof is enabled","path":"/debug/pprof"}
三.簡(jiǎn)單操作
1.查看cluster中member:etcdctl --write-out=table --endpoints=localhost:2379 member list
+------------------+---------+--------+------------------------+------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+------------------------+------------------------+------------+
| 8211f1d0f64f3269 | started | infra1 | http://127.0.0.1:12380 | http://127.0.0.1:2379 | false |
| 91bc3c398fb3c146 | started | infra2 | http://127.0.0.1:22380 | http://127.0.0.1:22379 | false |
| fd422379fda50e48 | started | infra3 | http://127.0.0.1:32380 | http://127.0.0.1:32379 | false |
+------------------+---------+--------+------------------------+------------------------+------------+
- 添加kv/獲取key對(duì)應(yīng)內(nèi)容
etcdctl --endpoints=localhost:32379 put foo1 bar1
etcdctl --endpoints=localhost:32379 get foo1
etcdctl get foo1
3.watch
./bin/etcdctl watch key<實(shí)際的key>
或
./bin/etcdctl watch -i watch key<實(shí)際的key>
可查看key執(zhí)行的歷史記錄
./bin/etcdctl watch --rev=reversion key<實(shí)際的key>
raft_term: 代表每次leader發(fā)生變化時(shí),該值就會(huì)遞增(全局);
revision:代表每次被修改(eg. Put/Delete/Txn等操作)該值會(huì)被遞增;
mod_revision: 代表當(dāng)前kv被修改最近一次的版本;
create_revision: 代表當(dāng)前kv創(chuàng)建時(shí)的版本;
version: 代表當(dāng)前kv從創(chuàng)建到現(xiàn)在經(jīng)歷的版本數(shù)(mod_revision-create_revision);
4.transaction:
$ etcdctl put flag 0
$ etcdctl txn -i # 執(zhí)行txn
compares:
value("flag") = "1"
success requests (get, put, del):
put hello world_123
failure requests (get, put, del):
put hello world_world_ooo
FAILURE
OK
5.lease:
etcd能為key設(shè)置超時(shí)時(shí)間,etcd需要先創(chuàng)建lease,然后使用put命令加上參數(shù)–lease=<lease ID>來(lái)設(shè)置
$ etcdctl lease grant 100 #創(chuàng)建lease
lease 38015a3c00490513 granted with TTL(100s)
$ etcdctl put hello world --lease=38015a3c00490513 # 授權(quán)l(xiāng)ease
OK
$ etcdctl lease timetolive 38015a3c00490513 # 查看某個(gè)lease
lease 38015a3c00490513 granted with TTL(100s), remaining(67s)
$ etcdctl lease timetolive 38015a3c00490513 --keys # 查看某個(gè)lease關(guān)聯(lián)的keys
lease 38015a3c00490513 granted with TTL(100s), remaining(59s), attached keys([hello])
補(bǔ)充:
1.Logical view
在etcd中l(wèi)ogical view其實(shí)就是一個(gè)binary key space,并支持key按照詞法index排查能夠進(jìn)行范圍查詢。logical view支持key的多版本內(nèi)容,每當(dāng)進(jìn)行modify操作時(shí)都會(huì)觸發(fā),就會(huì)在key-space新增一個(gè)版本。同時(shí)以前的會(huì)保持不變的,通過(guò)指定revision來(lái)獲取當(dāng)前key對(duì)應(yīng)的歷史版本。同樣revision會(huì)作為index,這樣就可以結(jié)合watch來(lái)進(jìn)行操作,完成對(duì)某個(gè)key的操作。隨著key space不停的產(chǎn)生新版本的內(nèi)容,會(huì)導(dǎo)致整個(gè)cluster維護(hù)數(shù)據(jù)量變大,本身消耗的資源遞增,則通過(guò)compact來(lái)節(jié)省現(xiàn)有的空間。
存在key space中任一key的生命期:從創(chuàng)建到刪除。每個(gè)key會(huì)有至少1次產(chǎn)生(每個(gè)可具有不止一個(gè)revision)。當(dāng)創(chuàng)建一個(gè)不存在key,則會(huì)從1開(kāi)始遞增產(chǎn)生version;
而每當(dāng)刪除一個(gè)key時(shí),則會(huì)產(chǎn)生tombstone并將key當(dāng)前的version置為0;
針對(duì)每個(gè)key的修改,則會(huì)導(dǎo)致key的version+1;
在key產(chǎn)生時(shí),其關(guān)聯(lián)的version都是單調(diào)遞增的。一旦發(fā)生compaction,在改compaction指定的revision前面的revision會(huì)被移除,同樣該revision之前的values也會(huì)被移除。
2.Physical view
在etcd存儲(chǔ)的數(shù)據(jù),是以kv對(duì)的方式以B+ tree存儲(chǔ)。存儲(chǔ)狀態(tài)的每個(gè)revision只包含前面revision的增量,以提高效率。單個(gè)revision可能對(duì)應(yīng)于tree中的多個(gè)keys。
而kv中的key是一個(gè)三元組<major, sub, type>: Major對(duì)應(yīng)key的revision;Sub用于區(qū)分屬于同一個(gè)revision的不同keys;Type作為指定value的后綴(可選的)。
ke中的value保留前面所有revision,當(dāng)前revision的value都是前面revision的增量。b+ tree是按照詞法字節(jié)排序,故而通過(guò)range查詢速度相對(duì)比較快。在進(jìn)行compation時(shí)會(huì)清除過(guò)時(shí)的kv對(duì)。
etcd會(huì)在memory存放一個(gè)二級(jí)index加快數(shù)據(jù)的查詢,特別是range查詢。
四. 新增node
在前面已啟動(dòng)的local multiple_member_cluster新增member
執(zhí)行添加member:etcdctl member add infra4 --peer-urls="http://127.0.0.1:42380" --learner=true
啟動(dòng)member:
etcd --name infra4 --listen-client-urls http://127.0.0.1:42379 --advertise-client-urls http://127.0.0.1:42379 --listen-peer-urls http://127.0.0.1:42380 --initial-advertise-peer-urls http://127.0.0.1:42380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra4=http://127.0.0.1:42380,infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr驗(yàn)證member是否已添加到etcd cluster: etcdctl member promote 8de0eb3c0ff43347<新增member對(duì)應(yīng)的id>
-
查看當(dāng)前cluster中member: etcdctl --write-out=table member list
+------------------+---------+--------+------------------------+------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+--------+------------------------+------------------------+------------+ | 8211f1d0f64f3269 | started | infra1 | http://127.0.0.1:12380 | http://127.0.0.1:2379 | false | | 8de0eb3c0ff43347 | started | infra4 | http://127.0.0.1:42380 | http://127.0.0.1:42379 | true | | 91bc3c398fb3c146 | started | infra2 | http://127.0.0.1:22380 | http://127.0.0.1:22379 | false | | fd422379fda50e48 | started | infra3 | http://127.0.0.1:32380 | http://127.0.0.1:32379 | false | +------------------+---------+--------+------------------------+------------------------+------------+五. 關(guān)于learner設(shè)計(jì)
第一部分 實(shí)例
實(shí)例一:添加新member,leader負(fù)載
當(dāng)向現(xiàn)有etcd cluster添加新的member,此時(shí)該member node沒(méi)有任何數(shù)據(jù),需要從leader同步數(shù)據(jù)直至追上leader的最新數(shù)據(jù)。在此過(guò)程中可能會(huì)導(dǎo)致leader node的network負(fù)載,可能會(huì)導(dǎo)致發(fā)向followers的hearbeats被阻塞或丟失。這樣在當(dāng)前l(fā)eader的election-timeout周期內(nèi),followers可能會(huì)觸發(fā)新的一輪leader election。換而言之,當(dāng)向一個(gè)etcd cluster添加新member時(shí)會(huì)影響leader election。故而leader election和后續(xù)的數(shù)據(jù)同步到一個(gè)新的member都會(huì)對(duì)cluster產(chǎn)生一定的影響,導(dǎo)致其是否可用。
當(dāng)一個(gè)新member加入cluster并沒(méi)有數(shù)據(jù),接著向leader請(qǐng)求同步數(shù)據(jù)直至追上leader.同步給新member的snapshots過(guò)大導(dǎo)致leader的network過(guò)載,進(jìn)而導(dǎo)致leader向cluster其他follower發(fā)送hearbeat阻塞甚至丟失,在達(dá)到指定election-timeout有效期,follower會(huì)進(jìn)行新一輪的leader election
實(shí)例二:leader isolation
當(dāng)cluster中l(wèi)eader和其他部分follower隔離會(huì)對(duì)整個(gè)cluster產(chǎn)生影響導(dǎo)致其不可用,由于leader需要監(jiān)控每個(gè)follower的進(jìn)展,一旦不能和quorum的follower間互通,在超過(guò)election-timeout周期后followers也會(huì)觸發(fā)新一輪的leader election。

在前面的兩個(gè)實(shí)例中,展示向cluster添加一個(gè)新member會(huì)導(dǎo)致什么問(wèn)題或潛在的情況,接下來(lái)結(jié)合實(shí)例三來(lái)講解:向一個(gè)已包含3個(gè)nodes cluster添加一個(gè)新的member后,network partitions變化?是否取決于partition后新member隸屬于哪個(gè)partition?
實(shí)例三:向3個(gè)node cluster添加一個(gè)新的member
-
假如當(dāng)前新增的member和leader屬于相同的partition
在此種情況下,leader仍維持3個(gè)active quorum,也就是說(shuō)leader election是不會(huì)發(fā)生的,對(duì)現(xiàn)有的cluster不產(chǎn)生任何影響,如下圖:
image.png
-
當(dāng)新增的member和leader不在同一partition,形成2-2 partitioned,這樣兩個(gè)partition都沒(méi)有達(dá)到quorum,會(huì)導(dǎo)致leader election發(fā)生
image.png -
達(dá)不到quorum
當(dāng)cluster先發(fā)生partition,接著有新member加入?比如現(xiàn)有3個(gè)node的cluster,此時(shí)有一個(gè)follower和leader間不可互通,這時(shí)有新member加入,原來(lái)集群的quorum也由2變更為3,然而此時(shí)4個(gè)node cluster其實(shí)只有2個(gè)active followers,就導(dǎo)致在進(jìn)行新一輪的leader election時(shí)不能達(dá)到quorum的要求。
image.png
由于新member的添加會(huì)導(dǎo)致原有cluster的quorum size發(fā)生變更,此時(shí)要優(yōu)先將集群現(xiàn)有不健康的節(jié)點(diǎn)剔除,在進(jìn)行新member的添加替換原有不健康的node。
當(dāng)向1-node cluster添加新member時(shí),會(huì)導(dǎo)致quorum size變更為2,當(dāng)previous leader發(fā)現(xiàn)quorum是無(wú)效的會(huì)立刻會(huì)觸發(fā)leader election:由于“member add”屬于2-step操作,首先需要完成member添加,接著啟動(dòng)新member node的process。如下圖
image.png -
cluster配置錯(cuò)誤
在實(shí)際的應(yīng)用可會(huì)出現(xiàn)比較糟糕的事情:當(dāng)進(jìn)行一個(gè)新member添加時(shí)配置錯(cuò)誤,再加上membership reconfiguration屬于2-steps操作,首先“etcdctl member add”,接著啟動(dòng)根據(jù)指定peer URL啟動(dòng)server process。也就是說(shuō) 不管URL是什么甚至URL指定的值無(wú)效,也會(huì)應(yīng)用成員添加命令。若是第一步使用了無(wú)效的url,那么第二步甚至不能啟動(dòng)新的etcd。一旦集群達(dá)不到指定的quorum,就無(wú)法恢復(fù)成員更改。

同樣在多節(jié)點(diǎn)集群中,比如集群有兩個(gè)members宕機(jī)(一個(gè)失敗,另一個(gè)配置錯(cuò)誤),兩個(gè)members宕機(jī),但現(xiàn)在需要至少3個(gè)quorum才能更改cluster membership。如下圖

如上所述,簡(jiǎn)單的錯(cuò)誤配置可能會(huì)使整個(gè)群集無(wú)法工作。 在這種情況下,operator需要使用etcd --force-new-cluster來(lái)手動(dòng)重新創(chuàng)建集群。 而由于etcd已成為Kubernetes的關(guān)鍵任務(wù)服務(wù),即使是最輕微的中斷也可能對(duì)用戶產(chǎn)生重大影響。 我們?cè)鯓硬拍苁筫tcd這樣的操作更容易? 除其他事項(xiàng)外,leader election對(duì)集群可用性至關(guān)重要:我們是否可以通過(guò)不更改quorum size來(lái)降低members reconfiguration的破壞性? 一個(gè)新節(jié)點(diǎn)是否可以idle,僅向領(lǐng)導(dǎo)者請(qǐng)求最少的更新,直到它趕上leader? membership 錯(cuò)誤配置是否可以始終可撤銷(xiāo)的并以更安全的方式處理(錯(cuò)誤的member add命令運(yùn)行應(yīng)永遠(yuǎn)不會(huì)使集群fail)? 添加新成員時(shí),用戶是否應(yīng)該擔(dān)心network topology? 不管節(jié)點(diǎn)和正在進(jìn)行的網(wǎng)絡(luò)分區(qū)的位置如何,member add API都可以工作?
第二部分 Raft Learner
為了解決前面實(shí)例中的情況,新增了一個(gè)新的node state:learner:當(dāng)有新member加入到cluster時(shí),首先該member作為一個(gè)non-voting member,直到其追上leader logs,進(jìn)而轉(zhuǎn)為member。
- features in v3.4
要使一個(gè)新的learner node相對(duì)比較簡(jiǎn)單:member add --learner 來(lái)添加一個(gè)learner node,此時(shí)該member只是作為一個(gè)non-voting member,并能夠接收l(shuí)eader的logs,直至追上leader。

一旦learner追趕上leader進(jìn)度后,使用“member promote”api來(lái)將該learner變成具有quorum的member:

對(duì)于一個(gè)learner是否能夠變?yōu)関oting-member則需要etcd server來(lái)驗(yàn)證promoted request來(lái)確保安全,并保證learner已經(jīng)趕上leader的進(jìn)度了。

在etcd server沒(méi)有promoted request檢驗(yàn)之前,learner會(huì)一直作為standby node存在:Leadership不能變?yōu)閘eaner,并且learner不對(duì)外提供read和write(client balancer不會(huì)路由請(qǐng)求到learner)。也就是說(shuō)learner不需要向leader發(fā)送read index請(qǐng)求。

另外,etcd也會(huì)限制cluster中存在learners的數(shù)量,并避免leader進(jìn)行l(wèi)og replication的負(fù)載。另外learner node不會(huì)主動(dòng)提升自己變?yōu)関oting-member,etcd也提供learner status信息和安全檢查,而cluster operator會(huì)做出最終決定是夠?qū)earner提升為voting-member。
- features in v3.5
默認(rèn)情況下,新增一個(gè)member其狀態(tài)為learner,在當(dāng)前新member未變成“voting-member”前,是不會(huì)改變quorum size,同樣Misconfiguration能夠撤銷(xiāo)保證quorum不會(huì)lose。
使voting-member promotion過(guò)程完全自動(dòng)化:learner追上leader的logs后,cluster便能自動(dòng)promote leaner。 etcd要求用戶定義某些閾值,一旦滿足要求,learner便會(huì)提升為voting-member。 從用戶的角度來(lái)看,“member add”命令的工作方式相同,但learner功能可提供更高的安全性。
使learner成為standby failover node:learner加入并成為standby node,并在集群可用性受到影響時(shí)自動(dòng)promoted。
使“l(fā)earner”成為read-only節(jié)點(diǎn):“l(fā)earner”可以作為一個(gè)read-only節(jié)點(diǎn),永遠(yuǎn)不會(huì)被promoted。在weak consistency模式下,learner只接收l(shuí)eader的數(shù)據(jù),從不處理write操作。在沒(méi)有consensus的情況下提供本地讀操作將極大地減少leader的工作負(fù)載,但可能會(huì)提供stale data。在強(qiáng)consistency模式下,learner請(qǐng)求從leader處讀取索引以提供最新數(shù)據(jù),但仍然拒絕write操作。
- Learner vs Mirror Maker
etcd使用watch API實(shí)現(xiàn)“mirror maker”,以持續(xù)地將key創(chuàng)建和更新到一個(gè)單獨(dú)的集群中。 一旦完成初始同步,Mirror通常具有較低的延遲開(kāi)銷(xiāo)。learner和mirror的重疊之處在于,兩者均可用于復(fù)制現(xiàn)有數(shù)據(jù)以只讀方式。 但是,mirror不能保證線性化。 在網(wǎng)絡(luò)斷開(kāi)連接期間,以前的key-values可能已被丟棄,并且希望clients驗(yàn)證監(jiān)視響應(yīng)的正確順序。 因此,Mirror中沒(méi)有訂購(gòu)保證。 使用Mirror來(lái)減少延遲(例如跨數(shù)據(jù)中心),以保持一致性為代價(jià)。 使用learner保留所有歷史數(shù)據(jù)及其順序。
第三部分 : Learner實(shí)現(xiàn)
etcd client中添加一個(gè)flag在Member AddAPI來(lái)標(biāo)示learner node。具體操作見(jiàn)前面[新增node]部分。
引用
- Original github issue: etcd#9161
- Use case: etcd#3715
- Use case: etcd#8888
- Use case: etcd#10114
- Design-Leaner:design-learner.md




