官方解答
As mentioned, the implementation up to version 3.3.3 has not included epoch variables acceptedEpoch and currentEpoch. This omission has generated problems [5]
(issue ZOOKEEPER-335 in Apache’s issue tracking system) in a production version
and was noticed by many ZooKeeper clients. The origin of this problem is at the beginning of Recovery Phase (Algorithm 4 line 2), when the leader increments its epoch
(contained in lastZxid) even before acquiring a quorum of successfully connected followers (such leader is called false leader ). Since a follower goes back to FLE if its
epoch is larger than the leader’s epoch (line 25), when a false leader drops leadership
and becomes a follower of a leader from a previous epoch, it finds a smaller epoch (line 25) and goes back to FLE. This behavior can loop, switching from Recovery Phase to
FLE.
Consequently, using lastZxid to store the epoch number, there is no distinction
between a tried epoch and a joined epoch in the implementation. Those are the respective purposes for acceptedEpoch and currentEpoch, hence the omission of them
render such problems. These variables have been properly inserted in recent (unstable)
ZooKeeper versions to fix the problems mentioned above.
意思是,以前是不區(qū)分acceptedEpoch 和 currentEpoch的,以前epoch是直接從zxid中前32位里提取的。但這會導致一個問題:假設有三個服務器s1, s2, s3. 集群s1和s2取得聯(lián)系,且s1為leader,s3為LOOKING:
- s2重啟,加上s3的選票,將s3選為leader
- s3把自己當做leader,且epoch+1,但無法與其它server取得聯(lián)系。此時s1還是認為自己是leader(后文會問為什么)。
- s2無法與s3取得聯(lián)系,同時收到s1的LEADING信息,便回到s1的舊集群里
- s3無法與他人取得聯(lián)系,退出leadership,回到FLE,并收到舊集群leader s1的消息,便作為follower也回到舊集群里
- s3作為follower發(fā)現(xiàn)自己的epoch比舊leader的epoch還大,便又回到FLE
之后s3就不斷在4和5之間徘徊,不斷在FLE階段和RECOVER階段循環(huán)。
別的都講得通,但還有個關鍵疑惑,不是說"leader在不能與半數(shù)以上follower取得聯(lián)系時,會重回選舉FLE"嗎?那舊集群的follower s2重啟時,為何s1會仍然會認為自己是LEADER?
實驗證明leader不會立刻退出leadership
我特意做實驗試了一下,在4個server的集群中,啟動3個server,其中s3為leader,s1和s2為follower,并在選舉算法開始處增添輸出字樣。然后快速關閉、重啟s2,發(fā)現(xiàn)s3并沒有進入選舉模式,而是再次接納了s2。
leader維持leadership的機制
我認為是基于一種心跳包的機制,在一段時間(self.tickTime / 2)
我們看下Leader::lead的源碼,代碼里加了一些我的中文注釋理解:
while (true) {
synchronized (this) {
long start = Time.currentElapsedTime();
long cur = start;
long end = start + self.tickTime / 2;
while (cur < end) {
// 我認為end - cur是心跳包的周期,每過這段時間要檢查一遍集群的連接情況
// 以此決定是否維持leadership
// 另外,debug時看到self.tickTime值為200000,也就是200秒
wait(end - cur);
cur = Time.currentElapsedTime();
}
if (!tickSkip) {
self.tick.incrementAndGet();
}
// We use an instance of SyncedLearnerTracker to
// track synced learners to make sure we still have a
// quorum of current (and potentially next pending) view.
// 我們使用SyncedLearnerTracker來跟蹤追隨者們,以保證自己維持了一組集群
SyncedLearnerTracker syncedAckSet = new SyncedLearnerTracker();
syncedAckSet.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier() != null
&& self.getLastSeenQuorumVerifier().getVersion() > self
.getQuorumVerifier().getVersion()) {
syncedAckSet.addQuorumVerifier(self
.getLastSeenQuorumVerifier());
}
syncedAckSet.addAck(self.getId());
// 查看每個追隨者是否還維持著連接
for (LearnerHandler f : getLearners()) {
if (f.synced()) {
syncedAckSet.addAck(f.getSid());
}
}
// check leader running status
if (!this.isRunning()) {
// set shutdown flag
shutdownMessage = "Unexpected internal error";
break;
}
if (!tickSkip && !syncedAckSet.hasAllQuorums()) {
// Lost quorum of last committed and/or last proposed
// config, set shutdown flag
// 如果沒有過半數(shù)的連接,則不再維持leadership
shutdownMessage = "Not sufficient followers synced, only synced with sids: [ "
+ syncedAckSet.ackSetsToString() + " ]";
break;
}
tickSkip = !tickSkip;
}
// 根據(jù)上面的邏輯,每個tickTime會調(diào)用兩次下方代碼
// 對每個追隨者發(fā)送一個ping
for (LearnerHandler f : getLearners()) {
f.ping();
}
}
if (shutdownMessage != null) {
shutdown(shutdownMessage);
// leader goes in looking state
}
幾個重要片段
Leader.leader的while (true)循環(huán)中,對每個LearnerHandler線程查看是否同步
for (LearnerHandler f : getLearners()) {
if (f.synced()) {
syncedAckSet.addAck(f.getSid());
}
}
...
if (!tickSkip && !syncedAckSet.hasAllQuorums()) {
// Lost quorum of last committed and/or last proposed
// config, set shutdown flag
shutdownMessage = "Not sufficient followers synced, only synced with sids: [ "
+ syncedAckSet.ackSetsToString() + " ]";
break;
}
Leader::getLearners返回的是learners變量的拷貝
public List<LearnerHandler> getLearners() {
synchronized (learners) {
return new ArrayList<LearnerHandler>(learners);
}
}
LearnerHandler::synced主要判斷線程是否存活
public boolean synced() {
return isAlive()
&& leader.self.tick.get() <= tickOfNextAckDeadline;
}
看到這里我們可以認為,SyncedLearnerTracker syncedAckSet是否判斷集群成立,主要取決于Leader.learners每個LearnerHandler線程的存活情況。
因此,這個集群的健康狀態(tài),取決于LearnerHandler線程何時會退出,以及Leader.learners變量何時會增減元素。
LearnerHandler是如何啟動的
LearnerCnxAcceptor::run中, LearnerCnxAcceptor線程不斷收聽新連接socket,并作為參數(shù)啟動LearnerHandler。
LearnerHandler如何被添加到Leader.learners中
@Override
public void run() {
try {
leader.addLearnerHandler(this);
...
void addLearnerHandler(LearnerHandler learner) {
synchronized (learners) {
learners.add(learner);
}
}
private final HashSet<LearnerHandler> learners =
new HashSet<LearnerHandler>();
LearnerHandler在啟動時就把自己加入Leader.learners了。
(查看源碼發(fā)現(xiàn),LearnerHandler并沒有重寫hashCode、equals,個人覺得這不夠嚴謹,當某個追隨者重啟時,會導致代表該追隨者的LearnerHandler有兩個,盡管其中一個應該會shutdown)
LearnerHandler如何被移除Leader.learners
LearnerHandler::shutdown
public void shutdown() {
...
this.interrupt();
leader.removeLearnerHandler(this);
}
該方法有兩處調(diào)用
LearnerHandler::run
@Override
public void run() {
try {
leader.addLearnerHandler(this);
...
}
} catch (IOException e) {
...
} finally {
LOG.warn("******* GOODBYE "
+ (sock != null ? sock.getRemoteSocketAddress() : "<null>")
+ " ********");
shutdown();
}
}
LearnerHandler::ping
public void ping() {
// If learner hasn't sync properly yet, don't send ping packet
// otherwise, the learner will crash
if (!sendingThreadStarted) {
return;
}
long id;
if (syncLimitCheck.check(System.nanoTime())) {
synchronized(leader) {
id = leader.lastProposed;
}
QuorumPacket ping = new QuorumPacket(Leader.PING, id, null, null);
queuePacket(ping);
} else {
LOG.warn("Closing connection to peer due to transaction timeout.");
shutdown();
}
}
這說明兩點:
- 若連接斷開,LearnerHandler線程會自己檢測到socket關閉,并將自己移出learners
- 或者由Leader調(diào)用ping時,發(fā)現(xiàn)超出時限,便shutdown該LearnerHandler
我在自己實驗follwer斷線重連時,發(fā)現(xiàn)第一處的shutdown被調(diào)用,也就是說,通常來說只要連接斷開,對應的LearnerHandler就會斷開了連接了。
總結(jié)leadership機制
從上文總結(jié)我們看到
- LearnerHandler在創(chuàng)建時會加入
Leader.learners,在socket關閉時會移出Leader.learners。我們可以認為一個存活的LearnerHandler代表了一個追隨者的連接。 - Leader每隔
self.tickTime時間會檢查LearnerHandler數(shù)是否過半(if (!tickSkip && !syncedAckSet.hasAllQuorums())),若不過半就退出leadership - Leader每隔
self.tickTime / 2時間會對所有追隨者ping一次,這之中可能會導致LearnerHandler的銷亡 - debug發(fā)現(xiàn)
self.tickTime為200秒,該值應該是在QuorumPeerMain::runFromConfig的quorumPeer.setTickTime(config.getTickTime());中設置的
總結(jié)而言,Leader維持領導確實采取了心跳包的策略,而且只要在200秒到期檢查的時候,能湊齊過半數(shù)(加上自己以后)的存活追隨者即可。
- 這也就回答了開頭的"那舊集群的follower s2重啟時,為何s1會仍然會認為自己是LEADER?",因為斷線重連很快,s1并沒有放棄自己的leadership。
- 這也提出新的疑問,如果網(wǎng)絡抖動,使得剛好在200秒時能湊齊過半數(shù),在其它時候都不過半,還能正常運轉(zhuǎn)嗎?我認為之后的寫操作要求過半ACK才能commit能夠規(guī)避這一點,只要網(wǎng)絡抖動導致不能過半ACK,寫操作就只能阻塞。