【起因】22:31:04分被通知集群異常,無法正常處理調(diào)度任務(wù);遠(yuǎn)程登錄查看,發(fā)現(xiàn)HDFS NameNode發(fā)生腦裂;
NameNode節(jié)點:name21.hadoop,name22.hadoop
hadoop版本:2.7.1
【事故報告】
1. 【時間】=> 22:31:04
【HA狀態(tài)】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/gc.log-xxxxxxxxxx
此時,Active NameNode觸發(fā)一次異常JVM GC(正常情況下GC耗時在2s以下),此次耗時32.46s;
2018-02-11T22:31:04.707+0800: 801030.875: [GC2018-02-11T22:31:04.707+0800: 801030.875: [ParNew (promotion failed): 1550355K->1544189K(1710528K), 0.8542820 secs]
2018-02-11T22:31:05.562+0800: 801031.729: [CMS: 12100861K->3636381K(13303808K), 31.6119450 secs] 13650016K->3636381K(15014336K), [CMS Perm : 55251K->55008K(131072K)], 32.4666280 secs] [Times: user=33.04 sys=0.00, **real=32.46 secs**]
2.【時間】=> 22:31:37
【HA狀態(tài)】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-namenode-name21.hadoop.log
此時,由于之前GC時間過長,導(dǎo)致active NameNode向JournalNodes進(jìn)行數(shù)據(jù)通信時,超時導(dǎo)致無法響應(yīng)(超時時間默認(rèn)值為20s,GC耗時32s);
2018-02-11 22:31:37,201 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485], stream=QuorumOutputStream starting at txid 3108262584)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
2018-02-11 22:31:37,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(191)) - Detected pause in JVM or host machine (eg GC): pause of approximately 32158ms GC pool 'ParNew' had collection(s): count=1 time=855ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=31612ms
2018-02-11 22:31:37,211 INFO namenode.FSNamesystem (FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from 172.21.54.22
2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:rollEditLog(1202)) - Rolling edit logs 2018-02-11 22:31:37,211 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment 3108262584
2018-02-11 22:31:37,228 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
至此,Active Namenode節(jié)點進(jìn)入異常狀態(tài)(進(jìn)程無法結(jié)束,程序所起8021端口不存在);
3.【時間】=> 22:32:22
【HA狀態(tài)】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name21.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name21.hadoop.log
此時, name21.hadoop上的ZKFailoverController無法獲取到本機(jī)Namenode的狀態(tài)[無法將Active狀態(tài)修改為Standby],此后每隔45s反復(fù)重試;同時,與zookeeper的session連接關(guān)閉;
2018-02-11 22:32:22,267 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at name21.hadoop/xxx.xx.xx.xx:8021 entered state: SERVICE_NOT_RESPONDING
2018-02-11 22:32:22,267 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(768)) - Quitting master election for NameNode at name21.hadoop2/ xxx.xx.xx.xx:8021 and marking that fencing is necessary
2018-02-11 22:32:22,267 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(401)) - Yielding from election
2018-02-11 22:32:22,268 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(830)) - Terminating ZK connection for elector id=660690290 appData=0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e cb=Elector callbacks for NameNode at name21.hadoop2/172.21.54.21:8021
2018-02-11 22:32:22,270 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x15f5451238a08b3 closed
2018-02-11 22:32:22,271 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1066)) - Ignoring stale result from old client with sessionId 0x15f5451238a08b3
2018-02-11 22:32:22,271 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down
4.【時間】=> 22:32:22
【HA狀態(tài)】=> name21.hadoop:Active, name22.hadoop:Standyby
【日志信息】=> name22.hadoop:/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-name22.hadoop.log
與此同時, name22.hadoop的namenode與zookeeper建立session連接, ZKFailoverController 嘗試連接name21.hadoop上的Namenode,將狀態(tài)從Active修改為Standby, 報錯連接超時;
2018-02-11 22:32:22,286 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(931)) - Checking for any old active which needs to be fenced...
2018-02-11 22:32:22,290 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:fenceOldActive(952)) - Old node exists: 0a06636261734e4112036e6e311a0e6e616d6532312e6861646f6f703220d53e28d33e
2018-02-11 22:32:22,294 INFO ha.ZKFailoverController (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at name21.hadoop/ xxx.xx.xx.xx:8021
2018-02-11 22:32:27,304 WARN ha.FailoverController (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make NameNode at name21.hadoop/ xxx.xx.xx.xx:8021 standby (unable to connect) java.net.SocketTimeoutException: Call From name22.hadoop/ xxx.xx.xx.xx to name21.hadoop:8021 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511remote=name21.hadoop/ xxx.xx.xx.xx :8021];
......
Caused by: **java.net.SocketTimeoutException: 5000 millis timeout** while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ xxx.xx.xx.xx :56511 remote=name21.hadoop/ xxx.xx.xx.xx :8021]
......
啟動Fence服務(wù)(集群并未設(shè)置,直接返回ture),并將name22.hadoop Namenode的狀態(tài)從Standby變?yōu)锳ctive;
2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(91)) - ====== Beginning Service Fencing Process... ======
2018-02-11 22:32:27,323 INFO ha.NodeFencer (NodeFencer.java:fence(94)) - Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
2018-02-11 22:32:27,339 INFO ha.ShellCommandFencer (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' with pid 48839
2018-02-11 22:32:27,359 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
2018-02-11 22:32:27,360 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(878)) - Writing znode /hadoop-ha/cbasNA/ActiveBreadCrumb to indicate that the local node is the most recent active...
2018-02-11 22:32:27,365 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 active...
2018-02-11 22:32:46,337 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned NameNode at name22.hadoop/ xxx.xx.xx.xx :8021 to active state
5.【時間】=> 22:32:46
【HA狀態(tài)】=> name21.hadoop:Active, name22.hadoop:Active
至此,集群的NameNode出現(xiàn)雙Active狀態(tài),發(fā)生腦裂;
6.【恢復(fù)措施】
后面運(yùn)維介入時,嘗試重啟namenode,發(fā)現(xiàn)一直卡在35%進(jìn)度,無法在執(zhí)行下去;
原因:運(yùn)維介入時,首先嘗試重啟name22.hadoop上的namenode,此時重啟程序檢測到雙active,程序會反復(fù)獲取狀態(tài),直到active只有一個;
NameNode HA states: active_namenodes = [('nn1', 'name21.hadoop:50070'), ('nn2', 'name22.hadoop:50070')], standby_namenodes = [], unknown_namenodes = []
此時出現(xiàn)的情況是:
name21.hadoop的namenode異常(8021端口連接超時),無法將自身狀態(tài)從active修改為standby;
name22.hadoop的namenode已經(jīng)被ZKFailoverController選舉為active,選舉理由也是name21.hadoop的8021端口連接異常;
因此,需要重啟name21.hadoop上的namenode(恢復(fù)8021端口),重啟時會檢測name22.hadoop的namenode狀態(tài)(檢測name22.hadoop的8021端口),將自身從active切換為standby;name22.hadoop無需重啟;
7.【預(yù)防措施】
a. 優(yōu)化GC參數(shù),添加namenode的GC監(jiān)控,找出GC異常的原因并修復(fù);
b. 提高journalNode的寫入超時時間,目前默認(rèn)值是20s,對于生產(chǎn)環(huán)境有必要提高到60s;
c. 啟動Fence服務(wù);
d. 集群namenode節(jié)點的狀態(tài)監(jiān)控;