概念
- 機箱群集cluster-id(機箱集群中可以包含大量冗余組)
- 節(jié)點node id
- 冗余組
- 決定冗余組是否為主冗余組的因素有三個:為節(jié)點配置的優(yōu)先級、節(jié)點 ID(節(jié)點 ID 0 號最低的節(jié)點始終優(yōu)先,如果優(yōu)先級難分高下)和節(jié)點的出現(xiàn)順序。 如果優(yōu)先級較低的節(jié)點首先出現(xiàn),則將其視為冗余組的主項(如果未啟用搶先,則將保持為主項)。
摘要
設(shè)備加入集群之后,即成為集群的一個節(jié)點。 除了唯一節(jié)點設(shè)置和管理 IP 地址之外,同一個集群中的節(jié)點共享相同的配置。
機箱群集概述
控制平面
- 用來在節(jié)點之間同步配置和內(nèi)核狀態(tài)
- 節(jié)點之間通過控制端口連接(注意哪個作為控制端口)
- 以主動/備動模式運行,兩節(jié)點相互備份,一個當(dāng)主,一個當(dāng)輔,主設(shè)備出現(xiàn)故障,輔助設(shè)備將接管信息流的處理
數(shù)據(jù)平面
- 通過結(jié)構(gòu)端口相連來形成一個統(tǒng)一的數(shù)據(jù)平面(注意哪個作為結(jié)構(gòu)端口)
- 用來同步流經(jīng)各個節(jié)點信息流的會話信息,從而確保執(zhí)行故障切換時不會丟棄建立的會話
- 數(shù)據(jù)平面軟件以主動/主動模式運行
集群節(jié)點的不同狀態(tài)
- hold(等待)
- primary(主)
- secondary-hold(輔助-等待)
- secondary(輔助)
- ineligible(無資格)
- diabled(禁用)
配置機箱群集前要注意的事項
- 節(jié)點的硬件軟件要一致
- 節(jié)點要先設(shè)置root-authentication密碼,而且密碼要一致
- 管理控制口不能有任何配置,否則有可能會因為控制口被占用導(dǎo)致通信不了而失敗。如果不知道哪個是管理控制口,可以先恢復(fù)出廠設(shè)置(配置模式下:
load factory-default),然后( run show configuration |display set|match interface )檢查是否還有包含interface,有的話,用delete命令刪除掉
下面的錯誤是沒有設(shè)置密碼引起
root# commit
[edit]
'system'
Missing mandatory statement: 'root-authentication'
error: commit failed: (missing statements)
這里提示沒有配置根認證,這是由于第一次登陸Junos密碼為空,配置root密碼后再進行commit操作:
[edit]
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
[edit]
配置步驟
- 先對照前面的“配置機箱群集前要注意的事項”,確保滿足
1.設(shè)置root-authentication
這一步不需要命名主機
- srx-a
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
- srx-b
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
2. 設(shè)置chassis cluster
分別在節(jié)點的cli模式下執(zhí)行下面的命令,注意,命令(node)是不一樣的
srx-a
root> set chassis cluster cluster-id 1 node 0 reboot
srx-b
root> set chassis cluster cluster-id 1 node 1 reboot
驗證cluster
- show chassis cluster status
異常情況
如果管理控制口被占用或者由于其它原因?qū)е虏荒芡?,會出現(xiàn)下面這種情況。
- 節(jié)點srx-a
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 0 lost n/a n/a n/a
{primary:node0}
- 節(jié)點srx-b 同樣看不到對方的狀態(tài)
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 0 lost n/a n/a n/a
node1 1 primary no no None
注意到觀察倆節(jié)點node的狀態(tài),如果出現(xiàn)這種(lost)情況,一般是由于控制口被占用導(dǎo)致兩節(jié)點同步不了的,刪除控制可的配置即可解決
正常情況
- 節(jié)點node0
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 1 secondary no no None
{primary:node0}
- 節(jié)點node1
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 1 primary no no None
node1 1 secondary no no None
{secondary:node1}
測試接管
-
將主重起,備機馬上接管
quicker_b64e97e8-91c3-44a0-bf79-e1039ef137ab.png node0 重啟完后,并不會自動切換為主,會處于“hold”然后變成"secondary"
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 hold no no None
node1 1 primary no no None
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 secondary no no None
node1 1 primary no no None
手動切換主備
- cli模式下命令
request chassis cluster failover要指定切換的是哪個redundancy-group和哪個節(jié)點node為主
root> request chassis cluster failover ?
Possible completions:
node Node identifier of the new primary (0..1)
redundancy-group Redundancy-group identifier (0..63)
reset Undo the previous failover command
root> request chassis cluster failover node 0 redundancy-group 0
node0:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 0
{secondary:node0}
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 255 primary no yes None
node1 1 secondary-hold no yes None
注意到這時node0 的priority為255。可以使用這個命令reset為1,(一般發(fā)生故障切換后,reset可以恢復(fù)原定的主備priority)
root> request chassis cluster failover reset redundancy-group 0
node0:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 0
node1:
--------------------------------------------------------------------------
No reset required for redundancy group 0.
{primary:node0}
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 1 secondary no no None
{primary:node0}
怎樣取消雙機集群配置
有兩種方法:都是在操作模式(cli)下。
- set chassis cluster disable reboot 直接關(guān)
- set chassis cluster cluster-id 0 node 1 reboot 這種id 為0 時也會關(guān)掉。重啟就可以了
配置過程中遇到的問題
問題1
root@srx-a# set groups node0 system host-name srx-A
{primary:node0}[edit]
root@srx-a# set groups node1 system host-name srx-B
{primary:node0}[edit]
root@srx-a# commit
[edit interfaces]
'ge-0/0/0'
HA management port cannot be configured
error: configuration check-out failed
{primary:node0}[edit]
解決方法:
When clustering is enabled ge-0/0/0 become fxp0(management interface) and ge-0/0/1 become fxp1 (control link).
https://kb.juniper.net/InfoCenter/index?page=content&id=KB15356
問題2
root@srx-a# commit
[edit security zones security-zone untrust]
'interfaces ge-0/0/0.0'
Interface ge-0/0/0.0 must be configured under interfaces
error: configuration check-out failed
原因是Interface ge-0/0/0.0不存在(Look into show interfaces and see if a ge-0/0/0 unit 0 is configured there !
