
閱讀前說(shuō)明
- 按照官網(wǎng)提供的文檔操作能夠恢復(fù)etcd數(shù)據(jù),但是由于證書的問(wèn)題,恢復(fù)的集群并不能正常使用,需要單獨(dú)對(duì)集群的token進(jìn)行清理。
- 官方文檔中沒(méi)有明確恢復(fù)etcd集群的部署,經(jīng)過(guò)多次驗(yàn)證,確認(rèn)了恢復(fù)etcd集群需要以下三步:1. 部署1臺(tái)節(jié)點(diǎn)的etcd;2. 在當(dāng)前1臺(tái)etcd節(jié)點(diǎn)上恢復(fù)數(shù)據(jù);3. 使用ansible擴(kuò)容的方式,將etcd節(jié)點(diǎn)擴(kuò)展到3臺(tái)。
- 可以更新/etc/etcd/etcd.conf文件更改etcd name,進(jìn)而解決etcd客戶端訪問(wèn)服務(wù)器端證書不匹配的問(wèn)題。
- 文章最后附上了經(jīng)過(guò)測(cè)試認(rèn)證的根據(jù)備份一鍵恢復(fù)etcd集群的腳本。
Openshift集群平臺(tái)能夠使用備份完整恢復(fù)集群。Openshift集群全環(huán)境備份
在恢復(fù)集群之前,請(qǐng)確保對(duì)集群做過(guò)完成的備份,并重新安裝Openshift集群。
恢復(fù)Master節(jié)點(diǎn)
創(chuàng)建Master主機(jī)文件的備份后,如果它們被損壞或意外刪除,就可以通過(guò)這些文件復(fù)制回Master主機(jī)來(lái)恢復(fù)文件,然后重新啟動(dòng)受影響的服務(wù)。
恢復(fù)過(guò)程
-
恢復(fù)
/etc/origin/master/master-config.yaml文件$ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)* $ cp /etc/origin/master/master-config.yaml /etc/origin/master/master-config.yaml.old $ cp /backup/$(hostname)/$(date +%Y%m%d)/origin/master/master-config.yaml /etc/origin/master/master-config.yaml $ master-restart api $ master-restart controllers重啟master服務(wù)可能會(huì)導(dǎo)致停機(jī),此時(shí)可以將該主機(jī)從負(fù)載均衡池中刪除,再恢復(fù)主機(jī),待恢復(fù)完成后,Master服務(wù)也起來(lái)了,再將它添加到負(fù)載均衡池中。
-
如果因?yàn)槿鄙僖恍┒M(jìn)制包,而導(dǎo)致無(wú)法啟動(dòng)Master服務(wù),那么重新安裝缺少的包
-
獲得當(dāng)前已有的包
$ rpm -qa | sort > /tmp/current_packages.txt -
與之前備份的包列表作比較,得到缺少的包
$ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt > ansible-2.4.0.0-5.el7.noarch -
安裝缺少的包
$ yum reinstall -y <packages>
-
-
恢復(fù)系統(tǒng)信任的證書
$ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)* $ sudo cp ${MYBACKUPDIR}/external_certificates/my_company.crt /etc/pki/ca-trust/source/anchors/ $ sudo update-ca-trust
恢復(fù)計(jì)算節(jié)點(diǎn)
一般計(jì)算節(jié)點(diǎn)不需要做恢復(fù),但是如果有特殊的重要節(jié)點(diǎn)需要恢復(fù)的話,與Master節(jié)點(diǎn)恢復(fù)過(guò)程類似。
恢復(fù)過(guò)程
-
恢復(fù)
/etc/origin/node/node-config.yaml文件$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d) $ cp /etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml.old $ cp /backup/$(hostname)/$(date +%Y%m%d)/etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml $ reboot -
如果因?yàn)槿鄙僖恍┒M(jìn)制包,而導(dǎo)致無(wú)法啟動(dòng)Master服務(wù),那么重新安裝缺少的包
-
獲得當(dāng)前已有的包
$ rpm -qa | sort > /tmp/current_packages.txt -
與之前備份的包列表作比較,得到缺少的包
$ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt > ansible-2.4.0.0-5.el7.noarch -
安裝缺少的包
$ yum reinstall -y <packages>
-
-
恢復(fù)系統(tǒng)信任的證書
$ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)* $ sudo cp ${MYBACKUPDIR}/external_certificates/my_company.crt /etc/pki/ca-trust/source/anchors/ $ sudo update-ca-trust
恢復(fù)etcd數(shù)據(jù)
恢復(fù)過(guò)程
-
恢復(fù)etcd配置文件
用備份中的etcd配置文件替換掉當(dāng)前集群的配置文件,然后重啟服務(wù)或者靜態(tài)Pod。
$ ssh master-0 $ cp /backup/yesterday/master-0-files/etcd.conf /etc/etcd/etcd.conf $ restorecon -Rv /etc/etcd/etcd.conf $ systemctl restart etcd.service -
恢復(fù)etcd數(shù)據(jù)
-
根據(jù)etcd v2 和 v3數(shù)據(jù)恢復(fù)
該恢復(fù)過(guò)程必須,在單獨(dú)的一臺(tái)主機(jī)上恢復(fù)數(shù)據(jù),再通過(guò)擴(kuò)容的方式加入剩下的主機(jī)
-
通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod
$ mkdir -p /etc/origin/node/pods-stopped $ mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/ $ reboot -
清除之前的數(shù)據(jù)
-
對(duì)當(dāng)前數(shù)據(jù)做備份
$ mv /var/lib/etcd /var/lib/etcd.old $ mkdir /var/lib/etcd $ restorecon -Rv /var/lib/etcd/ -
直接清除當(dāng)前數(shù)據(jù)
$ rm -rf /var/lib/etcd
-
-
在所有的etcd節(jié)點(diǎn)做如下操作,恢復(fù)數(shù)據(jù)
$ cp -R /backup/etcd-xxx/* /var/lib/etcd/ $ mv /var/lib/etcd/db /var/lib/etcd/member/snap/db $ chcon -R --reference /backup/etcd-xxx/* /var/lib/etcd/ -
在每臺(tái)etcd主機(jī)上執(zhí)行以下操作,強(qiáng)制創(chuàng)建一個(gè)新的etcd集群
$ mkdir -p /etc/systemd/system/etcd.service.d/ $ echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf $ echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf $ sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \ /usr/lib/systemd/system/etcd.service \ >> /etc/systemd/system/etcd.service.d/temp.conf $ systemctl daemon-reload $ master-restart etcd -
檢查錯(cuò)誤日志
$ master-logs etcd etcd -
檢查etcd集群的狀態(tài)
# etcdctl2 cluster-health member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379 cluster is healthy -
集群默認(rèn)配置下啟動(dòng)etcd
$ rm -f /etc/systemd/system/etcd.service.d/temp.conf $ systemctl daemon-reload $ master-restart etcd -
檢查etcd狀態(tài),查看member list
$ etcdctl2 cluster-health member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379 cluster is healthy $ etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=http://localhost:2380 clientURLs=https://192.168.55.8:2379 isLeader=true 第一個(gè)實(shí)例運(yùn)行后,就可以還原其余的etcd服務(wù)器
修復(fù)PEERURL參數(shù)問(wèn)題
在恢復(fù)數(shù)據(jù)后,新的etcd集群參數(shù)peerurl為localhost而不是ip地址,我們需要將它修改為ip地址
-
執(zhí)行etcdctl member list獲得member ID
$ etcdctl member list -
獲得etcd通信的IP
$ ss -l4n | grep 2380 -
更新對(duì)應(yīng)member的peer地址
$ etcdctl2 member update 5ee217d17301 https://192.168.55.8:2380 Updated member with ID 5ee217d17301 in cluster -
查看新的peer地址進(jìn)行校驗(yàn)
$ etcdctl2 member list 5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
-
-
根據(jù)v3的快照snapshot恢復(fù)
如果是使用etcdctl snapshot save的方式備份的snapshot,etcdctl snapshot restore恢復(fù)數(shù)據(jù)時(shí)會(huì)去校驗(yàn)數(shù)據(jù)的hash,但是如果直接從數(shù)據(jù)目錄中拷貝出來(lái)的就無(wú)法校驗(yàn)hash,這時(shí)恢復(fù)數(shù)據(jù)時(shí)需要加上--skip-hash-check
該恢復(fù)過(guò)程必須,在單獨(dú)的一臺(tái)主機(jī)上恢復(fù)數(shù)據(jù),再通過(guò)擴(kuò)容的方式加入剩下的主機(jī)
-
通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod
$ mkdir -p /etc/origin/node/pods-stopped $ mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/ $ reboot -
清除之前的數(shù)據(jù)
$ rm -rf /var/lib/etcd -
使用snapshot restore命令來(lái)恢復(fù)數(shù)據(jù)
# etcdctl3 snapshot restore /backup/etcd-xxxxxx/backup.db \ --data-dir /var/lib/etcd \ --name master-0.example.com \ --initial-cluster "master-0.example.com=https://192.168.55.8:2380" \ --initial-cluster-token "etcd-cluster-1" \ --initial-advertise-peer-urls https://192.168.55.8:2380 \ --skip-hash-check=true 2017-10-03 08:55:32.440779 I | mvcc: restore compact to 1041269 2017-10-03 08:55:32.468244 I | etcdserver/membership: added member 40bef1f6c79b3163 [https://192.168.55.8:2380] to cluster 26841ebcf610583c相關(guān)配置從/etc/etcd/etcd.conf獲取
-
給相關(guān)文件及目錄設(shè)置相關(guān)的selinux權(quán)限
$ restorecon -Rv /var/lib/etcd/ -
啟動(dòng)etcd服務(wù)
$ systemctl start etcd -
檢查錯(cuò)誤日志
$ master-logs etcd etcd
-
-
在靜態(tài)etcd pod恢復(fù)etcd
-
通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod
$ mv /etc/origin/node/pods/etcd.yaml . -
清除之前的數(shù)據(jù)
$ rm -rf /var/lib/etcd -
使用snapshot恢復(fù)集群數(shù)據(jù)
$ export ETCDCTL_API=3 $ etcdctl snapshot restore /etc/etcd/backup/etcd/snapshot.db --data-dir /var/lib/etcd/ --name ip-172-18-3-48.ec2.internal --initial-cluster "ip-172-18-3-48.ec2.internal=https://172.18.3.48:2380" --initial-cluster-token "etcd-cluster-1" --initial-advertise-peer-urls https://172.18.3.48:2380 --skip-hash-check=true從$/backup_files/etcd.conf 文件中獲得相關(guān)的參數(shù)
-
給相關(guān)文件及目錄設(shè)置相關(guān)的selinux權(quán)限
$ restorecon -Rv /var/lib/etcd/ -
通過(guò)將etcd pod的yaml文件恢復(fù)到靜態(tài)pod目錄下來(lái)重啟etcd數(shù)據(jù)
$ mv etcd.yaml /etc/origin/node/pods/.
-
-
使用Ansible添加etcd節(jié)點(diǎn)
還原etcd數(shù)據(jù)后,可以使用ansible或者手動(dòng)的方式對(duì)etcd進(jìn)行擴(kuò)容。
添加過(guò)程
-
在inventory的hosts中添加[new_etcd]服務(wù)器組
[OSEv3:children] masters nodes etcd new_etcd ... [OUTPUT ABBREVIATED] ... [etcd] master-0.example.com master-1.example.com master-2.example.com [new_etcd] etcd0.example.com -
執(zhí)行ansible擴(kuò)容ansible腳本
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook playbooks/openshift-etcd/scaleup.yml -
將[new_etcd]服務(wù)器組的主機(jī)移到[etcd]組
[OSEv3:children] masters nodes etcd new_etcd ... [OUTPUT ABBREVIATED] ... [etcd] master-0.example.com master-1.example.com master-2.example.com etcd0.example.com
恢復(fù)Openshift集群節(jié)點(diǎn)上的服務(wù)
恢復(fù)過(guò)程
-
在每一個(gè)Master節(jié)點(diǎn)恢復(fù)配置文件及重啟相關(guān)服務(wù)
$ cp ${MYBACKUPDIR}/etc/origin/node/pods/* /etc/origin/node/pods/ $ cp ${MYBACKUPDIR}/etc/origin/master/master.env /etc/origin/master/master.env $ cp ${MYBACKUPDIR}/etc/origin/master/master-config.yaml.<timestamp> /etc/origin/master/master-config.yaml $ cp ${MYBACKUPDIR}/etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml $ cp ${MYBACKUPDIR}/etc/origin/master/scheduler.json.<timestamp> /etc/origin/master/scheduler.json $ master-restart api $ master-restart controllers -
在每一個(gè)Node節(jié)點(diǎn),恢復(fù)配置文件,并重啟origin-node服務(wù)
$ cp /etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml $ systemctl enable atomic-openshift-node $ systemctl start atomic-openshift-node
恢復(fù)項(xiàng)目Project
恢復(fù)項(xiàng)目前,先創(chuàng)建項(xiàng)目,再通過(guò)oc create -f命令將項(xiàng)目中的對(duì)象恢復(fù)?;謴?fù)項(xiàng)目時(shí)要注意對(duì)象的依賴關(guān)系,比如說(shuō)pod依賴configmap資源,就需要先創(chuàng)建configmap。
恢復(fù)過(guò)程
$ oc new-project <projectname>
$ oc create -f project.yaml
$ oc create -f secret.yaml
$ oc create -f serviceaccount.yaml
$ oc create -f pvc.yaml
$ oc create -f rolebindings.yaml
恢復(fù)應(yīng)用數(shù)據(jù)
與備份應(yīng)用數(shù)據(jù)類似,可以使用oc rsync命令來(lái)恢復(fù)應(yīng)用數(shù)據(jù)。
以下是一個(gè)利用jenkins應(yīng)用的備份數(shù)據(jù)恢復(fù)應(yīng)用的例子。
恢復(fù)過(guò)程
-
檢查備份數(shù)據(jù)
$ ls -la /tmp/jenkins-backup/ total 8 drwxrwxr-x. 3 user user 20 Sep 6 11:14 . drwxrwxrwt. 17 root root 4096 Sep 6 11:16 .. drwxrwsrwx. 12 user user 4096 Sep 6 11:14 jenkins -
使用
oc rsync恢復(fù)應(yīng)用數(shù)據(jù)$ oc rsync /tmp/jenkins-backup/jenkins jenkins-1-37nux:/var/lib -
重啟應(yīng)用
$ oc delete pod jenkins-1-37nux或者使用
oc scale命令將pod數(shù)調(diào)整為0,再調(diào)整為1,實(shí)現(xiàn)應(yīng)用的重啟$ oc scale --replicas=0 dc/jenkins $ oc scale --replicas=1 dc/jenkins
恢復(fù)持久化卷數(shù)據(jù)
如果應(yīng)用已掛載了新的PV,那就將該P(yáng)V原來(lái)的數(shù)據(jù)刪除,之后將備份的數(shù)據(jù)拷貝到對(duì)應(yīng)的目錄;如果應(yīng)用沒(méi)有掛載PV,那就先掛載一個(gè)PV,再恢復(fù)數(shù)據(jù)。
恢復(fù)過(guò)程
-
如果沒(méi)有掛載PV執(zhí)行創(chuàng)建新的掛載
$ oc set volume dc/demo --add --name=persistent-volume \ --type=persistentVolumeClaim --claim-name=filestore \ --mount-path=/opt/app-root/src/uploaded --overwrite -
刪除當(dāng)前PV掛載目錄下的數(shù)據(jù)
$ oc rsh demo-2-fxx6d sh-4.2$ ls */opt/app-root/src/uploaded/* lost+found ocp_sop.txt sh-4.2$ *rm -rf /opt/app-root/src/uploaded/ocp_sop.txt* sh-4.2$ *ls /opt/app-root/src/uploaded/* lost+found -
將之前備份的數(shù)據(jù)拷貝到對(duì)應(yīng)的目錄下
$ oc rsync uploaded demo-2-fxx6d:/opt/app-root/src/ -
驗(yàn)證應(yīng)用數(shù)據(jù)
$ oc rsh demo-2-fxx6d sh-4.2$ *ls /opt/app-root/src/uploaded/* lost+found ocp_sop.txt
實(shí)戰(zhàn)演練步驟
- 部署安裝3Master 1etcd及2個(gè)Node節(jié)點(diǎn)的Openshift集群
- 使用
恢復(fù)etcd數(shù)據(jù)中的根據(jù)v3的快照snapshot恢復(fù)恢復(fù)etcd數(shù)據(jù)
如果此時(shí)pod無(wú)法正常啟動(dòng),可以執(zhí)行以下命令
$ echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf
再重啟Pod,待etcd正常運(yùn)行后,將剛才添加的ETCD_FORCE_NEW_CLUSTER=true從/etc/etcd/etcd.conf文件中刪除。
- 按照
使用Ansible添加etcd節(jié)點(diǎn)中的步驟將1個(gè)etcd節(jié)點(diǎn)擴(kuò)容為3個(gè)etcd節(jié)點(diǎn) - 清理恢復(fù)中的Openshift集群中的token,并重啟相關(guān)pod,實(shí)現(xiàn)Openshift集群的完整恢復(fù)。
一鍵恢復(fù)與解決證書問(wèn)題的腳本
一鍵恢復(fù)etcd
[root@master01 ~]# cat restore_etcd.sh
#!/bin/bash
snapshot_file_dir=$1
if [ $# -lt 1 ]
then
echo "Please input snapshot file path"
exit 2
fi
export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
mv ${ETCD_POD_MANIFEST} .
rm -rf /var/lib/etcd
## 獲取etcd相關(guān)初始化配置項(xiàng)
ETCD_CONFIG_FILE="/etc/etcd/etcd.conf"
etcd_data_dir=$(grep ^ETCD_DATA_DIR= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_name=$(grep ^ETCD_NAME= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_initial_cluster=$(grep ^ETCD_INITIAL_CLUSTER= $ETCD_CONFIG_FILE|awk -F'ETCD_INITIAL_CLUSTER=' '{print $2}')
etcd_initial_cluster_token=$(grep ^ETCD_INITIAL_CLUSTER_TOKEN= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_initial_advertise_peer_urls=$(grep ^ETCD_INITIAL_ADVERTISE_PEER_URLS= $ETCD_CONFIG_FILE|cut -d= -f2)
## 恢復(fù)etcd數(shù)據(jù)
export ETCDCTL_API=3
etcdctl snapshot restore $snapshot_file_dir --data-dir $etcd_data_dir --name $etcd_name --initial-cluster "$etcd_initial_cluster" --initial-cluster-token "$etcd_initial_cluster_token" --initial-advertise-peer-urls $etcd_initial_advertise_peer_urls --skip-hash-check=true
restorecon -Rv /var/lib/etcd
mv etcd.yaml $ETCD_POD_MANIFEST
一鍵整理etcd數(shù)據(jù),解決證書問(wèn)題
[root@master01 ~]# cat reset.sh
#!/bin/bash
oc login -u system:admin
projects=$(oc get projects | awk '{print $1}' | grep -v kube-system|grep -v NAME)
for project in $(echo $projects)
do
oc delete secret $(oc get secret -n $project | grep token | awk '{print $1}') -n $project
oc delete pod $(oc get pod -n $project | grep -v NAME | awk '{print $1}') -n $project --force --grace-period=0
done