Openshift集群全環(huán)境恢復(fù)

敏捷自動(dòng)化

閱讀前說(shuō)明

  • 按照官網(wǎng)提供的文檔操作能夠恢復(fù)etcd數(shù)據(jù),但是由于證書的問(wèn)題,恢復(fù)的集群并不能正常使用,需要單獨(dú)對(duì)集群的token進(jìn)行清理。
  • 官方文檔中沒(méi)有明確恢復(fù)etcd集群的部署,經(jīng)過(guò)多次驗(yàn)證,確認(rèn)了恢復(fù)etcd集群需要以下三步:1. 部署1臺(tái)節(jié)點(diǎn)的etcd;2. 在當(dāng)前1臺(tái)etcd節(jié)點(diǎn)上恢復(fù)數(shù)據(jù);3. 使用ansible擴(kuò)容的方式,將etcd節(jié)點(diǎn)擴(kuò)展到3臺(tái)。
  • 可以更新/etc/etcd/etcd.conf文件更改etcd name,進(jìn)而解決etcd客戶端訪問(wèn)服務(wù)器端證書不匹配的問(wèn)題。
  • 文章最后附上了經(jīng)過(guò)測(cè)試認(rèn)證的根據(jù)備份一鍵恢復(fù)etcd集群的腳本。

Openshift集群平臺(tái)能夠使用備份完整恢復(fù)集群。Openshift集群全環(huán)境備份

在恢復(fù)集群之前,請(qǐng)確保對(duì)集群做過(guò)完成的備份,并重新安裝Openshift集群。

恢復(fù)Master節(jié)點(diǎn)

創(chuàng)建Master主機(jī)文件的備份后,如果它們被損壞或意外刪除,就可以通過(guò)這些文件復(fù)制回Master主機(jī)來(lái)恢復(fù)文件,然后重新啟動(dòng)受影響的服務(wù)。

恢復(fù)過(guò)程

  1. 恢復(fù)/etc/origin/master/master-config.yaml文件

    $ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    $ cp /etc/origin/master/master-config.yaml /etc/origin/master/master-config.yaml.old
    $ cp /backup/$(hostname)/$(date +%Y%m%d)/origin/master/master-config.yaml /etc/origin/master/master-config.yaml
    $ master-restart api
    $ master-restart controllers
    

    重啟master服務(wù)可能會(huì)導(dǎo)致停機(jī),此時(shí)可以將該主機(jī)從負(fù)載均衡池中刪除,再恢復(fù)主機(jī),待恢復(fù)完成后,Master服務(wù)也起來(lái)了,再將它添加到負(fù)載均衡池中。

  2. 如果因?yàn)槿鄙僖恍┒M(jìn)制包,而導(dǎo)致無(wú)法啟動(dòng)Master服務(wù),那么重新安裝缺少的包

    • 獲得當(dāng)前已有的包

      $ rpm -qa | sort > /tmp/current_packages.txt
      
    • 與之前備份的包列表作比較,得到缺少的包

      $ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
      > ansible-2.4.0.0-5.el7.noarch
      
    • 安裝缺少的包

      $ yum reinstall -y <packages>
      
  3. 恢復(fù)系統(tǒng)信任的證書

    $ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    $ sudo cp ${MYBACKUPDIR}/external_certificates/my_company.crt /etc/pki/ca-trust/source/anchors/
    $ sudo update-ca-trust
    

恢復(fù)計(jì)算節(jié)點(diǎn)

一般計(jì)算節(jié)點(diǎn)不需要做恢復(fù),但是如果有特殊的重要節(jié)點(diǎn)需要恢復(fù)的話,與Master節(jié)點(diǎn)恢復(fù)過(guò)程類似。

恢復(fù)過(guò)程

  1. 恢復(fù)/etc/origin/node/node-config.yaml文件

    $ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
    $ cp /etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml.old
    $ cp /backup/$(hostname)/$(date +%Y%m%d)/etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml
    $ reboot
    
    
  2. 如果因?yàn)槿鄙僖恍┒M(jìn)制包,而導(dǎo)致無(wú)法啟動(dòng)Master服務(wù),那么重新安裝缺少的包

    • 獲得當(dāng)前已有的包

      $ rpm -qa | sort > /tmp/current_packages.txt
      
    • 與之前備份的包列表作比較,得到缺少的包

      $ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
      > ansible-2.4.0.0-5.el7.noarch
      
    • 安裝缺少的包

      $ yum reinstall -y <packages>
      
  3. 恢復(fù)系統(tǒng)信任的證書

    $ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
    $ sudo cp ${MYBACKUPDIR}/external_certificates/my_company.crt /etc/pki/ca-trust/source/anchors/
    $ sudo update-ca-trust
    

恢復(fù)etcd數(shù)據(jù)

恢復(fù)過(guò)程

  • 恢復(fù)etcd配置文件

    用備份中的etcd配置文件替換掉當(dāng)前集群的配置文件,然后重啟服務(wù)或者靜態(tài)Pod。

    $  ssh master-0
    $ cp /backup/yesterday/master-0-files/etcd.conf /etc/etcd/etcd.conf
    $ restorecon -Rv /etc/etcd/etcd.conf
    $ systemctl restart etcd.service
    
  • 恢復(fù)etcd數(shù)據(jù)

    • 根據(jù)etcd v2 和 v3數(shù)據(jù)恢復(fù)

      該恢復(fù)過(guò)程必須,在單獨(dú)的一臺(tái)主機(jī)上恢復(fù)數(shù)據(jù),再通過(guò)擴(kuò)容的方式加入剩下的主機(jī)

      1. 通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod

        $ mkdir -p /etc/origin/node/pods-stopped
        $ mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
        $ reboot
        
      2. 清除之前的數(shù)據(jù)

        • 對(duì)當(dāng)前數(shù)據(jù)做備份

          $ mv /var/lib/etcd /var/lib/etcd.old
          $ mkdir /var/lib/etcd
          $ restorecon -Rv /var/lib/etcd/
          
          
        • 直接清除當(dāng)前數(shù)據(jù)

          $ rm -rf /var/lib/etcd
          
          
      3. 在所有的etcd節(jié)點(diǎn)做如下操作,恢復(fù)數(shù)據(jù)

        $ cp -R /backup/etcd-xxx/* /var/lib/etcd/
        $ mv /var/lib/etcd/db /var/lib/etcd/member/snap/db
        $ chcon -R --reference /backup/etcd-xxx/* /var/lib/etcd/
        
        
      4. 在每臺(tái)etcd主機(jī)上執(zhí)行以下操作,強(qiáng)制創(chuàng)建一個(gè)新的etcd集群

        $ mkdir -p /etc/systemd/system/etcd.service.d/
        $ echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf
        $ echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf
        $ sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \
            /usr/lib/systemd/system/etcd.service \
            >> /etc/systemd/system/etcd.service.d/temp.conf
        
        $ systemctl daemon-reload
        $ master-restart etcd
        
        
      5. 檢查錯(cuò)誤日志

        $ master-logs etcd etcd
        
        
      6. 檢查etcd集群的狀態(tài)

        # etcdctl2 cluster-health
        member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
        cluster is healthy
        
        
      7. 集群默認(rèn)配置下啟動(dòng)etcd

        $ rm -f /etc/systemd/system/etcd.service.d/temp.conf
        $ systemctl daemon-reload
        $ master-restart etcd
        
        
      8. 檢查etcd狀態(tài),查看member list

        $ etcdctl2 cluster-health
        member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
        cluster is healthy
        
        $ etcdctl2 member list
        5ee217d17301: name=master-0.example.com peerURLs=http://localhost:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
        
        
      9. 第一個(gè)實(shí)例運(yùn)行后,就可以還原其余的etcd服務(wù)器

      修復(fù)PEERURL參數(shù)問(wèn)題

      在恢復(fù)數(shù)據(jù)后,新的etcd集群參數(shù)peerurl為localhost而不是ip地址,我們需要將它修改為ip地址

      1. 執(zhí)行etcdctl member list獲得member ID

        $ etcdctl member list
        
        
      2. 獲得etcd通信的IP

        $ ss -l4n | grep 2380
        
        
      3. 更新對(duì)應(yīng)member的peer地址

        $ etcdctl2 member update 5ee217d17301 https://192.168.55.8:2380
        Updated member with ID 5ee217d17301 in cluster
        
        
      4. 查看新的peer地址進(jìn)行校驗(yàn)

        $ etcdctl2 member list
        5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
        
    • 根據(jù)v3的快照snapshot恢復(fù)

      如果是使用etcdctl snapshot save的方式備份的snapshot,etcdctl snapshot restore恢復(fù)數(shù)據(jù)時(shí)會(huì)去校驗(yàn)數(shù)據(jù)的hash,但是如果直接從數(shù)據(jù)目錄中拷貝出來(lái)的就無(wú)法校驗(yàn)hash,這時(shí)恢復(fù)數(shù)據(jù)時(shí)需要加上--skip-hash-check

      該恢復(fù)過(guò)程必須,在單獨(dú)的一臺(tái)主機(jī)上恢復(fù)數(shù)據(jù),再通過(guò)擴(kuò)容的方式加入剩下的主機(jī)

      1. 通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod

        $ mkdir -p /etc/origin/node/pods-stopped
        $ mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
        $ reboot
        
      2. 清除之前的數(shù)據(jù)

        $ rm -rf /var/lib/etcd
        
      3. 使用snapshot restore命令來(lái)恢復(fù)數(shù)據(jù)

        # etcdctl3 snapshot restore /backup/etcd-xxxxxx/backup.db \
          --data-dir /var/lib/etcd \
          --name master-0.example.com \
          --initial-cluster "master-0.example.com=https://192.168.55.8:2380" \
          --initial-cluster-token "etcd-cluster-1" \
          --initial-advertise-peer-urls https://192.168.55.8:2380 \
          --skip-hash-check=true
        
        2017-10-03 08:55:32.440779 I | mvcc: restore compact to 1041269
        2017-10-03 08:55:32.468244 I | etcdserver/membership: added member 40bef1f6c79b3163 [https://192.168.55.8:2380] to cluster 26841ebcf610583c
        

        相關(guān)配置從/etc/etcd/etcd.conf獲取

      4. 給相關(guān)文件及目錄設(shè)置相關(guān)的selinux權(quán)限

        $ restorecon -Rv /var/lib/etcd/
        
      5. 啟動(dòng)etcd服務(wù)

        $ systemctl start etcd
        
      6. 檢查錯(cuò)誤日志

        $ master-logs etcd etcd
        
    • 在靜態(tài)etcd pod恢復(fù)etcd

      1. 通過(guò)將pod的yaml文件移出來(lái)暫停etcd pod

        $ mv /etc/origin/node/pods/etcd.yaml .
        
      2. 清除之前的數(shù)據(jù)

        $ rm -rf /var/lib/etcd
        
      3. 使用snapshot恢復(fù)集群數(shù)據(jù)

        $ export ETCDCTL_API=3
        $ etcdctl snapshot restore /etc/etcd/backup/etcd/snapshot.db
          --data-dir /var/lib/etcd/
          --name ip-172-18-3-48.ec2.internal
          --initial-cluster "ip-172-18-3-48.ec2.internal=https://172.18.3.48:2380"
          --initial-cluster-token "etcd-cluster-1"
          --initial-advertise-peer-urls https://172.18.3.48:2380
          --skip-hash-check=true
        

        $/backup_files/etcd.conf 文件中獲得相關(guān)的參數(shù)

      4. 給相關(guān)文件及目錄設(shè)置相關(guān)的selinux權(quán)限

        $ restorecon -Rv /var/lib/etcd/
        
      5. 通過(guò)將etcd pod的yaml文件恢復(fù)到靜態(tài)pod目錄下來(lái)重啟etcd數(shù)據(jù)

        $ mv etcd.yaml /etc/origin/node/pods/.
        

使用Ansible添加etcd節(jié)點(diǎn)

還原etcd數(shù)據(jù)后,可以使用ansible或者手動(dòng)的方式對(duì)etcd進(jìn)行擴(kuò)容。

添加過(guò)程

  1. 在inventory的hosts中添加[new_etcd]服務(wù)器組

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd 
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    
    [new_etcd] 
    etcd0.example.com 
    
  2. 執(zhí)行ansible擴(kuò)容ansible腳本

    $ cd /usr/share/ansible/openshift-ansible
    $ ansible-playbook  playbooks/openshift-etcd/scaleup.yml
    
  3. 將[new_etcd]服務(wù)器組的主機(jī)移到[etcd]組

    [OSEv3:children]
    masters
    nodes
    etcd
    new_etcd
    
    ... [OUTPUT ABBREVIATED] ...
    
    [etcd]
    master-0.example.com
    master-1.example.com
    master-2.example.com
    etcd0.example.com
    

恢復(fù)Openshift集群節(jié)點(diǎn)上的服務(wù)

恢復(fù)過(guò)程

  1. 在每一個(gè)Master節(jié)點(diǎn)恢復(fù)配置文件及重啟相關(guān)服務(wù)

    $ cp ${MYBACKUPDIR}/etc/origin/node/pods/* /etc/origin/node/pods/
    $ cp ${MYBACKUPDIR}/etc/origin/master/master.env /etc/origin/master/master.env
    $ cp ${MYBACKUPDIR}/etc/origin/master/master-config.yaml.<timestamp> /etc/origin/master/master-config.yaml
    $ cp ${MYBACKUPDIR}/etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
    $ cp ${MYBACKUPDIR}/etc/origin/master/scheduler.json.<timestamp> /etc/origin/master/scheduler.json
    $ master-restart api
    $ master-restart controllers
    
  2. 在每一個(gè)Node節(jié)點(diǎn),恢復(fù)配置文件,并重啟origin-node服務(wù)

    $ cp /etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
    $ systemctl enable atomic-openshift-node
    $ systemctl start atomic-openshift-node
    

恢復(fù)項(xiàng)目Project

恢復(fù)項(xiàng)目前,先創(chuàng)建項(xiàng)目,再通過(guò)oc create -f命令將項(xiàng)目中的對(duì)象恢復(fù)?;謴?fù)項(xiàng)目時(shí)要注意對(duì)象的依賴關(guān)系,比如說(shuō)pod依賴configmap資源,就需要先創(chuàng)建configmap。

恢復(fù)過(guò)程

$ oc new-project <projectname>
$ oc create -f project.yaml
$ oc create -f secret.yaml
$ oc create -f serviceaccount.yaml
$ oc create -f pvc.yaml
$ oc create -f rolebindings.yaml

恢復(fù)應(yīng)用數(shù)據(jù)

與備份應(yīng)用數(shù)據(jù)類似,可以使用oc rsync命令來(lái)恢復(fù)應(yīng)用數(shù)據(jù)。

以下是一個(gè)利用jenkins應(yīng)用的備份數(shù)據(jù)恢復(fù)應(yīng)用的例子。

恢復(fù)過(guò)程

  1. 檢查備份數(shù)據(jù)

    $ ls -la /tmp/jenkins-backup/
    total 8
    drwxrwxr-x.  3 user     user   20 Sep  6 11:14 .
    drwxrwxrwt. 17 root     root 4096 Sep  6 11:16 ..
    drwxrwsrwx. 12 user     user 4096 Sep  6 11:14 jenkins
    
  2. 使用oc rsync恢復(fù)應(yīng)用數(shù)據(jù)

    $ oc rsync /tmp/jenkins-backup/jenkins jenkins-1-37nux:/var/lib
    
  3. 重啟應(yīng)用

    $ oc delete pod jenkins-1-37nux
    

    或者使用oc scale命令將pod數(shù)調(diào)整為0,再調(diào)整為1,實(shí)現(xiàn)應(yīng)用的重啟

    $ oc scale --replicas=0 dc/jenkins
    $ oc scale --replicas=1 dc/jenkins
    

恢復(fù)持久化卷數(shù)據(jù)

如果應(yīng)用已掛載了新的PV,那就將該P(yáng)V原來(lái)的數(shù)據(jù)刪除,之后將備份的數(shù)據(jù)拷貝到對(duì)應(yīng)的目錄;如果應(yīng)用沒(méi)有掛載PV,那就先掛載一個(gè)PV,再恢復(fù)數(shù)據(jù)。

恢復(fù)過(guò)程

  1. 如果沒(méi)有掛載PV執(zhí)行創(chuàng)建新的掛載

    $ oc set volume dc/demo --add --name=persistent-volume \
         --type=persistentVolumeClaim --claim-name=filestore \ --mount-path=/opt/app-root/src/uploaded --overwrite
    
  2. 刪除當(dāng)前PV掛載目錄下的數(shù)據(jù)

    $ oc rsh demo-2-fxx6d
    sh-4.2$ ls */opt/app-root/src/uploaded/*
    lost+found  ocp_sop.txt
    sh-4.2$ *rm -rf /opt/app-root/src/uploaded/ocp_sop.txt*
    sh-4.2$ *ls /opt/app-root/src/uploaded/*
    lost+found
    
  3. 將之前備份的數(shù)據(jù)拷貝到對(duì)應(yīng)的目錄下

    $ oc rsync uploaded demo-2-fxx6d:/opt/app-root/src/
    
  4. 驗(yàn)證應(yīng)用數(shù)據(jù)

    $ oc rsh demo-2-fxx6d
    sh-4.2$ *ls /opt/app-root/src/uploaded/*
    lost+found  ocp_sop.txt
    

實(shí)戰(zhàn)演練步驟

  1. 部署安裝3Master 1etcd及2個(gè)Node節(jié)點(diǎn)的Openshift集群
  2. 使用恢復(fù)etcd數(shù)據(jù)中的根據(jù)v3的快照snapshot恢復(fù)恢復(fù)etcd數(shù)據(jù)
    如果此時(shí)pod無(wú)法正常啟動(dòng),可以執(zhí)行以下命令
$ echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf

再重啟Pod,待etcd正常運(yùn)行后,將剛才添加的ETCD_FORCE_NEW_CLUSTER=true從/etc/etcd/etcd.conf文件中刪除。

  1. 按照使用Ansible添加etcd節(jié)點(diǎn)中的步驟將1個(gè)etcd節(jié)點(diǎn)擴(kuò)容為3個(gè)etcd節(jié)點(diǎn)
  2. 清理恢復(fù)中的Openshift集群中的token,并重啟相關(guān)pod,實(shí)現(xiàn)Openshift集群的完整恢復(fù)。

一鍵恢復(fù)與解決證書問(wèn)題的腳本

一鍵恢復(fù)etcd

[root@master01 ~]# cat restore_etcd.sh 
#!/bin/bash
snapshot_file_dir=$1
if [ $# -lt 1 ]
then
    echo "Please input snapshot file path"
    exit 2
fi

export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
mv ${ETCD_POD_MANIFEST} .
rm -rf /var/lib/etcd

## 獲取etcd相關(guān)初始化配置項(xiàng)
ETCD_CONFIG_FILE="/etc/etcd/etcd.conf"
etcd_data_dir=$(grep ^ETCD_DATA_DIR= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_name=$(grep ^ETCD_NAME= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_initial_cluster=$(grep ^ETCD_INITIAL_CLUSTER= $ETCD_CONFIG_FILE|awk -F'ETCD_INITIAL_CLUSTER=' '{print $2}')
etcd_initial_cluster_token=$(grep ^ETCD_INITIAL_CLUSTER_TOKEN= $ETCD_CONFIG_FILE|cut -d= -f2)
etcd_initial_advertise_peer_urls=$(grep ^ETCD_INITIAL_ADVERTISE_PEER_URLS= $ETCD_CONFIG_FILE|cut -d= -f2)

## 恢復(fù)etcd數(shù)據(jù)
export ETCDCTL_API=3
etcdctl snapshot restore $snapshot_file_dir --data-dir $etcd_data_dir --name $etcd_name --initial-cluster "$etcd_initial_cluster" --initial-cluster-token "$etcd_initial_cluster_token" --initial-advertise-peer-urls $etcd_initial_advertise_peer_urls --skip-hash-check=true

restorecon -Rv /var/lib/etcd

mv etcd.yaml $ETCD_POD_MANIFEST

一鍵整理etcd數(shù)據(jù),解決證書問(wèn)題

[root@master01 ~]# cat reset.sh 
#!/bin/bash
oc login -u system:admin

projects=$(oc get projects | awk '{print $1}' | grep -v kube-system|grep -v NAME)

for project in $(echo $projects)
do
  oc delete secret $(oc get secret -n $project | grep token | awk '{print $1}') -n $project
  oc delete pod $(oc get pod -n $project | grep -v NAME | awk '{print $1}') -n $project --force --grace-period=0
done

參考文章

Openshift官方文檔之恢復(fù)集群

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容