問題描述

我所在的部門是BI，平時(shí)業(yè)務(wù)計(jì)算有兩個(gè)Hadoop集群A和B。其中一個(gè)集群A因?yàn)榇蟛糠謽I(yè)務(wù)線計(jì)算都在上面，最近開始經(jīng)常出問題，并且計(jì)算變慢。為了進(jìn)行熱備，決定把A集群的計(jì)算遷到B上一份，新抽取的數(shù)據(jù)可以在A和B上各自獨(dú)立運(yùn)行，但是歷史數(shù)據(jù)沒必要從頭從MySQL中再抽一遍，即使可以這么做，也很耗費(fèi)時(shí)間。所以最快的方式是把A的數(shù)據(jù)copy到B上一份。

解決方案

Hadoop自帶的集群間copy工具distcp

distcp（分布式拷貝）是用于大規(guī)模集群內(nèi)部和集群之間拷貝的工具。它使用Map/Reduce實(shí)現(xiàn)文件分發(fā)，錯誤處理和恢復(fù)，以及報(bào)告生成。它把文件和目錄的列表作為map任務(wù)的輸入，每個(gè)任務(wù)會完成源列表中部分文件的拷貝。由于使用了Map/Reduce方法，這個(gè)工具在語義和執(zhí)行上都會有特殊的地方。

格式

指定源hdfs和目的hdfs路徑即可。

hadoop distcp hdfs://namenode01/user/hive/test.txt  hdfs://namenode02/user/hive/test.txt

參數(shù)

命令行輸入

hadoop distcp

即得到如下參數(shù)的使用說明

usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                Reuse existing data in target files and append new
                        data to them if possible
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpcax)(replication,
                        block-size, user, group, permission,
                        checksum-type, ACL, XATTR).  If -p is specified
                        with no <arg>, then preserves replication, block
                        size, user, group, permission and checksum
                        type.raw.* xattrs are preserved when both the
                        source and destination paths are in the
                        /.reserved/raw hierarchy (HDFS only). raw.*
                        xattrpreservation is independent of the -p flag.
                        Refer to the DistCp documentation for more
                        details.
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

應(yīng)用實(shí)例

其他準(zhǔn)備

對于在原始集群中已經(jīng)建hive表的數(shù)據(jù)，通過以下命令拿到建表語句，然后在目標(biāo)集群上執(zhí)行

show create table xxx

腳本實(shí)例

腳本在目的集群client端機(jī)器執(zhí)行，目標(biāo)hdfs路徑簡寫，具體copy.sh腳本內(nèi)容如下:

#!/usr/bin/sh
if [ $# -eq 0 ];then
        DT=`date -d "-1 day" +"%Y-%m-%d"`
else
        DT=`date -d"$1" +"%Y-%m-%d"`
fi
mysqlTable=$2;
table_name=function_$mysqlTable;

#按天增量拷貝，防止中間出錯，并且方便按天進(jìn)行數(shù)據(jù)驗(yàn)證
src_path=hdfs://namenode01/user/hive/warehouse/$table_name/dt=$DT
dest_path=/user/hive/warehouse/$table_name/dt=$DT

hadoop distcp  -D mapred.job.queue.name=compute_daily  $src_path $dest_path

#給目標(biāo)集群表加分區(qū)
hive -e"
use mbd;
alter table $table_name add partition(dt='$DT');
"

腳本調(diào)用

sh copy.sh 2017-08-18 pay

數(shù)據(jù)驗(yàn)證

腳本執(zhí)行之后，還要進(jìn)行數(shù)據(jù)驗(yàn)證，對于同樣的sql，按天查看不同字段的去重計(jì)數(shù)，分別在兩個(gè)集群上執(zhí)行，即可充分驗(yàn)證，目標(biāo)集群copy數(shù)據(jù)的準(zhǔn)確性。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

通過hadoop distcp進(jìn)行集群間數(shù)據(jù)遷移

通過hadoop distcp進(jìn)行集群間數(shù)據(jù)遷移

問題描述

解決方案

應(yīng)用實(shí)例

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

通過hadoop distcp進(jìn)行集群間數(shù)據(jù)遷移

問題描述

解決方案

應(yīng)用實(shí)例

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av