-
需求
有2張大的mysql表,量級分別是1億和4.5億(太大了,DBA的同學(xué)正在考慮分表),而且數(shù)據(jù)是增量的,需要寫spark任務(wù)做處理,直接讀取mysql有點吃力,想通過sqoop定時增量直接導(dǎo)入hive,然后spark sql再與hive交互,能避免mysql的很多瓶頸,研究好幾天sqoop定時任務(wù),使用的是sqoop1,版本是sqoop-1.4.6-cdh5.7.0。
1. 創(chuàng)建并執(zhí)行sqoop job:
sqoop job -delete torderincrementjob //先刪除之前的job
sqoop job --create torderincrementjob -- import \
--connect jdbc:mysql://172.16.*.*:3306/*?useCursorFetch=true \
--username *\
--password-file /input/sqoop/pwd/109mysql.pwd \
--target-dir /mysqldb/t_order \
--table t_order \
--fields-terminated-by "\t" \
--lines-terminated-by "\n" \
--null-string '\\N' \
--null-non-string '\\N' \
--incremental append \
--check-column id \
--last-value 1281 \
-m 1
其中:
--password-file指定hdfs上存放的密碼
--fields-terminated-by "\t" \ 指定列分隔符,即制表符
--lines-terminated-by "\n" \ 指定行分隔符,及換行符
--split-by id \ 指定分割的字段
--null-string '\N' \ 指定string類型到hive里的值為NULL
--null-non-string '\N' \ 指定非string類型到hive里的值為NULL
--incremental append
--check-column id
--last-value 1281
以上3個參數(shù)組合使用做增量
創(chuàng)建完成后,執(zhí)行這個job:sqoop job -exec torderincrementjob會看到在日志里有如下select語句:
SELECT MIN(id), MAX(id) FROM t_order WHERE ( id >= '1281' AND id < '100701508' ),也就是last-value指定的id,sqoop會自己維護記錄,下次再執(zhí)行這個任務(wù),起始id就是100701508,每次都是從上次執(zhí)行的id到當(dāng)前id的區(qū)間增量,這樣就能通過crontab做定時任務(wù),定時增量導(dǎo)入到hdfs
每次執(zhí)行sqoop都會更新last-value的值,下次從新的值開始,以下是我從3次打印的日志里截取的:
Lower bound value: 1281
Upper bound value: 100701508
Lower bound value: 100701508
Upper bound value: 100703035
Lower bound value: 100703035
Upper bound value: 100704475
第一次執(zhí)行完job后hdfs就有數(shù)據(jù)了,可以在命令行或者通過50070查看
2. 在hive中創(chuàng)建外部表關(guān)聯(lián)HDFS上的數(shù)據(jù):
CREATE external TABLE `t_order`(
`id` bigint,
`serial` string,
`product_id` int,
`product_type` tinyint,
`product_name` string,
`quantity` double,
`buyer_id` bigint,
`payer_id` bigint,
`price` double,
`vip_price` double,
`settle_price` double,
`currency` string,
`payer_level` tinyint,
`status` tinyint,
`pay_mode` tinyint,
`payment_serial` string,
`client_type` string,
`app_type` tinyint,
`seller_id` string,
`partner_id` int,
`reference` string,
`channel_source` string,
`note` string,
`expiration_time` string,
`operator` string,
`create_time` string,
`pay_time` string,
`update_time` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION
'hdfs://golive-master:8020/mysqldb/t_order'
這時候就可以通過hive查詢hdfs上的數(shù)據(jù)了
select * from golivecms20.t_order limit 10;
3. crontab定時任務(wù)
創(chuàng)建如下3個文件:
timermysqltohdfs.cron //定時任務(wù)
timermysqltohdfs.sh //腳本文件
timermysqltohdfs.log //日志文件
timermysqltohdfs.sh:
#!/bin/sh
current_time=$(date +%Y%m%d%H%M%S)
echo $current_time >> /data/bigdata/app/sqoopjob/timermysqltohdfs.log
echo ............................>> /data/bigdata/app/sqoopjob/timermysqltohdfs.log
#t_order表同步
/data/bigdata/sqoop-1.4.6-cdh5.7.0/bin/sqoop job -exec torderincrementjob
#t_userlogout表同步
/data/bigdata/sqoop-1.4.6-cdh5.7.0/bin/sqoop job -exec tuserlogoutincrementjob
timermysqltohdfs.cron(每天1點、7點、13點、19點定時執(zhí)行):
00 1,7,13,19 * * * /bin/bash /data/bigdata/app/sqoopjob/timermysqltohdfs.sh >> /data/bigdata/app/sqoopjob/timermysqltohdfs.log 2>&1
另一個表t_userlogout也是一樣,相關(guān)命令如下:
創(chuàng)建job:
sqoop job --create tuserlogoutincrementjob -- import \
--connect jdbc:mysql://172.16.*.*:3306/*?useCursorFetch=true \
--username *\
--password-file /input/sqoop/pwd/68mysql.pwd \
--target-dir /mysqldb/t_userlogout \
--table t_userlogout \
--fields-terminated-by "\t" \
--lines-terminated-by "\n" \
--null-string '\\N' \
--null-non-string '\\N' \
--incremental append \
--check-column ID \
--last-value 1 \
-m 1
首次執(zhí)行:sqoop job -exec tuserlogoutincrementjob
在hive創(chuàng)建外部表:
CREATE external TABLE `t_userlogout`(
`ID` bigint,
`GoliveId` string,
`InstalmentCode` string,
`ManufacturerCode` string,
`MacAddress` string,
`AreaCode` string,
`IpAddress` string,
`LoginTime` string,
`LogoutTime` string,
`DeviceID` string,
`VersionType` string,
`Version` string,
`Platform` string,
`PartnerID` int,
`BranchType` int,
`LicenseProviderCode` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION
'hdfs://golive-master:8020/mysqldb/t_userlogout'
后邊就是定時任務(wù),增量導(dǎo)入了
-
附錄:
除了指定--table導(dǎo)入表的全部字段,也可以通過--query指定sql:
--query "select ID,GoliveId,InstalmentCode,ManufacturerCode,MacAddress,IpAddress,LoginTime,VersionType,Version,PartnerID from t_userlogout where $CONDITIONS" \
--query "select id,serial,product_id,product_type,product_name,buyer_id,price,vip_price,settle_price,status,pay_mode,client_type,seller_id,partner_id,channel_source,expiration_time,create_time,pay_time,update_time from t_order where $CONDITIONS" \