Sqoop介紹
Sqoop是一款開源的工具,主要用于在Hadoop(Hive)與傳統(tǒng)的數(shù)據(jù)庫(mysql、postgresql…)間進行數(shù)據(jù)的傳遞,可以將一個關(guān)系型數(shù)據(jù)庫(例如 : MySQL ,Oracle ,Postgres等)中的數(shù)據(jù)導(dǎo)進到Hadoop的HDFS中,也可以將HDFS的數(shù)據(jù)導(dǎo)進到關(guān)系型數(shù)據(jù)庫中。其機制是將導(dǎo)入或?qū)С雒罘g成mapreduce程序來實現(xiàn)
在翻譯出的mapreduce中主要是對inputformat和outputformat進行定制

Sqoop的安裝
1、將Sqoop包上傳到hadoop集群,我這里用的是sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz。解壓后改下名字sqoop
[root@mini1 ~]#tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
[root@mini1 ~]# mv sqoop-1.4.6xxxx(解壓后的包名)sqoop
2、修改配置文件
進入conf目錄修改sqoop-env-template.sh名字為sqoop-env.sh
并修改該文件內(nèi)容,三個地方,一個hadoop命令所在位置,一個mapreduce所在位置,一個hive命令所在位置(怎么查看命令位置可以使用which,比如which hive,但是這里可以指定一個父目錄)。
[root@mini1 ~]# cd sqoop
[root@mini1 sqoop]#cd conf/
[root@mini1 conf]# ll
總用量 28
-rw-rw-r--. 1 root root 3895 4月 27 2015 oraoop-site-template.xml
-rw-rw-r--. 1 root root 1404 4月 27 2015 sqoop-env-template.cmd
-rwxr-xr-x. 1 root root 1345 4月 27 2015 sqoop-env-template.sh
-rw-rw-r--. 1 root root 5531 4月 27 2015 sqoop-site-template.xml
-rw-rw-r--. 1 root root 5531 4月 27 2015 sqoop-site.xml
[root@mini1 conf]# mv sqoop-env-template.sh sqoop-env.sh
[root@mini1 conf]#vi sqoop-env.sh
...
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/root/apps/hadoop-2.6.4/
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/root/apps/hadoop-2.6.4/
#Set the path to where bin/hive is available
export HIVE_HOME=/root/apps/hive/
...
3、加入mysql的驅(qū)動包
由于裝hive的時候就將mysql驅(qū)動包傳到了hive的lib目錄下,這里直接拷貝過來即可
[root@mini1 conf]#cd ..
[root@mini1 sqoop]# cp /root/apps/hive/lib/mysql-connector-java-5.1.28.jar ./lib/
到這就安裝完成了。
可能的問題
mysql-connector-java-5.1.28.jar
這個jar包的版本必須在28之上,否則可能會有問題。
數(shù)據(jù)導(dǎo)入
1、導(dǎo)入數(shù)據(jù)庫表數(shù)據(jù)導(dǎo)入到hdfs
mysql創(chuàng)建表,插入數(shù)據(jù),為了使用方便復(fù)制了如下
mysql> use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql>CREATE TABLE `emp` (
`id` int(32) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`deg` varchar(255) NOT NULL,
`salary` int(11) NOT NULL,
`dept` varchar(32) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.03 sec)
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| emp |
| t_user |
+----------------+
2 rows in set (0.01 sec)
mysql> desc emp;
+--------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+----------------+
| id | int(32) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| deg | varchar(255) | NO | | NULL | |
| salary | int(11) | NO | | NULL | |
| dept | varchar(32) | NO | | NULL | |
+--------+--------------+------+-----+---------+----------------+
5 rows in set (0.02 sec)
mysql> INSERT INTO `test`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) VALUES ('1', 'zhangsan', 'manager', '30000', 'AA');
Query OK, 1 row affected (0.02 sec)
mysql> INSERT INTO `test`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) VALUES ('2', 'lisi', 'programmer', '20000', 'AA');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO `test`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) VALUES ('2', 'wangwu', 'programmer', '15000', 'BB');
ERROR 1062 (23000): Duplicate entry '2' for key 'PRIMARY'
mysql> INSERT INTO `test`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) VALUES ('3', 'wangwu', 'programmer', '15000', 'BB');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `test`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) VALUES ('4', 'hund', 'programmer', '5000', 'CC');
Query OK, 1 row affected (0.01 sec)
mysql> select * from emp;
+----+----------+------------+--------+------+
| id | name | deg | salary | dept |
+----+----------+------------+--------+------+
| 1 | zhangsan | manager | 30000 | AA |
| 2 | lisi | programmer | 20000 | AA |
| 3 | wangwu | programmer | 15000 | BB |
| 4 | hund | programmer | 5000 | CC |
+----+----------+------------+--------+------+
使用下面的命令將test數(shù)據(jù)庫中的emp表導(dǎo)入到hdfs(有默認目錄)
bin/sqoop import \
--connect jdbc:mysql://192.168.25.127:3306/test \
--username root \
--password 123456 \
--table emp \
--m 1
數(shù)據(jù)庫ip,使用的數(shù)據(jù)庫
mysql用戶名
mysql密碼
要導(dǎo)入的表
注:m 1 表示使用一個mapreduce
程序在執(zhí)行的時候能看到是跑了mapreduce程序的。
執(zhí)行完畢后頁面進行查看(/user/root是默認默認目錄,我用的是root用戶)

查看文件內(nèi)容(數(shù)據(jù)間逗號隔開的)
[root@mini1 sqoop]# hadoop fs -ls /user/root/emp
Found 2 items
-rw-r--r-- 2 root supergroup 0 2017-10-26 09:49 /user/root/emp/_SUCCESS
-rw-r--r-- 2 root supergroup 110 2017-10-26 09:49 /user/root/emp/part-m-00000
[root@mini1 sqoop]# hadoop fs -cat /user/root/emp/part-m-00000
1,zhangsan,manager,30000,AA
2,lisi,programmer,20000,AA
3,wangwu,programmer,15000,BB
4,hund,programmer,5000,CC
注:執(zhí)行導(dǎo)入的時候很大可能出現(xiàn)下面的異常
java.sql.SQLException: Access denied for user 'root'@'mini1' (using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1086)
...
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
17/10/26 00:01:46 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter
這基本就是沒授權(quán)導(dǎo)致的,給mini1授權(quán)即可如下
mysql> grant all privileges on *.* to root@mini1 identified by "123456";
Query OK, 0 rows affected (0.01 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)
mysql> show grants for root@mini1;
+------------------------------------------------------------------------------------------------------------------------------------+
| Grants for root@mini1 |
+------------------------------------------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'mini1' IDENTIFIED BY PASSWORD '*6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9' WITH GRANT OPTION |
| GRANT PROXY ON ''@'' TO 'root'@'mini1' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------------------------------------+
2、emp表數(shù)據(jù)導(dǎo)入到hive表中
其實是先導(dǎo)入到hdfs,再由hdfs導(dǎo)入到hive(屬于剪切粘貼)
先將前面生成的目錄刪了
[root@mini2 ~]# hadoop fs -rm -r /user/root
執(zhí)行以下命令導(dǎo)入emp表數(shù)據(jù)到hive表(表名也是emp)
[root@mini1 sqoop]# bin/sqoop import \
> --connect jdbc:mysql://192.168.25.127:3306/test \
> --username root \
> --password 123456 \
> --table emp \
> --hive-import \
> --m 1
...
17/10/26 10:04:13 INFO mapreduce.Job: Job job_1508930025306_0022 running in uber mode : false
17/10/26 10:04:13 INFO mapreduce.Job: map 0% reduce 0%
17/10/26 10:04:17 INFO mapreduce.Job: map 100% reduce 0%
17/10/26 10:04:18 INFO mapreduce.Job: Job job_1508930025306_0022 completed successfully
17/10/26 10:04:19 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=124217
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=110
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=2926
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2926
Total vcore-milliseconds taken by all map tasks=2926
Total megabyte-milliseconds taken by all map tasks=2996224
...
17/10/26 10:04:21 INFO hive.HiveImport: It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
17/10/26 10:04:27 INFO hive.HiveImport: OK
17/10/26 10:04:27 INFO hive.HiveImport: Time taken: 1.649 seconds
17/10/26 10:04:27 INFO hive.HiveImport: Loading data to table default.emp
17/10/26 10:04:28 INFO hive.HiveImport: Table default.emp stats: [numFiles=1, totalSize=110]
17/10/26 10:04:28 INFO hive.HiveImport: OK
17/10/26 10:04:28 INFO hive.HiveImport: Time taken: 0.503 seconds
17/10/26 10:04:28 INFO hive.HiveImport: Hive import complete.
17/10/26 10:04:28 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
將重要的輸出信息都粘貼了下來,可見是先導(dǎo)入到hdfs的文件中,再移動到hive中的。
去hive中查看是否創(chuàng)建了該表導(dǎo)入了數(shù)據(jù)
hive> select * from emp;
OK
1 zhangsan manager 30000 AA
2 lisi programmer 20000 AA
3 wangwu programmer 15000 BB
4 hund programmer 5000 CC
Time taken: 0.641 seconds, Fetched: 4 row(s)
3、導(dǎo)入數(shù)據(jù)到hdfs指定目錄
跟導(dǎo)入數(shù)據(jù)到hdfs查了句指定目錄
[root@mini1 sqoop]# bin/sqoop import \
> --connect jdbc:mysql://192.168.25.127:3306/test \
> --username root \
> --password 123456 \
> --table emp \
> --target-dir /queryresult \
> --m 1
執(zhí)行后查看
[root@mini3 ~]# hadoop fs -ls /queryresult
Found 2 items
-rw-r--r-- 2 root supergroup 0 2017-10-26 10:14 /queryresult/_SUCCESS
-rw-r--r-- 2 root supergroup 110 2017-10-26 10:14 /queryresult/part-m-00000
[root@mini3 ~]# hadoop fs -cat /queryresult/part-m-00000
1,zhangsan,manager,30000,AA
2,lisi,programmer,20000,AA
3,wangwu,programmer,15000,BB
4,hund,programmer,5000,CC
4、導(dǎo)入表數(shù)據(jù)子集
有時候不是整張表都要導(dǎo)入,那么可以按照需要來進行導(dǎo)入。
比如只導(dǎo)入id,name,salary三個字段,且要求deg=programmer
如下
bin/sqoop import \
--connect jdbc:mysql://192.168.25.127:3306/test \
--username root \
--password 123456 \
--target-dir /wherequery2 \
--query 'select id,name,deg from emp WHERE deg = "programmer" and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 1
split-by id表示按照id切片,fields-terminated-by ‘\t’表示導(dǎo)入到文件系統(tǒng)中的數(shù)據(jù)分隔符為”\t”,默認是”,”
[root@mini3 ~]# hadoop fs -ls /wherequery2
Found 2 items
-rw-r--r-- 2 root supergroup 0 2017-10-26 10:21 /wherequery2/_SUCCESS
-rw-r--r-- 2 root supergroup 56 2017-10-26 10:21 /wherequery2/part-m-00000
[root@mini3 ~]# hadoop fs -cat /wherequery2/part-m-00000
2 lisi programmer
3 wangwu programmer
4 hund programmer
--split-by原理
1)--split-by的原理
設(shè)置并行--num-mappers=4,加--split-by的情況會根據(jù)主鍵先查最大值和最小值,即:select min(key_id),max(key_id) from tb_oracle_stock_info_key。
如tb_oracle_stock_info_key(股票信息表)中 key_id(主鍵)最小值為300,最大值為400,那么4個并行度的切片情況如下:
并行度實現(xiàn)的sql如下:
select * from tb_oracle_stock_info_key where key_id between 300 and 325;
select * from tb_oracle_stock_info_key where key_id between 325 and 350;
select * from tb_oracle_stock_info_key where key_id between 351 and 375;
select * from tb_oracle_stock_info_key where key_id between 376 and 400;
綜上所述,加--split-by參數(shù)后,使用大于1個并行時,效果理論上優(yōu)于沒有加--split-by參數(shù)作業(yè)。
2)數(shù)據(jù)傾斜
假設(shè)oracle的表tb_oracle_stock_info_key(股票信息表)主鍵為key_id,sqoop根據(jù)max(key_id)來平均分配4份。假設(shè)min(key_id)=1,max(key_id)=400,那么導(dǎo)數(shù)的時候會按400切割生4份,即 :
select * from tb_oracle_stock_info_key where key_id between 1 and 100;
select * from tb_oracle_stock_info_key where key_id between 101 and 200;
select * from tb_oracle_stock_info_key where key_id between 201 and 300;
select * from tb_oracle_stock_info_key where key_id between 301 and 400;
但是由于數(shù)據(jù)特殊的原因,key_id=[1,100]分區(qū)內(nèi)自由1條數(shù)據(jù),key_id=[101,300]內(nèi)完全沒有數(shù)據(jù),99%數(shù)據(jù)都是key_id=[301,400],這樣就會產(chǎn)生數(shù)據(jù)傾斜,也就是4個并行中,有3個不耗費時間,有1個花了大部分時間,這樣的并行效果相當?shù)牟缓茫?/p>
因此,在使用并行度的時候需要了解主鍵的分布情況是是否有必要的。
5、增量導(dǎo)入
增量導(dǎo)入這里是僅導(dǎo)入新增加的表中的行,比如emp表有4條記錄,但是我們新表中只需要導(dǎo)入id為3和4的記錄進去
使用以下命令
bin/sqoop import \
--connect jdbc:mysql://192.168.25.127:3306/test \
--username root \
--password 123456 \
--table emp --m 1 \
--incremental append \
--check-column id \
--last-value 2
[root@mini1 sqoop]# bin/sqoop import \
> --connect jdbc:mysql://192.168.25.127:3306/test \
> --username root \
> --password 123456 \
> --table emp --m 1 \
> --incremental append \
> --check-column id \
> --last-value 2
[root@mini1 sqoop]# hadoop fs -ls /user/root/emp
Found 1 items
-rw-r--r-- 2 root supergroup 55 2017-10-26 10:28 /user/root/emp/part-m-00000
[root@mini1 sqoop]# hadoop fs -cat /user/root/emp/part-m-00000
3,wangwu,programmer,15000,BB
4,hund,programmer,5000,CC
數(shù)據(jù)導(dǎo)出
將hdfs上數(shù)據(jù)導(dǎo)入到mysql數(shù)據(jù)庫表中
注:需要將mysql上數(shù)據(jù)庫和表創(chuàng)建出來才能導(dǎo)出
繼續(xù)使用上面的emp表,但是將數(shù)據(jù)清空
mysql> select * from emp;
+----+----------+------------+--------+------+
| id | name | deg | salary | dept |
+----+----------+------------+--------+------+
| 1 | zhangsan | manager | 30000 | AA |
| 2 | lisi | programmer | 20000 | AA |
| 3 | wangwu | programmer | 15000 | BB |
| 4 | hund | programmer | 5000 | CC |
+----+----------+------------+--------+------+
4 rows in set (0.00 sec)
mysql> truncate emp;
Query OK, 0 rows affected (0.05 sec)
mysql> select * from emp;
Empty set (0.00 sec)
使用以下命令,將數(shù)據(jù)從hdfs上指定目錄數(shù)據(jù)導(dǎo)出到mysql指定的數(shù)據(jù)庫和表上
bin/sqoop export \
--connect jdbc:mysql://192.168.25.127:3306/test \
--username root \
--password 123456 \
--table emp \
--export-dir /user/root/emp/
執(zhí)行完之后查看表emp數(shù)據(jù)
mysql> select * from emp;
+----+--------+------------+--------+------+
| id | name | deg | salary | dept |
+----+--------+------------+--------+------+
| 3 | wangwu | programmer | 15000 | BB |
| 4 | hund | programmer | 5000 | CC |
+----+--------+------------+--------+------+
2 rows in set (0.00 sec)
導(dǎo)出完成