文章作者:foochane
原文鏈接:https://foochane.cn/article/2019063001.html
Sqoop數(shù)據(jù)遷移工具的使用 sqoop簡單介紹 sqoop數(shù)據(jù)到HDFS/HIVE sqoop數(shù)據(jù)到MySQL
1 sqoop簡單介紹
sqoop是apache旗下一款“Hadoop和關(guān)系數(shù)據(jù)庫服務(wù)器之間傳送數(shù)據(jù)”的工具。用于數(shù)據(jù)的導(dǎo)入和導(dǎo)出。
[圖片上傳失敗...(image-93b332-1561888166313)]
- 導(dǎo)入數(shù)據(jù):MySQL,Oracle導(dǎo)入數(shù)據(jù)到Hadoop的HDFS、HIVE、HBASE等數(shù)據(jù)存儲(chǔ)系統(tǒng);
- 導(dǎo)出數(shù)據(jù):從Hadoop的文件系統(tǒng)中導(dǎo)出數(shù)據(jù)到關(guān)系數(shù)據(jù)庫mysql等
sqoop的工作機(jī)制是將導(dǎo)入或?qū)С雒罘g成mapreduce程序來實(shí)現(xiàn),在翻譯出的mapreduce中主要是對inputformat和outputformat進(jìn)行定制。
2 sqoop安裝
安裝sqoop前要先安裝好java環(huán)境和hadoop環(huán)境。
sqoop只是一個(gè)工具,安裝在那個(gè)節(jié)點(diǎn)都可以,只要有java環(huán)境和hadoop環(huán)境,并且能連接到對應(yīng)數(shù)據(jù)庫即可。
2.1 下載并解壓
下載地址:http://mirror.bit.edu.cn/apache/sqoop/
下載:sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
解壓到安裝目錄下
2.2 修改配置文件
將sqoop-env-template.sh復(fù)制一份重命名為sqoop-env.sh文件,在sqoop-env.sh文件中添加如下內(nèi)容:
export HADOOP_COMMON_HOME=/usr/local/bigdata/hadoop-2.7.1
export HADOOP_MAPRED_HOME=/usr/local/bigdata/hadoop-2.7.1
export HIVE_HOME=/usr/local/bigdata/hive-2.3.5
2.3 安裝mysql的jdbc啟動(dòng)
將 mysql-connector-java-5.1.45.jar 拷貝到sqoop的lib目錄下。
$ sudo apt-get install libmysql-java #之前已經(jīng)裝過了
$ ln -s /usr/share/java/mysql-connector-java-5.1.45.jar /usr/local/bigdata/sqoop-1.4.7/lib
也可以自己手動(dòng)復(fù)制 mysql-connector-java-5.1.45.jar。
2.4 驗(yàn)證sqoop
查看sqoop版本
$ bin/sqoop-version
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/06/30 03:03:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 20
會(huì)出現(xiàn)幾個(gè)警告,暫時(shí)先不管。
驗(yàn)證sqoop到mysql業(yè)務(wù)庫之間的連通性:
$ bin/sqoop-list-databases --connect jdbc:mysql://Master:3306 --username hiveuser --password 123456
$ bin/sqoop-list-tables --connect jdbc:mysql://Master:3306/metastore --username hiveuser --password 123456
3 sqoop數(shù)據(jù)導(dǎo)入
3.1 從MySql導(dǎo)數(shù)據(jù)到HDFS
先在mysql中,建表插入測試數(shù)據(jù);
SET FOREIGN_KEY_CHECKS=0;
-- ----------------------------
-- Table structure for `emp`
-- ----------------------------
DROP TABLE IF EXISTS `emp`;
CREATE TABLE `emp` (
`id` int(11) DEFAULT NULL,
`name` varchar(100) DEFAULT NULL,
`deg` varchar(100) DEFAULT NULL,
`salary` int(11) DEFAULT NULL,
`dept` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- ----------------------------
-- Records of emp
-- ----------------------------
INSERT INTO `emp` VALUES ('1201', 'gopal', 'manager', '50000', 'TP');
INSERT INTO `emp` VALUES ('1202', 'manisha', 'Proof reader', '50000', 'TP');
INSERT INTO `emp` VALUES ('1203', 'khalil', 'php dev', '30000', 'AC');
INSERT INTO `emp` VALUES ('1204', 'prasanth', 'php dev', '30000', 'AC');
INSERT INTO `emp` VALUES ('1205', 'kranthi', 'admin', '20000', 'TP');
-- ----------------------------
-- Table structure for `emp_add`
-- ----------------------------
DROP TABLE IF EXISTS `emp_add`;
CREATE TABLE `emp_add` (
`id` int(11) DEFAULT NULL,
`hno` varchar(100) DEFAULT NULL,
`street` varchar(100) DEFAULT NULL,
`city` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- ----------------------------
-- Records of emp_add
-- ----------------------------
INSERT INTO `emp_add` VALUES ('1201', '288A', 'vgiri', 'jublee');
INSERT INTO `emp_add` VALUES ('1202', '108I', 'aoc', 'sec-bad');
INSERT INTO `emp_add` VALUES ('1203', '144Z', 'pgutta', 'hyd');
INSERT INTO `emp_add` VALUES ('1204', '78B', 'old city', 'sec-bad');
INSERT INTO `emp_add` VALUES ('1205', '720X', 'hitec', 'sec-bad');
-- ----------------------------
-- Table structure for `emp_conn`
-- ----------------------------
DROP TABLE IF EXISTS `emp_conn`;
CREATE TABLE `emp_conn` (
`id` int(100) DEFAULT NULL,
`phno` varchar(100) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- ----------------------------
-- Records of emp_conn
-- ----------------------------
INSERT INTO `emp_conn` VALUES ('1201', '2356742', 'gopal@tp.com');
INSERT INTO `emp_conn` VALUES ('1202', '1661663', 'manisha@tp.com');
INSERT INTO `emp_conn` VALUES ('1203', '8887776', 'khalil@ac.com');
INSERT INTO `emp_conn` VALUES ('1204', '9988774', 'prasanth@ac.com');
INSERT INTO `emp_conn` VALUES ('1205', '1231231', 'kranthi@tp.com');
導(dǎo)入:
bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username root \
--password root \
--target-dir /sqooptest \
--fields-terminated-by ',' \
--table emp \
--m 2 \
--split-by id;
- --connect:指定數(shù)據(jù)庫
- --username:指定用戶名
- --password:指定密碼
- --table:指定要導(dǎo)入的表
- --target-dir:指定hdfs的目錄
- --fields-terminated-by:指定文件分割符
- --m: 指定maptask個(gè)數(shù),如果大于1,必須指定split-by參數(shù),如指定為2,最后生產(chǎn)的文件會(huì)是兩個(gè)
- --split-by:指定分片的字段
如果表的數(shù)據(jù)量不是很大就不用指定設(shè)置--m參數(shù)了
注意導(dǎo)入前前啟動(dòng)hdfs和yarn,并且提交的yarn上運(yùn)行,而不是在本地運(yùn)行。
示例:
$ bin/sqoop import --connect jdbc:mysql://Master:3306/test --username hadoop --password 123456 --target-dir /sqooptest --fields-terminated-by ',' --table emp --m 1 --split-by id;
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/06/30 05:00:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/06/30 05:00:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/06/30 05:00:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/06/30 05:00:44 INFO tool.CodeGenTool: Beginning code generation
Sun Jun 30 05:00:45 UTC 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
19/06/30 05:00:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
19/06/30 05:00:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
19/06/30 05:00:46 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/bigdata/hadoop-2.7.1
Note: /tmp/sqoop-hadoop/compile/cd17c36add75dfe67edd3facf7538def/emp.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/06/30 05:00:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/cd17c36add75dfe67edd3facf7538def/emp.jar
19/06/30 05:00:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/06/30 05:00:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/06/30 05:00:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/06/30 05:00:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/06/30 05:00:56 INFO mapreduce.ImportJobBase: Beginning import of emp
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/bigdata/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/bigdata/hbase-2.0.5/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/06/30 05:00:58 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/06/30 05:01:06 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/06/30 05:01:07 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.233.200:8032
Sun Jun 30 05:01:55 UTC 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
19/06/30 05:01:56 INFO db.DBInputFormat: Using read commited transaction isolation
19/06/30 05:01:56 INFO mapreduce.JobSubmitter: number of splits:1
19/06/30 05:01:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561868076549_0002
19/06/30 05:02:05 INFO impl.YarnClientImpl: Submitted application application_1561868076549_0002
19/06/30 05:02:06 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1561868076549_0002/
19/06/30 05:02:06 INFO mapreduce.Job: Running job: job_1561868076549_0002
19/06/30 05:02:47 INFO mapreduce.Job: Job job_1561868076549_0002 running in uber mode : false
19/06/30 05:02:48 INFO mapreduce.Job: map 0% reduce 0%
19/06/30 05:03:35 INFO mapreduce.Job: map 100% reduce 0%
19/06/30 05:03:36 INFO mapreduce.Job: Job job_1561868076549_0002 completed successfully
19/06/30 05:03:37 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=135030
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=151
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=42476
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=42476
Total vcore-seconds taken by all map tasks=42476
Total megabyte-seconds taken by all map tasks=43495424
Map-Reduce Framework
Map input records=5
Map output records=5
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=250
CPU time spent (ms)=2700
Physical memory (bytes) snapshot=108883968
Virtual memory (bytes) snapshot=1934733312
Total committed heap usage (bytes)=18415616
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=151
19/06/30 05:03:37 INFO mapreduce.ImportJobBase: Transferred 151 bytes in 150.6495 seconds (1.0023 bytes/sec)
19/06/30 05:03:37 INFO mapreduce.ImportJobBase: Retrieved 5 records.
查看是否導(dǎo)入成功:
$ hdfs dfs -cat /sqooptest/part-m-*
1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
3.2 從MySql導(dǎo)數(shù)據(jù)到Hive
命令:
bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--table emp \
--hive-import \
--split-by id \
--m 1;
導(dǎo)入到hive,需要添加--hive-import參數(shù),不用指定--target-dir其他參數(shù)跟導(dǎo)入到hdfs上一樣。
3.3 導(dǎo)入表數(shù)據(jù)子集
有時(shí)候我們并不需要,導(dǎo)入數(shù)據(jù)表中的全部數(shù)據(jù),sqoop也支持導(dǎo)入數(shù)據(jù)表的部分?jǐn)?shù)據(jù)。
這是可以使用Sqoop的where語句。where子句的一個(gè)子集。它執(zhí)行在各自的數(shù)據(jù)庫服務(wù)器相應(yīng)的SQL查詢,并將結(jié)果存儲(chǔ)在HDFS的目標(biāo)目錄。
where子句的語法如下:
--where <condition>
下面的命令用來導(dǎo)入emp_add表數(shù)據(jù)的子集。居住城市為:sec-bad
bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--where "city ='sec-bad'" \
--target-dir /wherequery \
--table emp_add \
--m 1
另外也可以使用select語句:
bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--target-dir /wherequery2 \
--query 'select id,name,deg from emp WHERE id>1207 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2
3.4 增量導(dǎo)入
增量導(dǎo)入是僅導(dǎo)入新添加的表中的行的技術(shù)。
sqoop支持兩種增量MySql導(dǎo)入到hive的模式,一種是append,即通過指定一個(gè)遞增的列。另種是可以根據(jù)時(shí)間戳。
3.4.1 append
指定如下參數(shù):
--incremental append
--check-column num_id
--last-value 0
--check-column 表示指定遞增的字段,--last-value指定上一次到入的位置
如:
bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--table emp --m 1 \
--incremental append \
--check-column id \
--last-value 1208
3.4.2 根據(jù)時(shí)間戳
命令中添加如下參數(shù):
--incremental lastmodified
--check-column created
--last-value '2012-02-01 11:0:00'
就是只導(dǎo)入created 比2012-02-01 11:0:00更大的數(shù)據(jù)。
4 Sqoop的數(shù)據(jù)導(dǎo)出
將數(shù)據(jù)從HDFS把文件導(dǎo)出到RDBMS數(shù)據(jù)庫,導(dǎo)出前目標(biāo)表必須存在于目標(biāo)數(shù)據(jù)庫中。默認(rèn)操作是從將文件中的數(shù)據(jù)使用INSERT語句插入到表中。更新模式下,是生成UPDATE語句更新表數(shù)據(jù)
語法
$ sqoop export (generic-args) (export-args)
導(dǎo)入過程
1、首先需要手動(dòng)創(chuàng)建mysql中的目標(biāo)表
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
2、執(zhí)行導(dǎo)出命令
bin/sqoop export \
--connect jdbc:mysql://Master:3306/test \
--username root \
--password root \
--table employee \
--export-dir /user/hadoop/emp/
3、驗(yàn)證表mysql命令行。
mysql>select * from employee;
如果給定的數(shù)據(jù)存儲(chǔ)成功,那么可以找到數(shù)據(jù)在如下的employee表。
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |
+------+--------------+-------------+-------------------+--------+