Hive的表在邏輯上由存儲(chǔ)的數(shù)據(jù)和描述表中數(shù)據(jù)形式的相關(guān)元數(shù)據(jù)組成。數(shù)據(jù)通常存儲(chǔ)在HDFS中,元數(shù)據(jù)通常保存在關(guān)系型數(shù)據(jù)庫(kù)中。
數(shù)據(jù)庫(kù)
Hive中也由database的概念,本質(zhì)是在HDFS中的一個(gè)目錄。
創(chuàng)建數(shù)據(jù)庫(kù)
create database dbname;
切換數(shù)據(jù)庫(kù)
use dbname;
刪除數(shù)據(jù)庫(kù)
drop database dbname;
全限定指定表
dbname.tablename
默認(rèn)庫(kù)
在不指定數(shù)據(jù)庫(kù)時(shí),hive默認(rèn)使用default數(shù)據(jù)庫(kù)。
托管表和外部表
默認(rèn)情況下,表數(shù)據(jù)由Hive負(fù)責(zé)管理,hive會(huì)把數(shù)據(jù)文件移動(dòng)到倉(cāng)庫(kù)目錄下,我們把這種表稱為托管表,也叫內(nèi)部表。
我們也可以指定Hive從倉(cāng)庫(kù)目錄以外的位置訪問(wèn)數(shù)據(jù),這種表我們稱為外部表。
托管表的創(chuàng)建和數(shù)據(jù)加載
//創(chuàng)建
create table managed_table (name string);
//加載數(shù)據(jù)
load data inpath '/user/stefan/data.txt' into table managed_table;
//刪除表
drop table managed_table;
這里創(chuàng)建的表為托管表(內(nèi)部表),加載數(shù)據(jù)的過(guò)程實(shí)際上是hive將數(shù)據(jù)文件移動(dòng)到倉(cāng)庫(kù)目錄下的對(duì)應(yīng)表目錄下,而刪除表,hive不光會(huì)刪除元數(shù)據(jù)信息,還會(huì)刪除對(duì)應(yīng)的數(shù)據(jù)文件。所以,托管表(內(nèi)部表)是完全將數(shù)據(jù)交給了Hive來(lái)管理。如果誤刪除了表,相應(yīng)的數(shù)據(jù)也會(huì)丟失。
外部表的創(chuàng)建和數(shù)據(jù)加載
//創(chuàng)建
create external table external_table (name string) location '/user/stefan/external_table';
//加載數(shù)據(jù)
load data inpath '/user/stefan/data.txt' into table external_table;
//刪除表
drop table managed_table;
通過(guò)external關(guān)鍵字來(lái)指定當(dāng)前表是外部表,注意外部表在drop表時(shí),只會(huì)刪除元數(shù)據(jù)信息,并不會(huì)刪除數(shù)據(jù)文件。
托管表和外部表如何選擇?
可以看到托管表和外部表最大的區(qū)別在于drop表時(shí),是否會(huì)刪除數(shù)據(jù)。我們知道drop表是一個(gè)高危的操作,通常情況下應(yīng)該嚴(yán)格控制drop操作。如果我們采用外部表,在一定程度上也提高了數(shù)據(jù)的可恢復(fù)性,挽回?fù)p失。另外,外部表的數(shù)據(jù)通常可以方便其他工具來(lái)處理相同的數(shù)據(jù)。不過(guò)外部表同時(shí)也增加了數(shù)據(jù)維護(hù)的成本和復(fù)雜性。
在ETL過(guò)程中,我們可以將初始來(lái)源數(shù)據(jù)設(shè)置為外部表,然后中間的數(shù)據(jù)加工整體依賴于托管表,最后輸出的時(shí)候,可以再變換為外部表供其他工具等使用。
案例:將mysql數(shù)據(jù)同步到hive,進(jìn)行加工后,再同步到mysql供線上系統(tǒng)使用。方案:將mysql數(shù)據(jù)導(dǎo)入至hdfs,然后加載到hive外部表中進(jìn)行一系列的加工處理之后,寫入外部表作為輸出。其中,mysql2hdfs(hive)和hdfs(hive)2mysql都采用datax開源工具完成。
分區(qū)和桶
Hive表分區(qū)(pattition)是根據(jù)分區(qū)列的值對(duì)表進(jìn)行粗略劃分的,具體表現(xiàn)為對(duì)應(yīng)的分區(qū)對(duì)應(yīng)一個(gè)具體的目錄。
Hive表或分區(qū)還可以進(jìn)一步劃分為桶(bucket),桶是在數(shù)據(jù)的基礎(chǔ)上提供額外的結(jié)構(gòu)以獲得更高效的查詢處理。
分區(qū)(partition)
我們最常見(jiàn)的分區(qū)方式有根據(jù)時(shí)間和省份或者國(guó)家來(lái)劃分分區(qū)。
分區(qū)的定義:
//創(chuàng)建分區(qū)表
create table partition_table (id int, name string) partitioned by (dt string, country string);
//加載數(shù)據(jù)
load data local inpath '/user/home/stefan/partition_table_data.txt' into table partition_table partition ( dt = '20190402', country = 'china');
hive> create table partition_table (id int, name string) partitioned by (dt string, country string);
OK
Time taken: 1.195 seconds
hive> show tables;
OK
first_table
partition_table
Time taken: 1.302 seconds, Fetched: 2 row(s)
hive> desc partition_table;
OK
id int
name string
dt string
country string
# Partition Information
# col_name data_type comment
dt string
country string
Time taken: 2.638 seconds, Fetched: 9 row(s)
hive> load data local inpath '/home/hadoop/stefan/test/partition_table_data.txt' into table partition_table partition (dt = '20190402', country = 'china');
Loading data to table default.partition_table partition (dt=20190402, country=china)
OK
Time taken: 1.944 seconds
hive> select * from partition_table;
OK
1 楊康 20190402 china
2 郭靖 20190402 china
3 黃蓉 20190402 china
Time taken: 1.612 seconds, Fetched: 3 row(s)
查看分區(qū)數(shù)據(jù)目錄結(jié)構(gòu)
查看當(dāng)前表分區(qū)
show pattitions tablename;
hive> show partitions partition_table;
OK
dt=20190402/country=china
Time taken: 0.449 seconds, Fetched: 1 row(s)
查看數(shù)據(jù)存儲(chǔ)位置
show create table table_name;
其中l(wèi)ocation指定了當(dāng)前表數(shù)據(jù)在hdfs中的存儲(chǔ)位置。
hive> show create table partition_table;
OK
CREATE TABLE `partition_table`(
`id` int, `name` string)
PARTITIONED BY (
`dt` string,
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://jms-master-01:9000/user/hive/warehouse/partition_table'
TBLPROPERTIES (
'transient_lastDdlTime'='1554190131')
Time taken: 0.218 seconds, Fetched: 15 row(s)
查看分區(qū)表目錄結(jié)構(gòu)
hadoop fs -ls -R locationpath;
通過(guò) hadoop fs ls -R 命令我們可以看出分區(qū)表的數(shù)據(jù)目錄結(jié)構(gòu)。
[hadoop@jms-master-01 ~]$ hadoop fs -ls -R /user/hive/warehouse/partition_table
19/04/02 15:49:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x - hadoop supergroup 0 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402
drwxr-xr-x - hadoop supergroup 0 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402/country=china
-rwxr-xr-x 3 hadoop supergroup 21 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402/country=china/partition_table_data.txt
關(guān)于分區(qū)列
partitioned by指定的列在hive表中是正式的列,稱為分區(qū)列(partition column)。數(shù)據(jù)文件中是不包含這些列的值的,分區(qū)列源自于目錄名。
桶
桶(bucket)的功能:一是提高查詢處理效率;二是數(shù)據(jù)取樣更高效。
桶的定義
//創(chuàng)建
create table bucketed_table (id int, name string) clustered by (id) into 4 buckets;
//向桶表中插入數(shù)據(jù)
insert overwriter table bucketed_table select id, name from partition_table;
向桶表中添加數(shù)據(jù)時(shí),啟動(dòng)了一個(gè)MR任務(wù)來(lái)散列數(shù)據(jù)。
hive> insert overwrite table bucketed_table select id, name from partition_table;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190402172813_54c4fc16-0b17-4e5e-af95-2308156acf88
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1552651623473_0004, Tracking URL = http://jms-master-01:8088/proxy/application_1552651623473_0004/
Kill Command = /home/hadoop/tools/hadoop-2.7.7/bin/hadoop job -kill job_1552651623473_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-04-02 17:28:23,843 Stage-1 map = 0%, reduce = 0%
2019-04-02 17:28:29,339 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.61 sec
2019-04-02 17:28:36,812 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.4 sec
2019-04-02 17:28:37,860 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 10.36 sec
2019-04-02 17:28:38,987 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 12.99 sec
MapReduce Total cumulative CPU time: 12 seconds 990 msec
Ended Job = job_1552651623473_0004
Loading data to table default.bucketed_table
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 12.99 sec HDFS Read: 19374 HDFS Write: 393 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 990 msec
OK
Time taken: 27.706 seconds
桶表的每個(gè)桶就是表數(shù)據(jù)目錄下的一個(gè)文件。其實(shí)一個(gè)桶對(duì)應(yīng)一個(gè)MapReduce的輸出文件分區(qū):一個(gè)作業(yè)產(chǎn)生的桶和reduce任務(wù)個(gè)數(shù)相同。
[hadoop@jms-master-01 test]$ hadoop fs -ls hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table
19/04/02 17:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rwxr-xr-x 3 hadoop supergroup 24 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000000_0
-rwxr-xr-x 3 hadoop supergroup 18 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000001_0
-rwxr-xr-x 3 hadoop supergroup 18 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000002_0
-rwxr-xr-x 3 hadoop supergroup 21 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000003_0
[hadoop@jms-master-01 test]$ hadoop fs -cat hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000000_0
19/04/02 17:30:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
8楊鐵心
4穆念慈
分桶規(guī)則:分桶字段除以桶數(shù)取余。
桶表的取樣
通過(guò)tablesample子句對(duì)表取樣。
TABLESAMPLE (BUCKET x OUT OF y [ON colname])
說(shuō)明:colname表明抽取樣本的列,可以是非分區(qū)列中的任意一列,或者使用rand()表明咋整個(gè)行中抽取樣本而不是單個(gè)列。在colname上分桶的行隨機(jī)進(jìn)入1到y(tǒng)個(gè)桶中,返回屬于桶x的行。舉個(gè)例子:
//將數(shù)據(jù)隨機(jī)進(jìn)入20個(gè)桶中,返回第三個(gè)桶的數(shù)據(jù)
TABLESAMPLE (BUCKET 3 OUT OF 20 rand())
hive> select * from bucketed_table tablesample(bucket 1 out of 4 on id);
OK
8 楊鐵心
4 穆念慈
Time taken: 0.17 seconds, Fetched: 2 row(s)
hive> select * from bucketed_table tablesample(bucket 1 out of 2 on id);
OK
8 楊鐵心
4 穆念慈
6 西毒
2 黃蓉
Time taken: 0.093 seconds, Fetched: 4 row(s)