Hive學(xué)習(xí)筆記(3)表

Hive的表在邏輯上由存儲(chǔ)的數(shù)據(jù)和描述表中數(shù)據(jù)形式的相關(guān)元數(shù)據(jù)組成。數(shù)據(jù)通常存儲(chǔ)在HDFS中,元數(shù)據(jù)通常保存在關(guān)系型數(shù)據(jù)庫(kù)中。

數(shù)據(jù)庫(kù)

Hive中也由database的概念,本質(zhì)是在HDFS中的一個(gè)目錄。

創(chuàng)建數(shù)據(jù)庫(kù)

create database dbname;

切換數(shù)據(jù)庫(kù)

use dbname;

刪除數(shù)據(jù)庫(kù)

drop database dbname;

全限定指定表

dbname.tablename

默認(rèn)庫(kù)

在不指定數(shù)據(jù)庫(kù)時(shí),hive默認(rèn)使用default數(shù)據(jù)庫(kù)。

托管表和外部表

默認(rèn)情況下,表數(shù)據(jù)由Hive負(fù)責(zé)管理,hive會(huì)把數(shù)據(jù)文件移動(dòng)到倉(cāng)庫(kù)目錄下,我們把這種表稱為托管表,也叫內(nèi)部表。
我們也可以指定Hive從倉(cāng)庫(kù)目錄以外的位置訪問(wèn)數(shù)據(jù),這種表我們稱為外部表。

托管表的創(chuàng)建和數(shù)據(jù)加載

//創(chuàng)建
create table managed_table (name string);
//加載數(shù)據(jù)
load data inpath '/user/stefan/data.txt' into table managed_table;
//刪除表
drop table managed_table;

這里創(chuàng)建的表為托管表(內(nèi)部表),加載數(shù)據(jù)的過(guò)程實(shí)際上是hive將數(shù)據(jù)文件移動(dòng)到倉(cāng)庫(kù)目錄下的對(duì)應(yīng)表目錄下,而刪除表,hive不光會(huì)刪除元數(shù)據(jù)信息,還會(huì)刪除對(duì)應(yīng)的數(shù)據(jù)文件。所以,托管表(內(nèi)部表)是完全將數(shù)據(jù)交給了Hive來(lái)管理。如果誤刪除了表,相應(yīng)的數(shù)據(jù)也會(huì)丟失。

外部表的創(chuàng)建和數(shù)據(jù)加載

//創(chuàng)建
create external table external_table (name string) location '/user/stefan/external_table';
//加載數(shù)據(jù)
load data inpath '/user/stefan/data.txt' into table external_table;
//刪除表
drop table managed_table;

通過(guò)external關(guān)鍵字來(lái)指定當(dāng)前表是外部表,注意外部表在drop表時(shí),只會(huì)刪除元數(shù)據(jù)信息,并不會(huì)刪除數(shù)據(jù)文件。

托管表和外部表如何選擇?

可以看到托管表和外部表最大的區(qū)別在于drop表時(shí),是否會(huì)刪除數(shù)據(jù)。我們知道drop表是一個(gè)高危的操作,通常情況下應(yīng)該嚴(yán)格控制drop操作。如果我們采用外部表,在一定程度上也提高了數(shù)據(jù)的可恢復(fù)性,挽回?fù)p失。另外,外部表的數(shù)據(jù)通常可以方便其他工具來(lái)處理相同的數(shù)據(jù)。不過(guò)外部表同時(shí)也增加了數(shù)據(jù)維護(hù)的成本和復(fù)雜性。
在ETL過(guò)程中,我們可以將初始來(lái)源數(shù)據(jù)設(shè)置為外部表,然后中間的數(shù)據(jù)加工整體依賴于托管表,最后輸出的時(shí)候,可以再變換為外部表供其他工具等使用。
案例:將mysql數(shù)據(jù)同步到hive,進(jìn)行加工后,再同步到mysql供線上系統(tǒng)使用。方案:將mysql數(shù)據(jù)導(dǎo)入至hdfs,然后加載到hive外部表中進(jìn)行一系列的加工處理之后,寫入外部表作為輸出。其中,mysql2hdfs(hive)和hdfs(hive)2mysql都采用datax開源工具完成。

分區(qū)和桶

Hive表分區(qū)(pattition)是根據(jù)分區(qū)列的值對(duì)表進(jìn)行粗略劃分的,具體表現(xiàn)為對(duì)應(yīng)的分區(qū)對(duì)應(yīng)一個(gè)具體的目錄。
Hive表或分區(qū)還可以進(jìn)一步劃分為桶(bucket),桶是在數(shù)據(jù)的基礎(chǔ)上提供額外的結(jié)構(gòu)以獲得更高效的查詢處理。

分區(qū)(partition)

我們最常見(jiàn)的分區(qū)方式有根據(jù)時(shí)間和省份或者國(guó)家來(lái)劃分分區(qū)。
分區(qū)的定義:

//創(chuàng)建分區(qū)表
create table partition_table (id int, name string) partitioned by (dt string, country string);
//加載數(shù)據(jù)
load data local inpath '/user/home/stefan/partition_table_data.txt' into table partition_table partition ( dt = '20190402', country = 'china');

hive> create table partition_table (id int, name string) partitioned by (dt string, country string);
OK
Time taken: 1.195 seconds
hive> show tables;
OK
first_table
partition_table
Time taken: 1.302 seconds, Fetched: 2 row(s)
hive> desc partition_table;
OK
id      int
name                    string
dt                      string
country                 string

# Partition Information
# col_name              data_type               comment

dt                      string
country                 string
Time taken: 2.638 seconds, Fetched: 9 row(s)
hive> load data local inpath '/home/hadoop/stefan/test/partition_table_data.txt' into table partition_table partition (dt = '20190402', country = 'china');
Loading data to table default.partition_table partition (dt=20190402, country=china)
OK
Time taken: 1.944 seconds
hive> select * from partition_table;
OK
1  楊康   20190402    china
2  郭靖   20190402    china
3  黃蓉   20190402    china
Time taken: 1.612 seconds, Fetched: 3 row(s)

查看分區(qū)數(shù)據(jù)目錄結(jié)構(gòu)

查看當(dāng)前表分區(qū)

show pattitions tablename;

 hive> show partitions partition_table;
OK
dt=20190402/country=china
Time taken: 0.449 seconds, Fetched: 1 row(s)
   

查看數(shù)據(jù)存儲(chǔ)位置

show create table table_name;

其中l(wèi)ocation指定了當(dāng)前表數(shù)據(jù)在hdfs中的存儲(chǔ)位置。

hive> show create table partition_table;
OK
CREATE TABLE `partition_table`(
  `id` int, `name` string)
PARTITIONED BY (
  `dt` string,
  `country` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://jms-master-01:9000/user/hive/warehouse/partition_table'
TBLPROPERTIES (
  'transient_lastDdlTime'='1554190131')
Time taken: 0.218 seconds, Fetched: 15 row(s)

查看分區(qū)表目錄結(jié)構(gòu)

hadoop fs -ls -R locationpath;

通過(guò) hadoop fs ls -R 命令我們可以看出分區(qū)表的數(shù)據(jù)目錄結(jié)構(gòu)。

[hadoop@jms-master-01 ~]$ hadoop fs -ls -R /user/hive/warehouse/partition_table
19/04/02 15:49:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x   - hadoop supergroup          0 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402
drwxr-xr-x   - hadoop supergroup          0 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402/country=china
-rwxr-xr-x   3 hadoop supergroup         21 2019-04-02 15:34 /user/hive/warehouse/partition_table/dt=20190402/country=china/partition_table_data.txt

關(guān)于分區(qū)列

partitioned by指定的列在hive表中是正式的列,稱為分區(qū)列(partition column)。數(shù)據(jù)文件中是不包含這些列的值的,分區(qū)列源自于目錄名。

桶(bucket)的功能:一是提高查詢處理效率;二是數(shù)據(jù)取樣更高效。

桶的定義

//創(chuàng)建
create table bucketed_table (id int, name string) clustered by (id) into 4 buckets;
//向桶表中插入數(shù)據(jù)
insert overwriter table bucketed_table select id, name from partition_table;

向桶表中添加數(shù)據(jù)時(shí),啟動(dòng)了一個(gè)MR任務(wù)來(lái)散列數(shù)據(jù)。

hive> insert overwrite table bucketed_table select id, name from partition_table;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190402172813_54c4fc16-0b17-4e5e-af95-2308156acf88
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1552651623473_0004, Tracking URL = http://jms-master-01:8088/proxy/application_1552651623473_0004/
Kill Command = /home/hadoop/tools/hadoop-2.7.7/bin/hadoop job  -kill job_1552651623473_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-04-02 17:28:23,843 Stage-1 map = 0%,  reduce = 0%
2019-04-02 17:28:29,339 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.61 sec
2019-04-02 17:28:36,812 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.4 sec
2019-04-02 17:28:37,860 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 10.36 sec
2019-04-02 17:28:38,987 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.99 sec
MapReduce Total cumulative CPU time: 12 seconds 990 msec
Ended Job = job_1552651623473_0004
Loading data to table default.bucketed_table
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 12.99 sec   HDFS Read: 19374 HDFS Write: 393 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 990 msec
OK
Time taken: 27.706 seconds

桶表的每個(gè)桶就是表數(shù)據(jù)目錄下的一個(gè)文件。其實(shí)一個(gè)桶對(duì)應(yīng)一個(gè)MapReduce的輸出文件分區(qū):一個(gè)作業(yè)產(chǎn)生的桶和reduce任務(wù)個(gè)數(shù)相同。

[hadoop@jms-master-01 test]$ hadoop fs -ls hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table
19/04/02 17:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rwxr-xr-x   3 hadoop supergroup         24 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000000_0
-rwxr-xr-x   3 hadoop supergroup         18 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000001_0
-rwxr-xr-x   3 hadoop supergroup         18 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000002_0
-rwxr-xr-x   3 hadoop supergroup         21 2019-04-02 17:28 hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000003_0
[hadoop@jms-master-01 test]$ hadoop fs -cat hdfs://jms-master-01:9000/user/hive/warehouse/bucketed_table/000000_0
19/04/02 17:30:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
8楊鐵心
4穆念慈

分桶規(guī)則:分桶字段除以桶數(shù)取余。

桶表的取樣

通過(guò)tablesample子句對(duì)表取樣。

TABLESAMPLE (BUCKET x OUT OF y [ON colname])

說(shuō)明:colname表明抽取樣本的列,可以是非分區(qū)列中的任意一列,或者使用rand()表明咋整個(gè)行中抽取樣本而不是單個(gè)列。在colname上分桶的行隨機(jī)進(jìn)入1到y(tǒng)個(gè)桶中,返回屬于桶x的行。舉個(gè)例子:

//將數(shù)據(jù)隨機(jī)進(jìn)入20個(gè)桶中,返回第三個(gè)桶的數(shù)據(jù)
TABLESAMPLE (BUCKET 3 OUT OF 20 rand())

hive> select * from bucketed_table tablesample(bucket 1 out of 4 on id);
OK
8   楊鐵心
4   穆念慈
Time taken: 0.17 seconds, Fetched: 2 row(s)
hive> select * from bucketed_table tablesample(bucket 1 out of 2 on id);
OK
8   楊鐵心
4   穆念慈
6   西毒
2   黃蓉
Time taken: 0.093 seconds, Fetched: 4 row(s)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容