1. 借鑒
HIVE的四種排序,以及分區(qū)分桶
談?wù)刪ive的order by ,sort by ,distribute by 和 cluster by
hive hsq中的group by & Distribute by & partition by & cluster by & partitioned by & clustered by
2. 開始
排序的四種方式:
- order by
- sort by
- distribute by
- cluster by
下面我們依次來看下
order by
全局排序,只有一個(gè)reducer執(zhí)行,跟配置的mapreduce.job.reduces無關(guān),將所有數(shù)據(jù)進(jìn)行排序(默認(rèn)升序)。
select * from hotel_checkin_log order by order_time;
sort by
每個(gè)reducer內(nèi)部排序,跟配置的mapreduce.job.reduces有關(guān),從全局的角度來看并不是排序的。
在排序之前,先確認(rèn)一下我們r(jià)educer的數(shù)量
hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=4
是4個(gè)reducer,然后我們執(zhí)行一下sort by,按照order_time升序排序
select * from hotel_checkin_log sort by order_time;
然后我們看下排序后的數(shù)據(jù),其中倒數(shù)第三列為排序列。
就下面這個(gè)結(jié)果結(jié)合order_time排序列來看,它并不是全局排序的,我們可以根據(jù)數(shù)據(jù)來推算出它每個(gè)reducer都計(jì)算幾行數(shù)據(jù)(這里我用回車表示每個(gè)reducer的結(jié)果,實(shí)際數(shù)據(jù)沒有回車的)。
hotel_checkin_log.code hotel_checkin_log.checkin_date hotel_checkin_log.checkin_time hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time hotel_checkin_log.uid hotel_checkin_log.order_date
10271563 2020-03-20 14:00 2020-03-21 12:00 02:55:22 1124759681900888144 2020-03-19
10271563 2020-04-02 14:00 2020-04-03 12:00 02:55:22 1123659685898059853 2020-04-01
02371493 2020-04-01 17:00 2020-04-02 12:00 09:20:00 1133659694081146884 2020-03-31
00277553 2020-03-22 12:00 2020-03-25 12:00 12:00:00 1213559689173811267 2020-03-20
02371493 2020-04-02 17:00 2020-04-03 12:00 09:20:00 1133659694081146884 2020-04-01
02371493 2020-03-20 17:00 2020-03-21 12:00 09:20:00 1236559684014817343 2020-03-19
00277553 2020-04-02 12:00 2020-04-03 12:00 12:00:00 1213559689173811267 2020-04-01
00277553 2020-03-19 12:00 2020-03-22 12:00 12:00:00 1124559675462631493 2020-03-18
10271563 2020-03-19 14:00 2020-03-20 12:00 02:55:22 1431259675035954008 2020-03-18
10271563 2020-04-01 14:00 2020-04-02 12:00 02:55:22 1123659685898059853 2020-03-31
02371493 2020-03-19 17:00 2020-03-19 12:00 09:20:00 1634659676964192299 2020-03-18
00277553 2020-04-01 12:00 2020-04-02 12:00 12:00:00 1213559689173811267 2020-03-31
00277553 2020-03-20 12:00 2020-03-22 12:00 12:00:00 1234759682320318549 2020-03-19
10271563 2020-03-21 14:00 2020-03-23 12:00 02:55:22 1123659685898059853 2020-03-20
02371493 2020-03-22 17:00 2020-03-23 12:00 09:20:00 1133659694081146884 2020-03-20
Time taken: 33.041 seconds, Fetched: 15 row(s)
為了更方便確認(rèn)它的reducer內(nèi)部排序,我們將排序的結(jié)果寫入文件,據(jù)我們所知,一個(gè)reducer的結(jié)果會(huì)生成一個(gè)文件
insert overwrite local directory '/opt/envs/datas/sortby'
select * from hotel_checkin_log sort by order_time;
我們看下文件數(shù)量就知道了
[root@e2d21b78fdb0 sortby]# ll
total 16
-rw-r--r-- 1 root root 332 Jun 25 11:07 000000_0
-rw-r--r-- 1 root root 332 Jun 25 11:07 000001_0
-rw-r--r-- 1 root root 415 Jun 25 11:07 000002_0
-rw-r--r-- 1 root root 166 Jun 25 11:07 000003_0
另外我也查看了一下文件內(nèi)容,跟我的推算完全一致,這里我只粘貼出來000002_0這個(gè)文件的內(nèi)容
102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19
distribute by
分區(qū)排序,結(jié)合sort by使用,另外和是否是分區(qū)表沒有半毛錢關(guān)系。
按照指定的字段對(duì)數(shù)據(jù)進(jìn)行劃分到不同的輸出reduce文件中。
默認(rèn)是用hash分區(qū),即按照分區(qū)字段的hash值來進(jìn)行分區(qū)。
在排序之前,先確認(rèn)一下我們r(jià)educer的數(shù)量
hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=3
是3個(gè)reducer,然后我們執(zhí)行一下distribute by。按照order_date進(jìn)行分區(qū),并按照order_time升序排序
select * from hotel_checkin_log distribute by order_date sort by order_time;
結(jié)果如下,這個(gè)就不太直觀了,我們還是輸出到文件中看吧。
hotel_checkin_log.code hotel_checkin_log.checkin_date hotel_checkin_log.checkin_time hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time hotel_checkin_log.uid hotel_checkin_log.order_date
10271563 2020-03-20 14:00 2020-03-21 12:00 02:55:22 1124759681900888144 2020-03-19
10271563 2020-04-01 14:00 2020-04-02 12:00 02:55:22 1123659685898059853 2020-03-31
02371493 2020-04-01 17:00 2020-04-02 12:00 09:20:00 1133659694081146884 2020-03-31
02371493 2020-03-20 17:00 2020-03-21 12:00 09:20:00 1236559684014817343 2020-03-19
00277553 2020-03-20 12:00 2020-03-22 12:00 12:00:00 1234759682320318549 2020-03-19
00277553 2020-04-01 12:00 2020-04-02 12:00 12:00:00 1213559689173811267 2020-03-31
10271563 2020-04-02 14:00 2020-04-03 12:00 02:55:22 1123659685898059853 2020-04-01
10271563 2020-03-21 14:00 2020-03-23 12:00 02:55:22 1123659685898059853 2020-03-20
02371493 2020-03-22 17:00 2020-03-23 12:00 09:20:00 1133659694081146884 2020-03-20
02371493 2020-04-02 17:00 2020-04-03 12:00 09:20:00 1133659694081146884 2020-04-01
00277553 2020-03-22 12:00 2020-03-25 12:00 12:00:00 1213559689173811267 2020-03-20
00277553 2020-04-02 12:00 2020-04-03 12:00 12:00:00 1213559689173811267 2020-04-01
10271563 2020-03-19 14:00 2020-03-20 12:00 02:55:22 1431259675035954008 2020-03-18
02371493 2020-03-19 17:00 2020-03-19 12:00 09:20:00 1634659676964192299 2020-03-18
00277553 2020-03-19 12:00 2020-03-22 12:00 12:00:00 1124559675462631493 2020-03-18
我們將數(shù)據(jù)保存到文件中再看下
insert overwrite local directory '/opt/envs/datas/distributeby'
select * from hotel_checkin_log distribute by order_date sort by order_time;
我們先看下文件數(shù)量
[root@e2d21b78fdb0 distributeby]# ll
total 12
-rw-r--r-- 1 root root 498 Jun 25 12:42 000000_0
-rw-r--r-- 1 root root 498 Jun 25 12:42 000001_0
-rw-r--r-- 1 root root 249 Jun 25 12:42 000002_0
再分別看下文件內(nèi)容
[root@e2d21b78fdb0 distributeby]# more 000000_0
102715632020-03-2014:002020-03-2112:0002:55:2211247596819008881442020-03-19
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-04-0117:002020-04-0212:0009:20:0011336596940811468842020-03-31
023714932020-03-2017:002020-03-2112:0009:20:0012365596840148173432020-03-19
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31
[root@e2d21b78fdb0 distributeby]# more 000001_0
102715632020-04-0214:002020-04-0312:0002:55:2211236596858980598532020-04-01
102715632020-03-2114:002020-03-2312:0002:55:2211236596858980598532020-03-20
023714932020-03-2217:002020-03-2312:0009:20:0011336596940811468842020-03-20
023714932020-04-0217:002020-04-0312:0009:20:0011336596940811468842020-04-01
002775532020-03-2212:002020-03-2512:0012:00:0012135596891738112672020-03-20
002775532020-04-0212:002020-04-0312:0012:00:0012135596891738112672020-04-01
[root@e2d21b78fdb0 distributeby]# more 000002_0
102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-03-1912:002020-03-2212:0012:00:0011245596754626314932020-03-18
cluster by
1> 當(dāng)distribute by和sorts by字段相同時(shí),可以使用cluster by方式。
2> 排序只能是升序排序,不能指定排序規(guī)則為asc或者desc。
select * from hotel_checkin_log cluster by order_date;
如果指定倒序會(huì)出現(xiàn)下列錯(cuò)誤
hive (db_company)> select * from hotel_checkin_log cluster by order_date desc;
FAILED: ParseException line 1:54 extraneous input 'desc' expecting EOF near '<EOF>'