Hive高級
1)產(chǎn)生背景
2)部署
3)DDL
4)DML
5)JOIN
6)function:build-in & udf
7)Sqoop
hive cli:hive/webui/beeline/Java API?
HiveServer2? 服務(wù)
beeline/Java API 連上HS2: client發(fā)起sql
!connect jdbc:hive2://localhost:10000 hadoop
beeline -u jdbc:hive2://localhost:14000 -n hadoop
大數(shù)據(jù)中著名的端口:
50070
8088
4040
2181
8020
7077
60010
10000
19888
hiveserver2 --hiveconf hive.server2.thrift.port=14000
Java API操作Hive
1)pom.xml添加如下dependency
org.apache.hivehive-jdbc${hive.version}
2)JDBC CODE
3)官網(wǎng)有錯,請慎重
note:
java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://hadoop000:14000/default: java.net.ConnectException: Connection refused: connect
WEB UI
Hive官方的不建議使用
HUE: Hadoop User Experience
http://github.com/cloudera/hue
Zeppelin
基礎(chǔ)數(shù)據(jù)類型
復(fù)雜數(shù)據(jù)類型:array、map、struct
* 根據(jù)你所需要的復(fù)雜數(shù)據(jù)類型創(chuàng)建表
* 取數(shù)據(jù)
1) arraycreate table hive_array(name string, work_locations array)
row format delimited fields terminated by '\t'
COLLECTION ITEMS TERMINATED BY ',';
load data local inpath '/home/hadoop/data/hive_array.txt' overwrite into table hive_array;
array[index]
2) map? Map('a'#1,'b'#2)
item , item
key#value
create table ruoze_map(id int, name string, family map, age int)
row format delimited fields terminated by ','
COLLECTION ITEMS TERMINATED BY '#'
MAP KEYS TERMINATED BY ':'? ;
load data local inpath '/home/hadoop/data/hive_map.txt' overwrite into table ruoze_map;
map['key']
3) struct 192.168.1.1#zhangsan:40:xxx:bbb:aaacreate table ruoze_struct(ip string,userinfo struct)
row format delimited fields terminated by '#'
COLLECTION ITEMS TERMINATED BY ':';
struct.property
metadata
VERSION
Hive是一個進(jìn)程級別
TODO... VERSION + 1record ==> 2條記錄
DBS
DATABASE_PARAMS
TBLS? 和DBS是通過DB_ID關(guān)聯(lián)的
JOIN
執(zhí)行計劃
explain sql
ABSTRACT SYNTAX TREE? <== extended
STAGE DEPENDENCIES
STAGE PLANS
SQL on Hadoop
common join/shuffle join/reduce join? 有shuffle
mapjoin/broadcastjoin? 沒有shuffle 通常情況下性能高于common join,但是有前提
explain select e.empno,e.ename,d.dname from ruozedata_emp e join ruoze_dept d on e.depnto=d.deptno;
empempno, ename, deptnomap:ruoze_deptdname,deptnomap:shuffle: 相同的key分發(fā)到一個reduce task上去執(zhí)行 join的過程其實真正是發(fā)生在reduce階段的
mapjoin: join是發(fā)生在map階段,無shuffle
前提:大表 join 小表
原理:把小表加入到分布式緩存中去,在讀取大表的時候,
直接和分布式緩存中的數(shù)據(jù)匹配,匹配上就ok,匹配不上就滾蛋
select /*+mapjoin(d)*/ e.empno,e.ename,d.dname from emp e join ruoze_dept d on e.depnto=d.deptno;
hint
壓縮!??! 殺手锏 圈起來必考
大數(shù)據(jù)? ==> HDFS? <== 壓縮
帶來的好處:
1)disk
2)file
3)shuffle
壞處:cpu
壓縮技術(shù):
有損
無損
壓縮常用場景:
map輸入
map輸出
reduce輸出