安裝
元素數(shù)據(jù)存儲選擇
默認使用derby數(shù)據(jù)庫,不能夠多個用戶同時使用,多用于測試
使用MySQL數(shù)據(jù)庫存儲元數(shù)據(jù),多用于生產(chǎn)環(huán)境
HDFS數(shù)據(jù)倉庫目錄
創(chuàng)建數(shù)據(jù)倉庫目錄
hadoop fs -mkdir -p /user/hive/warehouse賦予權限
hadoop fs -chmod a+w /user/hive/warehouse
hadoop fs -chmod a+w /temp
hive安裝
hadoop 用戶將HIVE安裝包解壓到/home/hadoop/apps安裝目錄
tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /home/hadoop/apps切換到root用戶
創(chuàng)建軟連接
ln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hive
修改屬主
chown -R hadoop:hadoop /usr/local/hive
添加環(huán)境變量
vim /etc/profile
添加內(nèi)容
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:${HIVE_HOME}/bin
使環(huán)境變量生效
source /etc/profile
切換到hadoop用戶
修改HIVE_HOME/conf/hive-site.xml內(nèi)容,沒有則新建
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.183.101:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive123</value>
<description>password to use against metastore database</description>
</property>
</configuration>
注: 修改對應ip和密碼
啟動hive
/usr/local/hive/bin/hive
[hadoop@hadoop4 bin]$ hive
Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive>
啟動成功
hive -hiveconf hive.root.logger=DEBUG,console
顯示日志方式啟動hive
Hive 操作
- 查看數(shù)據(jù)庫
show databases
hive> show databases;
OK
default
Time taken: 0.02 seconds, Fetched: 1 row(s)
- 創(chuàng)建數(shù)據(jù)庫
create database mytestDB;
create database IF NOT EXISTS mytestDB;
hadoop fs -ls /user/hive/warehouse
會增加mytestdb.db目錄 - 選擇數(shù)據(jù)庫
use mytestdb - 查看表
show tables; - 創(chuàng)建表
創(chuàng)建用戶表:user_info
字段信息:用戶id,地域id,年齡,職業(yè)
create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
創(chuàng)建成功后,同時會在HDFS中創(chuàng)建目錄
/user/hive/warehouse/mytestdb.db/user_info
- 刪除表
drop table user_info;
user_info表在hdfs的目錄也會被同時刪除
創(chuàng)建內(nèi)部表
- 在數(shù)據(jù)庫rel 中創(chuàng)建學生信息表
create table student_info(
student_id string comment '學號',
name string comment '姓名',
age int comment '年齡',
origin string comment '地域'
)
comment '學生信息表'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用load從本地加載數(shù)據(jù)到student_info
load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
查看student_info表在hdfs路徑,新增加了student_info_data.txt文件
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
- 查詢origin為 11 的學生
hive> select * from student_info where origin='11'
hive> select * from student_info where origin='11'
> ;
OK
1 xiaoming 20 11
6 zhangsan 20 11
7 lisi 19 11
Time taken: 0.473 seconds, Fetched: 3 row(s)
hive>
- 使用load將hdfs文件加載到student_info表中
上傳文件到HDFS中
hadoop fs -put student_info_data.txt /
追加的方式載入
load data inpath '/student_info_data.txt' into table student_info;
hdfs中student_info表位置會出現(xiàn)兩個 student_info_data.txt
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/mydb.db/student_info
Found 2 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:39 /user/hive/warehouse/mydb.db/student_info/student_info_data_copy_1.txt
并且HDFS中 /student_info_data.txt會剪切到student_info表的hdfs路徑下/user/hive/warehouse/rel.db/student_info
以重寫的方式載入
load data inpath '/student_info_data.txt' overwrite into table student_info;
會覆蓋原來的數(shù)據(jù).
數(shù)據(jù)類型
- 創(chuàng)建員工表:employee
字段信息:用戶id,工資,工作過的城市,社保繳費情況(養(yǎng)老,醫(yī)保),福利(吃飯補助(float),是否轉(zhuǎn)正(boolean),商業(yè)保險(float))
create table employee(
user_id string,
salary int,
worked_citys array<string>,
social_security map<string,float>,
welfare struct<meal_allowance:float,if_regular:boolean,commercial_insurance:float>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
- 從本地加載數(shù)據(jù)到表employee
load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
hive> select * from employee;
OK
zhangsan 10800 ["beijing","shanghai"] {"養(yǎng)老":1000.0,"醫(yī)療":600.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi 20000 ["beijing","nanjing"] {"養(yǎng)老":2000.0,"醫(yī)療":1200.0} {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu 17000 ["shanghai","nanjing"] {"養(yǎng)老":1800.0,"醫(yī)療":1100.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
- 查詢已轉(zhuǎn)正的員工編號,工資,工作過的第一個城市,社保養(yǎng)老繳費情況,福利餐補金額
hive> select user_id,
> salary,
> worked_citys[0],
> social_security['養(yǎng)老'],
> welfare.meal_allowance
> from employee
> where welfare.if_regular=true;
OK
zhangsan 10800 beijing 1000.0 2000.0
wangwu 17000 shanghai 1800.0 2000.0
創(chuàng)建外部表
可以提前創(chuàng)建好HDFS路徑
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果沒有提前創(chuàng)建好,在創(chuàng)建外部表的時候會根據(jù)指定路徑自動創(chuàng)建
- 創(chuàng)建外部學生入學信息表
字段信息:
學號、姓名、學院id、專業(yè)id、入學年份
HDFS數(shù)據(jù)路徑:/user/hive/warehouse/data/student_school_info
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';
- 上傳本地數(shù)據(jù)文件到HDFS
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
查詢
select * from student_school_info
創(chuàng)建內(nèi)部分區(qū)表
創(chuàng)建學生入學信息表
字段信息:學號、姓名、學院id
分區(qū)字段:專業(yè)id
create table student_school_info_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用insert into從student_school_info表將2017年入學的學籍信息導入到student_school_info_partition_maj分區(qū)表中
insert into table student_school_info_partition_maj partition(major_id ='bigdata')
select t1.student_id,t1.name,t1.institute_id
from student_school_info t1
where t1. major_id = bigdata;
查看分區(qū)
show partitions student_school_info_partition_maj;查看hdfs路徑,會增加major_id ='bigdata'目錄
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj/刪除分區(qū)
alter table student_school_info_partition drop partition (major_id ='bigdata');查看分區(qū)表,數(shù)據(jù)已經(jīng)被刪除
使用動態(tài)分區(qū)添加數(shù)據(jù)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1 ;
- 查看分區(qū)
show partitions student_school_info_partition_maj;
hive> show partitions student_school_info_partition_maj;
OK
major_id=bigdata
major_id=computer
major_id=software
Time taken: 0.114 seconds, Fetched: 3 row(s)
- 查看hdfs路徑
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=software
會增加三個目錄,每個目錄存儲對應的數(shù)據(jù)
創(chuàng)建外部分區(qū)表
創(chuàng)建學生入學信息表
字段信息:學號、姓名、學院id
分區(qū)字段:專業(yè)id
create external table rel.student_school_info_external_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition_maj';
- 動態(tài)分區(qū)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_external_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1;
- 刪除內(nèi)部分區(qū)表,表刪除,hdfs中的數(shù)據(jù)也刪除了
drop table student_school_info_partition_maj;
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/
Found 1 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:23 /user/hive/warehouse/rel.db/student_school_info_partition
- 刪除外部分區(qū)表
hive> drop table student_school_info_external_partition_maj;
OK
Time taken: 0.63 seconds
hive> show tables;
OK
student_school_info
student_school_info_partition
Time taken: 0.027 seconds, Fetched: 2 row(s)
查看hdfs中的文件,數(shù)據(jù)依然存在
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:06 /user/hive/warehouse/data/student_school_info
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 46 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software/000000_0
使用LIKE、AS創(chuàng)建表,表重命名,添加、修改、刪除列
根據(jù)已存在的表結構,使用like關鍵字,復制一個表結構一模一樣的新表
create table student_info2 like student_info;根據(jù)已經(jīng)存在的表,使用as關鍵字,創(chuàng)建一個與查詢結果字段一致的表,同時將查詢結果數(shù)據(jù)插入到新表
create table student_info3 as select * from student_info;
只有student_id,name兩個字段的表
create table student_info4 as select student_id,name from student_info;student_info4表重命名為student_id_name
alter table student_info4 rename to student_id_name;給student_info3表添加性別列,新添加的字段會在所有列最后,分區(qū)列之前,在添加新列之前已經(jīng)存在的數(shù)據(jù)文件中
如果沒有新添加列對應的數(shù)據(jù),在查詢的時候顯示為空。添加多個列用逗號隔開
alter table student_info_new3 add columns (gender string comment '性別');-
刪除列或修改列
修改列,將繼續(xù)存在的列再定義一遍,需要替換的列重新定義
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string,gender2 int);刪除列,將繼續(xù)存在的列再定義一遍,需要刪除的列不再定義
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string);
創(chuàng)建分桶表
按照指定字段取它的hash散列值分桶
創(chuàng)建學生入學信息分桶表
字段信息:學號、姓名、學院ID、專業(yè)ID
分桶字段:學號,4個桶,桶內(nèi)按照學號升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 插入數(shù)據(jù)
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
分桶表一般不使用load向分桶表中導入數(shù)據(jù),因為load導入數(shù)據(jù)只是將數(shù)據(jù)復制到表的數(shù)據(jù)存儲目錄下,hive并不會
在load的時候?qū)?shù)據(jù)進行分析然后按照分桶字段分桶,load只會將一個文件全部導入到分桶表中,并沒有分桶。一般
采用insert從其他表向分桶表插入數(shù)據(jù)。
分桶表在創(chuàng)建表的時候只是定義表的模型,插入的時候需要做如下操作:
在每次執(zhí)行分桶插入的時候在當前執(zhí)行的session會話中要設置hive.enforce.bucketing = true;聲明本次執(zhí)行的是一次分桶操作。
需要指定reduce個數(shù)與分桶的數(shù)量相同set mapreduce.job.reduces=4,這樣才能保證有多少桶就生成多少個文件。
如果定義了按照分桶字段排序,需要在從其他表查詢數(shù)據(jù)過程中將數(shù)據(jù)按照分區(qū)字段排序之后插入各個桶中,分桶表并不會將各分桶中的數(shù)據(jù)排序。
排序和分桶的字段相同的時候使用Cluster by(字段),cluster by 默認按照分桶字段在桶內(nèi)升序排列,如果需要在桶內(nèi)降序排列,
使用distribute by (col) sort by (col desc)組合實現(xiàn)。
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
導出數(shù)據(jù)
- 使用insert將student_info表數(shù)據(jù)導出到本地指定路徑
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data' row format delimited fields terminated by '\t' select * from student_info; - 導出數(shù)據(jù)到本地的常用方法
hive -e"select * from rel.student_info"> ./student_info_data.txt
默認結果分隔符:'\t'
join關聯(lián)
- join 或者inner join
兩個表通過id關聯(lián),只把id值相等的數(shù)據(jù)查詢出來。join的查詢結果與inner join的查詢結果相同。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
full outer join 或者 full join
兩個表通過id關聯(lián),把兩個表的數(shù)據(jù)全部查詢出來
select * from a full join b on a.id=b.id;left join
左連接時,左表中出現(xiàn)的join字段都保留,右表沒有連接上的都為空
select * from a left join b on a.id=b.id;rightjoin
右連接時,右表中出現(xiàn)的join字段都保留,左表沒有連接上的都是空
select * from a right join b on a.id=b.id;left semi join
左半連接實現(xiàn)了類似IN/EXISTS的查詢語義,輸出符合條件的左表內(nèi)容。
hive不支持in …exists這種關系型數(shù)據(jù)庫中的子查詢結構,hive暫時不支持右半連接。
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive對應于如下語句:
select a.id,a.name from a left semi join b on a.id = b.id;
- map side join
使用分布式緩存將小表數(shù)據(jù)加載都各個map任務中,在map端完成join,map任務輸出后,不需要將數(shù)據(jù)拷貝到reducer階段再進行join,
降低的數(shù)據(jù)在網(wǎng)絡節(jié)點之間傳輸?shù)拈_銷。多表關聯(lián)數(shù)據(jù)傾斜優(yōu)化的一種手段。多表連接,如果只有一個表比較大,其他表都很小,
則join操作會轉(zhuǎn)換成一個只包含map的Job。運行日志中會出現(xiàn)Number of reduce tasks is set to 0 since there's no reduce operator
沒有reduce的提示。
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
hive 內(nèi)置函數(shù)
- case when
語法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
說明:如果a等于b,那么返回c;如果a等于d,那么返回e;否則返回f
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one
語法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
說明:如果a為TRUE,則返回b;如果c為TRUE,則返回d;否則返回e
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one
自定義UDF函數(shù)
當Hive提供的內(nèi)置函數(shù)無法滿足你的業(yè)務處理需要時,此時就可以考慮使用用戶自定義函數(shù)(UDF:user-defined function)。
UDF 作用于單個數(shù)據(jù)行,產(chǎn)生一個數(shù)據(jù)行作為輸出。
步驟:
- 先開發(fā)一個java類,繼承UDF,并重載evaluate方法
- 打成jar包上傳到服務器
- 在使用的時候?qū)ar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar; - 創(chuàng)建臨時函數(shù)與開發(fā)好的java class關聯(lián)
hive>create temporary function age_partition as 'cn.hadoop.udf.AgePartitionFunction'; - 即可在hql中使用自定義的函數(shù)
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);
HIVE安裝使用時遇到的問題
- 創(chuàng)建表時失敗
原因: mysql字符集問題,要設置mysql中hive數(shù)據(jù)庫的字符為latin1
- 刪除表時,卡主
原因: 也是字符問題. 是在創(chuàng)建表時,mysql的字符還是utf-8, 后來用命令改掉為latin1,
需要重新設置,刪除hive數(shù)據(jù)庫,重新創(chuàng)建,并設置字符集.