亚洲日韩一区二区精品,国产精品粉嫩在线观看

Hive 是一個(gè)SQL 解析引擎，將SQL語(yǔ)句轉(zhuǎn)譯成MR Job，然后再hadoop上運(yùn)行，達(dá)到快速

mysql是存放數(shù)據(jù)的，而hive是不存放數(shù)據(jù)的，hive的表是純邏輯表，只是表的定義，即表的元數(shù)據(jù)，實(shí)際數(shù)據(jù)在hadoop的磁盤(pán)上

Hive的內(nèi)容是讀多寫(xiě)少，不支持對(duì)數(shù)據(jù)的改寫(xiě)和刪除，要?jiǎng)h除只能把整個(gè)表drop掉

當(dāng)需要導(dǎo)入到hive中的數(shù)據(jù)，文本中包含'\n'，就會(huì)以'\n'換行，導(dǎo)致數(shù)據(jù)串行。
怎么辦？

hive的mapreduce


select word, count(*)
from (
select explode(split(sentence,' ')) as word from article_1
) t
group by word
 
解釋：
select explode(split(sentence,' ')) as word from article： 做map操作
explode()：這個(gè)函數(shù)的功能就是行轉(zhuǎn)列
split(sentence,' ')：將sentence這個(gè)字段里面的內(nèi)容以空格分割開(kāi)，返回的是單詞的數(shù)組
as word 表示新生成的列名字叫做word
t： 新生成的表的別名，新生成的表是臨時(shí)表【語(yǔ)法是from后面要接一個(gè)表】
select word, count(*)
from () t 
group by  word
--
group by word: 對(duì)word做聚合，reduce 的過(guò)程
count(*): 求和

測(cè)試：
select explode(split(sentence,' ')) as word from article limit 30

select word, count(1) as cnt
from (
select explode(split(sentence,' ')) as word from article
) t
group by word

Hive體系架構(gòu)
數(shù)據(jù)存儲(chǔ)：
hive數(shù)據(jù)以文件形式存儲(chǔ)在HDFS的指定目錄下
hive語(yǔ)句生成查詢計(jì)劃，由mapreduce調(diào)用執(zhí)行
語(yǔ)句轉(zhuǎn)換
解析器：生成抽象語(yǔ)法樹(shù)
語(yǔ)法分析器：驗(yàn)證查詢語(yǔ)句
邏輯計(jì)劃生成器（包括優(yōu)化器）：生成操作符樹(shù)
查詢計(jì)劃生成器：轉(zhuǎn)換為map-reduce任務(wù)
用戶接口
CLI：?jiǎn)?dòng)的時(shí)候，會(huì)同時(shí)啟動(dòng)一個(gè)Hive的副本
JDBC：Hive的客戶端，用戶連接至Hive Server
WUI：通過(guò)瀏覽器訪問(wèn)Hive

hive的表的本質(zhì)就是hadoop的目錄

hive創(chuàng)建表的方式：
創(chuàng)建內(nèi)部表：create table 內(nèi)部表
創(chuàng)建外部表：create external table location 'hdfs_path' 必須是文件夾路徑

在導(dǎo)入數(shù)據(jù)到外部表，數(shù)據(jù)并沒(méi)有移動(dòng)到自己的數(shù)據(jù)倉(cāng)庫(kù)目錄下，也就是說(shuō)外部表的數(shù)據(jù)并不是由它自己來(lái)管理的，而內(nèi)部表不一樣
在刪除表的時(shí)候，hive將會(huì)把屬于表的元數(shù)據(jù)和數(shù)據(jù)全部刪除；而刪除外部表的時(shí)候，hive僅僅刪除外部表的元數(shù)據(jù)，數(shù)據(jù)是不會(huì)刪除的

============================
實(shí)戰(zhàn)部分

查看數(shù)據(jù)庫(kù)
show databases;

查看表
show tables；

創(chuàng)建數(shù)據(jù)庫(kù) user_base_1：
CREATE DATABASE IF NOT EXISTS user_base_1;

hive的mapreduce:
代碼：
select word, count(1) as cnt
from (
select explode(split(sentence,' ')) as word from article
) t
group by word
order by cnt desc
limit 100

說(shuō)明：
1. order by 排序，因?yàn)槭侨峙判?，所以只能在一個(gè)reduce里面跑
2. order by 是一個(gè)任務(wù)，所以上面的代碼會(huì)啟動(dòng)兩個(gè)Job，第一個(gè)Job有一個(gè)map一個(gè)reduce，第二個(gè)Job只有一個(gè)reduce
3. 而且會(huì)有依賴，必須等第一個(gè)Job結(jié)束之后才有第二個(gè)Job執(zhí)行

SQL的成本很低，而且在大公司一般都有一個(gè)內(nèi)部使用的web界面，直接在上面寫(xiě)SQL語(yǔ)句就可以了，而且還帶提示的，特別方便，用習(xí)慣了hive之后，再寫(xiě)python的mapreduce表示回不去了。

SQL是鍛煉數(shù)據(jù)思維、數(shù)據(jù)處理的能力，需要經(jīng)常練習(xí)。

Hive的SQL可擴(kuò)展性高，支持UDF/UDAF/UDTF，支持用戶自定義的函數(shù)方法。

hive的架構(gòu)：
類比于執(zhí)行一個(gè)C程序
首先編譯檢查語(yǔ)法是否有問(wèn)題，檢查hive需要調(diào)取的那些元數(shù)據(jù)是否有問(wèn)題，然后將hive的代碼轉(zhuǎn)化為mapreduce的任務(wù)，然后在hadoop執(zhí)行任務(wù)，最后生成結(jié)果數(shù)據(jù)。

分區(qū) partition
hive表名就是文件夾，好處：根據(jù)時(shí)間、日期做partition，每天一個(gè)partition，每天的數(shù)據(jù)會(huì)存放到一個(gè)文件夾里面，相當(dāng)于將數(shù)據(jù)按日期劃分。
如果只想要查詢昨天的數(shù)據(jù)，只需用對(duì)應(yīng)查詢昨天日期的文件夾下的數(shù)據(jù)
分桶 bucket
10bucket 把數(shù)據(jù)劃分10份， 1/10 只需要拿一份，但是因?yàn)橥ㄟ^(guò)shuffle過(guò)程分的，所以可能數(shù)量上不是很準(zhǔn)

建表，只是建立元數(shù)據(jù)信息+hdfs目錄下給一個(gè)表名文件夾，里面是沒(méi)有數(shù)據(jù)的
create table article(sentence string)
row format delimited fields terminated by '\n';

從本地導(dǎo)入數(shù)據(jù)，相當(dāng)于將path數(shù)據(jù) 類似于 hadoop fs -put /hive/warehouse/badou.db
load data local inpath 'localpath' into table article;

查看數(shù)據(jù)：
select * from article limit 3;

查看hadoop中的數(shù)據(jù)：
 hadoop fs -ls /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/article_1

-rwxr-xr-x   3 root supergroup     632207 2019-03-15 22:27 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/article_1/The_Man_of_Property.txt

外部表
create external table article_2(sentence string)
row format delimited fields terminated by '\n'
stored as textfile #存儲(chǔ)成為文本形式
location '/data/ext';
badou.db目錄下沒(méi)有新建的外部表數(shù)據(jù)（因?yàn)槭峭獠勘頂?shù)據(jù)）
外部數(shù)據(jù)源數(shù)據(jù)未發(fā)生變化
drop table article_1；
--發(fā)現(xiàn)數(shù)據(jù)原信息被刪除了，但是在hdfs路徑下的/data/ext的數(shù)據(jù)還存在，類似于軟鏈接

partition 建表
create table art_dt(sentence string)
partitioned by (dt string)
row format delimited fields terminated by '\n';

從hive表中的數(shù)據(jù)插入到新表(分區(qū)表)中：從article表中取100條數(shù)據(jù)插入到art_dt表中
insert overwrite table art_dt partition(dt='20190329')
select * from article limit 100;

hdfs的hive目錄下對(duì)應(yīng)數(shù)據(jù)庫(kù)中：badou.db/art_dt/dt_20190329

select * from art_dt  limit 10;
分析：這個(gè)查找是一個(gè)全量的查找，相當(dāng)于查找這個(gè)表下面的全量的分區(qū)，舉個(gè)例子：如果只有兩個(gè)分區(qū)的話，等價(jià)于:
select * from art_dt  where dt between  '20190328' and '20190329' limit 10;
如果表的分區(qū)數(shù)特別多的話，查找就會(huì)很慢很慢。
如果知道在哪個(gè)分區(qū)，直接去那個(gè)分區(qū)找，查詢的效率就會(huì)特別高。
select * from art_dt  where dt between  '20190328' and '20190329' limit 10;

partition實(shí)際是怎么產(chǎn)生的？用在什么數(shù)據(jù)上？
每天都會(huì)產(chǎn)生用戶瀏覽、點(diǎn)擊、收藏、購(gòu)買(mǎi)的記錄。
按照每天的方式去存儲(chǔ)數(shù)據(jù)，按天做partition
--
根據(jù)數(shù)據(jù)來(lái)源區(qū)分，app/m/pc
例如：logs/dt=20190329/type=app
logs這張表，在20190329這個(gè)日期，app端的log數(shù)據(jù)存放路徑
logs/dt=20190329/type=app
logs/dt=20190329/type=m
logs/dt=20190329/type=pc
--
數(shù)據(jù)量太大的情況下，除了按照天劃分?jǐn)?shù)據(jù)，還可以按照三端的方式劃分?jǐn)?shù)據(jù)
數(shù)據(jù)庫(kù) 存放數(shù)據(jù)：用戶的屬性，年齡，性別，blog等等
每天都會(huì)有新增用戶，修改信息 dt=20190328 dt=20190329 大量信息太冗余了
解決方法：
overwrite 7 每天做overwrite dt=20190328 這天中的信息包含這天之前的所有用戶信息(當(dāng)天之前所有的全量數(shù)據(jù))
存7個(gè)分區(qū)，冗余7份，防止丟失(不是防止機(jī)器掛掉了丟失數(shù)據(jù)，而是防止誤操作導(dǎo)致的數(shù)據(jù)丟失，這個(gè)鍋很大，背不起)，也會(huì)有冗余，但是只冗余7份，每天刪除7天前的數(shù)據(jù)。

分桶 bucket

create table udata(
user_id string,
item_id string,
rating string,
`timestamp` string
) row format delimited fields terminated by '\t';
load data local inpath '/home/badou/Documents/u.data' into table udata;

# 設(shè)置打印列名
set hive.cli.print.header=true;

bucket
hive中的table可以拆分成partition，table和partition可以通過(guò)‘CLUSTERED BY’ 進(jìn)一步分bucket， bucket中的數(shù)據(jù)可以通過(guò)‘sort by’排序。
sort by 是分桶內(nèi)的排序，order by 是全局排序。
作用：數(shù)據(jù)sampling 數(shù)據(jù)采樣

#建表
create table bucket_users (
user_id int,
item_id string,
rating string,
`timestamp` string
) clustered by(user_id) into 4 buckets;

#插入數(shù)據(jù)
#因?yàn)樾枰殖?個(gè)桶，需要設(shè)置強(qiáng)制分桶，否則會(huì)根據(jù)處理的數(shù)據(jù)量，只會(huì)啟用一個(gè)reduce
set hive.enforce.bucketing = true;

insert overwrite table bucket_users
select cast(user_id as int ) as user_id, item_id, rating, `timestamp` from udata;

#查看結(jié)果：可以看到4個(gè)分桶的表
$ hadoop fs -ls /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users

-rwxr-xr-x   3 root supergroup     466998 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000000_0
-rwxr-xr-x   3 root supergroup     497952 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000001_0
-rwxr-xr-x   3 root supergroup     522246 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000002_0
-rwxr-xr-x   3 root supergroup     491977 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000003_0

#采樣 sampling
tablesample() 函數(shù)
格式：tablesample(bucket x out of y)
比如：有32個(gè)桶，bucket 3 out of 16，意思就是32/16=2，取兩個(gè)桶的數(shù)據(jù)，從第三個(gè)桶開(kāi)始算起，3%16=3，19%16=3，最終結(jié)果就是取第3個(gè)桶和第19個(gè)桶的數(shù)據(jù)，這樣就達(dá)到了采樣的目的。
 
#查看任意一個(gè)bucket的數(shù)據(jù)
select * from bucket_users tablesample(bucket 1 out of 4 on user_id);

#計(jì)算任意一個(gè)bucket有多少數(shù)據(jù)
select count(*) from bucket_users tablesample(bucket 1 out of 4 on user_id);
結(jié)果：23572 (總數(shù)是100000條)

select count(*) from bucket_users tablesample(bucket 2 out of 4 on user_id);
結(jié)果：25159 (總數(shù)是100000條)

分桶是進(jìn)行了partition的過(guò)程，分的不是特別精確。

#采樣數(shù)據(jù)，插入到新創(chuàng)建表中
$ create table tmp as select * from bucket_users tablesample(bucket 1 out of 4 on user_id);

hive join in MR

image.png

# 訂單商品的歷史行為數(shù)據(jù)
create table order_product_prior(
order_id string, 
product_id string,
add_to_cart string,  #加購(gòu)物車
reordered string  #重復(fù)購(gòu)買(mǎi)
) row format delimited fields terminated by ',';
load data local inpath '/home/badou/Documents/data/order_data/order_products__prior.csv' into table order_product_prior;

#訂單表
# order_number 訂單購(gòu)買(mǎi)順序
# eval_set 標(biāo)志是訓(xùn)練集還是測(cè)試集
# order_dow dow day of week 那天買(mǎi)的
# order_hour_of_day  一天中什么時(shí)候下的訂單
# days_since_prior_order 距離上一個(gè)訂單多久了
create table orders (
order_id string,
user_id string,
eval_set string,
order_number string,
order_dow string,
order_hour_of_day string,
days_since_prior_order string
) row format delimited fields terminated by ',';
load data local inpath '/home/badou/Documents/data/order_data/orders.csv' into table orders;

$ select * from order_product_prior limit 10;
order_id        product_id      add_to_cart_order       reordered
2       33120   1       1
2       28985   2       1
2       9327    3       0
2       45918   4       1
2       30035   5       0
2       17794   6       1
2       40141   7       1
2       1819    8       1
2       43668   9       0

$ select * from orders limit 10;
order_id        user_id eval_set        order_number    order_dow       order_hour_of_day       days_since_prior_order
2539329 1       prior   1       2       08
2398795 1       prior   2       3       07      15.0
473747  1       prior   3       3       12      21.0
2254736 1       prior   4       4       07      29.0
431534  1       prior   5       4       15      28.0
3367565 1       prior   6       2       07      19.0
550135  1       prior   7       1       09      20.0
3108588 1       prior   8       1       14      14.0
2295261 1       prior   9       1       16      0.0

需求：統(tǒng)計(jì)每個(gè)用戶購(gòu)買(mǎi)過(guò)多少商品
1. 每個(gè)訂單的商品數(shù)量【訂單中的商品數(shù)量】
select order_id, count(1) as prod_cnt 
from order_product_prior
group by order_id
order by prod_cnt desc
limit 30;

2. user - 產(chǎn)品數(shù)量的關(guān)系
將每個(gè)訂單的數(shù)量帶給user  join
table1: order_id  prod_cnt
table2: order_id user_id
table1 + table2 => order_id, user_id, prod_cnt

-- 這個(gè)用戶在這個(gè)訂單中購(gòu)買(mǎi)了多少商品prod_cnt
select 
t2.order_id as order_id, 
t2.user_id as user_id,
t1.prod_cnt as prod_cnt 
from orders t2
join
(select order_id, count(1) as prod_cnt
from order_product_prior
group by order_id) t1
on t2.order_id=t1.order_id
limit 30;

3. 這個(gè)用戶所有訂單的商品總和
select 
user_id,
sum(prod_cnt) as sum_prod_cnt
from
(select
t2.order_id as order_id,
t2.user_id as user_id,
t1.prod_cnt as prod _cnt
from orders t2
join
(select order_id, count(1) as prod_cnt
from order_prodct_prior
group by order_id) t1
on t2.order_id=t1.order_id) t12
group by user_id
order by sum_prod_cnt desc
limit 30;

簡(jiǎn)寫(xiě)：
select x from (select x from t1) join (select x from t2) on x 
group by x
order by x
limit n

寫(xiě)sql，上千行的都有??
這才哪到哪??

hive優(yōu)化

合并小文件，減少map數(shù)？

適當(dāng)增加map數(shù)？
set mapred.map.tasks = 10;

map的優(yōu)化主要是在文件數(shù)量上的優(yōu)化，遇到的比較少，主要還是在reduce上的優(yōu)化，比如最重要的數(shù)據(jù)傾斜。

設(shè)置reduce任務(wù)處理的數(shù)據(jù)量
hive.exec.reduceers.bytes.per.reducer
調(diào)整reduce的個(gè)數(shù)
設(shè)置reducer處理的數(shù)量
set mapred.reduce.tasks=10
一個(gè)reduce的情況
全局排序的話，在一個(gè)reduce里面進(jìn)行
笛卡爾積：
select
t1.u1 as u1,
t2.u2 as u2
from
(select user_id as u1 from tmp) t1
join
(select user_id as u2 from tmp) t2;

笛卡爾積會(huì)使得數(shù)據(jù)增加得特別快，需要盡量避免，笛卡爾積是在一個(gè)reduce里面進(jìn)行的。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

hive

hive

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

hive

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av