一、應用場景:
- 用于分區(qū)排序
- 動態(tài)Group By
- top N
- 累計計算
二、函數(shù)介紹
1、窗口函數(shù):
first_value:取分組內排序后,截止到當前行,第一個值;
last_value:取分組內排序后,截止到當前行,最后一個值;
lead(col, n, default):用于統(tǒng)計窗口內往下第n行值。第一個參數(shù)為列名,第二個參數(shù)為往下第n行(可選,默認為1),第三個參數(shù)為默認值(當往下第n行為null時,取默認值,如不指定則為null);
lag(col, n, default):與lead相反,用于統(tǒng)計窗口內往上第n行值。第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選,默認為1),第三個參數(shù)為默認值(當往上第n行為null時,取默認值,如不指定,則為null)。
2、over從句
1)使用標準的聚合函數(shù)count、sum、min、max、avg
2)使用partition by語句,使用一個或多個原始列
3)使用partition by與order by語句,使用一個或多個分區(qū)或者排序列
4)使用窗口規(guī)范,窗口規(guī)范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
當ORDER BY后面缺少窗口從句條件,窗口規(guī)范默認是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
當ORDER BY和窗口從句都缺失, 窗口規(guī)范默認是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
OVER從句支持以下函數(shù), 但是并不支持和窗口一起使用它們。
Ranking函數(shù): Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead 和 Lag 函數(shù).
3、分析函數(shù)
row_number():從1開始,按照順序生成組內記錄的序列,比如按照pv降序排列生成分組內的pv排名;獲取分組內的top1記錄;獲取一個session內的第一條記錄等等。
rank():生成數(shù)據項在分組內的排名,排名相等會在名次中留下空位。
dense_rank():生成數(shù)據項在分組內的排名,排名相對不會在名次中留下空位。
cume_dist:小于等于當前值的行數(shù)/分組內總行數(shù)。比如,統(tǒng)計小于等于當前薪資的人數(shù)占總人數(shù)的比例。
percent_rank: (分組內當前行的rank值-1)/(分組內總行數(shù)-1)。
ntile(n):用于將分組數(shù)據按照順序切分成n片,返回當前切片值,如果切片不均勻,默認增加第一個切片的分布。ntile不支持rows between,比如ntile(2) over(partition by cookieied order by createtime rows between 3 preceding and current row)。
--- Hive2.1.0及以后支持Distinct
COUNT(DISTINCT a) OVER (PARTITION BY c)
--- Hive 2.2.0中在使用ORDER BY和窗口限制時支持distinct
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
--- Hive2.1.0及以后支持在OVER從句中支持聚合函數(shù)
SELECT rank() OVER (ORDER BY sum(b))
FROM t
GROUP BY a
;
4、測試數(shù)據集

-- COUNT、SUM、MIN、MAX、AVG
select
user_id,
user_type,
sales,
--默認為從起點到當前行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc) AS sales_1,
--從起點到當前行,結果與sales_1不同。
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sales_2,
--當前行+往前3行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sales_3,
--當前行+往前3行+往后1行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sales_4,
--當前行+往后所有行
sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sales_5,
--分組內所有行
SUM(sales) OVER(PARTITION BY user_type) AS sales_6
from
order_detail
order by
user_type,
sales,
user_id
;
-- 注意:
-- 輸出結果和order by相關,默認為升序;
-- 如果不指定rows between,默認為起點到當前行;
-- 如果不指定order by,則將分組內所有值累加;

關鍵是理解
ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當前行
UNBOUNDED:無界限(起點或終點)
UNBOUNDED PRECEDING:表示從前面的起點
UNBOUNDED FOLLOWING:表示到后面的終點
其他COUNT、AVG,MIN,MAX,和SUM用法一樣。
-- first_value與last_value
select
user_id,
user_type,
ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
order_detail;

-- lead與lag
select
user_id,device_id,
lead(device_id) over (order by sales) as default_after_one_line,
lag(device_id) over (order by sales) as default_before_one_line,
lead(device_id,2) over (order by sales) as after_two_line,
lag(device_id,2,'abc') over (order by sales) as before_two_line
from
order_detail;

-- RANK、ROW_NUMBER、DENSE_RANK
select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
order_detail;

-- NTILE
select
user_type,sales,
--分組內將數(shù)據分成2片
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
--分組內將數(shù)據分成3片
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
--分組內將數(shù)據分成4片
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
--將所有數(shù)據分成4片
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
order_detail
order by
user_type,
sales

--取sale前20%的用戶ID
select
user_id
from
(
select
user_id,
NTILE(5) OVER(ORDER BY sales desc) AS nt
from
order_detail
)A
where nt=1;
-- CUME_DIST、PERCENT_RANK
select
user_id,user_type,sales,
--沒有partition,所有數(shù)據均為1組
CUME_DIST() OVER(ORDER BY sales) AS cd1,
--按照user_type進行分組
CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
order_detail;

select
user_type,sales
--分組內總行數(shù)
SUM(1) OVER(PARTITION BY user_type) AS s,
--RANK值
RANK() OVER(ORDER BY sales) AS r,
PERCENT_RANK() OVER(ORDER BY sales) AS pr,
--分組內
PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
order_detail;
