Hive開窗函數(shù)

一、應用場景:

  • 用于分區(qū)排序
  • 動態(tài)Group By
  • top N
  • 累計計算

二、函數(shù)介紹

1、窗口函數(shù):

first_value:取分組內排序后,截止到當前行,第一個值;
last_value:取分組內排序后,截止到當前行,最后一個值;
lead(col, n, default):用于統(tǒng)計窗口內往下第n行值。第一個參數(shù)為列名,第二個參數(shù)為往下第n行(可選,默認為1),第三個參數(shù)為默認值(當往下第n行為null時,取默認值,如不指定則為null);
lag(col, n, default):與lead相反,用于統(tǒng)計窗口內往上第n行值。第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選,默認為1),第三個參數(shù)為默認值(當往上第n行為null時,取默認值,如不指定,則為null)。

2、over從句

1)使用標準的聚合函數(shù)count、sum、min、max、avg
2)使用partition by語句,使用一個或多個原始列
3)使用partition byorder by語句,使用一個或多個分區(qū)或者排序列
4)使用窗口規(guī)范,窗口規(guī)范支持以下格式:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

ORDER BY后面缺少窗口從句條件,窗口規(guī)范默認是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

ORDER BY和窗口從句都缺失, 窗口規(guī)范默認是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

OVER從句支持以下函數(shù), 但是并不支持和窗口一起使用它們。
Ranking函數(shù): Rank, NTile, DenseRank, CumeDist, PercentRank.
LeadLag 函數(shù).

3、分析函數(shù)

row_number():從1開始,按照順序生成組內記錄的序列,比如按照pv降序排列生成分組內的pv排名;獲取分組內的top1記錄;獲取一個session內的第一條記錄等等。
rank():生成數(shù)據項在分組內的排名,排名相等會在名次中留下空位。
dense_rank():生成數(shù)據項在分組內的排名,排名相對不會在名次中留下空位。
cume_dist:小于等于當前值的行數(shù)/分組內總行數(shù)。比如,統(tǒng)計小于等于當前薪資的人數(shù)占總人數(shù)的比例。
percent_rank: (分組內當前行的rank值-1)/(分組內總行數(shù)-1)。
ntile(n):用于將分組數(shù)據按照順序切分成n片,返回當前切片值,如果切片不均勻,默認增加第一個切片的分布。ntile不支持rows between,比如ntile(2) over(partition by cookieied order by createtime rows between 3 preceding and current row)。

--- Hive2.1.0及以后支持Distinct
COUNT(DISTINCT a) OVER (PARTITION BY c)

--- Hive 2.2.0中在使用ORDER BY和窗口限制時支持distinct
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

--- Hive2.1.0及以后支持在OVER從句中支持聚合函數(shù)
SELECT rank() OVER (ORDER BY sum(b))
FROM t
GROUP BY a
;

4、測試數(shù)據集


-- COUNT、SUM、MIN、MAX、AVG
select 
    user_id,
    user_type,
    sales,
    --默認為從起點到當前行
    sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc) AS sales_1,
    --從起點到當前行,結果與sales_1不同。
    sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sales_2,
    --當前行+往前3行
    sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sales_3,
    --當前行+往前3行+往后1行
    sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sales_4,
    --當前行+往后所有行  
    sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sales_5,
    --分組內所有行
    SUM(sales) OVER(PARTITION BY user_type) AS sales_6                          
from 
    order_detail
order by 
    user_type,
    sales,
    user_id
;
-- 注意:
-- 輸出結果和order by相關,默認為升序;
-- 如果不指定rows between,默認為起點到當前行;
-- 如果不指定order by,則將分組內所有值累加;

關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當前行
UNBOUNDED:無界限(起點或終點)
UNBOUNDED PRECEDING:表示從前面的起點
UNBOUNDED FOLLOWING:表示到后面的終點
其他COUNT、AVG,MIN,MAX,和SUM用法一樣。

--  first_value與last_value
select 
    user_id,
    user_type,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,  
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from 
    order_detail;
-- lead與lag
select 
    user_id,device_id,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from 
    order_detail;
-- RANK、ROW_NUMBER、DENSE_RANK
select 
    user_id,user_type,sales,
    RANK() over (partition by user_type order by sales desc) as r,
    ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
    DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
    order_detail;  
-- NTILE

select 
    user_type,sales,
    --分組內將數(shù)據分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分組內將數(shù)據分成3片    
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分組內將數(shù)據分成4片    
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --將所有數(shù)據分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from 
    order_detail
order by 
    user_type,
    sales
--取sale前20%的用戶ID
select
    user_id
from
(
    select 
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from 
        order_detail
)A
where nt=1;
-- CUME_DIST、PERCENT_RANK 

select 
user_id,user_type,sales,
--沒有partition,所有數(shù)據均為1組
CUME_DIST() OVER(ORDER BY sales) AS cd1,
--按照user_type進行分組
CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2 
from 
order_detail;   

select 
user_type,sales
--分組內總行數(shù)      
SUM(1) OVER(PARTITION BY user_type) AS s, 
--RANK值  
RANK() OVER(ORDER BY sales) AS r,    
PERCENT_RANK() OVER(ORDER BY sales) AS pr,
--分組內     
PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg 
from 
order_detail; 
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

友情鏈接更多精彩內容