Hive分析函數(shù)和窗口函數(shù)

Hive分析函數(shù)和窗口函數(shù)

在Hive 0.11之后支持的,掃描多個輸入的行計算每行的結(jié)果。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用。和傳統(tǒng)的分組結(jié)果不一樣,傳統(tǒng)的結(jié)果每組中只有一個結(jié)果。分析函數(shù)的結(jié)果會出現(xiàn)多次,和每條記錄都連接輸出。

語法形式如下:

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

窗口函數(shù)

函數(shù)名 說明
FIRST_VALUE 取出分組內(nèi)排序后,截止到當(dāng)前行,第一個值
LAST_VALUE 取出分組內(nèi)排序后,截止到當(dāng)前行,最后一個值
LEAD(col, n, DEFAULT) 用于統(tǒng)計窗口內(nèi)往下第n行的值。第一個參數(shù)為列名,第二個參數(shù)為往下第n行(可選,默認(rèn)為1),第三個參數(shù)為默認(rèn)值(當(dāng)往下第n行為NULL時,取默認(rèn)值)
LAG(col,n,DEFAULT) 與lead相反,用于統(tǒng)計窗口內(nèi)往下第n個值。第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選,默認(rèn)為1)

OVER從句

  1. 使用標(biāo)準(zhǔn)的聚合函數(shù)COUNT,SUM,MIN,MAX,AVG
  2. 使用PARTITION BY語句,使用一個或者多個原始數(shù)據(jù)類型的列
  3. 使用PARTITION BY與ORDER BY語句,使用一個或者多個數(shù)據(jù)類型的分區(qū)或者拍序列
  4. 使用窗口規(guī)范,窗口規(guī)范支持一下格式:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

當(dāng)ORDER BY后面缺少窗口從句條件,窗口規(guī)范默認(rèn)是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

當(dāng)ORDER BY和窗口從句都缺失,窗口規(guī)范默認(rèn)是:

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

分析函數(shù)

函數(shù) 說明
ROW_NUMBER() 從1開始,按照順序,生成分組內(nèi)記錄的序列,比如,按照pv降序排列,生成分組內(nèi)每天的pv名次,ROW_NUMBER()的應(yīng)用 場景非常多,再比如,獲取分組內(nèi)排序第一的記錄,獲取一個session中的第一條refer等
RANK() 生成數(shù)據(jù)項在分組中的排名,排名相等會在名次中留下空位
DENSE_RANK() 生成數(shù)據(jù)項在分組中的排名,排名相等會在名次中不會留下空位
CUME_DIST() 小于等于當(dāng)前值的行數(shù)除以分組內(nèi)總行數(shù)。比如,統(tǒng)計小于等于當(dāng)前薪水的人數(shù)所占總?cè)藬?shù)的比例
PERCENT_RANK() 分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi) 總行數(shù)-1
NTILE(n) 用于將分組數(shù)據(jù)按照順序切分成n片,返回當(dāng)前切片值,如果切片不均勻,默認(rèn)增加第一個切片的分布。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW)

Hive2.1.0及以后版本支持distinct

在聚合函數(shù)(sum, count, avg)中支持distinct,但是在order by或者 窗口限制中不支持。
conut(distinct a) over(partition by c)

Hive2.1.0以后支持在OVER從句中支持聚合函數(shù)

select rank() over(order by sum(b))

Hive2.2.0中在使用ORDER BY和窗口限制時支持distinct

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

通過實例深入理解窗口函數(shù)和分析函數(shù)

COUNT、SUM、MIN、MAX、AVG案例分析


## 創(chuàng)建數(shù)據(jù)表
create table orders(
    user_id string,
    device_id string,
    user_type string,
    price float,
    sales int);

## 添加數(shù)據(jù)orders.txt
zhangsa test1   new     67.1    2
lisi    test2   old     43.32   1
wanger  test3   new     88.88   3
liliu   test4   new     66.0    1
tom     test5   new     54.32   1
tomas   test6   old     77.77   2
tomson  test7   old     88.44   3
tom1    test8   new     56.55   6
tom2    test9   new     88.88   5
tom3    test10  new     66.66   5

## 開窗函數(shù)案例
select
    user_id,
    user_type,
    sales,
    -- 默認(rèn)從起點到當(dāng)前所有重復(fù)行
    sum(sales) over(partition by user_type order by sales asc) as sales_1,
    -- 從起點到當(dāng)前所有重復(fù)行與sales_1結(jié)果相同
    sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
    -- 從起點到當(dāng)前行,結(jié)果與sale_1結(jié)果不同
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
    -- 當(dāng)前行加上往前3行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
    -- 當(dāng)前范圍往上加3行
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
    -- 當(dāng)前行+往前3行+往后1行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
    --
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
    -- 當(dāng)前行+之后所有行
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
    --
    sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
    -- 分組內(nèi)所有行
    sum(sales) over(partition by user_type) as sales_10
from
    orders
order by
    user_type,
    sales,
    user_id;

##上述查詢結(jié)果如下:

| user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  | sales_8  | sales_9  | sales_10  |
|----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| liliu    | new        | 1      | 2        | 2        | 2        | 2        | 2        | 4        | 4        | 22       | 23       | 23        |
| tom      | new        | 1      | 2        | 2        | 1        | 1        | 2        | 2        | 4        | 23       | 23       | 23        |
| zhangsa  | new        | 2      | 4        | 4        | 4        | 4        | 4        | 7        | 7        | 21       | 21       | 23        |
| wanger   | new        | 3      | 7        | 7        | 7        | 7        | 7        | 12       | 7        | 19       | 19       | 23        |
| tom2     | new        | 5      | 17       | 17       | 17       | 15       | 15       | 21       | 21       | 11       | 16       | 23        |
| tom3     | new        | 5      | 17       | 17       | 12       | 11       | 15       | 16       | 21       | 16       | 16       | 23        |
| tom1     | new        | 6      | 23       | 23       | 23       | 19       | 19       | 19       | 19       | 6        | 6        | 23        |
| lisi     | old        | 1      | 1        | 1        | 1        | 1        | 1        | 3        | 3        | 6        | 6        | 6         |
| tomas    | old        | 2      | 3        | 3        | 3        | 3        | 3        | 6        | 6        | 5        | 5        | 6         |
| tomson   | old        | 3      | 6        | 6        | 6        | 6        | 6        | 6        | 6        | 3        | 3        | 6         |

注意

  1. 結(jié)果和ORDER BY相關(guān),默認(rèn)為升序
  2. 如果不指定ROWS BETWEEN,默認(rèn)為從起點到當(dāng)前行;
  3. 如果不指定ORDER BY,則將分組內(nèi)所有值累加;
  4. PRECEDING:往前
  5. FOLLOWING:往后
  6. CURRENT ROW:當(dāng)前行
  7. UNBOUNDED:無界限(起點或終點)
  8. UNBOUNDED PRECEDING:表示從前面的起點
  9. UNBOUNDED FOLLOWING:表示到后面的終點
  10. 其他COUNT、AVG,MIN,MAX,和SUM用法一樣。

FIRST_VALUE和LAST_VALUE案例分析


select
    user_id,
    user_type,
    sales,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    orders
order by
    user_type,
    sales;

##上述查詢結(jié)果如下:

| user_id | user_type | sales | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
| ------- | --------- | ----- | ------- | -------------- | -------------- | ------------------ | ------------------ |
| tom     | new       | 1     | 1       | tom1           | tom            | tom                | liliu              |
| liliu   | new       | 1     | 2       | tom1           | tom            | tom                | liliu              |
| zhangsa | new       | 2     | 3       | tom1           | tom            | zhangsa            | zhangsa            |
| wanger  | new       | 3     | 4       | tom1           | tom            | wanger             | wanger             |
| tom3    | new       | 5     | 5       | tom1           | tom            | tom3               | tom2               |
| tom2    | new       | 5     | 6       | tom1           | tom            | tom3               | tom2               |
| tom1    | new       | 6     | 7       | tom1           | tom            | tom1               | tom1               |
| lisi    | old       | 1     | 1       | tomson         | lisi           | lisi               | lisi               |
| tomas   | old       | 2     | 2       | tomson         | lisi           | tomas              | tomas              |
| tomson  | old       | 3     | 3       | tomson         | lisi           | tomson             | tomson             |

LEAD與LAG


select
    user_id,
    device_id,
    sales,
    ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from
    orders
order by
    sales;

上述查詢結(jié)果如下

| user_id  | device_id  | sales  | row_num  | default_after_one_line  | default_before_one_line  | after_two_line  | before_two_line  |
|----------|------------|--------|----------|-------------------------|--------------------------|-----------------|------------------|
| lisi     | test2      | 1      | 3        | test6                   | test4                    | test1           | test5            |
| liliu    | test4      | 1      | 2        | test2                   | test5                    | test6           | abc              |
| tom      | test5      | 1      | 1        | test4                   | NULL                     | test2           | abc              |
| zhangsa  | test1      | 2      | 5        | test7                   | test6                    | test3           | test2            |
| tomas    | test6      | 2      | 4        | test1                   | test2                    | test7           | test4            |
| wanger   | test3      | 3      | 7        | test10                  | test7                    | test9           | test1            |
| tomson   | test7      | 3      | 6        | test3                   | test1                    | test10          | test6            |
| tom2     | test9      | 5      | 9        | test8                   | test10                   | NULL            | test3            |
| tom3     | test10     | 5      | 8        | test9                   | test3                    | test8           | test7            |
| tom1     | test8      | 6      | 10       | NULL                    | test9                    | NULL            | test10           |

RANK、ROW_NUMBER、DENSE_RANK


select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;

##上述查詢結(jié)果如下

| user_id | user_type | sales |  r  | rn  | dr  |
| ------- | --------- | ----- | --- | --- | --- |
| tom1    | new       | 6     | 1   | 1   | 1   |
| tom3    | new       | 5     | 2   | 2   | 2   |
| tom2    | new       | 5     | 2   | 3   | 2   |
| wanger  | new       | 3     | 4   | 4   | 3   |
| zhangsa | new       | 2     | 5   | 5   | 4   |
| tom     | new       | 1     | 6   | 6   | 5   |
| liliu   | new       | 1     | 6   | 7   | 5   |
| tomson  | old       | 3     | 1   | 1   | 1   |
| tomas   | old       | 2     | 2   | 2   | 2   |
| lisi    | old       | 1     | 3   | 3   | 3   |

NTILE


select
    user_type,sales,
    --分組內(nèi)將數(shù)據(jù)分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分組內(nèi)將數(shù)據(jù)分成3片
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分組內(nèi)將數(shù)據(jù)分成4片
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --將所有數(shù)據(jù)分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    orders
order by
    user_type,
    sales;

##上述查詢結(jié)果如下
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
| --------- | ----- | --- | --- | --- | ------- |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 2     | 1   | 1   | 2   | 2       |
| new       | 3     | 1   | 2   | 2   | 3       |
| new       | 5     | 2   | 2   | 3   | 4       |
| new       | 5     | 2   | 3   | 3   | 3       |
| new       | 6     | 2   | 3   | 4   | 4       |
| old       | 1     | 1   | 1   | 1   | 1       |
| old       | 2     | 1   | 2   | 2   | 2       |
| old       | 3     | 2   | 3   | 3   | 2       |

求取sale前20%的用戶ID

select
    user_id
from
(
    select
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from
        orders
)A
where nt=1;
## 結(jié)果如下
+----------+
| user_id  |
+----------+
| tom1     |
| tom3     |
+----------+

CUME_DIST、PERCENT_RANK


select
    user_id,user_type,sales,
    --沒有partition,所有數(shù)據(jù)均為1組
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --按照user_type進(jìn)行分組
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
    orders;

## 上述結(jié)果如下
+----------+------------+--------+------+----------------------+--+
| user_id  | user_type  | sales  | cd1  |         cd2          |
+----------+------------+--------+------+----------------------+--+
| liliu    | new        | 1      | 0.3  | 0.2857142857142857   |
| tom      | new        | 1      | 0.3  | 0.2857142857142857   |
| zhangsa  | new        | 2      | 0.5  | 0.42857142857142855  |
| wanger   | new        | 3      | 0.7  | 0.5714285714285714   |
| tom2     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom3     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom1     | new        | 6      | 1.0  | 1.0                  |
| lisi     | old        | 1      | 0.3  | 0.3333333333333333   |
| tomas    | old        | 2      | 0.5  | 0.6666666666666666   |
| tomson   | old        | 3      | 0.7  | 1.0                  |
+----------+------------+--------+------+----------------------+--+

select
    user_type,sales,
    --分組內(nèi)總行數(shù)
    SUM(1) OVER(PARTITION BY user_type) AS s,
    --RANK值
    RANK() OVER(ORDER BY sales) AS r,
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --分組內(nèi)
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
    orders;

## 上述結(jié)果如下
+------------+--------+----+-----+---------------------+---------------------+--+
| user_type  | sales  | s  |  r  |         pr          |         prg         |
+------------+--------+----+-----+---------------------+---------------------+--+
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 2      | 7  | 4   | 0.3333333333333333  | 0.3333333333333333  |
| new        | 3      | 7  | 6   | 0.5555555555555556  | 0.5                 |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 6      | 7  | 10  | 1.0                 | 1.0                 |
| old        | 1      | 3  | 1   | 0.0                 | 0.0                 |
| old        | 2      | 3  | 4   | 0.3333333333333333  | 0.5                 |
| old        | 3      | 3  | 6   | 0.5555555555555556  | 1.0                 |
+------------+--------+----+-----+---------------------+---------------------+--+

參考博文

http://blog.csdn.net/scgaliguodong123_/article/details/60135385

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容