準備數據
CREATE TABLE lxy (cookieid INT, create_time STRING, pv INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/chenlinlin2156233/lxy.csv';
SELECT * FROM lxy;
查看結果

SUM(), MIN(),MAX(),AVG()等聚合函數
對一定窗口期內的數據進行聚合
SELECT *,
SUM(a.pv) OVER (PARTITION BY cookieid ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv1,
SUM(a.pv) OVER (PARTITION BY cookieid ORDER BY create_time ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS pv2
FROM lxy AS a;
在這里根據cookieid進行分組,然后按照create_time進行分組,選擇不同的窗口進行一定函數的聚合運算。
基本的語法是ROWS BETWEEN 一個時間點 AND 一個時間點
時間點分別可以是以當前行作為參考系,前面幾行n PRECEDING或者是后面幾行n FOLLOWING,也可以是當前行CURRENT ROW。總之可以想象有一個滑動窗口,我們可以規(guī)定一個滑動窗口的中心位置和大小,然后每次畫過一個步長,計算一次窗口內的值。

新增加序號列NTILE, ROW_NUMBER(), RANK(), DENSE_RANK()
我們先來試試看這幾個函數的實際返回結果。

SELECT *,
NTILE(3) OVER (PARTITION BY cookid2 ORDER BY pv) AS n1,
ROW_NUMBER() OVER (PARTITION BY cookid2 ORDER BY pv) AS n2,
RANK() OVER (PARTITION BY cookid2 ORDER BY pv) AS n3,
DENSE_RANK() OVER (PARTITION BY cookid2 ORDER BY pv) AS n4
FROM lxy3;

我們可以看到,對于NTILE函數,傳入的參數n是指要切分成多少份,返回對應的序號,ROW_NUMBER()則是生成一列連續(xù)的序號,RANK()與ROW_NUMBER()類似,只是對于數值相同的這一項會同時為相同的序號,下一個序號跳過,比如倒數第二列當中有出現4,4,6沒有5;而DENSE_RANK()則相反,會緊跟著下一個是緊接著的序號,比如4,4,5。
LAG, LEAD, FIRST_VALUE, LAST_VALUE
這幾個函數可以通過字面意思記得,LAG是遲滯的意思,也就是對某一列進行往后錯行;LEAD是LAG的反義詞,也就是對某一列進行提前幾行;FIRST_VALUE是對該列到目前為止的首個值,而LAST_VALUE是到目前行為止的最后一個值。
仍舊是這張表

SELECT *,
LAG(pv, 2) OVER(PARTITION BY cookid2 ORDER BY log_date) AS lag1,
LEAD(pv, 2, 0) OVER(PARTITION BY cookid2 ORDER BY log_date) AS lead1,
FIRST_VALUE() OVER(PARTITION BY cookid2 ORDER BY log_date) AS first_pv,
FIRST_VALUE() OVER(PARTITION BY cookid2 ORDER BY log_date) AS last_pv,
LAST_VALUE() OVER(PARTITION BY cookid2 ORDER BY log_date) AS current_last_pv
FROM lxy3;

LAG和LEAD里面都是傳入三個參數,分別是排序的列名,滯后/往前的行數,以及默認填充值。因為我們在這里的LEAD()里面設置默認填充值為0,所以對于cookid后面兩行缺失值填充為0。
如果我們要返回每個分組下排序后的最后一個數,可以對該組進行DESC的操作,注意ORDER BY對返回的結果很有影響。
SELECT *,
FIRST_VALUE() OVER(PARTITION BY cookid2 ORDER BY pv DESC) AS first_pv
FROM lxy3;
GROUPING SET, CUBE, ROLL UP
我們先準備一張表格
CREATE EXTERNAL TABLE lxw1234 (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/chenlinlin2156233/lxy2/';

SELECT * FROM lxw1234;

從上面看到我們已經成功導入了一張外部表。
GROUPING SET(key1, key2)相當于是對不同字段進行group操作以后,再進行union all的操作。
SELECT month,
day,
count(DISTINCT cookieid) AS count_id,
GROUPING__ID
FROM lxw1234
GROUP BY month, day
GROUPING SETS(month, day)
ORDER BY GROUPING__ID;

在這里注意,
- GROUPING_ID是自動生成的,是進行了GROUPING_SET()的操作之后。
- 下劃線有兩個
- 需要先做GROUP BY操作再傳入GROUPING SETS
等價于先group再union all的做法
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day

CUBE就是比以上的GROUPING SETS多了一個兩列的整合,也就是笛卡爾乘積。
SELECT month,
day,
count(DISTINCT cookieid) AS count_id,
GROUPING__ID
FROM lxw1234
GROUP BY month, day
WITH CUBE
ORDER BY GROUPING__ID;

假如我們把上面的代碼里面的CUBE改成了ROLL UP,我們看下會返回什么結果。
SELECT month,
day,
count(DISTINCT cookieid) AS count_id,
GROUPING__ID
FROM lxw1234
GROUP BY month, day
WITH ROLLUP
ORDER BY GROUPING__ID;

可以看到,這個時候就不會返回以右邊為關鍵字的聚合結果,只是返回左邊的鍵以及笛卡爾乘積的結果。
我們如果換一下聚合的關鍵字month和day的順序呢?
SELECT month,
day,
count(DISTINCT cookieid) AS count_id,
GROUPING__ID
FROM lxw1234
GROUP BY day, month
WITH ROLLUP
ORDER BY GROUPING__ID;

從上面結果可以看到,關鍵字的順序對rollup的結果也是很有影響的。
以上就是所學習hive窗口函數的總結。
參考資源
以上總結主要參考該博客。