一 . 開窗函數(shù)
分組函數(shù)
語法結(jié)構(gòu):select col1,分析函數(shù)(col2) from table group by col1
開窗函數(shù)
語法結(jié)構(gòu):函數(shù)名(col2)over(partition by ool1 order by col3)
分組與開窗的區(qū)別
分組函數(shù)每組只返回一行,而開窗函數(shù)每組返回多行。如下:
分組查詢
select visit_id, count(session_id) ct from event group by visit_id
開窗查詢
select visit_id,count(session_id) over(partition by visit_id) ct from event
對(duì)比可以發(fā)現(xiàn),visit_id為00009e20-e3ec-4340-bf44-484b523fafbc時(shí),用分組查詢,ct只返回一行結(jié)果:8,而開窗查詢visit_id的8條記錄,都返回了,并且這八條記錄的ct都為8。
二 . rows between關(guān)鍵字含義
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當(dāng)前行
UNBOUNDED:無邊界,UNBOUNDED PRECEDING 表示從最前面的起點(diǎn)開始, UNBOUNDED FOLLOWING:表示到最后面的終點(diǎn)
舉例如下:
三 . 常見分析函數(shù)總結(jié)
測(cè)試數(shù)據(jù)
1)sum求和函數(shù)
加order by時(shí)
select
cookieid,
createtime,
pv,
sum(pv) over(partition by cookieid order by createtime) as pv1
from event
不加order by時(shí)
select
cookieid,
createtime,
pv,
sum(pv) over(partition by cookieid) as pv1
from event
對(duì)比可以發(fā)現(xiàn),sum函數(shù)加了order by之后,不僅會(huì)對(duì)createtime排序,而且求的和是從起點(diǎn)到當(dāng)前行的求和,而不是整個(gè)分區(qū)(分組)的求和。而不加order by,是對(duì)整個(gè)分組(分片)求和注:max()函數(shù)無論有沒有order by 都是計(jì)算整個(gè)分區(qū)的最大值
2)幾個(gè)排序函數(shù)
排序函數(shù):row_number() 、rank() 、dense_rank() 、ntile() ,其區(qū)別如下:
(1) row_number() over():對(duì)相等的值不進(jìn)行區(qū)分,相等的值對(duì)應(yīng)的排名相同,序號(hào)從1到n連續(xù)。
(2)rank() over():相等的值排名相同,但若有相等的值,則序號(hào)從1到n不連續(xù)。如果有兩個(gè)人都排在第3名,則沒有第4名。
(3)dense_rank() over():對(duì)相等的值排名相同,但序號(hào)從1到n連續(xù)。如果有兩個(gè)人都排在第一名,則排在第2名(假設(shè)僅有1個(gè)第二名)的人是第3個(gè)人。
(4) ntile( n ) over():可以看作是把有序的數(shù)據(jù)集合平均分配到指定的數(shù)量n的桶中,將桶號(hào)分配給每一行,排序?qū)?yīng)的數(shù)字為桶號(hào)。如果不能平均分配,則較小桶號(hào)的桶分配額外的行,并且各個(gè)桶中能放的數(shù)據(jù)條數(shù)最多相差1。學(xué)生成績(jī)表同上,查詢語句如下:
select
id,
stu_name,
course_name,
grades,
row_number() over(order by grades) as row_num,
rank() over(order by grades) as rank,
dense_rank() over(order by grades) as dense_rank,
ntile(5) over(order by grades) as ntile
from students_grades
查詢結(jié)果如下:
3)lag和leag函數(shù)
lag(col,n,default) 用于統(tǒng)計(jì)窗口內(nèi)往前第n行值,default表示默認(rèn)值,意思是如果往前第n行的值null時(shí),則返回默認(rèn)值,如果不指定,則返回null。
SELECT
cookieid,
createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01') OVER(PARTITION BY cookieid ORDER BY createtime) AS lag1,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS lag2
FROM event;
ps:leag(col,n,default) 用于統(tǒng)計(jì)窗口內(nèi)往下第n行值
4)first_value和last_value函數(shù)
first_value返回分組內(nèi)排序后,截止到當(dāng)前行,第一個(gè)值
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS first
FROM event
ps:last_value和first_value相反,取最后一個(gè)值
參考文檔:
1)Hive開窗函數(shù)整理
2)數(shù)據(jù)分析(SQL)常見面試題(一):開窗函數(shù)
https://www.douban.com/group/topic/155112949/
最后歡迎關(guān)注微信公眾號(hào)
