HiveSQL窗口函數(shù)

課堂目標

  • 掌握sum(),avg()用于累計計算的窗口函數(shù)
  • 掌握row_number(),rank()用于排序的窗口函數(shù)【最高頻※※※】
  • 掌握ntile()用于分組查詢的窗口函數(shù)【很低頻】
  • 掌握lag(),lead()偏移分析窗口函數(shù)【很低頻】


1.2018年每月的支付總額和當年累計支付總額
'''sum后面加需要累計求和的字段
over后面是按照月份的升序進行的【排到某月份前面的所有都進行匯總】'''
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month)
from
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by month(dt)) a;
image.png
2.2017-2018年每月的支付總額和當年累計支付總額
  • 先找出每年每月的支付金額
  • 是按年分組的前提下進行月份升序求和的
    partition by 起到分組的作用
    order by 按照什么順序進行累加,默認升序,降序desc
select a.year,
a.month,
a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month)
from
(select year(dt) year,
month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),month(dt)) a;
image.png
3.2018年每個月的近三月移動平均支付金額

移動范圍:rows between 2 preceding and current row


image.png
select a.month,
a.pay_amount,
avg(a.pay_amount) over(order by a.month rows between 2 preceding and current row)
from 
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by month(dt)) a;
image.png

方法總結(jié)

image.png

拓展:
max(......) over(parttition by ......order by ......rows between ......and......)
min(......) over(parttition by ......order by ......rows between ......and......)



row_number() over(......)
rank() over(......)
dense_rand() over(......)

1.2019年1月,用戶購買商品品類數(shù)量的排名
select user_name,
count(distinct goods_category),
row_number() over(order by count(distinct goods_category)),
rank() over(order by count(distinct goods_category)),
dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name;
image.png

row_number:名額是固定的,挑選一定數(shù)量的人
rank:成績排名
dense_number:三種類型的獎牌,同樣分數(shù)獲得

2.選出2019年支付金額排名在第 10、20、30 名的用戶
select a.user_name,
a.pay_amount,
a.rank
from
(select user_name,
sum(pay_amount) pay_amount,
rank() over(order by sum(pay_amount) desc) rank
from user_trade
where year(dt)='2019'
group by user_name) a
where a.rank in (10,20,30);
image.png


分組窗口函數(shù)
ntile(n) over(partition by ... order by ...)

1.將2019年1月的支付用戶,按照支付金額分成5組
select user_name,
sum(pay_amount) pay_amount,
ntile(5) over(order by sum(pay_amount) desc)
from user_trade
where dt between'2019-01-01' and '2019-01-31'
group by user_name;

'''
dt between'2019-01-01' and '2019-01-31'
等價于
substr(dt,1,7)='2019-01'
'''
image.png
2.選出2019年退款金額排名前10%的用戶
select a.user_name,
a.refund_amount,
a.rank
from
(select user_name,
sum(refund_amount) refund_amount,
ntile(10) over(order by sum(refund_amount) desc) rank 
from user_refund
where year(dt)='2019'
group by user_name) a
where a.rank=1;

image.png


偏移分析窗口函數(shù)

  • 這一次和上一次,今天和明天,這個月和上個月等
    lag(exp_str,offset,defval) over(partition by ... order by ...)【向前】
    lead(exp_str,offset,defval) over(partition by ... order by ...)【向后】
    exp_str:字段名稱
    offset:偏移量(默認值是1)
    defval:默認值,超出范圍
1.Alice和Alexander的各種時間偏移量

【場景】購買用戶購買的時間間隔分布

select user_name,
dt,
lag(dt,1,dt) over(partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,dt) over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>'0'
and user_name in ('Alice','Alexander');
image.png
2.支付時間間隔超過100天的用戶數(shù)【※】
select count(distinct user_name)
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt
from user_trade
where dt>'0') a
where a.lead_dt is not null
and datediff(a.lead_dt,a.dt)>100;
'''結(jié)果:180'''


重點練習

1.每個城市,不同性別,2018年支付金額最高的TOP3用戶

【user_trade和user_info表】

'''
1.每個用戶的支付金額匯總
2.每個用戶的城市和性別(直接左連接user_info表)
3.每個城市不同性別的用戶的支付金額排名
4.找出TOP3
'''
select c.city,
c.sex,
c.user_name,
c.pay_amount,
c.rank
from 
(select b.city,
b.sex,
a.user_name,
a.pay_amount,
row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
from
(select user_name,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by user_name) a
left join user_info b
on a.user_name=b.user_name
-- 這里不要 group by 了,因為在排序中已經(jīng)指定了
) c
where c.rank<=3;
image.png
2.2.每個手機品牌退款金額前25%的用戶【desc千萬不要忘記】【user_refund和user_info表】

ntile(4) over()

'''
自己做錯原因:
1.將user_info 表左連接user_trade表
而應該user_trade表左連接user_info 表
2.可以直接 left join user_info b 表示可以去全部字段
a.user_name=b.user_name
而當left join (select ... from ...) b 時,就一定要寫select user_name,否則就不能
a.user_name=b.user_name
3.get_json_object(extra1,'$.phonebrand') phonebrand
等價于extra2['phonebrand'] phonebrand

'''
'''
1.每個用戶的手機品牌
2.每個用戶的退款金額
'''
select c.user_name,
c.phonebrand,
c.refund_amount,
c.level
from
(select a.user_name,
b.phonebrand,
a.refund_amount,
ntile(4) over(partition by b.phonebrand order by a.refund_amount desc) level
from
(select user_name,
sum(refund_amount) refund_amount
from user_refund
group by user_name) a
left join 
(select user_name,
get_json_object(extra1,'$.phonebrand') phonebrand
from user_info) b
on a.user_name=b.user_name) c
where c.level=1;
image.png
3.計算每12個月的用戶累計支付金額
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row)
from
(select substr(dt,1,7) month,
sum(pay_amount) pay_amount
from user_trade
where dt>'0'
group by substr(dt,1,7)) a;
image.png
4.計算出每4個月的最大退款金額
select a.month,
a.refund_amount,
max(a.refund_amount) over(order by a.month rows between 3 preceding and current row)
from
(select substr(dt,1,7) month,
sum(refund_amount) refund_amount
from user_refund
where dt>'0'
group by substr(dt,1,7)) a;
image.png
5.退款時間間隔最長的用戶
  • 用戶的退款時間間隔(天),然后order by 就好,真搞不懂當初為啥要用max()

select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null;
image.png
  • 退款時間間隔最長的用戶【錯誤】:要用max()必須要用 group by
select a.user_name,
max(a.date_diff)
from
(select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null) a;
  • 正確寫法:
select b.user_name,
b.date_diff
from
(select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null) b
order by b.date_diff desc limit 1;
image.png
最后編輯于
?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

友情鏈接更多精彩內(nèi)容