1. 啟發(fā)
在業(yè)務(wù)上,我們常常會(huì)碰到這樣的需求
例如原始表如下:
| uid | state | time |
|---|---|---|
| a | 1 | 0 |
| a | null | 1 |
| a | null | 2 |
| a | 3 | 3 |
| b | null | 4 |
| b | 2 | 5 |
| b | null | 6 |
需要變成如下:
| uid | state | time |
|---|---|---|
| a | 1 | 0 |
| a | 1 | 1 |
| a | 1 | 2 |
| a | 3 | 3 |
| b | null | 4 |
| b | 2 | 5 |
| b | 2 | 6 |
簡(jiǎn)單來(lái)講就是,對(duì)所有空值,復(fù)制每個(gè)分組下排序后的最后一個(gè)非空值,這種需求也叫做The Last non-NULL Puzzle。對(duì)于這種問(wèn)題,pandas有ffill函數(shù)可以簡(jiǎn)單輕易的解決這個(gè)問(wèn)題,但是在SQL中就沒(méi)這么暢快了。
對(duì)與這個(gè)問(wèn)題有多個(gè)解法,以下是筆者找到的一種,作為拋磚引玉
代碼如下:
select
uid
, coalesce(last_value(state, true) over(partition by uid order by time rows between unbounded preceding and current row)) state
, time
from values
('a', 1, 0),
('a', null, 1),
('a', null, 2),
('a', 3, 3) ,
('b', null, 4),
('b', 2, 5),
('b', 2, 6)
as tab(uid, state, time);
/*
output:
uid state2 time
a 1 0
a 1 1
a 1 2
a 3 3
b NULL 4
b 2 5
b 2 6
*/
解釋?zhuān)?br>
over()開(kāi)窗函數(shù)中,unbounded preceding and current row表示從第一行到目前這一行
last_value()表示取一個(gè)數(shù)組中的最后一個(gè)值
coalesce()表示聚合一個(gè)數(shù)組,剔除所有的空值
2. 拓展
在Hive中,SQL中額外添加了一些函數(shù):
2.1 開(kāi)窗函數(shù)中的聚合函數(shù)
- LEAD() 返回分組中的上一個(gè)值
- LAG() 返回分組中的下一個(gè)值
- FIRST_VALUE() 返回分組中的第一個(gè)值
- LAST_VALUE() 返回分組中的最后一個(gè)值
2.2 OVER() 表達(dá)式中
OVER()中開(kāi)窗,可以使用ROWS表達(dá)式子分別對(duì)行進(jìn)行選擇/RANGE表達(dá)式對(duì)范圍進(jìn)行選擇,例如:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
reference:
https://cwiki.apache.org/confluence/display/hive/languagemanual+windowingandanalytics