英文代碼以及講解來自datacamp
窗口的概念非常重要,它可以理解為記錄集合,窗口函數(shù)也就是在滿足某種條件的記錄集合上執(zhí)行的特殊函數(shù)。對于每條記錄都要在此窗口內(nèi)執(zhí)行函數(shù),有的函數(shù)隨著記錄不同,窗口大小都是固定的,這種屬于靜態(tài)窗口;有的函數(shù)則相反,不同的記錄對應(yīng)著不同的窗口,這種動態(tài)變化的窗口叫滑動窗口。
1.window function中的over()
1.1 over用法
The OVER() clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT. The OVER() clause offers significant benefits over subqueries in select -- namely, your queries will run faster, and the OVER() clause has a wide range of additional functions and clauses you can include with it that we will cover later on in this chapter.
函數(shù)名([expr]) over子句
其中,over是關(guān)鍵字,用來指定函數(shù)執(zhí)行的窗口范圍,如果后面括號中什么都不寫,則意味著窗口包含滿足where條件的所有行,窗口函數(shù)基于所有行進行計算;如果不為空,則支持以下四種語法來設(shè)置窗口:
a.window_name:給窗口指定一個別名,如果SQL中涉及的窗口較多,采用別名可以看起來更清晰易讀
b.partition子句:窗口按照那些字段進行分組,窗口函數(shù)在不同的分組上分別執(zhí)行。
c.order by子句:按照哪些字段進行排序,窗口函數(shù)將按照排序后的記錄順序進行編號。可以和partition子句配合使用,也可以單獨使用。
d.frame子句:frame是當(dāng)前分區(qū)的一個子集,子句用來定義子集的規(guī)則,通常用來作為滑動窗口使用。
1.2 frame滑動窗口
對于滑動窗口的范圍指定,有兩種方式,基于行和基于范圍,具體區(qū)別如下
1.2.1 基于行
通常使用BETWEEN frame_start AND frame_end語法來表示行范圍,frame_start和frame_end可以支持如下關(guān)鍵字,來確定不同的動態(tài)行記錄:
CURRENT ROW 邊界是當(dāng)前行,一般和其他范圍關(guān)鍵字一起使用
UNBOUNDED PRECEDING 邊界是分區(qū)中的第一行
UNBOUNDED FOLLOWING 邊界是分區(qū)中的最后一行
expr PRECEDING 邊界是當(dāng)前行減去expr的值
expr FOLLOWING 邊界是當(dāng)前行加上expr的值
1.2.2 基于范圍
和基于行類似,但有些范圍不是直接可以用行數(shù)來表示的,比如希望窗口范圍是一周前的訂單開始,截止到當(dāng)前行,則無法使用rows來直接表示,此時就可以使用范圍來表示窗口:INTERVAL 7 DAY PRECEDING。Linux中常見的最近1分鐘、5分鐘負載是一個典型的應(yīng)用場景。
函數(shù)包括:CUME_DIST()
DENSE_RANK()
LAG()
LEAD()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()
SELECT
# Select the id, country name, season, home, and away goals
m.id,
c.name AS country,
m.season,
m.home_goal,
m.away_goal,
# Use a window to include the aggregate average in each row
avg(m.home_goal +m.away_goal ) over() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;
2. window function 中的 rank用法;row_number()用法
2.1 rank
Window functions allow you to create a RANK of information according to any variable you want to use to sort your data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY clause inside the OVER() clause.
SELECT
id,
RANK() OVER(ORDER BY home_goal) AS rank
FROM match;
EG
In this exercise, you will create a data set of ranked matches according to which leagues, on average, score the most goals in a match.
SELECT
#Select the league name and average goals scored
l.name AS league,
AVG(m.home_goal + m.away_goal) AS avg_goals,
# Rank each league according to the average goals
RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal)) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
# Order the query by the rank you created
ORDER BY league_rank;
來自datacamp
EG:
In the last exercise, the rank generated in your query was organized from smallest to largest. By adding DESC to your window function, you can create a rank sorted from largest to smallest.
SELECT
# Select the league name and average goals scored
l.name AS league,
avg(m.home_goal + m.away_goal) AS avg_goals,
# Rank leagues in descending order by average goals
rank ()over(order by avg(m.home_goal + m.away_goal) desc) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
# Order the query by the rank you created
order by league_rank;
2.2 row_number()
eg: 取出沒門課程的第一名
CREATE TABLE window_test
(id int,
name text,
subject text,
score numeric
);
INSERT INTO window_test VALUES (1,'小黃','數(shù)學(xué)',99.5), (2,'小黃','語文',89.5),(3,'小黃','英語',79.5), (4,'小黃','物理',99.5), (5,'小黃','化學(xué)',98.5), (6,'小紅','數(shù)學(xué)',89.5), (7,'小紅','語文',99.5), (8,'小紅','英語',79.5), (9,'小紅','物理',89.5), (10,'小紅','化學(xué)',69.5),(11,'小綠','數(shù)學(xué)',89.5), (12,'小綠','語文',91.5), (13,'小綠','英語',92.5),(14,'小綠','物理',93.5), (15,'小綠','化學(xué)',94.5);
>select * from window_test;
正常解法:
select b.* from
(select subject,max(score) as score from window_test group by subject) a
join window_test b on a.score = b.score and a.subject = b.subject;
用row_number
select id,name,subject,score from (select row_number() over (partition by subject order by score desc) as rn,
id,name,subject,score from window_test )t where rn=1;
ROW_NUMBER():順序排序——1、2、3
RANK():并列排序,跳過重復(fù)序號——1、1、3
DENSE_RANK():并列排序,不跳過重復(fù)序號——1、1、2
3.window function中的 over 與partition by用法
partition by: calculate separate values for different categories
calculate different calculations in the same column
AVG(home_goal) OVER (PARTITION BY season)
3.1 partition by 一列
datacamp練習(xí):
In this exercise, you will be creating a data set of games played by Legia Warszawa (Warsaw League), the top ranked team in Poland, and comparing their individual game performance to the overall average for that season.
Where do you see the more outliers? Are they Legia Warszawa's home or away games?
SELECT
date,
season,
home_goal,
away_goal,
CASE WHEN hometeam_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
#Calculate the average goals scored partitioned by season
avg(home_goal) over(PARTITION BY season) AS season_homeavg,
avg(away_goal) over(PARTITION BY season) AS season_awayavg
FROM match
# Filter the data set for Legia Warszawa matches only
WHERE
hometeam_id = 8673
OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;
3.2 partition by 多列
The PARTITION BY clause can be used to break out window averages by multiple data points (columns). You can even calculate the information you want to use to partition your data! For example, you can calculate average goals scored by season and by country, or by the calendar year (taken from the date column).
In this exercise, you will calculate the average number home and away goals scored Legia Warszawa, and their opponents, partitioned by the month in each season.
SELECT
date,
season,
home_goal,
away_goal,
CASE WHEN hometeam_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
#Calculate average goals partitioned by season and month
avg(home_goal) over(partition by season,
EXTRACT(month FROM date)) AS season_mo_home,
avg(away_goal) over(partition by season,
EXTRACT(month FROM date)) AS season_mo_away
FROM match
WHERE
hometeam_id = 8673
OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;
4.sliding windows
- perform calculations relative to the current row
- can be used to calculate running totals, sums, averages
- can be partition by one or more columns
ROW BETWEEN <start> AND <finish>
Sliding windows allow you to create running calculations between any two points in a window using functions such as PRECEDING, FOLLOWING, and CURRENT ROW. You can calculate running counts, sums, averages, and other aggregate functions between any two points you specify in the data set.
rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口范圍是當(dāng)前行、前一行、后一行一共三行記錄。
rows UNBOUNDED FOLLOWING 窗口范圍是當(dāng)前行到分區(qū)中的最后一行。
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口范圍是當(dāng)前分區(qū)中所有行,等同于不寫。
In this exercise, you will expand on the examples discussed in the video, calculating the running total of goals scored by the FC Utrecht when they were the home team during the 2011/2012 season. Do they score more goals at the end of the season as the home or away team?
SELECT
date,
home_goal,
away_goal,
# Create a running total and running average of home goals
SUM(home_goal) over(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
avg(home_goal) over(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE
hometeam_id = 9908
AND season = '2011/2012';
In this exercise, you will slightly modify the query from the previous exercise by sorting the data set in reverse order and calculating a backward running total from the CURRENT ROW to the end of the data set (earliest record).
SELECT
-- Select the date, home goal, and away goals
DATE,
home_goal,
away_goal,
#Create a running total and running average of home goals
SUM(home_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
avg(home_goal) over(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE
awayteam_id = 9908
AND season = '2011/2012';
5. 與CASE,CTE合用
how badly did Manchester United lose in each match?
In order to determine this, let's add a window function to the main query that ranks matches by the absolute value of the difference between home_goal and away_goal. This allows us to directly compare the difference in scores without having to consider whether Manchester United played as the home or away team!
The equation is complete for you -- all you need to do is properly complete the window function!
WITH HOME AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
WHEN m.home_goal < m.away_goal THEN 'MU Loss'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id),
# Set up the away team CTE
AWAY AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
WHEN m.home_goal < m.away_goal THEN 'MU Win'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id)
# Select columns and and rank the matches by date
SELECT DISTINCT
m.date,
home.team_long_name AS home_team,
away.team_long_name AS away_team,
m.home_goal, m.away_goal,
rank() over(order by ABS(home_goal - away_goal) desc) as match_rank
# Join the CTEs onto the match table
FROM match AS m
left JOIN home ON m.id = home.id
left JOIN away ON m.id = away.id
WHERE m.season = '2014/2015'
AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));
6.應(yīng)用場景
6.1 希望查詢每個用戶訂單金額最高的前三個訂單

上面紅色粗體顯示了三個函數(shù)的區(qū)別,row_number()在amount都是800的兩條記錄上隨機排序,但序號按照1、2遞增,后面amount為600的的序號繼續(xù)遞增為3,中間不會產(chǎn)生序號間隙;rank()/dense_rank()則把amount為800的兩條記錄序號都設(shè)置為1,但后續(xù)amount為600的需要則分別設(shè)置為3(rank)和2(dense_rank)。即rank()會產(chǎn)生序號相同的記錄,同時可能產(chǎn)生序號間隙;而dense_rank()也會產(chǎn)生序號相同的記錄,但不會產(chǎn)生序號間隙。
一般排序用rank, 但是數(shù)據(jù)里面有重復(fù)的值,最好用dense_rank???
6.2 查詢上一個訂單距離當(dāng)前訂單的時間間隔。

內(nèi)層SQL先通過lag函數(shù)得到上一次訂單的日期,外層SQL再將本次訂單和上次訂單日期做差得到時間間隔diff。
6.3 查詢截止到當(dāng)前訂單,按照日期排序第一個訂單和最后一個訂單的訂單金額。

結(jié)果和預(yù)期一致,比如order_id為4的記錄,first_amount和last_amount分別記錄了用戶‘001’截止到時間2018-01-03 00:00:00為止,第一條訂單金額100和最后一條訂單金額800,注意這里是按時間排序的最早訂單和最晚訂單,并不是最小金額和最大金額訂單。
6.4 每個用戶按照訂單id,截止到當(dāng)前的累計訂單金額/平均訂單金額/最大訂單金額/最小訂單金額/訂單數(shù)是多少?

reference:
https://dbaplus.cn/news-11-2258-1.html