Neil Zhu,簡(jiǎn)書(shū)ID Not_GOD,University AI 創(chuàng)始人 & Chief Scientist,致力于推進(jìn)世界人工智能化進(jìn)程。制定并實(shí)施 UAI 中長(zhǎng)期增長(zhǎng)戰(zhàn)略和目標(biāo),帶領(lǐng)團(tuán)隊(duì)快速成長(zhǎng)為人工智能領(lǐng)域最專(zhuān)業(yè)的力量。
作為行業(yè)領(lǐng)導(dǎo)者,他和UAI一起在2014年創(chuàng)建了TASA(中國(guó)最早的人工智能社團(tuán)), DL Center(深度學(xué)習(xí)知識(shí)中心全球價(jià)值網(wǎng)絡(luò)),AI growth(行業(yè)智庫(kù)培訓(xùn))等,為中國(guó)的人工智能人才建設(shè)輸送了大量的血液和養(yǎng)分。此外,他還參與或者舉辦過(guò)各類(lèi)國(guó)際性的人工智能峰會(huì)和活動(dòng),產(chǎn)生了巨大的影響力,書(shū)寫(xiě)了60萬(wàn)字的人工智能精品技術(shù)內(nèi)容,生產(chǎn)翻譯了全球第一本深度學(xué)習(xí)入門(mén)書(shū)《神經(jīng)網(wǎng)絡(luò)與深度學(xué)習(xí)》,生產(chǎn)的內(nèi)容被大量的專(zhuān)業(yè)垂直公眾號(hào)和媒體轉(zhuǎn)載與連載。曾經(jīng)受邀為國(guó)內(nèi)頂尖大學(xué)制定人工智能學(xué)習(xí)規(guī)劃和教授人工智能前沿課程,均受學(xué)生和老師好評(píng)。

聲明:感謝 Tambet Matiisen 的創(chuàng)作,這里只對(duì)最為核心的部分進(jìn)行的翻譯
Two years ago, a small company in London called DeepMind uploaded their pioneering paper “Playing Atari with Deep Reinforcement Learning” to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!
It has been hailed since then as the first step towards general artificial intelligence – an AI that can survive in a variety of environments, instead of being confined to strict realms such as playing chess. No wonder DeepMind was immediately bought by Google and has been on the forefront of deep learning research ever since. In February 2015 their paper “Human-level control through deep reinforcement learning” was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.
Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn’t had time to delve into reinforcement learning yet.
我們按照下面的幾個(gè)問(wèn)題來(lái)看看到底深度強(qiáng)化學(xué)習(xí)技術(shù)長(zhǎng)成什么樣?
- 什么是強(qiáng)化學(xué)習(xí)的主要挑戰(zhàn)?針對(duì)這個(gè)問(wèn)題,我們會(huì)討論 credit assignment 問(wèn)題和 exploration-exploitation 困境。
- 如何使用數(shù)學(xué)來(lái)形式化強(qiáng)化學(xué)習(xí)?我們會(huì)定義 Markov Decision Process 并用它來(lái)對(duì)強(qiáng)化學(xué)習(xí)進(jìn)行分析推理。
- 我們?nèi)绾沃付ㄩL(zhǎng)期的策略?這里,定義了 discounted future reward,這也給出了在下面部分的算法的基礎(chǔ)。
- 如何估計(jì)或者近似未來(lái)收益?給出了簡(jiǎn)單的基于表的 Q-learning 算法的定義和分析。
- 如果狀態(tài)空間非常巨大該怎么辦?這里的 Q-table 就可以使用(深度)神經(jīng)網(wǎng)絡(luò)來(lái)替代。
- 怎么樣將這個(gè)模型真正可行?采用 Experience replay 技術(shù)來(lái)穩(wěn)定神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)。
- 這已經(jīng)足夠了么?最后會(huì)研究一些對(duì) exploration-exploitation 問(wèn)題的簡(jiǎn)單解決方案。
強(qiáng)化學(xué)習(xí)
Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.

Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem – for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.
This is the task **reinforcement learning **tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. 基于這些收益,agent 必須學(xué)會(huì)在環(huán)境中如何行動(dòng)。
盡管這個(gè)想法非常符合直覺(jué),但是實(shí)際操作時(shí)卻困難重重。例如,當(dāng)你擊中一個(gè)磚塊,并得到收益時(shí),這常常和最近做出的行動(dòng)(paddle 的移動(dòng))沒(méi)有關(guān)系。在你將 paddle 移動(dòng)到了正確的位置時(shí)就可以將球彈回,其實(shí)所有的困難的工作已經(jīng)完成。這個(gè)問(wèn)題被稱(chēng)作是 credit assignment 問(wèn)題——先前的行動(dòng)會(huì)影響到當(dāng)前的收益的獲得與否及收益的總量。
一旦你想出來(lái)一個(gè)策略來(lái)收集一定數(shù)量的收益,你是要固定在當(dāng)前的策略上還是嘗試其他的策略來(lái)得到可能更好的策略呢?在上面的 Breakout 游戲中,簡(jiǎn)單的策略就是移動(dòng)到最左邊的等在那里。發(fā)出球球時(shí),球更可能是向左飛去,這樣你能夠輕易地在死前獲得 10 分。但是,你真的滿(mǎn)足于做到這個(gè)程度么? 這就是 exploration-exploit 困境 ——你應(yīng)當(dāng)利用已有的策略還是去探索其他的可能更好的策略。
強(qiáng)化學(xué)習(xí)是關(guān)于人類(lèi)(或者更一般的動(dòng)物)學(xué)習(xí)的重要模型。我們受到來(lái)自父母、分?jǐn)?shù)、薪水的獎(jiǎng)賞都是收益的各類(lèi)例子。credit assignment 問(wèn)題 和 exploration-exploitation 困境 在商業(yè)和人際關(guān)系中常常出現(xiàn)。這也是研究強(qiáng)化學(xué)習(xí)及那些提供嘗試新的觀點(diǎn)的沙盒的博弈游戲的重要原因。
Markov Decision Process
現(xiàn)在的問(wèn)題就是,如何來(lái)形式化這個(gè)強(qiáng)化學(xué)習(xí)問(wèn)題使得我們可以對(duì)其進(jìn)行分析推理。目前最常見(jiàn)的就是將其表示成一個(gè) Markov decision process。
假設(shè)你是一個(gè) agent,在一個(gè)環(huán)境中(比如說(shuō) Breakout 游戲)。環(huán)境會(huì)處在某個(gè)狀態(tài)下(比如說(shuō),paddle 的位置、球的位置和方向、每個(gè)磚塊是否存在等等)。agent 可以在環(huán)境中執(zhí)行某種行動(dòng)(比如說(shuō),將 paddle 向左或者向右移動(dòng))。這些行動(dòng)有時(shí)候會(huì)帶來(lái)收益(比如說(shuō)分?jǐn)?shù)的增加)。行動(dòng)會(huì)改變環(huán)境并使其新的狀態(tài),然后 agent 又可以這行另外的行動(dòng),如此進(jìn)行下去。如何選擇這些行動(dòng)的規(guī)則稱(chēng)為策略。一般來(lái)說(shuō)環(huán)境是隨機(jī)的,這意味著下一個(gè)狀態(tài)的出現(xiàn)在某種程度上是隨機(jī)的(例如,你輸了一個(gè)球的時(shí)候,重新啟動(dòng)新球,這個(gè)時(shí)候的射出方向是隨機(jī)選定的)。

狀態(tài)和行動(dòng)的集合,以及從一個(gè)狀態(tài)跳轉(zhuǎn)到另一個(gè)狀態(tài)的規(guī)則,共同真誠(chéng)了 Markov decision process。這個(gè)過(guò)程(比方說(shuō)一次游戲過(guò)程)的一個(gè) episode 形成了一個(gè)有限的狀態(tài)、行動(dòng)和收益的序列:

這里的 s_i 表示狀態(tài),a_i 表示行動(dòng),而 r_{i+1} 是在執(zhí)行行動(dòng)的匯報(bào)。episode 以 terminal 狀態(tài) s_n 結(jié)尾(可能就是“游戲結(jié)束”畫(huà)面)。MDP 依賴(lài)于 Markov 假設(shè)——下一個(gè)狀態(tài) s_{i+1} 的概率僅僅依賴(lài)于當(dāng)前的狀態(tài) s_i 和行動(dòng) a_i,但不依賴(lài)此前的狀態(tài)和行動(dòng)。
Discounted Future Reward
To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that?
Given one run of the Markov decision process, we can easily calculate the total reward for one episode:

Given that, the total future reward from time point t onward can be expressed as:

But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use **discounted future reward **instead:

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.
A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
Q-learning
In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.
**
******The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a** in state s“. It is called Q-function, because it represents the “quality” of a certain action in a given state.**
This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “*Q(s, a) *exists, *Q(s, a) *exists, …”. Feel it?
If you’re still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

Here π represents the policy, the rule how we choose an action in each state.
OK, how do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’.

This is called the Bellman equation. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.
The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following[1]:

α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.
The maxa’ Q[s’,a’] that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.
Deep Q Network
環(huán)境的狀態(tài)可以用 paddle 的位置,球的位置和方向以及每個(gè)磚塊是否消除來(lái)確定。不過(guò)這個(gè)直覺(jué)上的表示與游戲相關(guān)。我們能不能獲得某種更加通用適合所有游戲的表示呢?最明顯的選擇就是屏幕像素——他們隱式地包含所有關(guān)于除了球的速度和方向外的游戲情形的相關(guān)信息。不過(guò)兩個(gè)時(shí)間上相鄰接的屏幕可以包含這兩個(gè)丟失的信息。
如果我們像 DeepMind 的論文中那樣處理游戲屏幕的話(huà)——獲取四幅最后的屏幕畫(huà)面,將他們重新規(guī)整為 84 X 84 的大小,轉(zhuǎn)換為 256 灰度層級(jí)——我們會(huì)得到一個(gè) 256^{84X84X4} 大小的可能游戲狀態(tài)。這意味著我們的 Q-table 中需要有 10^67970 行——這比已知的宇宙空間中的原子的數(shù)量還要大得多!可能有人會(huì)說(shuō),很多像素的組合(也就是狀態(tài))不會(huì)出現(xiàn)——這樣其實(shí)可以使用一個(gè)稀疏的 table 來(lái)包含那些被訪問(wèn)到的狀態(tài)。即使這樣,很多的狀態(tài)仍然是很少被訪問(wèn)到的,也許需要宇宙的生命這么長(zhǎng)的時(shí)間讓 Q-table 收斂。我們希望理想化的情形是有一個(gè)對(duì)那些還未遇見(jiàn)的狀態(tài)的 Q-value 的猜測(cè)。
這里就是深度學(xué)習(xí)發(fā)揮作用的地方。神經(jīng)網(wǎng)絡(luò)其實(shí)對(duì)從高度結(jié)構(gòu)化的數(shù)據(jù)中獲取特征非常在行。我們可以用神經(jīng)網(wǎng)絡(luò)表示 Q-function,以狀態(tài)(四幅屏幕畫(huà)面)和行動(dòng)作為輸入,以對(duì)應(yīng)的 Q-value 作為輸出。另外,我們可以?xún)H僅用游戲畫(huà)面作為輸入對(duì)每個(gè)可能的行動(dòng)輸出一個(gè) Q-value。后面這個(gè)觀點(diǎn)對(duì)于我們想要進(jìn)行 Q-value 的更新或者選擇最優(yōu)的 Q-value 對(duì)應(yīng)操作來(lái)說(shuō)要更方便一些,這樣我們僅僅需要進(jìn)行一遍網(wǎng)絡(luò)的前向傳播就可立即得到所有行動(dòng)的 Q-value。

DeepMind 使用的深度神經(jīng)網(wǎng)絡(luò)架構(gòu)如下:

這實(shí)際上是一個(gè)經(jīng)典的卷積神經(jīng)網(wǎng)絡(luò),包含 3 個(gè)卷積層加上 2 個(gè)全連接層。對(duì)圖像識(shí)別的人們應(yīng)該會(huì)注意到這里沒(méi)有包含 pooling 層。但如果你好好想想這里的情況,你會(huì)明白,pooling 層會(huì)帶來(lái)變換不變性 —— 網(wǎng)絡(luò)會(huì)對(duì)圖像中的對(duì)象的位置沒(méi)有很強(qiáng)的感知。這個(gè)特性在諸如 ImageNet 這樣的分類(lèi)問(wèn)題中會(huì)有用,但是在這里游戲的球的位置其實(shí)是潛在能夠決定收益的關(guān)鍵因素,我們自然不希望失去這樣的信息了!
網(wǎng)絡(luò)的輸入是 4 幅 84X84 的灰度屏幕畫(huà)面。網(wǎng)絡(luò)的輸出是對(duì)每個(gè)可能的行動(dòng)(在 Atari 中是 18 個(gè))。Q-value 可能是任何實(shí)數(shù),其實(shí)這是一個(gè)回歸任務(wù),我們可以通過(guò)簡(jiǎn)單的平方誤差來(lái)進(jìn)行優(yōu)化。

給定轉(zhuǎn)移 <s, a, r, s'>,Q-table 更新規(guī)則變動(dòng)如下:
- 對(duì)當(dāng)前的狀態(tài) s 執(zhí)行前向傳播,獲得對(duì)所有行動(dòng)的預(yù)測(cè) Q-value
- 對(duì)下一狀態(tài) s' 執(zhí)行前向傳播,計(jì)算網(wǎng)絡(luò)輸出最大操作:max_{a'} Q(s', a')
- 設(shè)置行動(dòng)的 Q-value 目標(biāo)值為 r + γ max_{a'} Q(s', a')。使用第二步的 max 值。對(duì)所有其他的行動(dòng),設(shè)置為和第一步返回結(jié)果相同的 Q-value 目標(biāo)值,讓這些輸出的誤差設(shè)置為 0
- 使用反向傳播算法更新權(quán)重
Experience Replay
到現(xiàn)在,我們有了使用 Q-learning 如何估計(jì)未來(lái)回報(bào)和用卷積神經(jīng)網(wǎng)絡(luò)近似 Q-function 的方法。但是有個(gè)問(wèn)題是,使用非線(xiàn)性函數(shù)來(lái)近似 Q-value 其實(shí)是非常不穩(wěn)定的。對(duì)這個(gè)問(wèn)題其實(shí)也有一堆技巧來(lái)讓其收斂。不過(guò)這樣也會(huì)花點(diǎn)時(shí)間,在單個(gè) GPU 上估計(jì)要一個(gè)禮拜。
其中最為重要的技巧是 experience replay。在游戲過(guò)程中,所有的經(jīng)驗(yàn) <s, a, r', s'> 都存放在一個(gè) replay memory 中。訓(xùn)練網(wǎng)絡(luò)時(shí),replay memory 中隨機(jī)的 minibatch 會(huì)取代最近的狀態(tài)轉(zhuǎn)移。這會(huì)將連續(xù)的訓(xùn)練樣本之間的相似性打破,否則容易將網(wǎng)絡(luò)導(dǎo)向一個(gè)局部最優(yōu)點(diǎn)。同樣 experience replay 讓訓(xùn)練任務(wù)與通常的監(jiān)督學(xué)習(xí)更加相似,這樣也簡(jiǎn)化了程序的 debug 和算法的測(cè)試。當(dāng)然我們實(shí)際上也是可以收集人類(lèi)玩家的 experience 并用這些數(shù)據(jù)進(jìn)行訓(xùn)練。
Exploration-Exploitation
Q-learning 試著解決 credit assignment 問(wèn)題——將受益按時(shí)間傳播,直到導(dǎo)致獲得受益的實(shí)際的關(guān)鍵決策點(diǎn)為止。但是我們并沒(méi)有解決 exploration-exploitation 困境……
首先我們注意到,當(dāng) Q-table 或者 Q-network 隨機(jī)初始化時(shí),其預(yù)測(cè)結(jié)果也是隨機(jī)的。如果我們選擇一個(gè)擁有最高的 Q-value 的行動(dòng),這個(gè)行動(dòng)是隨機(jī)的,這樣 agent 會(huì)進(jìn)行任性的“exploration”。當(dāng) Q-function 收斂時(shí),它會(huì)返回一個(gè)更加一致的 Q-value 此時(shí) exploration 的次數(shù)就下降了。所以我們可以說(shuō),Q-learning 將 exploration 引入了算法的一部分。但是這樣的 exploration 是貪心的,它會(huì)采用找到的最有效的策略。
對(duì)上面問(wèn)題的一個(gè)簡(jiǎn)單卻有效的解決方案是 ** ε-greedy exploration——以概率ε選擇一個(gè)隨機(jī)行動(dòng),否則按照最高的 Q-value 進(jìn)行貪心行動(dòng)。在 DeepMind 的系統(tǒng)中,對(duì)ε本身根據(jù)時(shí)間進(jìn)行的從 1 到 0.1 的下降,也就是說(shuō)開(kāi)始時(shí)系統(tǒng)完全進(jìn)行隨機(jī)的行動(dòng)來(lái)最大化地 explore 狀態(tài)空間,然后逐漸下降到一個(gè)固定的 exploration 的比例。
Deep Q-learning 算法
現(xiàn)在我們得到加入 experience replay的最終版本:

DeepMind 其實(shí)還使用了很多的技巧來(lái)讓系統(tǒng)工作得更好——如 target network、error clipping、reward clipping 等等,這里我們不做介紹。
該算法最為有趣的一點(diǎn)就是它可以學(xué)習(xí)任何東西。你仔細(xì)想想——由于我們的 Q-function 是隨機(jī)初始化的,剛開(kāi)始給出的結(jié)果就完全是垃圾。然后我們就用這樣的垃圾(下個(gè)狀態(tài)的最高 Q-value)作為網(wǎng)絡(luò)的目標(biāo),僅僅會(huì)偶然地引入微小的收益。這聽(tīng)起來(lái)非常瘋狂,為什么它能夠?qū)W習(xí)任何有意義的東西呢?然而,它確實(shí)如此神奇。
Final notes
Many improvements to deep Q-learning have been proposed since its first introduction – Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture and extension to continuous action space to name a few. For latest advancements check out the NIPS 2015 deep reinforcement learning workshop and ICLR 2016 (search for “reinforcement” in title). But beware, that deep Q-learning has been patented by Google.
It is often said, that artificial intelligence is something we haven’t figured out yet. Once we know how it works, it doesn’t seem intelligent any more. But deep Q-networks still continue to amaze me. Watching them figure out a new game is like observing an animal in the wild – a rewarding experience by itself.
Credits
Thanks to Ardi Tampuu, Tanel P?rnamaa, Jaan Aru, Ilya Kuzovkin, Arjun Bansal and Urs K?ster for comments and suggestions on the drafts of this post.
Links
David Silver’s lecture about deep reinforcement learning
Slightly awkward but accessible illustration of Q-learning
UC Berkley’s course on deep reinforcement learning
David Silver’s reinforcement learning course
Nando de Freitas’ course on machine learning (two lectures about reinforcement learning in the end)
Andrej Karpathy’s course on convolutional neural networks
[1] Algorithm adapted from http://artint.info/html/ArtInt_265.html