增強(qiáng)學(xué)習(xí)玩轉(zhuǎn)FlappyBird

算是剛開始入門增強(qiáng)學(xué)習(xí)吧,結(jié)合畢設(shè)的要求,將增強(qiáng)學(xué)習(xí)的Q-learning和視頻游戲結(jié)合起來,花幾天時(shí)間啃透了yenchenlin的一個(gè)不錯(cuò)的項(xiàng)目,加了好多注釋和自己的理解,幾乎可以說是很簡(jiǎn)單易讀了,希望能夠?qū)δ阌兴鶐椭?/p>

GitHub地址: ReDeepLearningFlappyBird
https://github.com/ZhangRui111/ReDeepLearningFlappyBird

Using Deep Q-Network to Learn How To Play Flappy Bird

Based on DeepLearningFlappyBird

Overview

This project follows the description of the Deep Q Learning algorithm described in Playing Atari with Deep Reinforcement Learning [2] and shows that this learning algorithm can be further generalized to the notorious Flappy Bird.

bird_demo.gif

7 mins version: DQN for flappy bird

How to Run?

git clone https://github.com/ZhangRui111/ReDeepLearningFlappyBird.git
cd DeepLearningFlappyBird
python deep_q_network.py

The program is running with pretrained weights by default. If you want to train the network from the scratch. You can delete /saved_networks/checkpoint. By the way, the training precess may take several days depending on your hardware.

About deep_q_network.py

case1
case2

Deep Q-Network Algorithm

The pseudo-code for the Deep Q Learning algorithm, as given in [1], can be found below:

Initialize replay memory D to size N
Initialize action-value function Q with random weights
for episode = 1, M do
    Initialize state s_1
    for t = 1, T do
        With probability ? select random action a_t
        otherwise select a_t=max_a  Q(s_t,a; θ_i)
        Execute action a_t in emulator and observe r_t and s_(t+1)
        Store transition (s_t,a_t,r_t,s_(t+1)) in D
        Sample a minibatch of transitions (s_j,a_j,r_j,s_(j+1)) from D
        Set y_j:=
            r_j for terminal s_(j+1)
            r_j+γ*max_(a^' )  Q(s_(j+1),a'; θ_i) for non-terminal s_(j+1)
        Perform a gradient step on (y_j-Q(s_j,a_j; θ_i))^2 with respect to θ
    end for
end for

Network Architecture

According to [1], I first preprocessed the game screens with following steps:

  1. Convert image to grayscale
  2. Resize image to 80x80
  3. Stack last 4 frames to produce an 80x80x4 input array for network

The architecture of the network is shown in the figure below. The first layer convolves the input image with an 8x8x4x32 kernel at a stride size of 4. The output is then put through a 2x2 max pooling layer. The second layer convolves with a 4x4x32x64 kernel at a stride of 2. We then max pool again. The third layer convolves with a 3x3x64x64 kernel at a stride of 1. We then max pool one more time. The last hidden layer consists of 256 fully connected ReLU nodes.

image

The final output layer has the same dimensionality as the number of valid actions which can be performed in the game, where the 0th index always corresponds to doing nothing. The values at this output layer represent the Q function given the input state for each valid action. At each time step, the network performs whichever action corresponds to the highest Q value using a ? greedy policy.

Disclaimer

This work is highly based on the following repos:

  1. [sourabhv/FlapPyBird] (https://github.com/sourabhv/FlapPyBird)
  2. asrivat1/DeepLearningVideoGames
  3. yenchenlin/DeepLearningFlappyBird


原創(chuàng)文章,轉(zhuǎn)載請(qǐng)注明出處: http://www.itdecent.cn/p/755f9f2604d0

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容