A v无乱码,亚洲日韩黄视频97

在公司看文檔，對用到的一些知識做簡單梳理；大部分idea來源于DeepMind或OpenAI

PPO的目標函數(shù)

PPO有兩種目標函數(shù)形式，第一種一般簡稱adaptive KL

$\theta_{k+1}=\arg\max_{\theta}\mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}\gamma^{t}\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t})-\beta_{k}D_{KL}(\pi'||\pi_{\theta})]$

第二種一般被稱作clipped surrogate

$\theta_{k+1}=\arg\max_{\theta}\mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}[\min(\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t}),\ \text{clip}(\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})},1-\epsilon,1+\epsilon)A^{\pi}(s_{t},a_{t}))]]$

其中

$\theta$ 是policy模型的參數(shù)， $\pi_{\theta}$ 是我們要訓練迭代的模型
$\pi'$ 是迭代之前舊的policy模型，一般做法是初始化兩個結(jié)構(gòu)相同的網(wǎng)絡，使用 $\pi'$ 與環(huán)境交互得到的訓練數(shù)據(jù) (trajectory) 更新 $\pi$ ，若干步后把 $\pi$ 的參數(shù)全部copy給 $\pi'$
$A^{\pi}(s_{t},a_{t})=Q^{\pi}(s_{t},a_{t})-V^{\pi}(s_{t})=R_{t}+\gamma{V^{\pi}(s_{t+1})}-V^{\pi}(s_{t})$ 是advantage function
$\epsilon$ 一般取0.1之類的
$D_{KL}(\pi'||\pi)=\mathbb{E}_{s_{t}\sim{d^{\pi'}(s)}}\mathbb{E}_{a_{t}\sim\pi'}[\log(\frac{\pi'(a_{t}|s_{t})}{\pi(a_{t}|s_{t})})]$ ，就是常說的KL divergence。對于離散空間直接兩個交叉熵除一下即可；對于連續(xù)空間一般會采用reparameterization-trick將網(wǎng)絡參數(shù)化成一個Gaussian distribution （就是讓網(wǎng)絡輸出兩個向量一個代表 $\mu$ 一個代表 $\sigma$ 然后從中采樣），兩個Gaussian之間的KL有閉式解
$D_{KL}(\mathcal{N}(\mu_{1},\sigma_{1},\mathcal{N}(\mu_{2},\sigma_{2})))=\log(\frac{\sigma_{2}}{\sigma_{1}})+\frac{\sigma_{1}^{2}+\mu_{2}-\mu_{1}}{2\sigma_{2}^{2}}-\frac{1}{2}$
$V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}R_{t}]$ ，等號右邊這個東西叫做 total discounted reward，是所有強化學習算法優(yōu)化的最終目標，一般參數(shù)化成價值網(wǎng)絡的形式，直接用監(jiān)督學習訓練，近幾年的強化學習算法普遍采用GAE來估計 total discounted reward

設計這種目標函數(shù)的目的

這兩種目標函數(shù)的目的都是為了近似自然梯度 $\tilde{g}=F^{-1}g=(\nabla^{2}_{\theta}KL(\pi'||\pi_{\theta}))^{-1}g$ ，式中的 $F$ 是 Fisher information matrix，由 $F$ 可以確定一個在概率空間中具有不變性的黎曼度量，使得 $F^{-1}g$ 是逆變向量，i.e., 由 $F^{-1}g$ 所確定的自然梯度與 $\pi_{\theta}$ 的參數(shù)化形式無關(guān)，因而擁有較小的訓練方差
在bounded KL范圍內(nèi)迭代可以有單調(diào)提升的（弱）bound：
$J(\pi_{\theta})-J(\pi')\geq \mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}\gamma^{t}\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t})]-\frac{4\gamma\max_{s,a}A^{\pi}(s,a)}{(1-\gamma)^{2}}\mathbb{E}_{s\sim{d^{\pi'}}}[D_{KL}(\pi'||\pi_{\theta})]$
i.e., 在 bounded KL ball 中迭代policy可以保證收斂的穩(wěn)定
off-policyness：可以用從 $\pi'$ 中采樣得到的trajectory優(yōu)化 $\pi_{\theta}$ ，這樣做有利于實現(xiàn)分布式計算框架，但其實可以想到這個迭代速度不會很快，因為PPO的目標函數(shù)形式限制了每步迭代中 $D_{KL}(\pi'||\pi_{\theta})$ 的大小，且每次更新完以后都要把 $\pi_{\theta}$ 的參數(shù)copy回去給 $\pi'$

個人經(jīng)驗，訓練時要尤其關(guān)注KL的變化，KL bound得比較好的話，policy的improvement是有理論保障的；反之如果bound的不好，有時會出現(xiàn) policy 的退化現(xiàn)象，越訓練越差

GAE的簡單解釋

全稱 generalized advantage estimator，出自論文High-Dimensional Continuous Control Using Generalized Advantage Estimation，in a word，其最終目的是在 advantage function $A(s,a)=Q(s,a)-V(s)$ 的各種估計方式估計中找一個bias-variance tradeoff 的平衡點

[站外圖片上傳中...(image-7f9026-1560066957306)]

以上六種policy gradient的形式中，3擁有最小的理論方差，但實際計算中由于復雜度問題一般會采用5

一種估計 $A^{\pi}(s_{t},a_{t})$ 的方法是把整個trajectory的reward都考慮在內(nèi),這種估計方式有較小的bias和較大的variance：
$\hat{A}^{\pi}(s_{t},a_{t})=\sum_{t=0}^{T}\gamma_{t}R_{t}+\gamma^{T+1}V(s_{T})$

另一種方式是利用已有的 $V(s)$ 函數(shù)進行輔助估計，這種方法有較大的bias與較小的variance：
$\hat{A}(s_{t},a_{t})=R_{t}+\gamma V(s_{t+1})-V(s_{t})$

也可以構(gòu)造出介于兩者之間的形式，總結(jié)為：

$\hat{A}_t^{(1)} = R_t + \gamma V(s_{t+1}) - V(s_t) \\ \hat{A}_t^{(2)} = R_t + \gamma R_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \\ \hat{A}_{t}^{(n)} = \sum_{k=0}^{n-1}\gamma^{k-1}R_{t+k}+\gamma^{n}V(s_{t+n})-V(s_{t}) \\ \hat{A}_t^{(\infty)} = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots - V(s_t)$

在以上所有形式的advantage estimator中取bias-variance平衡點的方法，就是使用從 $\hat{A}_{t}^{(1)}$ 到 $\hat{A}_{t}^{(\infty)}$ 的幾何平均：

$\hat{A}_t^{GAE(\gamma,\lambda)} = (1-\lambda)\Big(\hat{A}_{t}^{(1)} + \lambda \hat{A}_{t}^{(2)} + \lambda^2 \hat{A}_{t}^{(3)} + \cdots \Big) = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^{V}$

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

[站外圖片上傳中...(image-878c4b-1560066957306)]

DeepMind用來做星際2的框架，里面包含的內(nèi)容非常多

模型采用off-policy的actor critic，加experience replay、self-imitation learning以及policy distillation
為保證策略的多樣性，先用SL訓練一個baseline的模型（上圖001），然后在每段iteration開始時,從前一輪iteration的模型中copy幾個相同的出來進行自對弈，每個模型的超參都不一樣，甚至reward定義都不一樣，用PBT訓練。上一輪迭代的模型不再更新，稱之為frozen competitor，采用和人類玩家天梯匹配系統(tǒng)類似的方式設計自對弈中的對手匹配系統(tǒng)
用了transformer結(jié)構(gòu)輸出每個unit的action，結(jié)合了pointer network以及centralized value baseline

Population based training of neural networks

簡稱PBT，出自DeepMind論文Population based training of neural networks，見DeepMind博客地址

本質(zhì)就是用genetic algorithm的思路來做hyper-parameter tuning：同時訓練N個模型，每個模型有自己的一套超參，訓練一段時間，取部分效果較好的模型，在其超參基礎上做一些外延探索，并繼續(xù)訓練模型

這樣的做法也可以防止模型陷入局部最優(yōu)

Policy distillation

image

Policy distillation的原始論文中，teacher是DQN，這一點和我們這邊差別很大，我們一般用policy gradient類的方法可以直接蒸餾，不存在作者文章中討論的各種不同的loss問題

對于student網(wǎng)絡的蒸餾，作者在文章中試驗了三種不同的loss

將student參數(shù)化為 $\pi_{\theta}:\mathcal{S}\rightarrow{\mathcal{A}}$ 的形式，直接從DQN的replay memory中拿之前的數(shù)據(jù)出來當做action的標簽
將student參數(shù)化為 $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ 的形式，loss用均方誤差
將student參數(shù)化為 $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ 的形式，將student和teacher網(wǎng)絡輸出的Q值都過一個softmax，然后用Hinton論文里的KL散度作為loss

實驗結(jié)果表明第三種loss效果最好

Centralized value baseline

出自DeepMind發(fā)表在AAAI-2018的文章 Counterfactual Multi-Agent Policy Gradients，idea很簡單，就是對于multi-agent問題場景，所有的actor共享一個critic

這個idea很像是OpenAI那篇 Multi-agent actor critic 的弱化版，OpenAI那篇好歹認真考慮了 multi-agent 設定下存在的 non-stationary MDP 問題，DeepMind這篇直接無視掉了這一點，側(cè)面說明可能理論上存在的 non-stationary MDP 問題實際在工程上并不影響模型效果

Transformer

以下內(nèi)容參考了博客 The Illustrated Transformer 與 Google Brain的原始paper "Attention is All You Need"

Transformer是BERT的基礎，NLP任務從此開始走上了【去RNN】化的道路

簡單來說，這篇文章只用attention和feed-forward network就在很多任務上取得了很好的效果，主要有幾部分組成

Self-attention, encoder-decoder attention, and multi-head attention
Feed-forward network
Positional encoding
Residual blocks

首先解釋 self-attention

[站外圖片上傳中...(image-9527cb-1560066957306)]

其中self-attention形式為
$Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V \\ \text{where}\ Q=W_{Q}X, K=W_{K}X, V=W_{V}X$

一圖勝千言
[站外圖片上傳中...(image-a4d7c5-1560066957306)]

然后是encoder-decoder attention

這個attention和之前seq2seq中的attention其實是一樣的

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence

Multi-head attention: 說白了就是在同一層堆疊多個self-attention，作用主要有兩點

It expands the model’s ability to focus on different positions

It gives the attention layer multiple "representation subspaces"

Positional Encoding

To be updated...

此外transformer網(wǎng)絡中還用到了residual block來防止由于網(wǎng)絡過深導致梯度退化

[站外圖片上傳中...(image-631f0c-1560066957306)]

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

游戲AI基礎知識梳理

游戲AI基礎知識梳理