PPO是2017年由OpenAI提出的一種基于隨機(jī)策略的DRL算法,它不僅有很好的性能(尤其是對于連續(xù)控制問題),同時(shí)相較于之前的TRPO方法更加易于實(shí)現(xiàn)。PPO算法也是當(dāng)前OpenAI的默認(rèn)算法,是策略算法的最好實(shí)現(xiàn)。
本文實(shí)現(xiàn)的PPO是參考莫煩的TensorFlow實(shí)現(xiàn),因?yàn)橥瑯拥拇a流程在使用Keras實(shí)現(xiàn)時(shí)發(fā)生訓(xùn)練無法收斂的問題,暫時(shí)還未找到原因。
Paper:
TRPO:Trust Region Policy Optimization
PPO:Proximal Policy Optimization Algorithms
OpenAI PPO Blog:Proximal Policy Optimization
Github:https://github.com/xiaochus/Deep-Reinforcement-Learning-Practice
環(huán)境
- Python 3.6
- Tensorflow-gpu 1.8.0
- Keras 2.2.2
- Gym 0.10.8
TRPO
策略梯度算法PG、DDPG等在離散動作空間和連續(xù)動作空間都取得了很好的成果,這一系列算法的梯度更新滿足如下關(guān)系:

策略梯度的方法取得好的結(jié)果存在著一些難度,因?yàn)檫@類方法對迭代步驟數(shù)(步長)非常敏感:如果選得太小,訓(xùn)練過程就會慢得令人絕望;如果選得太大,反饋信號就會淹沒在噪聲中,甚至有可能讓模型表現(xiàn)雪崩式地下降。這類方法的采樣效率也經(jīng)常很低,學(xué)習(xí)簡單的任務(wù)就需要百萬級至十億級的總迭代次數(shù)。
所謂合適的步長是指當(dāng)策略更新后,回報(bào)函數(shù)的值不能更差。如何選擇這個(gè)步長?或者說,如何找到新的策略使得新的回報(bào)函數(shù)的值單調(diào)增,或單調(diào)不減。TRPO的核心就是解決不好確定 Learning rate (或者 Step size) 的問題。
TRPO的做法是將新的策略所對應(yīng)的回報(bào)函數(shù)分解成舊的策略所對應(yīng)的回報(bào)函數(shù)+其他項(xiàng)。只要新的策略所對應(yīng)的其他項(xiàng)大于等于零,那么新的策略就能保證回報(bào)函數(shù)單調(diào)不減。

具體的TRPO原理以及公式推導(dǎo)可以參考TRPO這篇文章,寫的非常好,下面直接使用結(jié)論。
上述reward可以展開為下式:

TRPO問題為:

最終TRPO問題化簡為:

PPO
PPO是基于Actor-Critic架構(gòu)實(shí)現(xiàn)的一種策略算法, 屬于TRPO的進(jìn)階版本。

PPO1
PPO1對應(yīng)的 Policy 更新公式為:

在TRPO里,我們希望θ和θ'不能差太遠(yuǎn),這并不是說參數(shù)的值不能差太多,而是說,輸入同樣的state,網(wǎng)絡(luò)得到的動作的概率分布不能差太遠(yuǎn)。為了得到動作的概率分布的相似程度,可以用KL散度來計(jì)算。這個(gè)Policy的實(shí)現(xiàn)如下:
self.tflam = tf.placeholder(tf.float32, None, 'lambda')
kl = tf.distributions.kl_divergence(old_nd, nd)
self.kl_mean = tf.reduce_mean(kl)
self.aloss = -(tf.reduce_mean(surr - self.tflam * kl))
PPO1算法的思想很簡單,既然TRPO認(rèn)為在懲罰的時(shí)候有一個(gè)超參數(shù)β難以確定,因而選擇了限制而非懲罰。因此PPO1通過下面的規(guī)則來避免超參數(shù)的選擇而自適應(yīng)地決定β:

每完成一次訓(xùn)練,就通過下列公式調(diào)整一次β:
if kl < self.kl_target / 1.5:
self.lam /= 2
elif kl > self.kl_target * 1.5:
self.lam *= 2
PPO2
除此之外,原論文中還提出了另一種的方法來限制每次更新的步長,我們一般稱之為PPO2,論文里說PPO2的效果要比PPO1要好,所以我們平時(shí)說PPO都是指的是PPO2,PPO2的思想也很簡單,思想的起點(diǎn)來源于對表達(dá)式的觀察。
首先做出如下定義:

對應(yīng)的 Policy 更新公式為:

在這種情況下可以保證兩次更新之間的分布差距不大,防止了θ更新太快。
self.aloss = -tf.reduce_mean(tf.minimum(
surr,
tf.clip_by_value(ratio, 1.- self.epsilon, 1.+ self.epsilon) * self.adv))
PS:
對于PPO1和PPO2這兩個(gè)Policy,最后都要使用負(fù)值。這是因?yàn)槲覀兌x的這個(gè)Loss其實(shí)是TRPO中新策略公式里對應(yīng)的其他項(xiàng),在上面我們證明了我們需要使這個(gè)項(xiàng)大于0,那么我們的優(yōu)化目標(biāo)就是最大化這個(gè)項(xiàng)。但是基于梯度下降的方法是要最小化Loss的,因此我們添加負(fù)值,最小化這個(gè)項(xiàng)的負(fù)值就是最大化這個(gè)項(xiàng)。
算法實(shí)現(xiàn)
使用Pendulum來實(shí)驗(yàn)連續(xù)值預(yù)測,PPO如下所示:
對于連續(xù)動作,DDPG采用的action取值是直接回歸,而PPO采用的依然是隨機(jī)策略,Actor網(wǎng)絡(luò)輸出一個(gè)均值和方差,并返回一個(gè)由該均值和方差得到的正態(tài)分布,動作基于此正態(tài)分布進(jìn)行采樣。
import os
import gym
import numpy as np
import pandas as pd
import tensorflow as tf
class PPO:
def __init__(self, ep, batch, t='ppo2'):
self.t = t
self.ep = ep
self.batch = batch
self.log = 'model/{}_log'.format(t)
self.env = gym.make('Pendulum-v0')
self.bound = self.env.action_space.high[0]
self.gamma = 0.9
self.A_LR = 0.0001
self.C_LR = 0.0002
self.A_UPDATE_STEPS = 10
self.C_UPDATE_STEPS = 10
# KL penalty, d_target、β for ppo1
self.kl_target = 0.01
self.lam = 0.5
# ε for ppo2
self.epsilon = 0.2
self.sess = tf.Session()
self.build_model()
def _build_critic(self):
"""critic model.
"""
with tf.variable_scope('critic'):
x = tf.layers.dense(self.states, 100, tf.nn.relu)
self.v = tf.layers.dense(x, 1)
self.advantage = self.dr - self.v
def _build_actor(self, name, trainable):
"""actor model.
"""
with tf.variable_scope(name):
x = tf.layers.dense(self.states, 100, tf.nn.relu, trainable=trainable)
mu = self.bound * tf.layers.dense(x, 1, tf.nn.tanh, trainable=trainable)
sigma = tf.layers.dense(x, 1, tf.nn.softplus, trainable=trainable)
norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return norm_dist, params
def build_model(self):
"""build model with ppo loss.
"""
# inputs
self.states = tf.placeholder(tf.float32, [None, 3], 'states')
self.action = tf.placeholder(tf.float32, [None, 1], 'action')
self.adv = tf.placeholder(tf.float32, [None, 1], 'advantage')
self.dr = tf.placeholder(tf.float32, [None, 1], 'discounted_r')
# build model
self._build_critic()
nd, pi_params = self._build_actor('actor', trainable=True)
old_nd, oldpi_params = self._build_actor('old_actor', trainable=False)
# define ppo loss
with tf.variable_scope('loss'):
# critic loss
self.closs = tf.reduce_mean(tf.square(self.advantage))
# actor loss
with tf.variable_scope('surrogate'):
ratio = tf.exp(nd.log_prob(self.action) - old_nd.log_prob(self.action))
surr = ratio * self.adv
if self.t == 'ppo1':
self.tflam = tf.placeholder(tf.float32, None, 'lambda')
kl = tf.distributions.kl_divergence(old_nd, nd)
self.kl_mean = tf.reduce_mean(kl)
self.aloss = -(tf.reduce_mean(surr - self.tflam * kl))
else:
self.aloss = -tf.reduce_mean(tf.minimum(
surr,
tf.clip_by_value(ratio, 1.- self.epsilon, 1.+ self.epsilon) * self.adv))
# define Optimizer
with tf.variable_scope('optimize'):
self.ctrain_op = tf.train.AdamOptimizer(self.C_LR).minimize(self.closs)
self.atrain_op = tf.train.AdamOptimizer(self.A_LR).minimize(self.aloss)
with tf.variable_scope('sample_action'):
self.sample_op = tf.squeeze(nd.sample(1), axis=0)
# update old actor
with tf.variable_scope('update_old_actor'):
self.update_old_actor = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]
tf.summary.FileWriter(self.log, self.sess.graph)
self.sess.run(tf.global_variables_initializer())
def choose_action(self, state):
"""choice continuous action from normal distributions.
Arguments:
state: state.
Returns:
action.
"""
state = state[np.newaxis, :]
action = self.sess.run(self.sample_op, {self.states: state})[0]
return np.clip(action, -self.bound, self.bound)
def get_value(self, state):
"""get q value.
Arguments:
state: state.
Returns:
q_value.
"""
if state.ndim < 2: state = state[np.newaxis, :]
return self.sess.run(self.v, {self.states: state})
def discount_reward(self, states, rewards, next_observation):
"""Compute target value.
Arguments:
states: state in episode.
rewards: reward in episode.
next_observation: state of last action.
Returns:
targets: q targets.
"""
s = np.vstack([states, next_observation.reshape(-1, 3)])
q_values = self.get_value(s).flatten()
targets = rewards + self.gamma * q_values[1:]
targets = targets.reshape(-1, 1)
return targets
# not work.
# def neglogp(self, mean, std, x):
# """Gaussian likelihood
# """
# return 0.5 * tf.reduce_sum(tf.square((x - mean) / std), axis=-1) \
# + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
# + tf.reduce_sum(tf.log(std), axis=-1)
def update(self, states, action, dr):
"""update model.
Arguments:
states: states.
action: action of states.
dr: discount reward of action.
"""
self.sess.run(self.update_old_actor)
adv = self.sess.run(self.advantage,
{self.states: states,
self.dr: dr})
# update actor
if self.t == 'ppo1':
# run ppo1 loss
for _ in range(self.A_UPDATE_STEPS):
_, kl = self.sess.run(
[self.atrain_op, self.kl_mean],
{self.states: states,
self.action: action,
self.adv: adv,
self.tflam: self.lam})
if kl < self.kl_target / 1.5:
self.lam /= 2
elif kl > self.kl_target * 1.5:
self.lam *= 2
else:
# run ppo2 loss
for _ in range(self.A_UPDATE_STEPS):
self.sess.run(self.atrain_op,
{self.states: states,
self.action: action,
self.adv: adv})
# update critic
for _ in range(self.C_UPDATE_STEPS):
self.sess.run(self.ctrain_op,
{self.states: states,
self.dr: dr})
def train(self):
"""train method.
"""
tf.reset_default_graph()
history = {'episode': [], 'Episode_reward': []}
for i in range(self.ep):
observation = self.env.reset()
states, actions, rewards = [], [], []
episode_reward = 0
j = 0
while True:
a = self.choose_action(observation)
next_observation, reward, done, _ = self.env.step(a)
states.append(observation)
actions.append(a)
episode_reward += reward
rewards.append((reward + 8) / 8)
observation = next_observation
if (j + 1) % self.batch == 0:
states = np.array(states)
actions = np.array(actions)
rewards = np.array(rewards)
d_reward = self.discount_reward(states, rewards, next_observation)
self.update(states, actions, d_reward)
states, actions, rewards = [], [], []
if done:
break
j += 1
history['episode'].append(i)
history['Episode_reward'].append(episode_reward)
print('Episode: {} | Episode reward: {:.2f}'.format(i, episode_reward))
return history
def save_history(self, history, name):
name = os.path.join('history', name)
df = pd.DataFrame.from_dict(history)
df.to_csv(name, index=False, encoding='utf-8')
if __name__ == '__main__':
model = PPO(1000, 32, 'ppo1')
history = model.train()
model.save_history(history, 'ppo1.csv')
實(shí)驗(yàn)結(jié)果
可以看出PPO能夠成功收斂,并且PPO2收斂的比PPO1要快。
