OpenAI Gym
為了做實(shí)驗(yàn),發(fā)現(xiàn)有文章用OpenAI gym去做些小游戲的控制,主要是為了研究RL的算法,逐漸發(fā)現(xiàn)這個(gè)gym的例子成了standard test case. 所以,這個(gè)blog簡(jiǎn)單分析下Gym的架構(gòu),以及如何安裝和使用OpenAI Gym,最后還是附上一個(gè)簡(jiǎn)單的控制案例。
- https://gym.openai.com/docs/ 官網(wǎng)的英文文檔
- http://c.biancheng.net/view/1972.html 中文博客。
0. gym的簡(jiǎn)介[這部分很簡(jiǎn)單,請(qǐng)堅(jiān)持看完]
- 1.Environments & interface
The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.
environment是gym的重要概念,一個(gè)env就是一個(gè)場(chǎng)景,就是一個(gè)test case,據(jù)說(shuō)gym有700多個(gè)env(現(xiàn)在實(shí)測(cè)是859個(gè))。共同的接口(shared interface)是怎么定義的呢?且看下面的一個(gè)簡(jiǎn)單例子,cart-pole:
import gym
env = gym.make('CartPole-v0') #1.構(gòu)造env, 根據(jù)name指定
env.reset() #2.初始化env
for _ in range(1000):
env.render() #3.渲染
env.step(env.action_space.sample()) # take a random action#4.action
env.close()
效果請(qǐng)點(diǎn)擊http://s3-us-west-2.amazonaws.com/rl-gym-doc/cartpole-no-reset.mp4
從例子中可以看出,標(biāo)準(zhǔn)接口是(a)根據(jù)名字設(shè)置env; b)render渲染場(chǎng)景;c)step(control)函數(shù)更新一次;)這三個(gè)函數(shù)。
- 2.env_name
gym.make(env_name)
gym有很多env,到底怎么選擇其中一個(gè)環(huán)境呢?官方網(wǎng)址提供了相關(guān)的解釋. env_name 實(shí)際上放在這個(gè)文件下的https://github.com/openai/gym/blob/master/gym/envs/init.py, 點(diǎn)擊即可查看到所有的env_name. 采用程序當(dāng)然也可以展示所有可以使用的env:
from gym import envs
print(envs.registry.all())
print(len(envs.registry.all())) #859
下面是我用到的env_name, 對(duì)應(yīng)的描述。
# Classic
# ----------------------------------------
register(
id='CartPole-v0',
entry_point='gym.envs.classic_control:CartPoleEnv',
max_episode_steps=200,
reward_threshold=195.0,
)
register(
id='CartPole-v1',
entry_point='gym.envs.classic_control:CartPoleEnv',
max_episode_steps=500,
reward_threshold=475.0,
)
register(
id='Pendulum-v0',
entry_point='gym.envs.classic_control:PendulumEnv',
max_episode_steps=200,
)
register(
id='Acrobot-v1',
entry_point='gym.envs.classic_control:AcrobotEnv',
reward_threshold=-100.0,
max_episode_steps=500,
)
- Observations
在環(huán)境(env)構(gòu)建好后,觀測(cè)是做控制或者RL的第一步。observation是step函數(shù)返回的。
observation,reward,done, info=env.step(env.action_space.sample())
- observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
- reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.
考慮 observation的python代碼是這樣的:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset() # reset 得到初始的observation
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action) #*_*!
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
- Action space
控制量的space當(dāng)然是仿真模擬系統(tǒng)決定的,也就是env中包含了控制量的類(lèi)型和種類(lèi)。在RL中policy 就是在space A中選擇一個(gè)a。先看space, 包括action space 和 observation space。
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2) # 離散類(lèi)型,{0,1},離散控制量從0開(kāi)始。
print(env.observation_space)
#> Box(4,) # 連續(xù)閉區(qū)間, 4個(gè)值均處于各自的 連續(xù)閉區(qū)間內(nèi)。具體范圍看下面
print(env.observation_space.high)
#> array([ 2.4 , inf, 0.20943951, inf])
print(env.observation_space.low)
#> array([-2.4 , -inf, -0.20943951, -inf])
這個(gè)時(shí)候,回頭看前面代碼中的step函數(shù), 就是在action space中隨機(jī)sample了一個(gè)action。
action = env.action_space.sample() #恍然大悟,在A中隨機(jī)選一個(gè)控制a.
observation, reward, done, info = env.step(action)
- Render
Render中文名稱(chēng)叫做渲染,實(shí)際上就是將env系統(tǒng)狀態(tài)展示出來(lái),根據(jù)應(yīng)用不同,有兩種情況(給人看,給機(jī)器看),render函數(shù)定義很清楚,有model這個(gè)參數(shù),根據(jù)情況不同,很有必要簡(jiǎn)單設(shè)置下:
def render(self, mode='human'):
"""Renders the environment.
The set of supported modes varies per environment. (And some
environments do not support rendering at all.) By convention,
if mode is:
- human: render to the current display or terminal and
return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3),
representing RGB values for an x-by-y pixel image, suitable
for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a
terminal-style text representation. The text can include newlines
and ANSI escape sequences (e.g. for colors).
Note:
Make sure that your class's metadata 'render.modes' key includes
the list of supported modes. It's recommended to call super()
in implementations to use the functionality of this method.
Args:
mode (str): the mode to render with"""
1.安裝
- 1.pip 安裝
pip install gym
- 2.源碼安裝
git clone https://github.com/openai/gym
cd gym
pip install -e .
2.使用
import gym
# 更多的,想必聰明的你,已經(jīng)會(huì)了。
3. Refs
- https://gym.openai.com/docs/
- https://gym.openai.com/
- https://github.com/openai/gym
- http://c.biancheng.net/view/1972.html
讀者稍覺(jué)得有用,不坑,請(qǐng)點(diǎn)贊支持下,讓更多人參考。