1. 介紹

1.1 探索與利用間的困境

Online decision-making involves a fundamental choice:
Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions

1.2 生活中栗子

Restaurant Selection
Exploitation Go to your favorite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experiment move

1.3 伍種策略規(guī)則

Naive Exploration
Add noise to greedy policy (e.g. $\epsilon-greedy$ )
Optimistic Initialization
Assume the best until proven otherwise
Optimism in the Face of Uncertainty

2. 引入多臂老虎機(jī) (The Multi-Armed Bandit)

拉斯維加斯的一排老虎機(jī)

維基百科解釋如下：
??這個(gè)名字來自于想象一個(gè)賭徒在一排老虎機(jī)（有時(shí)被稱為“單臂匪徒”），他們必須決定要玩哪些機(jī)器，玩每臺(tái)機(jī)器多少次以及按順序播放它們，以及是否繼續(xù)使用當(dāng)前的機(jī)器或嘗試不同的機(jī)器。在該問題中，每臺(tái)機(jī)器從特定于該機(jī)器的概率分布中提供隨機(jī)獎(jiǎng)勵(lì)。賭徒的目標(biāo)是通過一系列杠桿拉動(dòng)最大化獲得的獎(jiǎng)勵(lì)總和。^[3]^[4]賭徒在每次試驗(yàn)中面臨的關(guān)鍵權(quán)衡是在“利用”具有最高預(yù)期收益的機(jī)器和“探索”以獲得關(guān)于其他機(jī)器的預(yù)期收益的更多信息之間。

2.1 最大化cumulative reward && 最小化 total regret

動(dòng)作空間和獎(jiǎng)賞分布
??在 $t$ 時(shí)刻，Agent做出動(dòng)作 $\alpha_t \in \cal A$ ， Environment依據(jù)未知分布 $\cal R^{\alpha}(r)=\mathbb P[r|\alpha]$ 產(chǎn)生對(duì)應(yīng)的獎(jiǎng)賞值 $r_t \sim \cal R^{\alpha_t}=\mathbb P[r| \alpha_t]$ 。動(dòng)作空間和獎(jiǎng)賞分布 可以記為二元組 $\langle \cal A, \cal R \rangle$ ，產(chǎn)生的具體觀測(cè)記為 $\langle \alpha_t, r_t \rangle$ 。
最大化cumulative reward
$max \sum_{\tau=1}^{t}{r_\tau}$