国产成人AV在线观看,久操人妻网。,久久精品露脸视频

資格跡是增強(qiáng)學(xué)習(xí)的一個基本的機(jī)制。比如在流行的TD(k)算法中，k涉及到資格跡的使用。幾乎所有的時間差分算法如Q-Learning，Sarsa，都能被結(jié)合資格跡來獲得更有效的方法。

The λ-return

Now we note that a valid update can be done not just toward any n-step return, but toward any average of n-step returns

TD(λ) 算法可以理解為一種特殊的平均n-step更新

the off-line λ-return algorithm

Theλ-return gives us an alternative way of moving smoothly between Monte Carlo and one-step TD methods that can be compared with then-step TD way of Chapter 7.

TD(λ)

TD(λ)是強(qiáng)化學(xué)習(xí)中最古老和應(yīng)用最廣泛的算法之一。這是第一個使用資格追蹤在更理論的前視圖和計(jì)算上更一致的后視圖之間顯示形式關(guān)系的算法。這里，我們將展示經(jīng)驗(yàn)，它近似上一節(jié)中提出的離線λ- return算法

TD（λ）以三種方式改進(jìn)了離線λ-返回算法。首先，它在每一步而不是僅在結(jié)尾處更新權(quán)重向量，因此其估計(jì)可能更快更好。其次，其計(jì)算分布在各個時間段而不是集中于末尾。第三，它可以應(yīng)用于連續(xù)的問題，而不僅僅是episodic問題。在本節(jié)中，我們提出了具有函數(shù)逼近的TD（λ）的半梯度版本

In TD(λ), the eligibility trace vector is initialized to zero at the beginning of the episode, is incremented on each time step by the value gradient