概述:
弄懂 Q-learning 算法的前提是了解法爾科夫過程和獎勵(lì)函數(shù),用value(max)替換了原先的value獎勵(lì)函數(shù)。
代碼:
import numpy as np
GAMA = 0.8
FINALLY = 5
#構(gòu)造一個(gè)6*6 的小型迷宮
R = np.random.randint(1,100,[6,6])
#初始化Q表
Q = np.zeros_like(R)
# Q表更新函數(shù)
def updataq(i,j):
try:
while True:
Q[i,j] = R[i,j] + GAMA * Q[j].max()
if j == FINALLY:break
return updataq(j,Q[j].argmax())
except:pass
# 測試函數(shù)
def findway(node):
if node != FINALLY:
way = Q[node].argmax()
ways.append(way)
return findway(way)
for _ in range(600):
updataq(*np.random.randint(0,6,2))
ways = []
findway(2)
print(ways)
測試結(jié)果:
希望找出2節(jié)點(diǎn)到5節(jié)點(diǎn)的路徑:
[3, 1, 5]
是不是很棒!