关于算法伪代码5.2,5.3 #21
Libertax-coder
started this conversation in
General
Replies: 2 comments 1 reply
-
|
如果策略在一个Episode里面没有使用是可以拿到外面的 |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
同样有这样的困惑,是不是在第二个for循环内进行policy improvement 效率会比 在第二个for循环外(即遍历完每个episode)进行policy improvement 这两种都是可以,但policy evaluation的准确度不同因此会造成收敛速度的不同?另,对5.2和5.3的伪代码有个疑问没想明白 第二层for循环内的第一个赋值 r的下标是否应该是t-1?因为是从后向前的 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
想请问一下,这两个伪代码中Policy improvement这个步骤是不是应该放在第二个for循环的外面。
第二个for 循环遍历的是一个完整的episode,在循环内更新策略好像没什么用,是不是应该在遍历完一整个episode之后进行一次Policy improvement?
就像书里说的——”Then, the policy can be improved in an episode-by-episode fashion.“
Beta Was this translation helpful? Give feedback.
All reactions