关于算法伪代码5.2，5.3 #21

Libertax-coder · 2025-04-20T07:07:21Z

Libertax-coder
Apr 20, 2025

想请问一下，这两个伪代码中Policy improvement这个步骤是不是应该放在第二个for循环的外面。
第二个for 循环遍历的是一个完整的episode，在循环内更新策略好像没什么用，是不是应该在遍历完一整个episode之后进行一次Policy improvement？
就像书里说的——”Then, the policy can be improved in an episode-by-episode fashion.“

MathFoundationRL · 2025-04-20T08:37:34Z

MathFoundationRL
Apr 20, 2025
Maintainer

如果策略在一个Episode里面没有使用是可以拿到外面的

1 reply

Libertax-coder Apr 21, 2025
Author

感谢老师

Lily-sh · 2026-03-26T10:52:08Z

Lily-sh
Mar 26, 2026

同样有这样的困惑，是不是在第二个for循环内进行policy improvement 效率会比在第二个for循环外（即遍历完每个episode）进行policy improvement 这两种都是可以，但policy evaluation的准确度不同因此会造成收敛速度的不同？另，对5.2和5.3的伪代码有个疑问没想明白第二层for循环内的第一个赋值 r的下标是否应该是t-1？因为是从后向前的

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于算法伪代码5.2，5.3 #21

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

关于算法伪代码5.2，5.3 #21

Uh oh!

Libertax-coder Apr 20, 2025

Replies: 2 comments · 1 reply

Uh oh!

MathFoundationRL Apr 20, 2025 Maintainer

Uh oh!

Libertax-coder Apr 21, 2025 Author

Uh oh!

Lily-sh Mar 26, 2026

Libertax-coder
Apr 20, 2025

Replies: 2 comments 1 reply

MathFoundationRL
Apr 20, 2025
Maintainer

Libertax-coder Apr 21, 2025
Author

Lily-sh
Mar 26, 2026