您的位置: 首页 > 文章 > 强化学习（二） Sample-based Learning Methods

强化学习（二） Sample-based Learning Methods

分类: 文章 • 2024-01-06 18:46:10

强化学习（二） Sample-based Learning Methods

第一章 Monte Carlo Methods for Prediction & Control
第二章 Temporal Difference Learning Methods for Prediction
第三章 Temporal Difference Learning Methods for Control

第一章 Monte Carlo Methods for Prediction & Control

1.1 What is Monte Carlo?

强化学习（二） Sample-based Learning Methods

MC是直接从episodes中学习，不知道先验的知识，对比**机，而且是通过完整的episode，然后注意的是通过MC对于MDP问题来说，所有的spisode必须是有限（终结的），更新的时候是通过episode而不是step。
强化学习（二） Sample-based Learning Methods

强化学习（二） Sample-based Learning Methods
K臂**机问题通过摇臂最后平均最后的Rewards，是已知道的值，而Monte Carlo方法最后通过Returns，sample后不知道先验知识来更新。

回想一下G的定义

强化学习（二） Sample-based Learning Methods

强化学习（二） Sample-based Learning Methods
总结一下：

首先有在Policy pi下有这样的一个episode S0，A0，R1，S1…直到最后结束，然后初始化G (returns），然后给他们累加得到了return S（sum），最后求到的是平均returns V，作为Value。

然后不断的累加到s上后，最后求平均的Value，V（S)

Rewards 和 Returns是两个不同的概念，R通常指的是Reward，而G表示的才是Returns。（return 应该指的是状态动作l序列中某个状态及其之后的所有状态的立即回报（reward）的折扣累加和）

1.2 Using Monte Carlo for Prediction

强化学习（二） Sample-based Learning Methods

1.3 Using Monte Carlo for Action Values

强化学习（二） Sample-based Learning Methods

在Model未知时，估计动作价值（action-values）要比估计状态价值要有用一些。

而当Model是已知的，仅仅利用状态价值就足够决定策略。

Model未知时，只有状态价值无法决定策略。

强化学习（二） Sample-based Learning Methods

1.4 Using Monte Carlo methods for generalized policy iteration

GPI
强化学习（二） Sample-based Learning Methods

1.5 Solving the Blackjack Example

强化学习（二） Sample-based Learning Methods

1.6 Epsilon-soft policies

强化学习（二） Sample-based Learning Methods

1.7 Why does off-policy learning matter?

强化学习（二） Sample-based Learning Methods

the exploration and exploitation trade-off.
强化学习（二） Sample-based Learning Methods

target policy-small;behavior policy-large

1.8 Importance Sampling

强化学习（二） Sample-based Learning Methods

也就是b到pi差一个比例关系。

1.9 Off-Policy Monte Carlo Prediction

强化学习（二） Sample-based Learning Methods

1.10 Emma Brunskill: Batch Reinforcement Learning

强化学习（二） Sample-based Learning Methods

1.11 Week 1 Summary

MC
蒙特卡洛算法是基于样本的方法。当模型不可用或难以记下时，可以使用它们。蒙特卡洛算法通过对多个观察到的回报进行平均来估计价值函数。它们在更新其值之前会等待完整的回报。因此，我们只对偶发的MDP使用蒙特卡洛。我们讨论了如何在广义政策迭代里面使用蒙特卡洛。这导致了我们第一个基于样本的控制算法，即带有探索开始的蒙特卡洛算法。蒙特卡洛算法并不像动态编程那样对状态动作空间进行扫视，所以它们需要一个探索机制来确保它们了解每一个状态动作对。(state-action part
MC with Explore stars
我们首先考虑的是exploring starts。exploring starts需要在每个episode择第一个状态和动作。使用exploring starts并不总是可行或安全的。试想一下，用一辆自主汽车来做exploring starts。这种认识促使我们研究额外的探索方法。我们涵盖了另外两种探索问题的策略。
MC with epsilon-soft
使用Epsilon-soft策略的on-policy off-policy,对于第一种策略，代理遵循并学习一个随机策略。它通常采取贪婪的行动。小部分时间它采取随机行动。这样就保证了所有状态动作对的价值估计会随着时间的推移而不断提高。这种上策略策略迫使我们学习一个接近最优的策略，而不是最优的策略。但是，如果我们想学习一个最优策略，但仍然保持探索呢？
Off-Policy
答案就在于离策略学习。我们介绍了一些关于非政策学习的新定义，让我们来回顾一下。行为策略是(A behavior policy是agent用来选择的的策略。通过发送一个适当的探索性行为策略，agent可以学习任何确定性的目标策略。学习一种策略而落下另一种策略的一种方法是使用重要性采样Sample,为策略下的经验抽样来估计目标策略下的预期收益。比值对样本重新加权。它增加了在Pi下更有可能出现的收益的重要性，它减少了那些不可能出现的收益。样本平均值有效地包含了每个收益的正确比例，因此在预期中，它就像在Pi下采样的收益一样。

第二章 Temporal Difference Learning Methods for Prediction

2.1 What is Monte Carlo?

2.2 What is Temporal Difference (TD) learning?

2.3 Rich Sutton: The Importance of TD Learning

2.4 The advantages of temporal difference learning

2.5 Comparing TD and Monte Carlo

2.6 Andy Barto and Rich Sutton: More on the History of RL

2.7 Week 2 Summary

第三章 Temporal Difference Learning Methods for Control

3.1 Sarsa: GPI with TD

3.2 Sarsa in the Windy Grid World

3.3 What is Q-learning?

3.4 Q-learning in the Windy Grid World

3.5 How is Q-learning off-policy?

3.6 Expected Sarsa

3.7 Expected Sarsa in the Cliff World

3.8 Generality of Expected Sarsa

3.9 Week 3 Summary