2017 Fall CS294 Lecture 7: Value Function Methods

回忆 $A^{π} (s_{t}, a_{t})$ 的含义，如果使用下述的 $π^{'} (a_{t} | s_{t})$ 来取代 $a_{t} \sim π (a_{t} | s_{t})$ ，那么由于 $π^{'}$ 是取了max的，那么至少不会比 $π$ 要差。那么算法的流程就如右小角的那个图一样，不断的用 $π^{'}$ 来更新 $π$ ，然后用 $π$ 生成samples，接着继续fit $A^{π}$ ，迭代优化下去。这个思想就是value based methods的主要思想。
2017 Fall CS294 Lecture 7: Value Function Methods

接着，使用以上的思想，介绍了policy iteration的算法：

假设已知了研究对象的dynamics，也就是 $p (s^{'} | s, a)$ ，并且是离散动作，离散状态的系统，那么可以使用动态规划

2017 Fall CS294 Lecture 7: Value Function Methods

然后，由于我们定义 $π^{'}$ 的方式，是一个deterministic policy，所以上图的期望符号都可以去掉了。

最后，使用动态规划求解policy iteration的算法如下：

2017 Fall CS294 Lecture 7: Value Function Methods

进一步简化之后，得到value iteration，下图中之所以第二步可以直接 $V (s) \leftarrow m a x_{a} Q (s, a)$ 是因为等价于上一张ppt的policy evaluation：

2017 Fall CS294 Lecture 7: Value Function Methods

由此看来，value iteration比policy iteration简单！

如果state少的时候，可以使用table来表示 $V (s)$ ，但是如果状态太多了，那么就不行了。解决办法是fit a function approximator！这样就会得到Fitted value iteration：

2017 Fall CS294 Lecture 7: Value Function Methods

可是，观察fitted value iteration的步骤，虽然我们通过拟合value function，可以摒弃离散state的约束，将算法用于连续状态，但是执行第一步的max操作时，我们仍然需要假设知道研究对象的dynamcis才行，而这是一个很苛刻的假设

2017 Fall CS294 Lecture 7: Value Function Methods

解决办法就是fitted Q iteration algorithm，下面的ppt中按理说，由于我们采用的是greedy policy（旨在最大化 $A$ ），那么直觉上感觉 $E [V (s_{i}^{'})]$ 和 $m a x_{a^{'}} Q_{ϕ} (s_{i}^{'}, a_{i}^{'})$ 应该是严格的等号关系。Sergey Levine解释说，这里之所以用的是约等于只因为 $E [V (s_{i}^{'})]$ 本应该用很多样本去估计，但是我们只用了一个sample来估计 $m a x_{a^{'}} Q_{ϕ} (s_{i}^{'}, a_{i}^{'})$ 。

2017 Fall CS294 Lecture 7: Value Function Methods

但是一个缺陷在于，使用non-linear approximator的时候，收敛性无法得到保证（之后会推导出一个更使用的形式！），因为许多tabular情形下的证明在此情形下无法得到满足了。再一个就是，这个算法仍然只适用于discrete action的情况，后面会讲到一些trick，将其拓展到continuous action。

完整形式的fitted Q-iteration algorithm，第一步是collect data（或者说generate samples），这一步使用的collection policy可以是来自于任意policy（不仅限于current policy）。因为，fitted Q-iteration algorithm不对collection policy做任何assumption。

2017 Fall CS294 Lecture 7: Value Function Methods

解释为什么fitted Q-iteration algorithm是一个off-policy algorithm。因为step2/3不要求第一步collect data的policy是current policy，只需要是一个 $(s_{i}, a_{i}, s_{i}^{'}, r_{i})$ 的序列就好了。第二步的 $γ m a x_{a_{i}^{'}} Q_{ϕ} (s_{i}^{'}, a_{i}^{'})$ ，循着前面PPT的思路容易发现，其实是近似的 $V^{π^{'}} (s_{i}^{'})$ 。

2017 Fall CS294 Lecture 7: Value Function Methods

What is fitted Q-iteration optimizing?

2017 Fall CS294 Lecture 7: Value Function Methods

下面的第二个，online Q iteration algorithm就是标准的Q learning algorithm：

2017 Fall CS294 Lecture 7: Value Function Methods

online Q iteration algorithm，既然第一步的behaviour policy（用于generate samples）可以是任意的policy，那么考虑一下使用greedy policy呢… 这样的缺点在于容易陷入局部极值（比如算法认为某一个action好，那么就会一直选择这个action，而不再考虑其他action了），exploration不足。

2017 Fall CS294 Lecture 7: Value Function Methods

exploration的设计对于Q-learning很重要。虽然它是off-policy算法，但是exploration的设计会影响到 $β$ 分布（也就是generated samples的分布），所以对于Q-learning的收敛性等都会有影响。后面还会提到更多的exploration的方法。

简单的总结：

2017 Fall CS294 Lecture 7: Value Function Methods

三分钟的break… break之后将会讲到一些theory

Value function leanring theory：

2017 Fall CS294 Lecture 7: Value Function Methods

上面几张ppt基本没听懂，大概记一下结论吧。值得注意的是，有学生提问为什么policy gradient算法没有这样的converge问题。Sergey Levine回答说，因为PG算法直接是对expected reward进行的梯度上升，所以是会收敛的，而这里的value iteration algorithm没有对任何东西做梯度优化（下面第二个ppt会解释原因）。

What about fitted Q-iteration? It’s exactly same.

2017 Fall CS294 Lecture 7: Value Function Methods

Q-learning中虽然有看似梯度的操作，但是实际上它并没有gradient descent，因为它的target( $y_{i}$ )一直在变。所以本质上它只是在做回归。

2017 Fall CS294 Lecture 7: Value Function Methods

有点令人”伤心”的是，上面的结论（不保证收敛）在batch actor-critic算法中也同样适用！观察actor-critic算法的第二步，同样是一个回归的问题。So, this in general is not guaranteed to converage. 但是几天后我们将会有一些happy conclusions。

2017 Fall CS294 Lecture 7: Value Function Methods

总结本课后半部分

2017 Fall CS294 Lecture 7: Value Function Methods

2017 Fall CS294 Lecture 7: Value Function Methods

相关推荐