您的位置: 首页 > 文章 > Lecture 6: Actor-Critic Algorithms

Lecture 6: Actor-Critic Algorithms

分类: 文章 • 2022-10-04 11:44:03

improve the policy gradient

在如下计算gradient 的公式中，只用到了一个trajectory 的数据，但实际情况非常复杂，所以需要使用期望：

Lecture 6: Actor-Critic Algorithms

所以将后面那个求和项用如下期望替代：

Lecture 6: Actor-Critic Algorithms

baseline 设为 Lecture 6: Actor-Critic Algorithms 的期望，表示平均的收益概念。减去baseline之后，变为如下等式：

Lecture 6: Actor-Critic Algorithms

其中上式的 Lecture 6: Actor-Critic Algorithms 表示 advantage value , 表示这个动作比预期的好（或者坏）多少。

接下来的问题就是，如何产生 Lecture 6: Actor-Critic Algorithms , , 。答案是只要求解， why?

首先将 Lecture 6: Actor-Critic Algorithms 展开，得到如下等式：

Lecture 6: Actor-Critic Algorithms

其中后一项可以用 Lecture 6: Actor-Critic Algorithms 表示，从而等式变为：

Lecture 6: Actor-Critic Algorithms

直观理解就是，使用下一个状态的expectation, 而不是整个trajectory 的expectation, 注意到这里使用了约号，所以等式后面是个近似值，这虽然会带来适当的损失，但这只是一个step的，带来的效果就是不必在求解 st 和at 的函数，只要求解 Lecture 6: Actor-Critic Algorithms 这个函数就行了。相应的，变为：

Lecture 6: Actor-Critic Algorithms

接下来的问题就是，如何求解 Lecture 6: Actor-Critic Algorithms ？可以用网络来求解，比如：

Lecture 6: Actor-Critic Algorithms

求解 Lecture 6: Actor-Critic Algorithms 的过程也叫 policy evaluation, 因为它只是评估该模型和好坏，不改善模型。

有一个情况需要注意的是，在求解 Lecture 6: Actor-Critic Algorithms 时，需要从某个状态从头出发，进行多次的运行得到一个expextation, 但现实情况很难从头再来，所以在采用dnn 来处理状态时，有一个近似的概念，即类似的状态会输出类似值，比如某个动作好，某个动作坏，那dnn 会输出一个取中的值。（但是也有不好的情况，比如两者的输出相差甚远( cliff）则此时会出现不好的情况，这里我先记录一下老师的说法，还没有很好得理解）。具体的情况如下所示：

Lecture 6: Actor-Critic Algorithms ------------------------->

如何训练 Lecture 6: Actor-Critic Algorithms ？使用监督学习的regresion 方法，对于每一个s_t, 其目标值由实际运行时产生的数据组成。假如数据有如下形式：

Lecture 6: Actor-Critic Algorithms

则 loss function （只是其中一种表示方式）可以表示为：（前面那项是网络的输出）

Lecture 6: Actor-Critic Algorithms

这里要注意一下如何获取 Lecture 6: Actor-Critic Algorithms ，通过使用上一次value function 的值：这里使用了约号。

Lecture 6: Actor-Critic Algorithms

从而 Lecture 6: Actor-Critic Algorithms 可以表示为：

Lecture 6: Actor-Critic Algorithms

但是上式在infinite time step 情况下，会有问题（值会变得无限大)，所以如何保证收敛？解决方法是是加一个 discount factor :

Lecture 6: Actor-Critic Algorithms

加上系数之后的直观解释是越以后的reward，对现在的影响越小。也可以理解为之后的average reward。