1. Q-Learning concept

Q-learning 方法是异策略时间差分方法.
其中，异策略指Agent 选择Action的策略，对应pseudocode第5行的 ϵ \epsilon ϵ贪婪策略
注意不是评估的策略(第6行)。
深度强化学习——第六~八章Q-Learning
时间差分方法指利用时间差分目标来更新当前状态-动作值函数Q，时间差分目标为 r t + γ m a x Q ( S t + 1 , a ) r_t+\gamma maxQ(S_{t+1}, a) rt+γmaxQ(St+1,a)

而Q-Learning要学习的就是评估Agent 选择Action策略的一个Policy Evaluation Function(策略评估函数）

2. Q-Learning Function

2.1 State Value Function Estimation V π ( s ) V^{\pi}(s) Vπ(s)

深度强化学习——第六~八章Q-Learning

通过给定某一个 state ， Q-Learing连续假设接下来互动的 actor 是 π \pi π，评估到互动结束时累积的 reward 的期望值 V π ( S ) V^{\pi}(S) Vπ(S)。

V π ( S ) V^{\pi}(S) Vπ(S)的 INPUT 是一个 state s，OUTPUT 是一个 scalar（即连续假设接下来互动的 actor 是 π \pi π，评估到互动结束时累积的 reward )

用 Monte-Carlo(MC) based 的方法逼近

假设神经网络里的参数是每层网络的权重 θ \theta θ, 则状态-动作值函数为 Q ( S t + 1 , a ) Q(S_{t+1}, a) Q(St+1,a)，累积奖励函数为 G G G(当 input s t a t e s a state s_a statesa 时，正确的output 应该是 G a G_a Ga
用Temporal-difference(时序差分) 的方法逼近
给定某一个 s t a t e state state s t s_t st，采取 a c t i o n action action a t a_t at ，得到 reward r t r_t rt ，跳到 s t a t e state state s t + 1 s_{t+1} st+1 ，就可以 apply TD:

V π ( s t ) V^{\pi}(s_t) Vπ(st) = V π ( s t + 1 ) + r t = V^{\pi}(s_{t+1}) + r_t =Vπ(st+1)+rt

把 s t s_t st 丢到 network 里面就会得到 V π ( s t ) V^{\pi}(s_t) Vπ(st) ，把 s t + 1 s_{t+1} st+1 丢到network 里面会得到 V π ( s t + 1 ) V^{\pi}(s_{t+1}) Vπ(st+1) 。通过training, 更新 V π ( s ) V^{\pi}(s) Vπ(s)的参数，当它们两个相减的结果和 r t r_t rt 越接近， V π ( s ) V^{\pi}(s) Vπ(s)就习得了。

深度强化学习——第六~八章Q-Learning
因为MS方法受 G a G_a Ga随机性的影响大于TD方法，故TD 方法更常见。

2.2 State-action Value Function Q π ( s , a ) Q^{\pi}(s, a) Qπ(s,a)

Q π ( s , a ) Q^{\pi}(s, a) Qπ(s,a)的INPUT 是一个 state s 和 action a，OUTPUT 是 accumulated reward 的期望值。

Q-function 有两种写法：

input 是 state 跟 action，output 就是一个 scalar；
input 是一个 state s，output 就是好几个 value。

举个例子：
深度强化学习——第六~八章Q-Learning

假设我们有 3 个 actions，3 个 actions 就是原地不动、向上、向下。

假设是在第一个 state，不管是采取哪个 action，最后到游戏结束的时候，得到的 expected reward 其实都差不多。因为球在这个地方，就算是你向下，接下来你其实应该还来的急救，所以今天不管是采取哪一个 action，就差不了太多。

假设在第二个 state，这个乒乓球它已经反弹到很接近边缘的地方，这个时候你采取向上，你才能得到 positive reward，才接的到球。如果你是站在原地不动或向下的话，接下来你都会 miss 掉这个球。你得到的 reward 就会是负的。

假设在第三个 state，球很近了，所以就要向上。

假设在第四个 state，球被反弹回去

深度强化学习——第六~八章Q-Learning

深度强化学习——第六~八章Q-Learning

1. Q-Learning concept

2. Q-Learning Function

2.1 State Value Function Estimation V π ( s ) V^{\pi}(s) Vπ(s)

2.2 State-action Value Function Q π ( s , a ) Q^{\pi}(s, a) Qπ(s,a)

相关推荐