Chapter 2 多臂*

k臂*问题（2.1节）：

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period.

解决办法（2.2、2.7、2.8、2.9节）

Action-value Methods（e-greedy）

假设我们大概知道每个action的奖励的分布，我们有两种办法，一个是每次选择奖励最多的action及greedy method，另一种是大部分选择奖励最多的小部分去探索别的action的分布即 $ϵ$ method。

一种简单的衡量每个行动的奖励的方法是，对这个action的奖励算平均值，一般可以暴力算，如下：

强化学习-An introduction之多臂* (k-bandits)

一种增量实现的方式（2.4节）只用到上一时刻的平均奖励和这一时刻的奖励：

强化学习-An introduction之多臂* (k-bandits)

（2.3节）对greedy-method和两个 $ϵ$ -method进行了实验对比，e更大的可以更快的找到最优策略，但长久下来e小的会占优势。bandit task were nonstationary，并且nonstationarity is the case most commonly encountered in reinforcement learning.

e-greedy:

前面说的是对某个action的衡量是通过求它的奖励的平均值，但是这是针对奖励的概率分布不变的问题，但是，非静态性在强化学习中很常见，所以针对这个问题，（2.5节）提出了reward的加权平均：

强化学习-An introduction之多臂* (k-bandits)

展开来就是

强化学习-An introduction之多臂* (k-bandits)

最后一步的系数和是1，所以是Q1和之后的奖励的加权平均。

这样的加权平均使越往后的奖励权重越大，且向前指数呈指数下降，但是这种加权平均不能保证对奖励的估计是收敛的，但是在最近的奖励附近变化。

（2.6节）讲的是初始化Q1的技巧——optimistic initial values(乐观初始值)，如果我们将初始的Q1设置成奖励的期望值，那么对exploration会有帮助，但是这只在静态问题中有效果，在非静态问题没那么有效。在以后的章节中，这种技巧会被经常使用。

Upper-Condence-Bound（UCB）

greedy method：没有考虑其他未探索的action可能会更优

e-greedy method：虽然加了一部分对未知行动的探索，但是没有倾向性，因为有的action最优的可能性更大。

针对以上问题，（2.7节）讲解了UCB方法，它同时考虑了奖励的期望值和不确定性，一方面我们想要期望值大的，一方面期望值虽然小一点但是它可能有更多的可能性（也就是潜力股）。即

强化学习-An introduction之多臂* (k-bandits)

Gradient Bandit Algorithms

前面的方法都是对action的奖励进行估计，然后根据估计来选择action。

（2.8节）we consider learning a numerical preference for each action a, which we denote $H_{t} (a)$ . The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward.

根据preference $H_{t} (a)$ 用softmax来计算选择每个action的概率：

强化学习-An introduction之多臂* (k-bandits)

选择某个action $A_{t}$ 并获得奖励 $R_{t}$ 后，如下更新 $H_{t}$ ：

强化学习-An introduction之多臂* (k-bandits)

这个方法我们可以把它理解成是随机梯度上升，核心思想是：

强化学习-An introduction之多臂* (k-bandits)

Contextual Bandits

以上的k-bandit问题都是在一种situation下，但是强化学习问题一般都不止一个situation，下面我们考虑多个状况的问题——contextual bandits问题。

考虑这样的问题：有个机器，它能产生不同的颜色，每种不同的颜色都对应一个k-bandits问题，所以在不同的颜色下选择的arm也不一样，这就是contextual bandits问题，它比前面讨论的k-bandits问题复杂了一点，因为加入了不同的颜色的situations，所以要学习一个policy，map(colos)->arm，但是比一般的强化学习问题简单点。

总结

强化学习-An introduction之多臂* (k-bandits)

强化学习-An introduction之 多臂* (k-bandits)

Chapter 2 多臂*

k臂*问题（2.1节）：

解决办法（2.2、2.7、2.8、2.9节）

Action-value Methods（e-greedy）

Upper-Condence-Bound（UCB）

Gradient Bandit Algorithms

Contextual Bandits

总结

相关推荐

强化学习-An introduction之多臂* (k-bandits)