您的位置: 首页 > 文章 > 强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

分类: 文章 • 2024-01-06 18:24:16

Chapter 3 马尔科夫决策过程（MDP）

MDP说白了就是面对不同的状态，采取一定行动后，有一定的概率到达某个状态。

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

1 state, action

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

最重要的两个东西就是状态和行动，强化学习说简单点就是面对不同的state采取怎样的action

2 p

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

p characterize the environment’s dynamics.

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

3 G

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

4 v, q

对 $π$ 的 v :

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

对 $π$ 的 q :

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

性质：

the Bellman equation for $v_{π}$ ：

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

5 optimal

Solving a reinforcement learning task means, roughly, needing a policy that achieves a lot of reward over the long run.

Optimal policies, denoted $π_{*}$ . They share the same state-value function, called the optimal state-value function, denoted $v_{*}$ .

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

Optimal policies also share the same optimal action-value function, denoted $q_{*}$ .

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

two forms of the Bellman optimality equation for $v_{*}$ ：

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

two forms of the Bellman optimality equation for $q_{*}$ ：

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

graphically representation:

强化学习-An introduction之马尔科夫决策过程（MDP）个人笔记

Once one has $v_{*}$ , it is relatively easy to determine an optimal policy.

Having $q_{*}$ makes choosing optimal actions even easier.

approximation

由于现实中的问题往往有很大的规模，如果用数组来一一映射每个state到action，无论是计算力还是内存都不现实，因此需要approximate value functions。

在估计最优策略时，我们往往花更多的精力去在更常见的状态最好的决策，而放弃那些很少出现的状态，这也是强化学习区别于其他的解决MDP的方法的所在之处。