强化学习-An introduction之 马尔科夫决策过程(MDP)个人笔记
Chapter 3 马尔科夫决策过程(MDP)
MDP说白了就是面对不同的状态,采取一定行动后,有一定的概率到达某个状态。
1 state, action
最重要的两个东西就是状态和行动,强化学习说简单点就是面对不同的state采取怎样的action
2 p
p characterize the environment’s dynamics.
3 G
4 v, q
对 的 v :
对 的 q :
性质:
the Bellman equation for :
5 optimal
Solving a reinforcement learning task means, roughly, needing a policy that achieves a lot of reward over the long run.
Optimal policies, denoted . They share the same state-value function, called the optimal state-value function, denoted .
Optimal policies also share the same optimal action-value function, denoted .
two forms of the Bellman optimality equation for :
two forms of the Bellman optimality equation for :
graphically representation:
Once one has , it is relatively easy to determine an optimal policy.
Having makes choosing optimal actions even easier.
approximation
由于现实中的问题往往有很大的规模,如果用数组来一一映射每个state到action,无论是计算力还是内存都不现实,因此需要approximate value functions。
在估计最优策略时,我们往往花更多的精力去在更常见的状态最好的决策,而放弃那些很少出现的状态,这也是强化学习区别于其他的解决MDP的方法的所在之处。