DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
3.3.1 What are Policy Gradient Methods?
Policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates.
Policy-based methods estimate the weights of an optimal policy through gradient ascent.
3.3.2 The Big Picture
策略梯度法和监督学习的区别。
3.3.4 Problem Setup
A trajectory is just a state action sequence. It can corresond to a full episode or just a small part of an episode. We denote a Trajectory with the Greek letter . Then the sum reward from that trajectory is written as R of .
3.3.5 REINFORCE
Our goal is to find the values of the weights in the neural network that maximize the expected return
where is an arbitrary trajectory. One way to determine the value of that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the differences are that:
- gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
- gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.
Our update step for gradient ascent appears as follows:
where is the step size that is generally allowed to decay over time. Once we know how to calculate or estimate this gradient, we can repeatedly apply this update step, in the hopes that converges to the value that maximizes .
In the above equation, we need take into account the probability of each possible trajectory and, the return permits trajectory.
In fact, to calculate the gradient , we have to consider every possible trajectory. However, It is computationally expensive because in order to calculate the gradient exeactly, we have to consider every possible trajectory. Instead of doing this, to calculate the gradient , we’ll just sample a few trajectories using the policy and then using those trajectories only to estimate the gradient.
对于单个轨迹序列,
REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.