DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

3.3.1 What are Policy Gradient Methods?

Policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates.
Policy-based methods estimate the weights of an optimal policy through gradient ascent.

3.3.2 The Big Picture

DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
策略梯度法和监督学习的区别。
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

3.3.4 Problem Setup

A trajectory is just a state action sequence. It can corresond to a full episode or just a small part of an episode. We denote a Trajectory with the Greek letter τ\tau. Then the sum reward from that trajectory is written as R of τ\tau.
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

3.3.5 REINFORCE

Our goal is to find the values of the weights θ\theta in the neural network that maximize the expected return UU

U(θ)=τP(τ;θ)R(τ)U(\theta) = \sum_\tau P(\tau;\theta)R(\tau)

where τ\tau is an arbitrary trajectory. One way to determine the value of θ\theta that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the differences are that:

  • gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
  • gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.
    Our update step for gradient ascent appears as follows:

θθ+αθU(θ)\theta \leftarrow \theta + \alpha \nabla_\theta U(\theta)
where α\alpha is the step size that is generally allowed to decay over time. Once we know how to calculate or estimate this gradient, we can repeatedly apply this update step, in the hopes that θ\theta converges to the value that maximizes U(θ)U(\theta).

In the above equation, we need take into account the probability of each possible trajectory and, the return permits trajectory.
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
In fact, to calculate the gradient θU(θ)\nabla_{\theta}U\left( \theta \right), we have to consider every possible trajectory. However, It is computationally expensive because in order to calculate the gradient exeactly, we have to consider every possible trajectory. Instead of doing this, to calculate the gradient θU(θ)\nabla_{\theta}U\left( \theta \right), we’ll just sample a few trajectories using the policy and then using those trajectories only to estimate the gradient.
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
对于单个轨迹序列,
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.