《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

9.1 Value-function Approximation . . . . . . . . . . . . . . . . . . . . . . . 191
9.2 The Prediction Objective (MSVE) . . . . . . . . . . . . . . . . . . . . 192
9.3 Stochastic-gradient and Semi-gradient Methods . . . . . . . . . . . . . 194
9.4 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.5 Feature Construction for Linear Methods . . . . . . . . . . . . . . . . 203
9.5.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.5.2 Fourier Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.5.3 Coarse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
9.5.4 Tile Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.5.5 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . 215
9.6 Nonlinear Function Approximation: Artificial Neural Networks . . . . 216
9.7 Least-Squares TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

not all function approximation methods are equally well suited for use in reinforcement learning

==》learn efficiently from incrementally acquired data

==》handle nonstationary target functions

with genuine approximation, an update at one state affects many others, and it is not possible to get all states exactly correct.

By assumption we have far more states than weights, so making one state’s estimate more accurate invariably means making others’ less accurate.

we obtain a natural objective function, theMean Squared Value Error, or MSVE:

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结
Typically one chooses d(s) to be the fraction of time spent in s under the target policy π. This is called the on-policy distribution

Remember that our ultimate purpose, the reason we are learning a value function, is to use it in finding a better policy. The best value function for this purpose is not necessarily the best for minimizing MSVE. Nevertheless, it is not yet clear what a more useful alternative goal for value prediction might be.

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

The gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally optimal solution。因为Gt是unbiased estimate ofvπ(St) 。

One does not obtain the same guarantees if a bootstrapping estimate ofvπ(St) is used as the target。因为bootstrapping estimation depends on the current value of the weight vectorθt, we call themsemi-gradient methods. Semi-gradient TD methods are not true gradient methods. In such bootstrapping methods (including DP), the weight vector appears in the update target, yet this is not taken into account in computing the gradient, thus they are semi-gradient methods.

Semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear case discussed in the next section.

linear model: the approximate function, v(*;θ), is a linear function of the weight vector, θ. Corresponding to every state s, there is a real-valued vector of features φ(s) = (φ1(s); φ2(s); ...; φn(s)), with the same number of components as θ.

In the linear case there is only one optimum (or, in degenerate cases, one set of equally good optima), and thus any method that is guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum. 比如上面提到的连个方法：the gradient Monte Carlo algorithm、the semi-gradient TD(0) algorithm。

Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learning systems.

State aggregation is a simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector θ) for each group

Polynomials Basis: Using these features means that functions are approximated as multi-dimensional quadratic functions|even though the approximation is still linear in the weights that have to be learned. （In general, we do not recommend using the polynomial basis for online learning.）

Fourier Basis, which expresses periodic functions as a weighted sum of sine and cosine basis functions of different frequencies. With enough basis functions essentially any function can be approximated as accurately as desired.

Coarse Coding: 本质上是把整个state space人为划分成很多更小粒度的feature space（注意，更小粒度反而叫粗粒度，因为不精确？）。Initial generalization from one point to another is indeed controlled by the size and shape of the receptive fields, but the finest discrimination ultimately possible, is controlled more by the total number of features（每个feature space的大小和形状确实影响初始时的generalization，但是最终的generalization主要受有多少个feature space影响）. Receptive field shape tends to
have a strong effect on generalization but little effect on asymptotic solution quality.

Tile coding is a form of coarse coding for multi-dimensional continuous spaces that is flexible and computationally efficient. With just one tiling, we would not have coarse coding by just a case of state aggregation. To get true coarse coding with tile coding, multiple tilings are used, each offset by a fraction of a tile width, but note that the choice of how to offset the tilings from each other affects generalization.

Radial basis functions (RBFs) are the natural generalization of coarse coding to continuous-valued features.

Nonlinear Function Approximation: Artificial Neural Networks: an ANN with a single hidden layer having a large enough finite number of sigmoid unitscan approximate any continuous function on a compact region of the network’s input space to any degree of accuracy (Cybenko, 1989). backpropagation algorithm / overfitting / 梯度消失(gradient vanishing);梯度膨胀(gradient explosion) / dropout / batch normalization / stacked auto-encoder / deep convolutional network

deep reinforcement learning

下面是silver课程《Lecture 6，Value Function Approximation》我觉得应该知道的内容：

4-6：motivation

7：types of value function approximation

9：We consider differentiablefunction approximators, furthermore, we require a training method that is suitable for non-stationary,non-iid data, e.g. Linear combinations of features、Neural network

16-20：Incremental Prediction Algorithms 的几种形式

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

30：prediction的convergence：MC都收敛、tabular都收敛、on-policy的linear approximation都收敛、off-policy的TD以及non-linear的TD都不保证收敛。

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

24：Incremental Control Algorithms 的几种形式

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

32：control的convergence：tabular都收敛、non-linear的approximation方法都不保证收敛、MC/SARSA的linear approximation都收敛，但是Q-learning的linear approximation都不一定保证收敛（因为是Q-learning是off-policy的，所以存在divergence风险：Baird’s Counterexample ）。

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

34-35：Batch Reinforcement Learning 的motivation（增加data efficiency、基于least square实现对所有history data的best fitting）

37：Stochastic Gradient Descent with Experience Replay思想

38-41：DQN，两个tricks（experience replay打破数据的correlation；target-network打破td-target的correlation）实现non-linear approximation的stable训练

42-45：Linear Least Squares Prediction，因为linear，所以能够通过close-form的形式直接计算最优w（insight是：LS(W)最小化时，the expected update of w must be zero ）

50：Least Squares Q-Learning （control）

《reinforcement learning：an introduction》第九章《On-policy Prediction with Approximation》总结

相关推荐