- 优化算法 (Optimization Algorithms)

吴恩达 Andrew Ng

Mini-batch 梯度下降法

  • 把巨大的数据集分成一个一个的小部分

  • 5000000 examples, 1000 × 5000, X{1}...X{5000} , Y{1}...Y{5000}

  • epoch means a single pass through the training set

  • Batch gradient descent’s cost decrease on every iteration

  • Mini-batch gradient descent may not decrease on every iteration. It trends downwards, but it’s going to be a little bit noisier.

  • mini-batch size = m: Batch gradient descent

  • mini-batch size = 1: Stochastic gradient descent (随机梯度下降法)

  • 一般 mini-batch 大小为 64、128、256、512

Exponentially weighted averages (指数加权平均)

  • Vt=βVt1+(1β)θt

  • Vt approximately average over 11β items

  • 迭代几次公式,展开递推式

  • 初始化 V0=0

  • As for computation and memory efficiency, it’s a good choice.

Bias correction in exponentially weighted average

  • during initial phase of estimating, make it more accurate

  • Vt1βt

  • oscillations 震荡,波动

Momentum (动量梯度下降法)

  • ball rolling down a bowl

  ball rolling down a bowl

  • usually β=0.9


root mean square prop

Adam optimization algorithm

Adaptive Moment Estimation

  • 结合 Momentum 和 RMSprop

  • 适用性广泛

  • Hyperparameters:

    α: needs to be tuned, β1:0.9 , β2:0.999 , ϵ:108

Learning rate decay


  • α=11+decayrate×iterationα0

  • α=0.95epoch_numα0

  • α=kepoch_numα0

  • discrete staircase

  • manually controlling alpha (small model)

The problem of local optima

  • saddle point 马鞍点
  • 许多低维空间里的直觉在高维空间中并不适用,高维空间极少出现局部最优点
  • 平稳区域(plateaus)会降低学习效率