斯坦福CS231课程笔记(上)

CS231n: Convolutional Neural Networks for Visual Recognition

笔记中文翻译:https://zhuanlan.zhihu.com/p/21930884?refer=intelligentunit

                        http://blog.****.net/u010004460/article/details/53432575


Lecture 5:    Convolutional Neural Networks

卷积层的参数设置:

斯坦福CS231课程笔记(上)

池化层的参数设置:

斯坦福CS231课程笔记(上)


Lecture 6:    Training Neural Networks,Part I

**函数的选择

斯坦福CS231课程笔记(上)
Sigmoid:
  • - Squashes numbers to range [0,1]
  • - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
  • - Saturated neurons “kill” the gradients (x很小或很大时,梯度趋近于0)
  • - Sigmoid outputs are not zero-centered(w的梯度will be always all positive or all negative,也就是所有w每次都往同一个方向前进
斯坦福CS231课程笔记(上)
        w本来应该走蓝色方向,现在只能走红色方向
  • - exp() is a bit compute expensive
tanh:
  • - Squashes numbers to range [-1,1]
  • - zero centered (nice)
  • - still kills gradients when saturated 
ReLU:
  • - Does not saturate (in +region)
  • - Very computationally efficient
  • - Converges much faster than sigmoid/tanh in practice (e.g. 6x)
  • - Actually more biologically plausible than sigmoid
  • - Not zero-centered output
  • - An annoyance: what is the gradient when x < 0?
            dead ReLUwill never activate=> never update
          => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

Leaky ReLU:
  • - Does not saturate
  • - Computationally efficient
  • - Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
  • - will not “die”
ELU:
  • - All benefits of ReLU
  • - Closer to zero mean outputs
  • - Negative saturation regime compared with Leaky ReLU adds some robustness to noise
  • - Computation requires exp()
Maxout:
          - Does not have the basic form of dot product ->nonlinearity
          - Generalizes ReLU and Leaky ReLU
          - Linear Regime! Does not saturate! Does not die!
          - Problem: doubles the number of parameters/neuron

In practice:
        - Use ReLU. Be careful with your learning rates
        - Try out Leaky ReLU / Maxout / ELU
        - Try out tanh but don’t expect much
        - Don’t use sigmoid

数据的预处理

斯坦福CS231课程笔记(上)
Zero-centered:避免数据全正或全负,梯度会出现问题
In practice, you may also see PCA and Whitening of the data:
斯坦福CS231课程笔记(上)


权重初始化

Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
斯坦福CS231课程笔记(上)
Works ~okay for small networks, but problems with deeper networks:All activations become zero!
如果初始权重设置过大:Almost all neurons completely saturated, either -1 and 1(tanh). Gradients will be all zero.

- Reasonable initialization.(Mathematical derivation assumes linear activations)
斯坦福CS231课程笔记(上)
but when using the ReLU nonlinearity it breaks
斯坦福CS231课程笔记(上)

总结:
Initialization too small:    Activations go to zero, gradients also zero, No learning
Initialization too big:    Activations saturate (for tanh), Gradients zero, no learning
Initialization just right:    Nice distribution of activations at all layers, Learning proceeds nicely

Batch Normalization

斯坦福CS231课程笔记(上)
斯坦福CS231课程笔记(上)
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
不一定都要normalize到高斯分布,太死板,因此又加上一步,学习参数r和B,让他们可产生偏移 
斯坦福CS231课程笔记(上)
斯坦福CS231课程笔记(上)
  • - Improves gradient flow through the network  提升梯度(收敛速度)
  • - Allows higher learning rates。可以有更高的学习率(加快训练速度)
  • - Reduces the strong dependence on initialization 减小对参数初始化的依赖
  • - Acts as a form of regularization in a funny way, and slightly reduces the need for dropout,maybe 有一定正则化的作用
  • 为了防止“梯度弥散”。在BN中,是通过将activation规范为均值和方差一致的手段使得原本会减小的activation的scale变大。
Note: at test time BatchNorm layer functions differently:
The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
(e.g. can be estimated during training with running averages)
在测试时使用的均值和方差不是现算的,而是使用训练时的估计值。

选择超参

- 先看loss是不是在合理的范围内,然后加上regularization,看loss是不是上升
- 确保能够overfit很小部分训练数据
- Start with small regularization and find learning rate that makes the loss go down.
    loss not going down: learning rate too low
    loss exploding: learning rate too high  loss is NaN almost always means high learning rate…

cross-validation in stages:
First stage: only a few epochs to get rough idea of what params work
- Second stage: longer running time, finer search... (repeat as necessary)
    Tip for detecting explosions in the solver:  If the cost is ever > 3 * original cost, break out early
    note it’s best to optimizein log space!
    得到一个最优的范围后,在这个缩小的范围里再次sample超参进行搜索

Random Search vs. Grid Search
斯坦福CS231课程笔记(上)

Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type- regularization (L2/Dropout strength)

斯坦福CS231课程笔记(上)
斯坦福CS231课程笔记(上)
斯坦福CS231课程笔记(上)

Lecture 7:    Training Neural Networks,Part 2

优化方法

SGD的问题:1. 在平缓的维度下降非常慢,在险峻的维度容易抖动
                    2. 容易陷入局部极小值或鞍点。Zero gradient,gradient descentgets stuck
                            在高维空间中,鞍点比局部极小值更容易出现
斯坦福CS231课程笔记(上)
 

SGD+Momentum

在每次下降时都加上之前运动方向上的动量

在梯度缓慢的维度下降更快,在梯度险峻的维度减少抖动
斯坦福CS231课程笔记(上)
v表示velocity 速率,即在原先运动的方向上再往前走一段
斯坦福CS231课程笔记(上)

Nesterov Momentum

斯坦福CS231课程笔记(上)

AdaGrad

把每一维度的梯度^2和记录下来,每次学习率都除以这个和
每一维度的学习率不一样,且都在不断减小
在梯度大的维度,减小下降速度;在梯度小的维度,加快下降速度
问题:随着迭代次数增加,学习率慢慢减小,到不再下降

斯坦福CS231课程笔记(上)
Added element-wise scaling of the gradient based on thehistorical sum of squares in each dimension

RMSProp

对AdaGrad的改进

斯坦福CS231课程笔记(上)

Adam

斯坦福CS231课程笔记(上)
Bias correction for the fact thatfirst and second momentestimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

学习率 Learning rate decay over time

step decay:     e.g. decay learning rate by half every few epochs.
exponential decay:     斯坦福CS231课程笔记(上)
1/t decay:    斯坦福CS231课程笔记(上)
*More critical with SGD+Momentum, less common with Adam

模型集成 Model Ensembles

  • Instead of training independent models, use multiple snapshots of a single model during training!
  • Instead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)

正则化 Regularization

  • 在损失函数上加正则项:L2正则,L1正则
  • Dropout:在每次前向传播时,随机地设置一些神经元为0
    • 一种解释:Forces the network to have a redundant representation;Prevents co-adaptation of features
    • 另一种理解:Dropout is training a large ensemble of models (that share parameters).Each binary mask is one model
  • 数据增强 Data argument
  • Batch Norm
  • DropConnect
  • Fractional Max Pooling
  • Stochastic Depth