斯坦福CS231课程笔记(上)
CS231n: Convolutional Neural Networks for Visual Recognition
笔记中文翻译:https://zhuanlan.zhihu.com/p/21930884?refer=intelligentunit
Lecture 5: Convolutional Neural Networks
卷积层的参数设置:
池化层的参数设置:
Lecture 6: Training Neural Networks,Part I
**函数的选择
Sigmoid:
- - Squashes numbers to range [0,1]
- - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
- - Saturated neurons “kill” the gradients (x很小或很大时,梯度趋近于0)
- - Sigmoid outputs are not zero-centered(w的梯度will be always all positive or all negative,也就是所有w每次都往同一个方向前进)
w本来应该走蓝色方向,现在只能走红色方向
- - exp() is a bit compute expensive
tanh:
- - Squashes numbers to range [-1,1]
- - zero centered (nice)
- - still kills gradients when saturated
ReLU:
- - Does not saturate (in +region)
- - Very computationally efficient
- - Converges much faster than sigmoid/tanh in practice (e.g. 6x)
- - Actually more biologically plausible than sigmoid
- - Not zero-centered output
- - An annoyance: what is the gradient when x < 0?
dead ReLU,will never activate=> never update
=> people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)
Leaky ReLU:
- - Does not saturate
- - Computationally efficient
- - Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
- - will not “die”
ELU:
- - All benefits of ReLU
- - Closer to zero mean outputs
- - Negative saturation regime compared with Leaky ReLU adds some robustness to noise
- - Computation requires exp()
Maxout:
- Does not have the basic form of dot product ->nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
- Problem: doubles the number of parameters/neuron
In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid
数据的预处理
Zero-centered:避免数据全正或全负,梯度会出现问题
In practice, you may also see PCA and Whitening of the data:
权重初始化
- Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with deeper networks:All activations become zero!
如果初始权重设置过大:Almost all neurons completely saturated, either -1 and 1(tanh). Gradients will be all zero.
- Reasonable initialization.(Mathematical derivation assumes linear activations)
but when using the ReLU nonlinearity it breaks
总结:
Initialization too small: Activations go to zero, gradients also zero, No learning
Initialization too big: Activations saturate (for tanh), Gradients zero, no learning
Initialization just right: Nice distribution of activations at all layers, Learning proceeds nicely
Batch Normalization
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
不一定都要normalize到高斯分布,太死板,因此又加上一步,学习参数r和B,让他们可产生偏移
- - Improves gradient flow through the network 提升梯度(收敛速度)
- - Allows higher learning rates。可以有更高的学习率(加快训练速度)
- - Reduces the strong dependence on initialization 减小对参数初始化的依赖
- - Acts as a form of regularization in a funny way, and slightly reduces the need for dropout,maybe 有一定正则化的作用
- 为了防止“梯度弥散”。在BN中,是通过将activation规范为均值和方差一致的手段使得原本会减小的activation的scale变大。
Note: at test time BatchNorm layer functions differently:
The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
(e.g. can be estimated during training with running averages)
在测试时使用的均值和方差不是现算的,而是使用训练时的估计值。
选择超参
- 先看loss是不是在合理的范围内,然后加上regularization,看loss是不是上升
- 确保能够overfit很小部分训练数据
- Start with small regularization and find learning rate that makes the loss go down.
loss not going down: learning rate too low
loss exploding: learning rate too high loss is NaN almost always means high learning rate…
cross-validation in stages:
- First stage: only a few epochs to get rough idea of what params work
- Second stage: longer running time, finer search... (repeat as necessary)
Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
note it’s best to optimizein log space!
得到一个最优的范围后,在这个缩小的范围里再次sample超参进行搜索
Random Search vs. Grid Search
Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type- regularization (L2/Dropout strength)
Lecture 7: Training Neural Networks,Part 2
优化方法
SGD的问题:1. 在平缓的维度下降非常慢,在险峻的维度容易抖动
2. 容易陷入局部极小值或鞍点。Zero gradient,gradient descentgets stuck
在高维空间中,鞍点比局部极小值更容易出现
SGD+Momentum
在每次下降时都加上之前运动方向上的动量
在梯度缓慢的维度下降更快,在梯度险峻的维度减少抖动
v表示velocity 速率,即在原先运动的方向上再往前走一段
Nesterov Momentum
AdaGrad
把每一维度的梯度^2和记录下来,每次学习率都除以这个和
每一维度的学习率不一样,且都在不断减小
在梯度大的维度,减小下降速度;在梯度小的维度,加快下降速度
问题:随着迭代次数增加,学习率慢慢减小,到不再下降
Added element-wise scaling of the gradient based on thehistorical sum of squares in each dimension
RMSProp
对AdaGrad的改进
Adam
Bias correction for the fact thatfirst and second momentestimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!
学习率 Learning rate decay over time
step decay: e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
*More critical with SGD+Momentum, less common with Adam
模型集成 Model Ensembles
- Instead of training independent models, use multiple snapshots of a single model during training!
- Instead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)
正则化 Regularization
- 在损失函数上加正则项:L2正则,L1正则
- Dropout:在每次前向传播时,随机地设置一些神经元为0
- 一种解释:Forces the network to have a redundant representation;Prevents co-adaptation of features
- 另一种理解:Dropout is training a large ensemble of models (that share parameters).Each binary mask is one model
- 数据增强 Data argument
- Batch Norm
- DropConnect
- Fractional Max Pooling
- Stochastic Depth