cs231n-notes-Lecture-4/5/6: 反向传播/**函数/数据预处理/权重初始化/batch norm
Lecture-4 Backpropagation and Neural Networks
Computational Graphs
- Node gradient = [local gradient] x [upstream gradient]
- add gate: gradient distributor
- max gate: gradient router (choose only a way)
- mul gate: gradient switcher
Lecture-5 Convolutional Neural Networks
image N*N, fliter F*F, stride S, then the feature map: (N-F)/S + 1.
common setting: F=2/3, S=2.
ConvNetJS demo:http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
Lecture-6 Training Neural Networks
Activation function
Sigmoid
- Pros:
- squashes numbers into range [0,1].
- nice interpretation as a saturating “firing rate” of a neuron
- Cons:
- Saturated neurons kill the gardients
- not zero-centered
- exp() is a bit compute expensive
tanh
- squashes numbers into range [0,1].
- zero-centered
- Saturated neurons kill the gardients
Relu
- Pros:
- Does not Saturate
- Computationally efficient
- Converges much faser that tanh and sigmoid.(eg. 6x)
- Actually more biologically plausible than sigmoid.
- Cons:
- not zero-centered
- kill the half gradient. dead relu will never update the weights
Leaky Relu
- Pros:
- Does not Saturate
- Computationally efficient
- Converges much faser that tanh and sigmoid.(eg. 6x)
- will not “die”
- parametric Relu :
Exponential Linear Units(ELU)
Maxout
Data Preprocessing
Preprocess Data
- Normalization
- For images, e.g. consider CIFAR-10 example with [32,32,3] images.
- Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)
- For images, e.g. consider CIFAR-10 example with [32,32,3] images.
Weight Initialization
- pre-training or fine-tuning
- Small random numbers.
- eg.N(0,1e-2), but it has problems in deep neural Networks. The weights of deep layers become zeroes because the gradients are too small.
- large random numbers: it’s easy for the neurons to Saturate.
- Xavier Initialization : $ W_{a*b} = \frac{N(0,1)}{\sqrt{a}}$
- performs well using tanh but breaks using relu. Hence, it’s Initialized by
ref: https://www.leiphone.com/news/201703/3qMp45aQtbxTdzmK.html
Batch Normalization
W : N*D (N: sample numbers; D: the dimension of features)
- compute the mean and variance for each dimension
- normlize
- usually after fc and conv layers and before nonlinearity.
Tip:
Before training a large number of data, you can test if the model can overfit a small amount of data.