COURSE 2 Improving Deep Neural Networks Hyperparameter tuning, Regularization and Optimization
Week1
Train/Dev/Test sets
- train set: train model (60% or higher)
- dev set: hold-out cross validation
- test set: take the best model
Make sure dev set and test set come from same distribution
Not having a test set might be okey
Bias and Variance
- high bias: underfitting
- just right
- high variance: overfitting
When it comes high bias or high variance, we need to see the optimal error (base error).
Basic “recipe” for machine learning
- high bias -> bigger network
- high variance -> more data
Norm Regularization
One of the first things you should try to solve a high variance problem probably regularization
We always omit
because w is usually a pretty high dimensional parameter vector, especially with a high variance problem.
Different Regularization
-
L2 regularization (most often)
||w||22=∑j=1nxw2j=wTw -
L1 regularisation (more zeors and more sparse)
||w||1=∑j=1nx|wj| -
Frobenius norm regularization (the sum of square of elements of a matrix)
||w[l]||2F=∑i=1n[l−1]∑j=1n[l](w[l]ij)2
Derivatives
Process
L2 regularization is sometimes called weight decay because the coefficient of w is going to be a little bit less than 1.
Why Regularization Reduces Overfitting
If the regularization becomes very large, the parameters W very small, so Z will be relatively small, kind of ignoring the effects of b for now, so Z will be relatively small or, really, I should say it takes on a small range of values. And so the activation function if is tanh, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function. And so is also much less able to overfit.
And you might not see a decrease monotonically on cost function.
Dropout Regulazation
With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
Train
Suppose
Then
Because
So it needs to divede by the keep.prob to make z not reduced.
This is inverted dropout.
Test
No dropout.
Why Does Dropout Work
Cannot rely on any one feature, so have to spread out weights (shrink weights)
Data Augmentation
This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting. And by synthesizing examples like this what you’re really telling your algorithm is that If something is a cat then flipping it horizontally is still a cat.
Early Stopping
The main downside of early stopping is that this couples these two tasks. So you no longer can work on these two problems independently, because by stopping gradient decent early, you’re sort of breaking whatever you’re doing to optimize cost function J, because now you’re not doing a great job reducing the cost function J.
Nomalizing Inputs
Substract mean:
Normalize variance:
And use the same parameters to normalize test set
Why Nomalize Inputs
If you normalize the features, then your cost function will on average look more symmetric. And if you’re running gradient descent on the cost function, then you might have to use a very small learning rate because if you’re here that gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum. Whereas if you have a more spherical contours, then wherever you start gradient descent can pretty much go straight to the minimum.
Vanishing / Exploding Gradients
In a very deep network,
If your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.
Weight Initialization Optimisation for Deep Network
Gradient Checking
for each i
check
Notes
- Don’t use in training - only to debug
- If algorithm fails grad check, look at components to try to identify bug
- Remember regularization
- Doesn’t work with dropout
- Run at random initialization
Week2
Mini-Batch Gradient Descent
You split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches.
$$
X = [x^{(1)}, x^{(2)}, … , x^{(i)} …, x^{(m)}], Y = [y^{(1)}, y^{(2)}, … , y^{(i)} …, y^{(m)}] \
\text{mini-batches: }
X = [X^{{1}}, X^{{2}}, …, X^{{t}}, …], Y = [Y^{{1}}, Y^{{2}}, …, Y^{{t}}, …], \text{where } X^{{t}}, Y^{{t}} is a mini-batch
$$
The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps.
Understanding Mini-Batch Gradient Descent
- If mini-batch size = m, then batch gradient descent.
- too long per iteration
- If mini-batch size = 1, then stochastic gradient descent
- lose speedup from vectorizatio
- more noisy
If small training set, use batch gradient descent (m <= 200, typically 54, 128, 256, 512 …). And make sure all mini-batch fits in CPU/GPU memory.
Exponentially Weighted Averages
Exponentially weighted averages are faster than gradient descent, and then we’ll use this to build up to more sophisticated optimization algorithms.
Understanding Exponentially Weighted Averages
This is a very efficient way to do so both from computation and memory efficiency point of view which is why it’s used in a lot of machine learning.
Bias Correction
When t is small, v is very small because previous v is very small.
But during this initial phase of learning when you’re still warming up your estimates when the bias correction can help you to obtain a better estimate
In machine learning, for most implementations of the exponential weighted average, people don’t often bother to implement bias corrections. Because most people would rather just wait that initial period and have a slightly more biased estimate and go from there. But if you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on.
Gradient Descent With Momentum
Compute dW, db on current mini-batch
Then compute V
Then update parameters
What this does is smooth out steps of gradient descent.
With a few iterations you find that the gradient descent with momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction, but are more directed to just moving quickly in the horizontal direction. And so this allows your algorithm to take a more straightforward path, or to damp out the oscillations in this path to the minimum.
RMSprop (Root Mean Squared prop)
Compute dW, db on current mini-batch
Then compute S
Then update parameters
The net effect of this is that your up days in the vertical direction are divided by a much larger number, and so that helps damp out the oscillations. Whereas the updates in the horizontal direction are divided by a smaller number.
And also to make sure that your algorithm doesn’t divide by 0
RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent. And allowing you to maybe use a larger learning rate alpha. And certainly speeding up the learning speed of your algorithm.
Adam Optimization Algorithm
Adam stands for Adaptive Moment Estimation.
Compute dW, db on current mini-batch
Then compute V with momentum
Then compute S with RMSprop
Then do bias correction
Then update parameters
So this algorithm combines the effect of gradient descient with momentum together with gradient descent with RMSprop
Hyperparameters Choice
Learning Rate Decay
One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time. We call this learning rate decay.
Other Leanring Rate Decay
Week3
Tuning Process
Hyperparameters
Try random values and do not use a grid
Coarse to fine search
Using an Appropriate Scale to Pick hyperparameters
Hyperparameters Tuning in Practice: Pandas vs. Caviar
Intuitions do get stale. Re-evaluate occasionally.
Babysitting One Model
If you have maybe a huge data set but not a lot of computational resources, not a lot of CPUs and GPUs, so you can basically afford to train only one model or a very small number of models at a time. In that case you might gradually babysit that model even as it’s training. People that babysit one model, that is watching performance and patiently nudging the learning rate up or down. But that’s usually what happens if you don’t have enough computational capacity to train a lot of models at the same time.
Training Many Models in Parallel
You might train many different models in parallel, where these orange lines are different models, right, and so this way you can try a lot of different hyperparameter settings and then just maybe quickly at the end pick the one that works best. Looks like in this example it was, maybe this curve that look best.
Normalizing Actications in a Network
Normalizing Inputs to Speed Up Learning
Implementing Batch Norm
Given some intermediate values in NN
Fitting Batch Norm Into a Neural Network
Working With Mini-Batches
Because Batch Norm zeroes out the mean of these Z values in the layer, there’s no point having this parameter b.
Implementing Gradient Descent
Why Batch Norm Work
One intuition behind why batch norm works is, this is to take on a similar range of values that can speed up learning, but further values in your hidden units and not just for your input there.
A second reason why batch norm works, is it makes weights, later or deeper than your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network because batch norm overcomes covariate shift on weights in earlier layers.
Batch Norm as Regularization
- Each mini-batch is scaled by the mean / variance computed on just that mini-batch
- This adds some noise to the values z within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activations
- This has a slight relularization effect because by adding noise to the hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.
Softmax Regression
If we have multiple possible classes, there’s a generalization of logistic regression called Softmax regression.
The number of units upper layer which is layer L is going to equal to C (the number of possible classes).
And the output labels y hat is going to be C by one dimensional vector, because it now has to output C numbers, giving you these C probabilities.
And the upper layer’s activation function is
Understanding Softmax
Softmax regression generalizes logistic regression to C classes
Loss Function
Backward Prop