【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验
【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验
第2周测验-优化算法
-
当输入从第八个mini-batch的第七个的例子的时候,你会用哪种符号表示第三层的**?
- 【
★
】
注:[i]{j}(k)上标表示 第i层,第j小块,第k个示例
- 【
-
关于mini-batch的说法哪个是正确的?
- 【 】 在不同的mini-batch下,不需要显式地进行循环,就可以实现mini-batch梯度下降,从而使算法同时处理所有的数据(矢量化)。
- 【 】 使用mini-batch梯度下降训练的时间(一次训练完整个训练集)比使用梯度下降训练的时间要快。
- 【
★
】mini-batch梯度下降(在单个mini-batch上计算)的一次迭代快于梯度下降的迭代。
注意:矢量化不适用于同时计算多个mini-batch。
-
为什么最好的mini-batch的大小通常不是1也不是m,而是介于两者之间?
- 【
★
】如果mini-batch大小为1,则会失去mini-batch示例中矢量化带来的的好处。 - 【
★
】如果mini-batch的大小是m,那么你会得到批量梯度下降,这需要在进行训练之前对整个训练集进行处理。
- 【
-
如果你的模型的成本随着迭代次数的增加,绘制出来的图如下,那么:
- 【
★
】如果你使用的是mini-batch梯度下降,这看起来是可以接受的。但是如果你使用的是下降,那么你的模型就有问题。
注意:使用mini-batch梯度下降会有一些振荡,因为mini-batch中可能会有一些噪音数据。 然而,批量梯度下降总是保证在到达最优值之前达到较低的J。
- 【
-
假设一月的前三天卡萨布兰卡的气温是一样的:
一月第一天: = 10一月第二天: * 10
假设您使用= 0.5的指数加权平均来跟踪温度: = 0, = +(1-)。 如果是在没有偏差修正的情况下计算第2天后的值,并且是您使用偏差修正计算的值。 这些下面的值是正确的是?
- 【
★
】 = 7.5, = 10
- 【
-
下面哪一个不是比较好的学习率衰减方法?
- 【
★
】α = *
请注意:这会使得学习率出现爆炸,而没有衰减。
- 【
-
您在伦敦温度数据集上使用指数加权平均值, 您可以使用以下公式来追踪温度: = -1 +(1 - )。 下面的红线使用的是β= 0.9来计算的。 当你改变β时,你的红色曲线会怎样变化?
- 【
★
】增加β会使红线稍微向右移动。 - 【
★
】减少β会在红线内产生更多的振荡。
- 【
看一下这个图:
这些图是由梯度下降产生的; 具有动量梯度下降(β= 0.5)和动量梯度下降(β= 0.9)。 哪条曲线对应哪种算法?
-【★
】(1)是梯度下降。 (2)是动量梯度下降(β值比较小)。 (3)是动量梯度下降(β比较大)-
假设在一个深度学习网络中批处理梯度下降花费了太多的时间来找到一个值的参数值,该值对于成本函数J(W[1],b[1],…,W[L],b[L])来说是很小的值。 以下哪些方法可以帮助找到J值较小的参数值?
- 【
★
】尝试使用 Adam 算法 - 【
★
】尝试对权重进行更好的随机初始化 - 【
★
】尝试调整学习率α - 【
★
】尝试mini-batch梯度下降 - 【 】 尝试把权值初始化为0
- 【
-
关于Adam算法,下列哪一个陈述是错误的?
- 【
★
】Adam应该用于批梯度计算,而不是用于mini-batch。
注: Adam 可以同时使用。
- 【
Week 2 Quiz - Optimization algorithms
-
Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
- a^[3]{8}(7)
Note: [i]{j}(k) superscript means i-th layer, j-th minibatch, k-th example
-
Which of these statements about mini-batch gradient descent do you agree with?
- [ ] You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
- [ ] Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
- [x] One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
Note: Vectorization is not for computing several mini-batches in the same time.
-
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
- If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
- If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
-
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
- If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
Note: There will be some oscillations when you’re using mini-batch gradient descent since there could be some noisy data example in batches. However batch gradient descent always guarantees a lower J before reaching the optimal.
-
Suppose the temperature in Casablanca over the first three days of January are the same:
Jan 1st: θ_1 = 10
Jan 2nd: θ_2 * 10
Say you use an exponentially weighted average with β = 0.5 to track the temperature: v_0 = 0, v_t = βv_t−1 + (1 − β)θ_t. If v_2 is the value computed after day 2 without bias correction, and v^corrected_2 is the value you compute with bias correction. What are these values?
- v_2 = 7.5, v^corrected_2 = 10
-
Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
- α = e^t * α_0
Note: This will explode the learning rate rather than decay it.
-
You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_t = βv_t−1 + (1 − β)θ_t. The red line below was computed using β = 0.9. What would happen to your red curve as you vary β? (Check the two that apply)
- Increasing β will shift the red line slightly to the right.
- Decreasing β will create more oscillation within the red line.
-
Consider this figure:
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)
-
Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)
- [x] Try using Adam
- [x] Try better random initialization for the weights
- [x] Try tuning the learning rate α
- [x] Try mini-batch gradient descent
- [ ] Try initializing all the weights to zero
-
Which of the following statements about Adam is False?
- Adam should be used with batch gradient computations, not with mini-batches.
Note: Adam could be used with both.