深度学习的实践方面Quiz 1

1。If you have 10,000,000 examples, how would you split the train/dev/test set?

98% train . 1% dev . 1% test

60% train . 20% dev . 20% test

33% train . 33% dev . 33% test
解析:在大数据时代,测试集的主要目的是评估模型的效果,如在单个分类器中,往往在百万级别的数据中,我们选择其中1000条数据足以评估单个模型的效果。

  • 100万数据量:98% / 1% / 1%;
  • 超百万数据量:99.5% / 0.25% / 0.25%(或者99.5% / 0.4% /0.1%)

2。The dev and test set should:

Come from the same distribution

Come from different distributions

Be identical to each other (same (x,y) pairs)

Have the same number of examples
解析:训练集,测试集,交叉验证集应当是独立同分布

3。If your Neural Network model seems to have high variance, what of the following would be promising things to try?

Make the Neural Network deeper

Increase the number of units in each hidden layer

Add regularization

Get more training data

Get more test data
解析:
1.High bias的解决方法:
- 增加网络结构,如增加隐藏层数目;
- 训练更长时间;
- 寻找合适的网络架构,使用更大的NN结构;
2.High variance的解决方法:
- 获取更多的数据;
- 正则化( regularization);
- 寻找合适的网络结构;

4。You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

Increase the regularization parameter lambda

Decrease the regularization parameter lambda

Get more training data

Use a bigger neural network
解析:训练集误差0.5%,验证集误差7%,说明是高方差,参考题3

5。What is weight decay?

A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.

The process of gradually decreasing the learning rate during training.

Gradual corruption of the weights in the neural network if it is trained on noisy data.

A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
解析:
深度学习的实践方面Quiz 1

6。What happens when you increase the regularization hyperparameter lambda?

Weights are pushed toward becoming smaller (closer to 0)

Weights are pushed toward becoming bigger (further from 0)

Doubling lambda should roughly result in doubling the weights

Gradient descent taking bigger steps with each iteration (proportional to lambda)
解析:lambda增大,W减小,只有这样,才能保持L的不变

7。With the inverted dropout technique, at test time:

You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
解析:
使用Dropout:
- 关闭dropout功能,即设置 keep_prob = 1.0;
- 运行代码,确保J(W,b)函数单调递减;
- 再打开dropout函数。
所以在测试(调试)阶段,应当关掉dropout功能,此时也不需要保留1/keep_prob因子

8。Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)

Increasing the regularization effect

Reducing the regularization effect

Causing the neural network to end up with a higher training set error

Causing the neural network to end up with a lower training set error
解析:增大keep_prob,相当于减下正则化的影响,比如当其为1时,相当于没有正则化的影响,而减小正则化的影响,训练集应当更加趋向于过拟合,因此训练误差倾向于降低

9。Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)

Vanishing gradient

Data augmentation

Xavier initialization

Exploding gradient

Dropout

Gradient Checking

L2 regularization
解析:High variance的解决方法有1、获取更多的数据;2、正则化( regularization);3、寻找合适的网络结构;
梯度消失和梯度爆炸跟题意没有关系
数据扩增(Data augmentation)、Early stopping、L2 regularization都是正则化的一种,可以解决过拟合问题
梯度检查只是用来检查梯度是否正确,和过拟合无关

10。Why do we normalize the inputs x?

Normalization is another word for regularization–It helps to reduce variance

It makes the parameter initialization faster

It makes it easier to visualize the data

It makes the cost function faster to optimize
解析:
深度学习的实践方面Quiz 1
在不使用归一化的代价函数中,如果我们设置一个较小的学习率,那么很可能我们需要很多次迭代才能到达代价函数全局最优解;如果使用了归一化,那么无论从哪个位置开始迭代,我们都能以相对很少的迭代次数找到全局最优解。所以可以加快最优化的速度。

参考:http://blog.csdn.net/koala_tree/article/details/78125697