【中英】【吴恩达课后测验】Course 5 -序列模型 - 第一周测验 - 循环神经网络

上一篇：【课程4 - 第四周编程作业】※※※※※ 【回到目录】※※※※※下一篇：【待撰写-课程5 -第一周编程作业】

假设你的训练样本是句子(单词序列)，下面哪个选项指的是第 $i$ 个训练样本中的第 $j$ 个词?
- 【★】 $x^{(i) < j >}$
- 【】 $x^{< i > (j)}$
- 【】 $x^{(j) < i >}$
- 【】 $x^{< j > (i)}$
We index into the $i^{t h}$ row first to get the $i^{t h}$ training example (represented by parentheses), then the $j^{t h}$ column to get the $j^{t h}$ word (represented by the brackets).

首先获取第 $i$ 个训练样本(用括号表示)，然后到 $j$ 列获取单词(用括尖括号表示)。
看一下下面的这个循环神经网络：

在下面的条件中，满足上图中的网络结构的参数是：
- 【★】 $T_{x} = T_{y}$
- 【】 $T_{x} < T_{y}$
- 【】 $T_{x} > T_{y}$
- 【】 $T_{x} = 1$
It is appropriate when every input should be matched to an output.

上图中每一个输入都与输出相匹配。
这些任务中的哪一个会使用多对一的RNN体系结构？
- 【】语音识别（输入语音，输出文本）。
- 【★】情感分类（输入一段文字，输出0或1表示正面或者负面的情绪）。
- 【】图像分类（输入一张图片，输出对应的标签）。
- 【★】人声性别识别（输入语音，输出说话人的性别）。
假设你现在正在训练下面这个RNN的语言模型：

在 $t$ 时，这个RNN在做什么？
- 【】计算 $P (y^{< 1 >}, y^{< 2 >}, \dots, y^{< t - 1 >})$
- 【】计算 $P (y^{< t >})$
- 【★】计算 $P (y^{< t >} ∣ y^{< 1 >}, y^{< 2 >}, \dots, y^{< t - 1 >})$
- 【】计算 $P (y^{< t >} ∣ y^{< 1 >}, y^{< 2 >}, \dots, y^{< t >})$
  
  Yes,in a language model we try to predict the next step based on the knowledge of all prior steps.
  
  是的，这个语言模型正在试着根据前面所有的知识来预测下一步。
你已经完成了一个语言模型RNN的训练，并用它来对句子进行随机取样，如下图：

在每个时间步 $t$ 都在做什么？
- 【】 (1)使用RNN输出的概率，选择该时间步的最高概率单词作为 ${\hat{y}}^{< t >}$ ，(2)然后将训练集中的正确的单词传递到下一个时间步。
- 【】 (i)使用由RNN输出的概率将该时间步的所选单词进行随机采样作为 ${\hat{y}}^{< t >}$ ，(2)然后将训练集中的实际单词传递到下一个时间步。
- 【】 (1)使用由RNN输出的概率来选择该时间步的最高概率词作为 ${\hat{y}}^{< t >}$ ，(2)然后将该选择的词传递给下一个时间步。
- 【★】 (1)使用RNN该时间步输出的概率对单词随机抽样的结果作为 ${\hat{y}}^{< t >}$ ，(2)然后将此选定单词传递给下一个时间步。
你正在训练一个RNN网络，你发现你的权重与**值都是“NaN”，下列选项中，哪一个是导致这个问题的最有可能的原因？
- 【】梯度消失。
- 【★】梯度爆炸。
- 【】 ReLU函数作为**函数g(.)，在计算g(z)时，z的数值过大了。
- 【】 Sigmoid函数作为**函数g(.)，在计算g(z)时，z的数值过大了。
假设你正在训练一个LSTM网络，你有一个10,000词的词汇表，并且使用一个**值维度为100的LSTM块，在每一个时间步中， $Γ_{u}$ 的维度是多少？
- 【】 1
- 【★】 100
- 【】300
- 【】 10000
Correct, $Γ_{u}$ is a vector of dimension equal to the number of hidden units in the LSTM.

$Γ_{u}$ 的向量维度等于LSTM中隐藏单元的数量。
这里有一些GRU的更新方程：

爱丽丝建议通过移除 $Γ_{u}$ 来简化GRU，即设置 $Γ_{u}$ ＝1。贝蒂提出通过移除 $Γ_{r}$ 来简化GRU，即设置 $Γ_{r}$ ＝1。哪种模型更容易在梯度不消失问题的情况下训练，即使在很长的输入序列上也可以进行训练？
- 【】爱丽丝的模型（即移除 $Γ_{u}$ ），因为对于一个时间步而言，如果 $Γ_{r} \approx 0$ ，梯度可以通过时间步反向传播而不会衰减。
- 【】爱丽丝的模型（即移除 $Γ_{u}$ ），因为对于一个时间步而言，如果 $Γ_{r} \approx 1$ ，梯度可以通过时间步反向传播而不会衰减。
- 【★】贝蒂的模型（即移除 $Γ_{r}$ ），因为对于一个时间步而言，如果 $Γ_{u} \approx 0$ ，梯度可以通过时间步反向传播而不会衰减。
- 【】贝蒂的模型（即移除 $Γ_{r}$ ），因为对于一个时间步而言，如果 $Γ_{u} \approx 1$ ，梯度可以通过时间步反向传播而不会衰减。
For the signal to backpropagate without vanishing, we need $c^{< t >}$ to be highly dependant on $c^{< t - 1 >}$

要使信号反向传播而不消失，我们需要 $c^{< t >}$ 高度依赖于 $c^{< t - 1 >}$ 。
这里有一些GRU和LSTM的方程:

从这些我们可以看到，在LSTM中的更新门和遗忘门在GRU中扮演类似 $\underline{}$ 与 $\underline{}$ 的角色，空白处应该填什么？
- 【★】 $Γ_{u}$ 与 1− $Γ_{u}$
- 【】 $Γ_{u}$ 与 $Γ_{r}$
- 【】 1− $Γ_{u}$ 与 $Γ_{u}$
- 【】 $Γ_{r}$ 与 $Γ_{u}$
你有一只宠物狗，它的心情很大程度上取决于当前和过去几天的天气。你已经收集了过去365天的天气数据 $x^{< 1 >}, \dots, x^{< 365 >}$ ，这些数据是一个序列，你还收集了你的狗心情的数据 $y^{< 1 >}, \dots, y^{< 365 >}$ ，你想建立一个模型来从x到y进行映射，你应该使用单向RNN还是双向RNN来解决这个问题？
- 【】双向RNN，因为在 $t$ 日的情绪预测中可以考虑到更多的信息。
- 【】双向RNN，因为这允许反向传播计算中有更精确的梯度。
- 【★】单向RNN，因为 $y^{< t >}$ 的值仅依赖于 $x^{< 1 >}, \dots, x^{< t >}$ ，而不依赖于 $x^{< t + 1 >}, \dots, x^{< 365 >}$ 。
- 【】单向RNN，因为 $y^{< t >}$ 的值只取决于 $x^{< t >}$ ，而不是其他天的天气。

Recurrent Neural Networks

Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?
- [x] $x^{(i) < j >}$
- [ ] $x^{< i > (j)}$
- [ ] $x^{(j) < i >}$
- [ ] $x^{< j > (i)}$
We index into the $i^{t h}$ row first to get the $i^{t h}$ training example (represented by parentheses), then the $j^{t h}$ column to get the $j^{t h}$ word (represented by the brackets).
Consider this RNN:

This specific type of architecture is appropriate when:
- [x] $T_{x} = T_{y}$
- [ ] $T_{x} < T_{y}$
- [ ] $T_{x} > T_{y}$
- [ ] $T_{x} = 1$
It is appropriate when every input should be matched to an output.
To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply).
- [ ] peech recognition (input an audio clip and output a transcript)
- [x] Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)
- [ ] Image classification (input an image and output a label)
- [x] Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)
You are training this RNN language model.

At the $t^{t h}$ time step, what is the RNN doing? Choose the best answer.
- [ ] Estimating $P (y^{< 1 >}, y^{< 2 >}, \dots, y^{< t - 1 >})$
- [ ] Estimating $P (y^{< t >})$
- [x] Estimating $P (y^{< t >} ∣ y^{< 1 >}, y^{< 2 >}, \dots, y^{< t - 1 >})$
- [ ] Estimating $P (y^{< t >} ∣ y^{< 1 >}, y^{< 2 >}, \dots, y^{< t >})$
Yes,in a language model we try to predict the next step based on the knowledge of all prior steps.
You have finished training a language model RNN and are using it to sample random sentences, as follows:

What are you doing at each time step t?
- [ ] (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as ${\hat{y}}^{< t >}$ . (ii) Then pass the ground-truth word from the training set to the next time-step.
- [ ] (i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as ${\hat{y}}^{< t >}$ . (ii) Then pass the ground-truth word from the training set to the next time-step.
- [ ] (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as ${\hat{y}}^{< t >}$ . (ii) Then pass this selected word to the next time-step.
- [x] (i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as ${\hat{y}}^{< t >}$ . (ii) Then pass this selected word to the next time-step.
You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
- [ ] Vanishing gradient problem.
- [x] Exploding gradient problem.
- [ ] ReLU activation function g(.) used to compute g(z), where z is too large.
- [ ] Sigmoid activation function g(.) used to compute g(z), where z is too large.
Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a. What is the dimension of Γu at each time step?
- [ ] 1
- [x] 100
- [ ] 300
- [ ] 10000
Correct, $Γ_{u}$ is a vector of dimension equal to the number of hidden units in the LSTM.
Here’re the update equations for the GRU.

Alice proposes to simplify the GRU by always removing the $Γ_{u}$ . I.e., setting $Γ_{u}$ = 1. Betty proposes to simplify the GRU by removing the $Γ_{r}$ . I. e., setting $Γ_{r}$ = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
- [ ] Alice’s model (removing $Γ_{u}$ ), because if $Γ_{r}$ ≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
- [ ] Alice’s model (removing $Γ_{u}$ ), because if $Γ_{r}$ ≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
- [x] Betty’s model (removing $Γ_{r}$ ), because if $Γ_{u}$ ≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
- [ ] Betty’s model (removing $Γ_{r}$ ), because if $Γ_{u}$ ≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
Yes, For the signal to backpropagate without vanishing, we need $c^{< t >}$ to be highly dependant on $c^{< t - 1 >}$
Here are the equations for the GRU and the LSTM:

From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _ and __ in the GRU. What should go in the the blanks?
- [x] $Γ_{u}$ and 1− $Γ_{u}$
- [ ] $Γ_{u}$ and $Γ_{r}$
- [ ] 1− $Γ_{u}$ and $Γ_{u}$
- [ ] $Γ_{r}$ and $Γ_{u}$
You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as $x^{< 1 >}, \dots, x^{< 365 >}$ . You’ve also collected data on your dog’s mood, which you represent as $y^{< 1 >}, \dots, y^{< 365 >}$ . You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
- [ ] Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.
- [ ] Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
- [x] Unidirectional RNN, because the value of $y^{< t >}$ depends only on $x^{< 1 >}, \dots, x^{< t >}$ , but not on $x^{< t + 1 >}, \dots, x^{< 365 >}$
- [ ] Unidirectional RNN, because the value of $y^{< t >}$ depends only on $x^{< t >}$ , and not other days’ weather.

【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第一周测验

【中英】【吴恩达课后测验】Course 5 -序列模型 - 第一周测验 - 循环神经网络

Recurrent Neural Networks

相关推荐