[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.1 sequence models

1.2 notation

one-hot

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.3 Recurrent Neural Network Model

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

forward propagation

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.4 backpropagation through time

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.5 different types of RNNs

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.6 Language model and sequence generation

rnn agriculture

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.7 Vanishing gradients with RNNs

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.8 Gated Recurrent Unit

this prevent vanishing problem, for gamma u can be 0.000001 which leads to c<t> = c<t-1>

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&amp;question)

1.9 Long Short Term Memory (LSTM)

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

LSTM in pictures

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.10 Bidirectional RNN

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

1.11 Deep RNNs

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)

The course in week1 simply tells what is NLP.

IF you want to leanr more, taking some papers to learn is better.

Q&A:

1. Question 1

Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?

x(i)<j>

Correct

We index into the ith row first to get the ith training example (represented by parentheses), then the jth column to get the jth word (represented by the brackets).

x<i>(j)

x(j)<i>

x<j>(i)

Question 2

Correct

1 / 1 points

2. Question 2

Consider this RNN:

This specific type of architecture is appropriate when:

Tx=Ty

Correct

It is appropriate when every input should be matched to an output.

Tx<Ty

Tx>Ty

Tx=1

Question 3

Incorrect

0 / 1 points

3. Question 3

To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply).

Speech recognition (input an audio clip and output a transcript)

Un-selected is correct

Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)

Correct

Correct!

Image classification (input an image and output a label)

This should not be selected

This is an example of one-to-one architecture.

Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)

Correct

Correct!

Question 4

Correct

1 / 1 points

4. Question 4

You are training this RNN language model.

At the tth time step, what is the RNN doing? Choose the best answer.

Estimating P(y<1>,y<2>,…,y<t−1>)

Estimating P(y<t>)

Estimating P(y<t>∣y<1>,y<2>,…,y<t−1>)

Correct

Yes, in a language model we try to predict the next step based on the knowledge of all prior steps.

Estimating P(y<t>∣y<1>,y<2>,…,y<t>)

Question 5

Incorrect

0 / 1 points

5. Question 5

You have finished training a language model RNN and are using it to sample random sentences, as follows:

What are you doing at each time step t?

(i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as y^<t>. (ii) Then pass the ground-truth word from the training set to the next time-step.

This should not be selected

The probabilities output by the RNN are not used to pick the highest probability word and the ground-truth word from the training set is not the input to the next time-step.

(i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as y^<t>. (ii) Then pass the ground-truth word from the training set to the next time-step.

(i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as y^<t>. (ii) Then pass this selected word to the next time-step.

(i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as y^<t>. (ii) Then pass this selected word to the next time-step.

Question 6

Incorrect

0 / 1 points

6. Question 6

You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?

Vanishing gradient problem.

This should not be selected

Vanishing and exploding gradients are common problems in training RNNs, but in the case of this problem, your weights and activations taking on the value of NaN implies one of the two.

Exploding gradient problem.

ReLU activation function g(.) used to compute g(z), where z is too large.

Sigmoid activation function g(.) used to compute g(z), where z is too large.

Question 7

Correct

1 / 1 points

7. Question 7

Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a<t>. What is the dimension of Γu at each time step?

100

Correct

Correct, Γu is a vector of dimension equal to the number of hidden units in the LSTM.

300

10000

Question 8

Correct

1 / 1 points

8. Question 8

Here’re the update equations for the GRU.

Alice proposes to simplify the GRU by always removing the Γu. I.e., setting Γu = 1. Betty proposes to simplify the GRU by removing the Γr. I. e., setting Γr = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?

Alice’s model (removing Γu), because if Γr≈0 for a timestep, the gradient can propagate back through that timestep without much decay.

Alice’s model (removing Γu), because if Γr≈1 for a timestep, the gradient can propagate back through that timestep without much decay.

Betty’s model (removing Γr), because if Γu≈0 for a timestep, the gradient can propagate back through that timestep without much decay.

Correct

Yes. For the signal to backpropagate without vanishing, we need c<t> to be highly dependant on c<t−1>.

Betty’s model (removing Γr), because if Γu≈1 for a timestep, the gradient can propagate back through that timestep without much decay.

Question 9

Correct

1 / 1 points

9. Question 9

Here are the equations for the GRU and the LSTM:

From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _______ and ______ in the GRU. What should go in the the blanks?

Γu and 1−Γu

Correct

Yes, correct!

Γu and Γr

1−Γu and Γu

Γr and Γu

Question 10

Incorrect

0 / 1 points

10. Question 10

You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as x<1>,…,x<365>. You’ve also collected data on your dog’s mood, which you represent as y<1>,…,y<365>. You’d like to build a model to map from x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?

Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.

Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.

This should not be selected

Your dog's mood is contingent on the current and past few days' weather, not on the current, past, AND future days' weather.

Unidirectional RNN, because the value of y<t> depends only on x<1>,…,x<t>, but not on x<t+1>,…,x<365>

Unidirectional RNN, because the value of y<t> depends only on x<t>, and not other days’ weather.

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&amp;question)

1. Question 1

2. Question 2

3. Question 3

4. Question 4

5. Question 5

6. Question 6

7. Question 7

8. Question 8

9. Question 9

10. Question 10

相关推荐

[coursera/SequenceModels/week1]Recurrent Neural Networks (summary&question)