CS224d-Lecture8
Language Model
probability of a sequence of words
- P(w1, w2, …, wT)
Useful for machine learning:
word - ordering
- p(the cat is small) > p(small the is cat)
word - choice
- p(walking home after school) > p(walking house after school)
Traditional Language Model
条件概率,其中 window size = n
assumption
n-gram
- unigram
p(w2|w1)=count(w1,w2)count(w1) - bigram
p(w3|w1,w2)=count(w1,w2,w3)count(w1,w2)
n-gram 耗费大量内存
RNN
- 每步权重互联
- 条件依赖于之前所有单词
- RAM 耗费只同单词量相关
训练 RNN is hard
vanishing / exploding gradient problem
total error
其中
故
由于取
则
可能非常快的就变得很大或者很小。
vanishing gradient problem 使得许多步之前的对当前训练的影响微乎其微
exploding gradient clip gradient
vanishing gradient -> Initialization + ReLus
softmax is huge and slow
- class based trick
双向 RNN
- 之前和之后的训练词对当前训练都有影响
深度双向 RNN
F1 度量
precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2(precision recall)/(precsion + recall)