RNN变体之dropout
问题
RNN在迭代运用状态转换操作“输入到隐状态”实现任意长序列的定长表示时,会遭遇到“对隐状态扰动过于敏感”的困境。
dropout
dropout的数学形式化:
-
y=f(W⋅d(x)) , 其中d(x)={mask∗x, if train phaseing(1−p)x,otherwise p 为dropout率,mask为以1-p为概率的贝努力分布生成的二值向量
rnn dropout
改变传统做法“在每个时间步采用不同的mask对隐节点进行屏蔽”,提出新的策略(如下图所示),其特点是:1)generates the dropout mask only at the beginning of each training sequence and fixes it through the sequence;2)dropping both the non-recurrent and recurrent connections。
Moon T, Choi H, Lee H, et al. RNNDROP: A novel dropout for RNNS in ASR[C]// Automatic Speech Recognition and Understanding. IEEE, 2016:65-70.
recurrent dropout
- 思想:通过dropout LSTM/GRU中的input或update门以prevents the loss of long-term memories built up in the states/cells 。
简单RNN及其dropout:
RNN:
dropout:
LSTM:
GRU:
从理论上讲, masks can be applied to any subset of the gates, cells, and states.
文献:Semeniuta S, Severyn A, Barth E. Recurrent Dropout without Memory Loss[J]. 2016.
垂直连接的dropout
针对多层LSTM网络,对其垂直连接进行随机dropout, 也即是否允许
图中虚线是进行随机dropout的操作对象。
dropout操作后的信息流
文献: Zaremba W, Sutskever I, Vinyals O. Recurrent Neural Network Regularization[C]. ICLR 2015.
源码:https://github.com/wojzaremba/lstm .
基于变分推理的dropout
图中虚线代表不进行dropout,而不同颜色的实线表示不同的mask。
传统dropout rnn: use different masks at different time steps
基于变分推理的dropout: uses the same dropout mask at each time step, including the recurrent layers
基于变分推理的dropout的具体实现(上图(b)的实线颜色可知):为每个连接矩阵一次性生成贝努力随机变量的mask,然后在后续的每个时间点上都采用相同的mask.
文献:Gal Y. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks[J]. Statistics, 2015:285-290.
源码: http://yarin.co/BRNN
Zoneout
ct=dct⊙ct−1+(1−dct)⊙ft⊙ct−1+it⊙gt ht=dht⊙ht−1+(1−dht)⊙ot⊙tanh(ct−1⊙ft+it⊙gt)
其中dht 为0与1的二值随机向量。
文献:Krueger D, Maharaj T, Kramár J, et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations[C]. ICLR 2017
源码:http://github.com/teganmaharaj/zoneout