RNN变体之dropout

问题

RNN在迭代运用状态转换操作“输入到隐状态”实现任意长序列的定长表示时，会遭遇到“对隐状态扰动过于敏感”的困境。

dropout

dropout的数学形式化:

y=f(W⋅d(x)), 其中d(x)={mask∗x, if train phaseing(1−p)x,otherwise
p为dropout率，mask为以1-p为概率的贝努力分布生成的二值向量

rnn dropout

改变传统做法“在每个时间步采用不同的mask对隐节点进行屏蔽”，提出新的策略(如下图所示),其特点是：1）generates the dropout mask only at the beginning of each training sequence and fixes it through the sequence；2）dropping both the non-recurrent and recurrent connections。
RNN变体之dropout
Moon T, Choi H, Lee H, et al. RNNDROP: A novel dropout for RNNS in ASR[C]// Automatic Speech Recognition and Understanding. IEEE, 2016:65-70.

recurrent dropout

思想：通过dropout LSTM/GRU中的input或update门以prevents the loss of long-term memories built up in the states/cells 。

简单RNN及其dropout:
RNN: ht=f(Wh⊙[xt,ht−1]+bh);
dropout: ht=f(Wh⊙[xt,d(ht−1)]+bh), d(⋅)为dropout函数
LSTM： ct=ft⊙ct−1+it⊙d(gt)
GRU: ht=(1−zt)⊙ct−1+zt⊙d(gt)
从理论上讲， masks can be applied to any subset of the gates, cells, and states.
文献：Semeniuta S, Severyn A, Barth E. Recurrent Dropout without Memory Loss[J]. 2016.

垂直连接的dropout

针对多层LSTM网络，对其垂直连接进行随机dropout, 也即是否允许L层某个LSTM单元的隐状态信息流入L+1层对应单元
RNN变体之dropout
图中虚线是进行随机dropout的操作对象。

dropout操作后的信息流
文献: Zaremba W, Sutskever I, Vinyals O. Recurrent Neural Network Regularization[C]. ICLR 2015.
源码：https://github.com/wojzaremba/lstm .

基于变分推理的dropout

RNN变体之dropout
图中虚线代表不进行dropout,而不同颜色的实线表示不同的mask。
传统dropout rnn: use different masks at different time steps
基于变分推理的dropout: uses the same dropout mask at each time step, including the recurrent layers
基于变分推理的dropout的具体实现（上图（b）的实线颜色可知）：为每个连接矩阵一次性生成贝努力随机变量的mask，然后在后续的每个时间点上都采用相同的mask.
文献：Gal Y. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks[J]. Statistics, 2015:285-290.
源码: http://yarin.co/BRNN

Zoneout