QANet学习
符号说明:Question ,Context , answer span ,
: represent both original word and its embedding
和其它大部分Reading Comprehension模型一样,也包括Embedding layer, Embedding encoder layer, Contex-query attention layer, Model encoder layer, Output layer 五个模块。
1. Embedding Layer
Word:
- 300-dim GloVe pre-trained word vectors
- fixed during training
- OOV words mapped to <UNK>, <UNK>的vector is randomly initialized, trained
Char:
- 200-dim, max word length is 16
- concatenate all char vectors of a word to form a matrix, use maximum value of each row to obtain a final vector
- trained
Final vector of a word is , and put it through two layers of high way network.
2. Embedding Encoder Layer
A stack of building blocks: [conv-layer x # + self-attention-layer + feed-forward-layer]
- depthwise separable convolutions, memory efficient and has better generalization
- kernel size is 7, number of filters is d = 128, number of conv layers within a block is 4
- self-attention use mutli-head attention, head number is 8
- Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block
- input and a given operation , the output is
- total number of encoder blocks is 1
- input dim is , output dim is
3. Context-Query Attention Layer
- similarity matrix , , the trilinear similarity function
- row softmax to get , and Context-to-query attention is
- context-to-query attention benefits a bit, and first column softmax to get , and then get the query to context attention
4. Model Encoder Layer
- Input to this layer is , where and are row of and
- parameters are the same as embedding encoder layer
- number of blocks is 7
- number of conv layers within a block is 2
- share weights between the model encoders
5. Output Layer
predict start and end