文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

Bidirectional LSTM-CRF Models for Sequence Tagging


Z. H. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, (2015)


摘要

基于长矩时记忆网络(long short-term memory,LSTM)的序列标注模型:LSTM、双向LSTM(bidirectional
LSTM,BI-LSTM)、条件随机场LSTM(LSTM with a conditional random field layer,LSTM-CRF)、双向(bidirectional LSTM with a conditional random field layer,BI-LSTM-CRF)

BI-LSTM-CRF模型:BI-LSTM能够充分使用输入特征的历史及未来信息(past and future input features);CRF能够使用语义层面的标签信息(sentence level tag information)

1 引言

序列标注(sequence tagging)包括词性标注(part of speech tagging,POS)、组块分析(chunking)和命名实体识别(named entity recognition,NER)

现有序列标注模型多数为线性统计模型(linear statistical models),如:隐马尔科夫模型(Hidden Markov Models,HMM)、最大熵马尔科夫模型(Maximum entropy Markov models,MEMMs)、条件随机场(Conditional Random Fields,CRF)

本文给出四种序列标注模型:LSTM、BI-LSTM、LSTM-CRF、BI-LSTM-CRF,

  • BI-LSTM使用输入特征的历史及未来信息;CRF使用语义层面的标签信息

  • BI-LSTM-CRF鲁棒性(robust)高,且与词嵌入相关小(less dependence on word embedding)

2 模型

LSTM、BI-LSTM、LSTM-CRF、BI-LSTM-CRF

2.1 LSTM网络(LSTM Networks)

循环神经网络(recurrent neural networks,RNN):保留关于历史信息的记忆(a memory based on history information),能够根据相隔很远的特征预测当前输出(predict the current output conditioned on long distance features);网络结构包括输入层(input layer)xx、隐含层(hidden layer)hh、输出层(output layer)yy

  • 输入层表示时间步tt的特征,与输入特征维度相同(an input layer has the same dimensionality as feature size);

  • 输出层表示时间步tt标签的概率分布(a probability distribution over labels),其维度与标签尺寸相同(the same dimensionality as size of labels);

RNN引入前一时间步隐状态与当前时间步隐状态的连接(a RNN introduces the connection between the previous hidden state and current hidden state),即循环层权值参数(the recurrent layer weight parameters)。循环层用于存储历史信息(recurrent layer is designed to store history information)。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging
ht=f(Uxt+Wht1)(1)\mathbf{h}_{t} = f( \mathbf{U} \mathbf{x}_{t} + \mathbf{W} \mathbf{h}_{t - 1}) \tag {1}

yt=g(Vht)(2)\mathbf{y}_{t} = g( \mathbf{V} \mathbf{h}_{t}) \tag {2}

其中,U\mathbf{U}W\mathbf{W}V\mathbf{V}表示连接权值(在训练过程中计算),f(z)f(z)g(zm)g(z_{m})分别表示sigmoid与softmax**函数。

f(z)=11+ez(3)f(z) = \frac{1}{1 + e^{-z}} \tag {3}

g(zm)=ezmkezk(4)g(z_{m}) = \frac{e^{z_{m}}}{\sum_{k} e^{z_{k}}} \tag {4}

LSTM(Long Short-Term Memory)网络:用记忆单元(purpose-built memory cells)代替隐含层(hidden layer)更新,以抽取数据的远距离相关性(long range dependencies in the data)。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging
■图2结构不准确,如ht1\mathbf{h}_{t - 1}并未反馈至各门输入。■

LSTM记忆单元(memory cell):

it=σ(Wxixt+Whiht1+Wcict1+bi)ft=σ(Wxfxt+Whfht1+Wcfct1+bf)ct=ftct1+ittanh(Wxcxt+Whcht1+bf)ot=σ(Wxoxt+Whoht1+Wcoct+bo)ht=ottanh(ct)\begin{aligned} \mathbf{i}_{t} = & \sigma ( \mathbf{W}_{xi} \mathbf{x}_{t} + \mathbf{W}_{hi} \mathbf{h}_{t - 1} + \mathbf{W}_{ci} \mathbf{c}_{t - 1} + \mathbf{b}_{i} ) \\ \mathbf{f}_{t} = & \sigma ( \mathbf{W}_{xf} \mathbf{x}_{t} + \mathbf{W}_{hf} \mathbf{h}_{t - 1} + \mathbf{W}_{cf} \mathbf{c}_{t - 1} + \mathbf{b}_{f} ) \\ \mathbf{c}_{t} = & \mathbf{f}_{t} \mathbf{c}_{t - 1} + \mathbf{i}_{t} \tanh ( \mathbf{W}_{xc} \mathbf{x}_{t} + \mathbf{W}_{hc} \mathbf{h}_{t - 1} + \mathbf{b}_{f} ) \\ \mathbf{o}_{t} = & \sigma ( \mathbf{W}_{xo} \mathbf{x}_{t} + \mathbf{W}_{ho} \mathbf{h}_{t - 1} + \mathbf{W}_{co} \mathbf{c}_{t} + \mathbf{b}_{o} ) \\ \mathbf{h}_{t} = & \mathbf{o}_{t} \tanh ( \mathbf{c}_{t} ) \\ \end{aligned}

其中,σ\sigma表示逻辑函数(logistic sigmoid function);i\mathbf{i}f\mathbf{f}o\mathbf{o}c\mathbf{c}h\mathbf{h}分别表示输入门向量(input gate vector)、遗忘门向量(forget gate vector)、输出门向量(output gate vector)、记忆向量(cell vector)、隐含向量(hidden vector),所有向量维数相同;W\mathbf{W}的下标表示对应的向量。单元向量到门向量的权值矩阵(the weight matrices from the cell to gate vectors),如Wci\mathbf{W}_{ci},为对角矩阵(diagonal),即门向量的第mm个元素仅与单元向量的第mm个元素相关。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

2.2 双向LSTM网络(Bidirectional LSTM Networks)

双向LSTM网络(bidirectional LSTM network):在给定时间步上,同时使用历史特征(正向状态)和未来特征(反向状态)(make use of past features (via forward states) and future features (via backward states) for a specific time frame)。

训练过程采用时域反向传播(back-propagation through time,BPTT),在各句起始处,隐状态设置为00(do forward and backward for whole sentences and we only need to reset the hidden states to 0 at the begging of each sentence)。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

2.3 CRF网络(CRF Networks)

根据近邻标签信息预测当前标签(make use of neighbor tag information in predicting current tags)的方式:

(1)预测各时间步(time step)标签分布,并使用集束解码(beam-like decoding)查找最优标签序列(optimal tag sequences),如最大熵分类器(maximum entropy classifier)、最大熵马尔科夫模型(Maximum entropy Markov models,MEMMs)

(2)关注语句层面而非单个词条(focus on sentence level instead of individual positions),如条件随机场(Conditional Random Fields,CRF)模型,输入输出直接相连(inputs and outputs are directly connected)

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

2.4 LSTM-CRF网络(LSTM-CRF Networks)

LSTM-CRF网络使用LSTM层处理历史输入特征,CRF层处理语句层面标签信息(sentence level tag information)。

CRF层的参数为状态转移矩阵(A CRF layer has a state transition matrix as parameters),该层根据历史和未来标签预测当前标签(use past and future tags to predict the current tag)。

网络输出为分值矩阵(matrix of scores)fθ([x]1T)f_{\theta} ([x]_{1}^{T}),矩阵元素[fθ]i,t[f_{\theta}]_{i, t}表示参数为θ\theta的网络预测语句(sentence)[x]1T[x]_{1}^{T}中第tt个词条标签为ii的输出分值(the element [fθ]i,t[f_{\theta}]_{i, t} of the matrix is the score output by the network with parameters θ\theta, for the sentence [x]1T[x]_{1}^{T} and for the ii-th tag, at the tt-th word)。转移分值(transition score)[A]i,j[A]_{i, j}表示相邻时间步从标签iijj的转移(a transition score [A]i,j[A]_{i, j} to model the transition from ii-th state to jj-th for a pair of consecutive time steps)。转移矩阵与时间步无关(transition matrix is position independent)

将网络参数重写为:θ~=θ{[A]i,ji,j}\tilde{\theta} = \theta \cup \{ [A]_{i, j} \forall i, j \},则语句[x]1T[x]_{1}^{T}沿标签路径(along with a path of tags)[i]1T[i]_{1}^{T}的分值为转移分值与网络分值之和(sum of transition scores and network scores):

s([x]1T,[i]1T,θ~)=t=1T([A][i]t1,[i]t+[fθ][i]t,t)(5)s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta}) = \sum_{t = 1}^{T} \left( [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t} \right) \tag {5}

[A]i,j[A]_{i, j}和推理最优标签序列(optimal tag sequences for inference)可由动态规化(dynamic programming)求解。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging
■■

方程(5)中,

tt:时间步,t=1,2,,Tt = 1, 2, \cdots, T

[x]1T[x]_{1}^{T}:LSTM网络输入语句序列,[x]1T=(x1,x2,,xT)[x]_{1}^{T} = ( x_{1}, x_{2}, \cdots, x_{T} )

[i]1T[i]_{1}^{T}:LSTM-CRF的输出标签序列,[i]1T=(i1,i2,,iT)[i]_{1}^{T} = ( i_{1}, i_{2}, \cdots, i_{T} )。其中,iti_{t}的取值为所有可能标签

fθ([x]1T)f_{\theta} ([x]_{1}^{T}):LSTM网络输出的分值

[A]i,j[A]_{i, j}:标签ii转移至标签jj的CRF输出分值,该分值与时间步tt无关。

s([x]1T,[i]1T,θ~)s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta}):给定序列[x]1T[x]_{1}^{T},LSTM-CRF输出的总分值

s([x]1T,[i]1T,θ~)=t=1Ts([x]t,[i]t,θ~)=t=1T([A][i]t1,[i]t+[fθ][i]t,t)\begin{aligned} s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta}) = \sum_{t = 1}^{T} s([x]_{t}, [i]_{t}, \tilde{\theta}) = \sum_{t = 1}^{T} \left( [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t} \right) \end{aligned}

在时间步tt上,LSTM-CRF输出的分值为s([x]t,[i]t,θ~)s([x]_{t}, [i]_{t}, \tilde{\theta})(CRF的输出分值与时间步t1t - 1有关,LSTM的输出分值与时间步1,2,t11, 2, \cdots t - 1有关)

s([x]t,[i]t,θ~)=[A][i]t1,[i]t+[fθ][i]t,ts([x]_{t}, [i]_{t}, \tilde{\theta}) = [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t}

2.5 双向LSTM-CRF网络(BI-LSTM-CRF Networks)

BI-LSTM-CRF同时处理历史与未来输入特征。

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

3 训练过程

模型训练:前向、后向随机梯度下降(a SGD forward and backward training procedure)

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4 实验

POS:为每个词条标注语法角色(POS assigns each word with a unique tag that indicates its syntactic role)

组块分析:为每个词条标注短语类型(chunking, each word is tagged with its phrase type)

命名实体识别:为每个词条标注实体类型,人物、地点、组织或其他(NER task, each word is tagged with other or one of four entity types: Person, Location, Organization, or Miscellaneous)

组块分析和命名实体识别采用BIO2注释标准(annotation standard)

4.1 数据

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4.2 特征(Features)

4.2.1 拼写特征(Spelling features)

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4.2.2 上下文特征(Context features)

uni-gram、bi-gram特征

4.2.3 词嵌入(Word embedding)

50维词嵌入向量(a 50-dimensional embedding vector)

4.2.4 特征连接技巧(Features connection tricks)

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4.3 结果

初始词向量(initialize word embedding):随机、Senna

模型训练:学习率(learning rate)0.10.1,隐层单元数量(hidden layer size)300300,模型性能对隐层单元数量不敏感。

POS指标:词条准确率(per-word accuracy);组块分析和NER指标:名词词组的F1\text{F}1分值(F1 scores over chunks)

4.3.1 与Cov-CRF比较

CRF为最佳基线模型(CRF forms strong baselines)

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4.3.2 模型鲁棒性(robustness)

去除拼写和上下文特征,仅保留词条特征

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

4.3.3 与现有系统比较

文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging
文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging
文献阅读 - Bidirectional LSTM-CRF Models for Sequence Tagging

5 讨论

6 结论