形式与定义
什么是条件随机场
条件随机场(conditional random field) 是给定随机变量X的条件下,随机变量Y 的马尔可夫随机场。常用于标注、实体识别等问题。
接下来我们通过命名实体识别来介绍crf。
先介绍实体识别:
给定一个句子:我爱*。实体识别的任务是从句子中识别出地点实体:*。
由于汉字不能直接作为模型的输入,我们先构造一个简单的实体识别的模型输入与label:
word_alphabet: {0: ‘我’, 1: ‘爱’, 2: ‘天’, 3: ‘安’, 4:‘门’}
label_alphabet: {0: b, 1: m, 2: e, 3: s,4: o, 5: ‘start’, 6: ‘pad’ }
label_alphabet中的b 代表实体的开始(begin),m代表实体的中间部分(middle),e代表尸体的结束(end),o代表不是实体(none),start、pad分别表示这个label序列的开始和结束。
所以有:
我们将构造的word_index作为模型的输入,label_index作为数据的label。我们希望实体识别模型能过通过word_index 预测出label_index,进而得到实体。
对于每一个字来说label有7种可能,为了数字化这些可能,我们从word_index到label_index设置一种分数,叫做状态分数 ,每一个字对应7种状态,如下图的红色箭头所示。
把状态分数记作:μ 1 − 4 \mu_{1-4} μ 1 − 4 ,表示’爱’对应状态4的得分是多少。
同时因为这是一个序列,label_index 的状态应该应该还与前一个状态有关,比如状态4(o: 不是实体)的后面不可能是状态1(m:实体的中间部分),也不可能是状态3(e:实体的结束)。所以这个时候需要一个转移分数 代表前一个label到后一个label的分数。
把转移分数记作:λ 4 − 4 \lambda_{4-4} λ 4 − 4 ,表示从状态4转移到状态4的得分是多少。 所以我们得到word_index=1到label_index=4的分数为μ 1 − 4 + λ 4 − 4 \mu_{1-4}+\lambda_{4-4} μ 1 − 4 + λ 4 − 4 ,CRF考虑了全局:将前一个分数累加到当前分数上,将其作为已经预测的序列的整体分数。所以当前得分为:s c o r e s = μ 0 − 4 + λ 5 − 4 + μ 1 − 4 + λ 4 − 4 + μ 2 − 0 + λ 4 − 0 + μ 3 − 1 + λ 0 − 1 + μ 4 − 3 + λ 1 − 3 + λ 3 − 6
scores = \mu_{0-4}+\lambda_{5-4} +\mu_{1-4}+\lambda_{4-4} + \mu_{2-0}+\lambda_{4-0} +\mu_{3-1}+\lambda_{0-1} + \mu_{4-3}+\lambda_{1-3}+\lambda_{3-6}
s c o r e s = μ 0 − 4 + λ 5 − 4 + μ 1 − 4 + λ 4 − 4 + μ 2 − 0 + λ 4 − 0 + μ 3 − 1 + λ 0 − 1 + μ 4 − 3 + λ 1 − 3 + λ 3 − 6
即:s c o r e ( x , y ) = ∑ i = 0 n λ y i − 1 , y i + ∑ i = 0 n μ i , y i
score(x,y) = \sum_{i=0}^n \lambda_{y_{i-1},y_i} + \sum_{i=0}^n \mu_{i,y_i}
s c o r e ( x , y ) = i = 0 ∑ n λ y i − 1 , y i + i = 0 ∑ n μ i , y i
因为这个预测序列有很多种,种类为label的排列组合大小。其中只有一种组合是对的,我们想通过模型训练使得对的score的比重在总体的所有score的越大越好。而这个时候我们一般softmax化,即:P ( y ∣ x ) = e s c o r e ( x , y ) ∑ y e s c o r e ( x , y )
P(y|x) = \frac{e^{score(x,y)}}{\sum_y e^{score(x,y)}}
P ( y ∣ x ) = ∑ y e s c o r e ( x , y ) e s c o r e ( x , y )
上面这个实体识别中,X和Y有相同的图结构,其条件随机场的概率图模型应该长这样:
在一些问题种 X 和Y结构不同,其概率图模型如下图所示:
上面的例子有这样一个问题:有些组合根本不存在,如:状态4(o: 不是实体)的后面不可能是状态1(m:实体的中间部分),也不可能是状态3(e:实体的结束), 因此我们不能单纯的求分数的相加,需要引入一些限制。即引入转移特征函数t k t_k t k 和状态特征函数s l s_l s l ,取值为0或者1,当满足条件时取值1,否则取值0。
说明:因为状态4(o: 不是实体)的后面不可能是状态1(m:实体的中间部分),即:t 1 = t 1 ( y i − 1 = 4 , y i = 1 , x , i ) = 0 , i = 1 , 2 … 6
t_1 = t_1(y_{i-1}=4,y_i=1,x,i)=0, i=1,2\ldots6
t 1 = t 1 ( y i − 1 = 4 , y i = 1 , x , i ) = 0 , i = 1 , 2 … 6
此时某一转移可能性的得分也更新为:t k λ k t_k\lambda_k t k λ k 如果t k = 0 t_k=0 t k = 0 则此种转移不存在,转移得分为0,对于状态得分也存在这样的情况。
此时CRF的参数化形式定义如下:P ( y ∣ x ) = 1 Z exp ( ∑ i , k λ k t k ( y i − 1 , y i , x , i + ∑ i , l μ l s l ( y i , x , i ) )
P(y|x)=\frac{1}{Z}\exp\left(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i+\sum_{i,l}\mu_ls_l(y_i,x,i)\right)
P ( y ∣ x ) = Z 1 exp ⎝ ⎛ i , k ∑ λ k t k ( y i − 1 , y i , x , i + i , l ∑ μ l s l ( y i , x , i ) ⎠ ⎞
Z ( x ) = ∑ y e x p ( ∑ i , k λ k t k ( y i − 1 , y i , x , i ) + ∑ i , l μ l s l ( y i , x , i ) )
Z(x)=\sum_y exp \left(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i)+\sum_{i,l}\mu_ls_l(y_i,x,i)\right)
Z ( x ) = y ∑ e x p ⎝ ⎛ i , k ∑ λ k t k ( y i − 1 , y i , x , i ) + i , l ∑ μ l s l ( y i , x , i ) ⎠ ⎞
式中t k t_k t k 是定义在边上的特征函数和s l s_l s l 是定义在节点上的特征函数,λ k \lambda_k λ k 和μ l \mu_l μ l 是对应的权值。Z ( x ) Z(x) Z ( x ) 是规范化因子,求和是在所有可能的输出序列上进行的 。
至此我们已经了解了CRF基本的思想了。
下面来看一个简单得例子深入了解CRF。
假设有一个标注问题,输入序列为X = ( X 1 , X 2 , X 3 ) X=(X_1,X_2,X_3) X = ( X 1 , X 2 , X 3 ) ,输出的标记序列为Y = ( Y 1 , Y 2 , Y 3 ) Y=(Y_1,Y_2,Y_3) Y = ( Y 1 , Y 2 , Y 3 ) ,Y i Y_i Y i 的取值为1或2。假设t k , s l t_k,s_l t k , s l 和对应的权值λ k , μ l \lambda_k,\mu_l λ k , μ l 如下:t 1 = t 1 ( y i − 1 = 1 , y i = 2 , x , i ) , i = 2 , 3 , λ 1 = 1 t 2 = t 2 ( y i − 1 = 1 , y i = 1 , x , i ) , i = 2 , λ 2 = 0.5 t 3 = t 3 ( y i − 1 = 2 , y i = 1 , x , i ) , i = 3 , λ 3 = 1 t 4 = t 4 ( y i − 1 = 2 , y i = 1 , x , i ) , i = 2 , λ 4 = 1 t 5 = t 5 ( y i − 1 = 2 , y i = 2 , x , i ) , i = 3 , λ 5 = 0.2 s 1 = s 1 ( y i = 1 , x , i ) , i = 1 , μ 1 = 1 s 2 = s 2 ( y i = 1 , x , i ) , i = 1 , 2 , μ 2 = 0.5 s 3 = s 3 ( y i = 1 , x , i ) , i = 2 , 3 , μ 3 = 0.8 s 4 = s 4 ( y i = 2 , x , i ) , i = 3 μ 4 = 0.5
\begin{aligned}
t_1&=t_1(y_{i-1}=1,y_i=2,x,i),&i&=2,3,&\lambda_1&=1 \\
t_2&=t_2(y_{i-1}=1,y_i=1,x,i),&i&=2,&\lambda_2&=0.5\\
t_3&=t_3(y_{i-1}=2,y_i=1,x,i),&i&=3,&\lambda_3&=1\\
t_4&=t_4(y_{i-1}=2,y_i=1,x,i),&i&=2,&\lambda_4&=1\\
t_5&=t_5(y_{i-1}=2,y_i=2,x,i),&i&=3,&\lambda_5&=0.2\\
s_1&=s_1(y_i=1,x,i),&i&=1,&\mu_1&=1\\
s_2&=s_2(y_i=1,x,i),&i&=1,2,&\mu_2&=0.5\\
s_3&=s_3(y_i=1,x,i),&i&=2,3,&\mu_3&=0.8\\
s_4&=s_4(y_i=2,x,i),&i&=3&\mu_4&=0.5\\
\end{aligned}
t 1 t 2 t 3 t 4 t 5 s 1 s 2 s 3 s 4 = t 1 ( y i − 1 = 1 , y i = 2 , x , i ) , = t 2 ( y i − 1 = 1 , y i = 1 , x , i ) , = t 3 ( y i − 1 = 2 , y i = 1 , x , i ) , = t 4 ( y i − 1 = 2 , y i = 1 , x , i ) , = t 5 ( y i − 1 = 2 , y i = 2 , x , i ) , = s 1 ( y i = 1 , x , i ) , = s 2 ( y i = 1 , x , i ) , = s 3 ( y i = 1 , x , i ) , = s 4 ( y i = 2 , x , i ) , i i i i i i i i i = 2 , 3 , = 2 , = 3 , = 2 , = 3 , = 1 , = 1 , 2 , = 2 , 3 , = 3 λ 1 λ 2 λ 3 λ 4 λ 5 μ 1 μ 2 μ 3 μ 4 = 1 = 0 . 5 = 1 = 1 = 0 . 2 = 1 = 0 . 5 = 0 . 8 = 0 . 5
这里只注明特征值取1的条件,取值为0的省略。
对于给定的观测序列 x, 求标记序列为y = ( y 1 , y 2 , y 3 ) = ( 1 , 2 , 2 ) y=(y_1,y_2,y_3)=(1,2,2) y = ( y 1 , y 2 , y 3 ) = ( 1 , 2 , 2 ) 的非规范化条件概率。
根据t k , s l t_k,s_l t k , s l 和对应的权值λ k , μ l \lambda_k,\mu_l λ k , μ l ,我们可以画出转移图如下:
图中红色路径的得分即为y = ( 1 , 2 , 2 ) y=(1,2,2) y = ( 1 , 2 , 2 ) 的非规范化条件概率。即:P ( y 1 = 1 , y 2 = 2 , y 3 = 2 ∣ x ) → e ( λ 1 t 1 + λ 5 t 5 + μ 1 s 1 + μ 2 s 2 + μ 4 s 4 ) = e 3.2
P(y_1=1,y_2=2,y_3=2|x) \rightarrow e^{(\lambda_1t_1 + \lambda_5 t_5 +\mu_1s_1+\mu_2s_2 + \mu_4s_4)} = e^{3.2}
P ( y 1 = 1 , y 2 = 2 , y 3 = 2 ∣ x ) → e ( λ 1 t 1 + λ 5 t 5 + μ 1 s 1 + μ 2 s 2 + μ 4 s 4 ) = e 3 . 2
从公式的角度看:P ( y ∣ x ) → exp ( ∑ k = 1 5 λ k ∑ i = 2 3 t k ( y i − 1 , y i , x , i ) + ∑ k = 1 4 μ l ∑ i = 1 3 s l ( y i , x , i ) )
\begin{aligned}
P(y|x) &\rightarrow \exp\left(\sum_{k=1}^5\lambda_k \sum_{i=2}^3t_k(y_{i-1},y_i,x,i)+\sum_{k=1}^4\mu_l\sum_{i=1}^3 s_l(y_i,x,i)\right)
\end{aligned}
P ( y ∣ x ) → exp ( k = 1 ∑ 5 λ k i = 2 ∑ 3 t k ( y i − 1 , y i , x , i ) + k = 1 ∑ 4 μ l i = 1 ∑ 3 s l ( y i , x , i ) )
其中:∑ k = 1 5 λ k ∑ i = 2 3 t k ( y i − 1 , y i , x , i ) = λ 1 ( t 1 ( y 1 , y 2 , x , 2 ) + t 1 ( y 2 , y 3 , x , 3 ) ) + λ 2 ( t 2 ( y 1 , y 2 , x , 2 ) + t 2 ( y 2 , y 3 , x , 3 ) ) + λ 3 ( t 3 ( y 1 , y 2 , x , 2 ) + t 3 ( y 2 , y 3 , x , 3 ) ) + λ 4 ( t 4 ( y 1 , y 2 , x , 2 ) + t 4 ( y 2 , y 3 , x , 3 ) ) + λ 5 ( t 5 ( y 1 , y 2 , x , 2 ) + t 5 ( y 2 , y 3 , x , 3 ) ) = λ 1 t 1 ( y 1 = 1 , y 2 = 2 , x , 2 ) + λ 5 t 5 ( y 2 = 2 , y 3 = 2 , x , 3 )
\begin{aligned}
&\sum_{k=1}^5\lambda_k \sum_{i=2}^3t_k(y_{i-1},y_i,x,i) \\
=&\lambda_1 (t_1(y_{1},y_2,x,2)+t_1(y_{2},y_3,x,3)) + \lambda_2 (t_2(y_{1},y_2,x,2)+t_2(y_{2},y_3,x,3)) \\
&+\lambda_3 (t_3(y_{1},y_2,x,2)+t_3(y_{2},y_3,x,3))+\lambda_4 (t_4(y_{1},y_2,x,2)+t_4(y_{2},y_3,x,3)) \\
&+\lambda_5 (t_5(y_{1},y_2,x,2)+t_5(y_{2},y_3,x,3))\\
=&\lambda_1 t_1(y_{1}=1,y_2=2,x,2) + \lambda_5 t_5(y_{2}=2,y_3=2,x,3)
\end{aligned}
= = k = 1 ∑ 5 λ k i = 2 ∑ 3 t k ( y i − 1 , y i , x , i ) λ 1 ( t 1 ( y 1 , y 2 , x , 2 ) + t 1 ( y 2 , y 3 , x , 3 ) ) + λ 2 ( t 2 ( y 1 , y 2 , x , 2 ) + t 2 ( y 2 , y 3 , x , 3 ) ) + λ 3 ( t 3 ( y 1 , y 2 , x , 2 ) + t 3 ( y 2 , y 3 , x , 3 ) ) + λ 4 ( t 4 ( y 1 , y 2 , x , 2 ) + t 4 ( y 2 , y 3 , x , 3 ) ) + λ 5 ( t 5 ( y 1 , y 2 , x , 2 ) + t 5 ( y 2 , y 3 , x , 3 ) ) λ 1 t 1 ( y 1 = 1 , y 2 = 2 , x , 2 ) + λ 5 t 5 ( y 2 = 2 , y 3 = 2 , x , 3 )
∑ l = 1 4 μ l ∑ i = 1 3 s l ( y i , x , i ) = μ 1 ( s 1 ( y 1 , x , 1 ) + s 1 ( y 2 , x , 2 ) + s 1 ( y 3 , x , 3 ) ) + μ 2 ( s 2 ( y 1 , x , 1 ) + s 2 ( y 2 , x , 2 ) + s 2 ( y 3 , x , 3 ) ) + μ 3 ( s 3 ( y 1 , x , 1 ) + s 3 ( y 2 , x , 2 ) + s 3 ( y 3 , x , 3 ) ) + μ 4 ( s 4 ( y 1 , x , 1 ) + s 4 ( y 2 , x , 2 ) + s 4 ( y 3 , x , 3 ) ) = μ 1 s 1 ( y 1 = 1 , x , 1 ) + μ 2 s 2 ( y 2 = 1 , x , 2 ) + μ 4 s 4 ( y 3 = 3 , x , 3 )
\begin{aligned}
&\sum_{l=1}^4\mu_l\sum_{i=1}^3 s_l(y_i,x,i) \\
=&\mu_1(s_1(y_1,x,1)+s_1(y_2,x,2)+s_1(y_3,x,3))+\mu_2(s_2(y_1,x,1)+s_2(y_2,x,2)+s_2(y_3,x,3)) \\
&+\mu_3(s_3(y_1,x,1)+s_3(y_2,x,2)+s_3(y_3,x,3))+\mu_4(s_4(y_1,x,1)+s_4(y_2,x,2)+s_4(y_3,x,3))\\
=&\mu_1s_1(y_1=1,x,1)+\mu_2s_2(y_2=1,x,2)+\mu_4s_4(y_3=3,x,3)
\end{aligned}
= = l = 1 ∑ 4 μ l i = 1 ∑ 3 s l ( y i , x , i ) μ 1 ( s 1 ( y 1 , x , 1 ) + s 1 ( y 2 , x , 2 ) + s 1 ( y 3 , x , 3 ) ) + μ 2 ( s 2 ( y 1 , x , 1 ) + s 2 ( y 2 , x , 2 ) + s 2 ( y 3 , x , 3 ) ) + μ 3 ( s 3 ( y 1 , x , 1 ) + s 3 ( y 2 , x , 2 ) + s 3 ( y 3 , x , 3 ) ) + μ 4 ( s 4 ( y 1 , x , 1 ) + s 4 ( y 2 , x , 2 ) + s 4 ( y 3 , x , 3 ) ) μ 1 s 1 ( y 1 = 1 , x , 1 ) + μ 2 s 2 ( y 2 = 1 , x , 2 ) + μ 4 s 4 ( y 3 = 3 , x , 3 )
带入数值即可。
简化形式
上述公式中,有两个特征函数,两个权值,为了简化,我们首先将转移特征和状态特征及其权值用统一的符号表示。假设有K 1 K_1 K 1 个转移特征,K 2 K_2 K 2 个状态特征,K = K 1 + K 2 K=K_1+K_2 K = K 1 + K 2 ,记:f k ( y i − 1 , y i , x , i ) = { t k ( y i − 1 , y i , x , i ) , k = 1 , 2 , … , K 1 s l ( y i , x , i ) , k = K 1 + l ; l = 1 , 2 , … , K 2
\color{red}
f_k(y_{i-1},y_i,x,i)=
\begin{cases}
t_k(y_{i-1},y_i,x,i),&k=1,2,\dots,K_1\\
s_l(y_i,x,i),&k=K_1+l;l=1,2,\dots,K_2
\end{cases}
f k ( y i − 1 , y i , x , i ) = { t k ( y i − 1 , y i , x , i ) , s l ( y i , x , i ) , k = 1 , 2 , … , K 1 k = K 1 + l ; l = 1 , 2 , … , K 2
然后,对转和状态特征在各个位置i i i 求和,记作:f k ( y , x ) = ∑ i = 1 n f k ( y i − 1 , y i , x , i ) = { ∑ i = 1 n t k ( y i − 1 , y i , x , i ) , k < K 1 ∑ i = 1 n s l ( y i , x , i ) , K 1 < k < K
f_k(y,x)=\sum_{i=1}^nf_k(y_{i-1},y_i,x,i)=\begin{cases}
\sum_{i=1}^n t_k(y_{i-1},y_i,x,i),&k<K_1\\
\sum_{i=1}^ns_l(y_i,x,i),&K_1<k<K
\end{cases}
f k ( y , x ) = i = 1 ∑ n f k ( y i − 1 , y i , x , i ) = { ∑ i = 1 n t k ( y i − 1 , y i , x , i ) , ∑ i = 1 n s l ( y i , x , i ) , k < K 1 K 1 < k < K
用w k w_k w k 表示特征f k ( y , x ) f_k(y,x) f k ( y , x ) 的权值w k = { λ k , k = 1 , 2 , … , K 1 μ l , k = K 1 + l ; l = 1 , 2 , … , K 2
w_k=
\begin{cases}
\lambda_k,&k=1,2,\dots,K_1\\
\mu_l,&k=K_1+l;l=1,2,\dots,K_2
\end{cases}
w k = { λ k , μ l , k = 1 , 2 , … , K 1 k = K 1 + l ; l = 1 , 2 , … , K 2
于是条件随机场可以表示为P ( y ∣ x ) = 1 Z ( x ) exp ∑ k = 1 K w k f k ( y , x ) Z ( x ) = ∑ y exp ∑ k = 1 K w k f k ( y , x )
\begin{aligned}
P(y|x)&=\frac{1}{Z(x)}\exp\sum_{k=1}^Kw_kf_k(y,x)\\
Z(x)&=\sum_y\exp\sum_{k=1}^Kw_kf_k(y,x)
\end{aligned}
P ( y ∣ x ) Z ( x ) = Z ( x ) 1 exp k = 1 ∑ K w k f k ( y , x ) = y ∑ exp k = 1 ∑ K w k f k ( y , x )
若以w w w 表示权值向量, 即w = ( w 1 , w 2 , … , w K ) T
w=(w_1,w_2,\dots,w_K)^T
w = ( w 1 , w 2 , … , w K ) T
以F F F 表示全局特征向量,即F ( y , x ) = ( f 1 ( y , x ) , f 2 ( y , x ) , … , f K ( y , x ) ) T
F(y,x)=(f_1(y,x),f_2(y,x),\dots,f_K(y,x))^T
F ( y , x ) = ( f 1 ( y , x ) , f 2 ( y , x ) , … , f K ( y , x ) ) T
条件随机场可以表示成向量内积的形式P w ( y ∣ x ) = exp ( w ⋅ F ( y , x ) ) Z w ( x ) Z w ( x ) = ∑ y exp ( w ⋅ F ( y , x ) )
\begin{aligned}
P_w(y|x)&=\frac{\exp(w\cdot F(y,x))}{Z_w(x)}\\
Z_w(x)&=\sum_y\exp\left(w\cdot F(y,x)\right)
\end{aligned}
P w ( y ∣ x ) Z w ( x ) = Z w ( x ) exp ( w ⋅ F ( y , x ) ) = y ∑ exp ( w ⋅ F ( y , x ) )
矩阵形式
针对线性链条件随机场引入起点和终点状态标记y 0 = s t a r t , y n + 1 = e n d y_0=start,y_{n+1}=end y 0 = s t a r t , y n + 1 = e n d , 这时P w ( y ∣ x ) P_w(y|x) P w ( y ∣ x ) 可以矩阵形式表示。
对观测序列x x x 的每一个位置i = 1 , 2 , … , n + 1 i = 1,2,\ldots,n+1 i = 1 , 2 , … , n + 1 ,由于y i − 1 y_{i-1} y i − 1 和y i y_i y i 在m个标记中取值,可以定义一个m 阶矩阵:M i ( x ) = [ M i ( y i − 1 , y i ∣ x ) ]
\begin{aligned}
M_i(x)&=\left[M_i(y_{i-1},y_i|x)\right]\\
\end{aligned}
M i ( x ) = [ M i ( y i − 1 , y i ∣ x ) ]
矩阵随机变量的元素为:M i ( y i − 1 , y i ∣ x ) = exp ( W i ( y i − 1 , y i ∣ x ) ) W i ( y i − 1 , y i ∣ x ) = ∑ k = 1 K w k f k ( y i − 1 , y i ∣ x )
\begin{aligned}
M_i(y_{i-1},y_i|x)&=\exp\left(W_i(y_{i-1},y_i|x)\right)\\
W_i(y_{i-1},y_i|x)&=\sum_{k=1}^Kw_kf_k(y_{i-1},y_i|x)
\end{aligned}
M i ( y i − 1 , y i ∣ x ) W i ( y i − 1 , y i ∣ x ) = exp ( W i ( y i − 1 , y i ∣ x ) ) = k = 1 ∑ K w k f k ( y i − 1 , y i ∣ x )
举例:假设y 1 ∈ { 1 , 2 } y_1 \in \{1,2\} y 1 ∈ { 1 , 2 } W 2 ( y 1 , y 2 ∣ x ) = ∑ k = 1 K w k f k ( y 1 , y 2 ∣ x ) = ∑ k = 1 K 1 λ k t k ( y 1 , y 2 , x , 2 ) + ∑ l = 1 K − K 1 μ l s l ( y 2 , x , 2 )
\begin{aligned}
W_2(y_{1},y_2|x)&=\sum_{k=1}^Kw_kf_k(y_{1},y_2|x) \\
&= \sum_{k=1}^{K_1}\lambda_kt_k(y_{1},y_2,x,2)+\sum_{l=1}^{K-K_1} \mu_ls_l(y_2,x,2)
\end{aligned}
W 2 ( y 1 , y 2 ∣ x ) = k = 1 ∑ K w k f k ( y 1 , y 2 ∣ x ) = k = 1 ∑ K 1 λ k t k ( y 1 , y 2 , x , 2 ) + l = 1 ∑ K − K 1 μ l s l ( y 2 , x , 2 )
则可以求得:W 2 ( y 1 = 1 , y 2 = 1 ∣ x ) , W 2 ( y 1 = 1 , y 2 = 2 ∣ x ) , W 2 ( y 1 = 2 , y 2 = 1 ∣ x ) , W 2 ( y 1 = 2 , y 2 = 2 ∣ x ) W_2(y_{1}=1,y_2=1|x),W_2(y_{1}=1,y_2=2|x),W_2(y_{1}=2,y_2=1|x),W_2(y_{1}=2,y_2=2|x) W 2 ( y 1 = 1 , y 2 = 1 ∣ x ) , W 2 ( y 1 = 1 , y 2 = 2 ∣ x ) , W 2 ( y 1 = 2 , y 2 = 1 ∣ x ) , W 2 ( y 1 = 2 , y 2 = 2 ∣ x )
取对数,构成2 × 2 2\times2 2 × 2 矩阵矩阵M 2 ( x ) M_2(x) M 2 ( x ) ,分别表示y 1 y_1 y 1 转移到y 2 y_2 y 2 的可能性的大小。
这样给定观测序列x x x ,相应的标记序列y y y 的非规范化概率可以通过该序列的n+1个矩阵原色的乘积∏ i = 1 n + 1 M i ( y i − 1 , y i ∣ x ) \prod_{i=1}^{n+1}M_i(y_{i-1},y_i|x) ∏ i = 1 n + 1 M i ( y i − 1 , y i ∣ x ) 表示。所以:P w ( y ∣ x ) = 1 Z w ( x ) ∏ i = 1 n + 1 M i ( y i − 1 , y i ∣ x )
P_w(y|x)=\frac{1}{Z_w(x)}\prod_{i=1}^{n+1}M_i(y_{i-1},y_i|x)
P w ( y ∣ x ) = Z w ( x ) 1 i = 1 ∏ n + 1 M i ( y i − 1 , y i ∣ x )
其中,Z w Z_w Z w 为规范化因子,是n+1个矩阵的乘积的(start,stop)元素 Z w ( x ) = ( M 1 ( x ) M 2 ( x ) … M n + 1 ( x ) ) s t a r t , s t o p
Z_w(x)=(M_1(x)M_2(x)\dots M_{n+1}(x))_{start,stop}
Z w ( x ) = ( M 1 ( x ) M 2 ( x ) … M n + 1 ( x ) ) s t a r t , s t o p
注:y 0 = s t a r t , y n + 1 = s t o p y_0=start, y_{n+1}=stop y 0 = s t a r t , y n + 1 = s t o p 表示开始状态和终止状态,规范化因子Z w ( x ) Z_w(x) Z w ( x ) 是以start为起点,end为终点通过状态的所有路径y 1 y 2 … y n y_1y_2\ldots y_n y 1 y 2 … y n 的非规范化概率∏ i = 1 n + 1 M i ( y i − 1 , y i ∣ x ) \prod_{i=1}^{n+1}M_i(y_{i-1},y_i|x) ∏ i = 1 n + 1 M i ( y i − 1 , y i ∣ x ) 之和。
举例如下:
给定下图的线性链条件随机场,观测序列x,状态序列y,假设y 0 = s t a r t = 1 , y 4 = s t o p = 1 y_0=start=1, y_{4}=stop=1 y 0 = s t a r t = 1 , y 4 = s t o p = 1 。
各个位置的随机矩阵(可以理解为状态转移矩阵)为:M 1 ( x ) = [ a 11 a 12 0 0 ] , M 2 ( x ) = [ b 11 b 12 b 21 b 22 ] M 3 ( x ) = [ c 11 c 12 c 21 c 22 ] , M 4 ( x ) = [ 1 0 1 0 ]
\begin{aligned}
M_1(x)=
\begin{bmatrix}
&a_{11}&a_{12}\\
&0&0
\end{bmatrix}
&,M_2(x)=
\begin{bmatrix}
&b_{11}&b_{12}\\
&b_{21}&b_{22}
\end{bmatrix}
\\
M_3(x)=
\begin{bmatrix}
&c_{11}&c_{12}\\
&c_{21}&c_{22}
\end{bmatrix}
&,M_4(x)=
\begin{bmatrix}
&1&0\\
&1&0
\end{bmatrix}
\end{aligned}
M 1 ( x ) = [ a 1 1 0 a 1 2 0 ] M 3 ( x ) = [ c 1 1 c 2 1 c 1 2 c 2 2 ] , M 2 ( x ) = [ b 1 1 b 2 1 b 1 2 b 2 2 ] , M 4 ( x ) = [ 1 1 0 0 ]
其中:M i ( x ) = [ exp ( ∑ k = 1 K w k f k ( y i − 1 , y i ∣ x ) ) ] , i = 1 , 2 , … , n + 1
M_i(x)=\left[\exp\left(\sum_{k=1}^Kw_kf_k(y_{i-1},y_i|x)\right)\right],i=1,2,\dots,n+1
M i ( x ) = [ exp ( k = 1 ∑ K w k f k ( y i − 1 , y i ∣ x ) ) ] , i = 1 , 2 , … , n + 1
例如M 1 ( x ) M_1(x) M 1 ( x ) :M 1 ( x ) = [ a 11 = exp ( ∑ k = 1 K w k f k ( y 0 = 1 , y 1 = 1 ∣ x ) ) a 12 = exp ( ∑ k = 1 K w k f k ( y 0 = 1 , y 1 = 2 ∣ x ) ) a 21 = exp ( ∑ k = 1 K w k f k ( y 0 = 2 , y 1 = 1 ∣ x ) ) a 22 = exp ( ∑ k = 1 K w k f k ( y 0 = 2 , y 2 = 1 ∣ x ) ) ]
M_1(x)=
\begin{bmatrix}
&a_{11} =\exp\left(\sum_{k=1}^Kw_kf_k(y_{0}=1,y_1=1|x)\right)&a_{12}=\exp\left(\sum_{k=1}^Kw_kf_k(y_{0}=1,y_1=2|x)\right)\\
&a_{21}=\exp\left(\sum_{k=1}^Kw_kf_k(y_{0}=2,y_1=1|x)\right)&a_{22}=\exp\left(\sum_{k=1}^Kw_kf_k(y_{0}=2,y_2=1|x)\right)
\end{bmatrix}
M 1 ( x ) = ⎣ ⎡ a 1 1 = exp ( ∑ k = 1 K w k f k ( y 0 = 1 , y 1 = 1 ∣ x ) ) a 2 1 = exp ( ∑ k = 1 K w k f k ( y 0 = 2 , y 1 = 1 ∣ x ) ) a 1 2 = exp ( ∑ k = 1 K w k f k ( y 0 = 1 , y 1 = 2 ∣ x ) ) a 2 2 = exp ( ∑ k = 1 K w k f k ( y 0 = 2 , y 2 = 1 ∣ x ) ) ⎦ ⎤
显然,a 21 , a 22 a_{21}, a_{22} a 2 1 , a 2 2 为0。
将上述矩阵相乘:∏ i = 1 4 M i ( y i − 1 , y i ∣ x )
\prod_{i=1}^{4}M_i(y_{i-1},y_i|x)
i = 1 ∏ 4 M i ( y i − 1 , y i ∣ x )
可以得到各个路径的非规范化概率为:a 11 b 11 c 11 , a 11 b 11 c 12 , a 11 b 12 c 21 , a 11 b 12 c 22 , a 12 b 21 c 11 , a 12 b 21 c 12 , a 12 b 22 c 21 , a 12 b 22 c 22 ,
a_{11}b_{11}c_{11},\quad a_{11}b_{11}c_{12},\quad a_{11}b_{12}c_{21},\quad a_{11}b_{12}c_{22},\quad \\
a_{12}b_{21}c_{11},\quad a_{12}b_{21}c_{12},\quad a_{12}b_{22}c_{21},\quad a_{12}b_{22}c_{22},\quad
a 1 1 b 1 1 c 1 1 , a 1 1 b 1 1 c 1 2 , a 1 1 b 1 2 c 2 1 , a 1 1 b 1 2 c 2 2 , a 1 2 b 2 1 c 1 1 , a 1 2 b 2 1 c 1 2 , a 1 2 b 2 2 c 2 1 , a 1 2 b 2 2 c 2 2 ,
规范化因子,即最终计算结果左上角的元素,为:a 11 b 11 c 11 + a 11 b 11 c 12 + a 11 b 12 c 21 + a 11 b 12 c 22 + a 12 b 21 c 11 + a 12 b 21 c 12 + a 12 b 22 c 21 + a 12 b 22 c 22
a_{11}b_{11}c_{11}+ a_{11}b_{11}c_{12}+ a_{11}b_{12}c_{21}+ a_{11}b_{12}c_{22}+ \\
a_{12}b_{21}c_{11}+ a_{12}b_{21}c_{12}+ a_{12}b_{22}c_{21}+ a_{12}b_{22}c_{22}
a 1 1 b 1 1 c 1 1 + a 1 1 b 1 1 c 1 2 + a 1 1 b 1 2 c 2 1 + a 1 1 b 1 2 c 2 2 + a 1 2 b 2 1 c 1 1 + a 1 2 b 2 1 c 1 2 + a 1 2 b 2 2 c 2 1 + a 1 2 b 2 2 c 2 2
学习算法
条件随机场模型的学习通过拟牛顿法进行。
CRF的模型:P ( y ∣ x ) = 1 Z ( x ) exp ∑ i = 1 n w i f i ( y , x ) Z ( x ) = ∑ y exp ∑ i = 1 n w i f i ( y , x )
\begin{aligned}P(y|x)&=\frac{1}{Z(x)}\exp\sum_{i=1}^nw_if_i(y,x)\\Z(x)&=\sum_y\exp\sum_{i=1}^nw_if_i(y,x)
\end{aligned}
P ( y ∣ x ) Z ( x ) = Z ( x ) 1 exp i = 1 ∑ n w i f i ( y , x ) = y ∑ exp i = 1 ∑ n w i f i ( y , x )
已知训练数据的经验概率分布P ~ ( x , y ) \widetilde {P}(x,y) P ( x , y ) ,条件概率分布的对数似然函数表示为:L P ~ ( P w ) = l o g ∏ x , y P ( y ∣ x ) P ~ ( x , y ) = ∑ x , y P ~ ( x , y ) log P ( y ∣ x )
L_{\widetilde {P}}(P_w)=log \prod_{x,y}{P}(y|x)^{\widetilde {P}(x,y)} =\sum \limits_{x,y}\widetilde {P}(x,y)\log{P}(y|x)
L P ( P w ) = l o g x , y ∏ P ( y ∣ x ) P ( x , y ) = x , y ∑ P ( x , y ) log P ( y ∣ x )
所以L P ~ ( P w ) = ∑ x , y P ~ ( x , y ) log P ( y ∣ x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x , y ) log ( Z w ( x ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) log ( Z w ( x ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ( Z w ( x ) ) ∑ y P ( y ∣ x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ( Z w ( x ) )
\begin{aligned}
L_{\widetilde {P}}(P_w)&=\sum \limits_{x,y}\widetilde {P}(x,y)\log{P}(y|x)\\
&=\sum \limits_{x,y}\widetilde {P}(x,y)\sum \limits_{i=1}^{n}w_if_i(x,y) -\sum \limits_{x,y}\widetilde{P}(x,y)\log{(Z_w(x))}\\
&=\sum \limits_{x,y}\widetilde {P}(x,y)\sum \limits_{i=1}^{n}w_if_i(x,y) -\sum \limits_{x,y}\widetilde{P}(x)P(y|x)\log{(Z_w(x))}\\
&=\sum \limits_{x,y}\widetilde {P}(x,y)\sum \limits_{i=1}^{n}w_if_i(x,y) -\sum \limits_{x}\widetilde{P}(x)\log{(Z_w(x))}\sum_{y}P(y|x)\\
&=\sum \limits_{x,y}\widetilde {P}(x,y)\sum \limits_{i=1}^{n}w_if_i(x,y) -\sum \limits_{x}\widetilde{P}(x)\log{(Z_w(x))}
\end{aligned}
L P ( P w ) = x , y ∑ P ( x , y ) log P ( y ∣ x ) = x , y ∑ P ( x , y ) i = 1 ∑ n w i f i ( x , y ) − x , y ∑ P ( x , y ) log ( Z w ( x ) ) = x , y ∑ P ( x , y ) i = 1 ∑ n w i f i ( x , y ) − x , y ∑ P ( x ) P ( y ∣ x ) log ( Z w ( x ) ) = x , y ∑ P ( x , y ) i = 1 ∑ n w i f i ( x , y ) − x ∑ P ( x ) log ( Z w ( x ) ) y ∑ P ( y ∣ x ) = x , y ∑ P ( x , y ) i = 1 ∑ n w i f i ( x , y ) − x ∑ P ( x ) log ( Z w ( x ) )
以上推导用到了∑ y P ( y ∣ x ) = 1 \sum\limits_yP(y|x)=1 y ∑ P ( y ∣ x ) = 1
要极大化似然函数,即极小化− L P ~ ( P w ) -L_{\widetilde {P}}(P_w) − L P ( P w ) 。
所以学习的优化目标是:min w ∈ R n f ( w ) = ∑ x P ~ ( x ) log ∑ y exp ( ∑ i = 1 n w i f i ( y , x ) ) − ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y )
\min\limits_{w \in \R^n} f(w) =\sum \limits_{x}\widetilde{P}(x)\log{\sum_y\exp \left(\sum_{i=1}^nw_if_i(y,x)\right)} - \sum \limits_{x,y}\widetilde {P}(x,y)\sum \limits_{i=1}^{n}w_if_i(x,y)
w ∈ R n min f ( w ) = x ∑ P ( x ) log y ∑ exp ( i = 1 ∑ n w i f i ( y , x ) ) − x , y ∑ P ( x , y ) i = 1 ∑ n w i f i ( x , y )
其梯度函数是g ( w ) = ( ∂ f ( w ) ∂ w 1 , ∂ f ( w ) ∂ w 2 , … ∂ f ( w ) ∂ w n ) T
g(w) = \left( \frac{\partial f(w)}{\partial w_1},\frac{\partial f(w)}{\partial w_2},\ldots \frac{\partial f(w)}{\partial w_n}\right)^T
g ( w ) = ( ∂ w 1 ∂ f ( w ) , ∂ w 2 ∂ f ( w ) , … ∂ w n ∂ f ( w ) ) T
其中:∂ f ( w ) ∂ w i = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( y , x ) − ∑ x , y P ~ ( x , y ) f i ( x , y )
\frac{\partial f(w)}{\partial w_i}=\sum \limits_{x,y}\widetilde{P}(x)P_w(y|x)f_i(y,x) - \sum \limits_{x,y}\widetilde {P}(x,y)f_i(x,y)
∂ w i ∂ f ( w ) = x , y ∑ P ( x ) P w ( y ∣ x ) f i ( y , x ) − x , y ∑ P ( x , y ) f i ( x , y )
向量化:∂ f ( w ) ∂ w = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f ( y , x ) − ∑ x , y P ~ ( x , y ) f ( x , y )
\frac{\partial f(w)}{\partial w}=\sum \limits_{x,y}\widetilde{P}(x)P_w(y|x)f(y,x) - \sum \limits_{x,y}\widetilde {P}(x,y)f(x,y)
∂ w ∂ f ( w ) = x , y ∑ P ( x ) P w ( y ∣ x ) f ( y , x ) − x , y ∑ P ( x , y ) f ( x , y )
条件随机场学习的BFGS算法 :
输入:特征函数f 1 , f 2 , … , f n f_1,f_2,\ldots,f_n f 1 , f 2 , … , f n ;经验分布P ~ ( x , y ) \widetilde P(x,y) P ( x , y ) ;
输出:最优参数w ^ \hat w w ^ ; 最优模型P w ( y ∣ x ) P_w(y|x) P w ( y ∣ x ) 。
(1)选定初始点w ( 0 ) w^{(0)} w ( 0 ) 取 B 0 \mathbf B_0 B 0 为正定对称矩阵,k = 0 k=0 k = 0 。
(2)计算g k = g ( w ( k ) ) g_k=g(w^{(k)}) g k = g ( w ( k ) ) 。若g k = 0 g_k=0 g k = 0 则停止计算,否则转(3)。
(3)由B k p k = − g k B_kp_k=-g_k B k p k = − g k 求出p k p_k p k
(4)一维搜索:求λ k \lambda_k λ k 使得:f ( w ( k ) + λ k p k ) = min λ ≥ 0 f ( w ( k ) + λ p k )
f(w^{(k)}+\lambda_kp_k)= \min\limits_{\lambda \geq 0}f(w^{(k)}+\lambda p_k)
f ( w ( k ) + λ k p k ) = λ ≥ 0 min f ( w ( k ) + λ p k )
(5)置 w ( k + 1 ) = w ( k ) + λ k p k w^{(k+1)} = w^{(k)} + \lambda_k p_k w ( k + 1 ) = w ( k ) + λ k p k
(6)计算 g k + 1 = g ( w ( k + 1 ) ) g_{k+1} = g(w^{(k+1)}) g k + 1 = g ( w ( k + 1 ) ) ,若g k + 1 = 0 g_{k+1} = 0 g k + 1 = 0 ,则停止计算,否则,按下式更新B k + 1 B_{k+1} B k + 1 :B k + 1 = B k + y k y k T y k T δ k − B k δ k δ k T B k δ k T B k δ k
\mathbf B_{k+1} = \mathbf B_{k}+\frac{y_ky_k^T}{y_k^T\delta_k}-\frac{\mathbf B_k\delta_k \delta_k^T\mathbf B_k}{\delta_k^T\mathbf B_k\delta_k}
B k + 1 = B k + y k T δ k y k y k T − δ k T B k δ k B k δ k δ k T B k
其中:y k = g k + 1 − g k , δ k = w ( k + 1 ) − w k
y_k = g_{k+1} - g_k, \qquad\delta_k=w^{(k+1)} - w^{k}
y k = g k + 1 − g k , δ k = w ( k + 1 ) − w k
(7)置 k = k + 1 k=k+1 k = k + 1 , 转(3)
概率计算
即给定 linear-CRF的条件概率分布P(y|x), 在给定输入序列x和输出序列y时,计算条件概率P ( y i ∣ x ) P(y_i|x) P ( y i ∣ x ) 和P ( y i − 1 , y i ∣ x ) P(y_i−1,y_i|x) P ( y i − 1 , y i ∣ x ) 以及对应的期望。
前向后向概率概述
要计算条件概率P ( y i ∣ x ) P(y_i|x) P ( y i ∣ x ) 和P ( y i − 1 , y i ∣ x ) P(y_{i-1},y_i|x) P ( y i − 1 , y i ∣ x ) ,可以使用前向后向算法来完成。
前向概率
定义α i ( y i ∣ x ) \alpha_i(y_i|x) α i ( y i ∣ x ) 表示序列位置i的标记是y i y_i y i 时,在位置i之前的部分标记序列的非规范化概率。
而我们在上面定义了:M i ( y i − 1 , y i ∣ x ) = e x p ( ∑ k = 1 K w k f k ( y i − 1 , y i , x , i ) )
M_i(y_{i-1},y_i |x) = exp(\sum\limits_{k=1}^Kw_kf_k(y_{i-1},y_i, x,i))
M i ( y i − 1 , y i ∣ x ) = e x p ( k = 1 ∑ K w k f k ( y i − 1 , y i , x , i ) )
用于计算在给定y i − 1 y_{i-1} y i − 1 时,从y i − 1 y_{i-1} y i − 1 转移到y i y_i y i 的非规范化概率。
那么在得知在位置i + 1 i+1 i + 1 处标记为y i + 1 y_{i+1} y i + 1 时,位置i + 1 i+1 i + 1 之前的标记序列非规范化概率α i + 1 ( y i + 1 ∣ x ) \alpha_{i+1}(y_{i+1}|x) α i + 1 ( y i + 1 ∣ x ) 的递推公式:α i + 1 ( y i + 1 ∣ x ) = α i ( y i ∣ x ) M i + 1 ( y i + 1 , y i ∣ x ) i = 1 , 2 , . . . , n + 1
\alpha_{i+1}(y_{i+1}|x) = \alpha_i(y_i|x)M_{i+1}(y_{i+1},y_i|x) \;\; i=1,2,...,n+1
α i + 1 ( y i + 1 ∣ x ) = α i ( y i ∣ x ) M i + 1 ( y i + 1 , y i ∣ x ) i = 1 , 2 , . . . , n + 1
特别的,在起点处,我们令:α 0 ( y 0 ∣ x ) = { 1 y 0 = s t a r t 0 e l s e
\alpha_0(y_0|x)= \begin{cases} 1 & {y_0 =start}\\ 0 & {else} \end{cases}
α 0 ( y 0 ∣ x ) = { 1 0 y 0 = s t a r t e l s e
由于在位置i + 1 i+1 i + 1 处,y i + 1 y_{i+1} y i + 1 的可能取值有m种,我们用α i ( x ) \alpha_i(x) α i ( x ) 表示这m个可能取值对应的前向向量:α i ( x ) = ( α i ( y i = 1 ∣ x ) , α i ( y i = 2 ∣ x ) , . . . α i ( y i = m ∣ x ) ) T
\alpha_i(x) = (\alpha_i(y_i=1|x), \alpha_i(y_i=2|x), ... \alpha_i(y_i=m|x))^T
α i ( x ) = ( α i ( y i = 1 ∣ x ) , α i ( y i = 2 ∣ x ) , . . . α i ( y i = m ∣ x ) ) T
则递推公式可以表示为:α i + 1 T ( x ) = α i T ( x ) M i + 1 ( x )
\alpha_{i+1}^T(x) = \alpha_i^T(x)M_{i+1}(x)
α i + 1 T ( x ) = α i T ( x ) M i + 1 ( x ) 后向概率
同样定义β i ( y i ∣ x ) \beta_i(y_i|x) β i ( y i ∣ x ) 表示序列位置i的标记是y i y_i y i 时,在位置i之后的部分(i+1到n的部分)标记序列的非规范化概率。
那么在得知i + 1 i+1 i + 1 处标记为y ( i + 1 ) y_(i+1) y ( i + 1 ) 时,位置i之后的部分标记序列的非规范化概率β i ( y i ∣ x ) \beta_i(y_i|x) β i ( y i ∣ x ) 的递推公式:β i ( y i ∣ x ) = M i + 1 ( y i , y i + 1 ∣ x ) β i + 1 ( y i + 1 ∣ x )
\beta_{i}(y_{i}|x) = M_{i+1}(y_i,y_{i+1}|x)\beta_{i+1}(y_{i+1}|x)
β i ( y i ∣ x ) = M i + 1 ( y i , y i + 1 ∣ x ) β i + 1 ( y i + 1 ∣ x )
特别的,在终点处定义:β n + 1 ( y n + 1 ∣ x ) = { 1 y n + 1 = s t o p 0 e l s e
\beta_{n+1}(y_{n+1}|x)= \begin{cases} 1 & {y_{n+1} =stop}\\ 0 & {else} \end{cases}
β n + 1 ( y n + 1 ∣ x ) = { 1 0 y n + 1 = s t o p e l s e
如果用向量表示则有:β i ( x ) = M i + 1 ( x ) β i + 1 ( x )
\beta_i(x) = M_{i+1}(x)\beta_{i+1}(x)
β i ( x ) = M i + 1 ( x ) β i + 1 ( x )
规范化因子Z ( x ) Z(x) Z ( x ) 的表达式为:Z ( x ) = ∑ c = 1 m α n ( y c ∣ x ) = ∑ c = 1 m β 1 ( y c ∣ x )
Z(x) = \sum\limits_{c=1}^m\alpha_{n}(y_c|x) = \sum\limits_{c=1}^m\beta_{1}(y_c|x)
Z ( x ) = c = 1 ∑ m α n ( y c ∣ x ) = c = 1 ∑ m β 1 ( y c ∣ x )
向量化的表示为:Z ( x ) = α n T ( x ) ∙ 1 = 1 T ∙ β 1 ( x )
Z(x) = \alpha_{n}^T(x) \bullet \mathbf{1} = \mathbf{1}^T \bullet \beta_{1}(x)
Z ( x ) = α n T ( x ) ∙ 1 = 1 T ∙ β 1 ( x )
其中,1是m维全1向量。
前向后向概率计算
有了前向后向概率的定义和计算方法,我们就很容易计算序列位置i的标记是y i y_i y i 时的条件概率P ( y i ∣ x ) P(y_i|x) P ( y i ∣ x ) :P ( y i ∣ x ) = α i T ( y i ∣ x ) β i ( y i ∣ x ) Z ( x ) = α i T ( y i ∣ x ) β i ( y i ∣ x ) α n T ( x ) ∙ 1
P(y_i|x) = \frac{\alpha_i^T(y_i|x)\beta_i(y_i|x)}{Z(x)} = \frac{\alpha_i^T(y_i|x)\beta_i(y_i|x)}{ \alpha_{n}^T(x) \bullet \mathbf{1}}
P ( y i ∣ x ) = Z ( x ) α i T ( y i ∣ x ) β i ( y i ∣ x ) = α n T ( x ) ∙ 1 α i T ( y i ∣ x ) β i ( y i ∣ x )
也容易计算序列位置i的标记是y i y_i y i ,位置i − 1 i-1 i − 1 的标记是y i − 1 y_{i-1} y i − 1 时的条件概率P ( y i − 1 , y i ∣ x ) P(y_{i-1},y_i|x) P ( y i − 1 , y i ∣ x ) :P ( y i − 1 , y i ∣ x ) = α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x ) Z ( x ) = α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x ) α n T ( x ) ∙ 1
\begin{aligned}
P(y_{i-1},y_i|x) &= \frac{\alpha_{i-1}^T(y_{i-1}|x)M_i(y_{i-1},y_i|x)\beta_i(y_i|x)}{Z(x)} \\
&= \frac{\alpha_{i-1}^T(y_{i-1}|x)M_i(y_{i-1},y_i|x)\beta_i(y_i|x)}{ \alpha_{n}^T(x) \bullet \mathbf{1}}
\end{aligned}
P ( y i − 1 , y i ∣ x ) = Z ( x ) α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x ) = α n T ( x ) ∙ 1 α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x )
期望计算
有了上一节计算的条件概率,我们也可以很方便的计算联合分布P ( x , y ) P(x,y) P ( x , y ) 与条件分布P ( y ∣ x ) P(y|x) P ( y ∣ x ) 的期望。
特征函数f k ( x , y ) f_k(x,y) f k ( x , y ) 关于条件分布P ( y ∣ x ) P(y|x) P ( y ∣ x ) 的期望表达式是:E P ( y ∣ x ) [ f k ] = E P ( y ∣ x ) [ f k ( y , x ) ] = ∑ i = 1 n + 1 ∑ y i − 1 y i P ( y i − 1 , y i ∣ x ) f k ( y i − 1 , y i , x , i ) = ∑ i = 1 n + 1 ∑ y i − 1 y i f k ( y i − 1 , y i , x , i ) α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x ) α n T ( x ) ∙ 1
\begin{aligned} E_{P(y|x)}[f_k] & = E_{P(y|x)}[f_k(y,x)] \\ & = \sum\limits_{i=1}^{n+1} \sum\limits_{y_{i-1}\;\;y_i}P(y_{i-1},y_i|x)f_k(y_{i-1},y_i,x, i) \\ & = \sum\limits_{i=1}^{n+1} \sum\limits_{y_{i-1}\;\;y_i}f_k(y_{i-1},y_i,x, i) \frac{\alpha_{i-1}^T(y_{i-1}|x)M_i(y_{i-1},y_i|x)\beta_i(y_i|x)}{ \alpha_{n}^T(x) \bullet \mathbf{1}} \end{aligned}
E P ( y ∣ x ) [ f k ] = E P ( y ∣ x ) [ f k ( y , x ) ] = i = 1 ∑ n + 1 y i − 1 y i ∑ P ( y i − 1 , y i ∣ x ) f k ( y i − 1 , y i , x , i ) = i = 1 ∑ n + 1 y i − 1 y i ∑ f k ( y i − 1 , y i , x , i ) α n T ( x ) ∙ 1 α i − 1 T ( y i − 1 ∣ x ) M i ( y i − 1 , y i ∣ x ) β i ( y i ∣ x )
同样可以计算联合分布P ( x , y ) P(x,y) P ( x , y ) 的期望:KaTeX parse error: No such environment: align at position 8:
\begin{̲a̲l̲i̲g̲n̲}̲ E_{P(x,y)}[f_k…
假设一共有K个特征函数,则k = 1 , 2 , . . . K k=1,2,...K k = 1 , 2 , . . . K 。
预测算法—维特比算法
预测问题也可以理解为解码问题:给定条件随机场的条件概率P ( y ∣ x ) P(y|x) P ( y ∣ x ) 和一个观测序列x,要求出满足P ( y ∣ x ) P(y|x) P ( y ∣ x ) 最大的序列y。这个解码算法最常用的还是和HMM解码类似的维特比算法。
对于我们linear-CRF中的维特比算法,我们定义一个局部状态δ i ( l ) \delta_i(l) δ i ( l ) ,表示在位置i i i 标记l l l 各个可能取值(1,2…m)对应的非规范化概率的最大值。之所以用非规范化概率是,规范化因子Z ( x ) Z(x) Z ( x ) 不影响最大值的比较。根据δ i ( l ) \delta_i(l) δ i ( l ) 的定义,我们递推在位置i + 1 i+1 i + 1 标记l l l 的表达式为:δ i + 1 ( l ) = max 1 ≤ j ≤ m { δ i ( j ) + ∑ k = 1 K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
\delta_{i+1}(l) = \max_{1 \leq j \leq m}\{\delta_i(j) + \sum\limits_{k=1}^Kw_kf_k(y_{i} =j,y_{i+1} = l,x,i)\}\;, l=1,2,...m
δ i + 1 ( l ) = 1 ≤ j ≤ m max { δ i ( j ) + k = 1 ∑ K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
我们需要用另一个局部状态Ψ i + 1 ( l ) \Psi_{i+1}(l) Ψ i + 1 ( l ) 来记录使δ i + 1 ( l ) \delta_{i+1}(l) δ i + 1 ( l ) 达到最大的位置i i i 的标记取值,这个值用来最终回溯最优解,Ψ i + 1 ( l ) \Psi_{i+1}(l) Ψ i + 1 ( l ) 的递推表达式为:Ψ i + 1 ( l ) = a r g max 1 ≤ j ≤ m { δ i ( j ) + ∑ k = 1 K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
\Psi_{i+1}(l) = arg\;\max_{1 \leq j \leq m}\{\delta_i(j) + \sum\limits_{k=1}^Kw_kf_k(y_{i} =j,y_{i+1} = l,x,i)\}\; ,l=1,2,...m
Ψ i + 1 ( l ) = a r g 1 ≤ j ≤ m max { δ i ( j ) + k = 1 ∑ K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
维特比算法流程
linear-CRF模型维特比算法流程:
输入:模型的K个特征函数,和对应的K个权重。观测序列x = ( x 1 , x 2 , . . . x n ) x=(x_1,x_2,...x_n) x = ( x 1 , x 2 , . . . x n ) ,可能的标记个数m
输出:最优标记序列y ∗ = ( y 1 ∗ , y 2 ∗ , . . . y n ∗ ) y^* =(y_1^*,y_2^*,...y_n^*) y ∗ = ( y 1 ∗ , y 2 ∗ , . . . y n ∗ )
具体而言:
1,初始化:δ 1 ( l ) = ∑ k = 1 K w k f k ( y 0 = s t a r t , y 1 = l , x , i ) } , l = 1 , 2 , . . . m Ψ 1 ( l ) = s t a r t , l = 1 , 2 , . . . m
\delta_{1}(l) = \sum\limits_{k=1}^Kw_kf_k(y_{0} =start,y_{1} = l,x,i)\}\;, l=1,2,...m \\
\Psi_{1}(l) = start\;, l=1,2,...m
δ 1 ( l ) = k = 1 ∑ K w k f k ( y 0 = s t a r t , y 1 = l , x , i ) } , l = 1 , 2 , . . . m Ψ 1 ( l ) = s t a r t , l = 1 , 2 , . . . m
2,对于i = 1 , 2... n − 1 i=1,2...n-1 i = 1 , 2 . . . n − 1 进行递推:δ i + 1 ( l ) = max 1 ≤ j ≤ m { δ i ( j ) + ∑ k = 1 K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
\delta_{i+1}(l) = \max_{1 \leq j \leq m}\{\delta_i(j) + \sum\limits_{k=1}^Kw_kf_k(y_{i} =j,y_{i+1} = l,x,i)\}\;, l=1,2,...m
δ i + 1 ( l ) = 1 ≤ j ≤ m max { δ i ( j ) + k = 1 ∑ K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
Ψ i + 1 ( l ) = a r g max 1 ≤ j ≤ m { δ i ( j ) + ∑ k = 1 K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
\Psi_{i+1}(l) = arg\;\max_{1 \leq j \leq m}\{\delta_i(j) + \sum\limits_{k=1}^Kw_kf_k(y_{i} =j,y_{i+1} = l,x,i)\}\; ,l=1,2,...m
Ψ i + 1 ( l ) = a r g 1 ≤ j ≤ m max { δ i ( j ) + k = 1 ∑ K w k f k ( y i = j , y i + 1 = l , x , i ) } , l = 1 , 2 , . . . m
3,i i i 迭代到n-1时停止:y n ∗ = a r g max 1 ≤ j ≤ m δ n ( j )
y_n^* = arg\;\max_{1 \leq j \leq m}\delta_n(j)
y n ∗ = a r g 1 ≤ j ≤ m max δ n ( j )
4,回溯:y i ∗ = Ψ i + 1 ( y i + 1 ∗ ) , i = n − 1 , n − 2 , . . . 1
y_i^* = \Psi_{i+1}(y_{i+1}^*)\;, i=n-1,n-2,...1
y i ∗ = Ψ i + 1 ( y i + 1 ∗ ) , i = n − 1 , n − 2 , . . . 1
最终得到的标记序列为:y ∗ = ( y 1 ∗ , y 2 ∗ , . . . y n ∗ )
y^* =(y_1^*,y_2^*,...y_n^*)
y ∗ = ( y 1 ∗ , y 2 ∗ , . . . y n ∗ )
维特比算法实例
假设输入的都是三个词的句子,即X = ( X 1 , X 2 , X 3 ) X=(X_1,X_2,X_3) X = ( X 1 , X 2 , X 3 ) ,输出的词性标记为Y = ( Y 1 , Y 2 , Y 3 ) Y=(Y_1,Y_2,Y_3) Y = ( Y 1 , Y 2 , Y 3 ) ,其中Y ∈ { 1 ( 名 词 ) , 2 ( 动 词 ) } Y \in \{1(名词),2(动词)\} Y ∈ { 1 ( 名 词 ) , 2 ( 动 词 ) } 。
这里只标记出取值为1的特征函数如下:t 1 = t 1 ( y i − 1 = 1 , y i = 2 , x , i ) , i = 2 , 3 , λ 1 = 1 t 2 = t 2 ( y 1 = 1 , y 2 = 1 , x , 2 ) λ 2 = 0.6 t 3 = t 3 ( y 2 = 2 , y 3 = 1 , x , 3 ) λ 3 = 1 t 4 = t 4 ( y 1 = 2 , y 2 = 1 , x , 2 ) λ 4 = 1 t 5 = t 5 ( y 2 = 2 , y 3 = 2 , x , 3 ) λ 5 = 0.2 s 1 = s 1 ( y 1 = 1 , x , 1 ) μ 1 = 1 s 2 = s 2 ( y i = 2 , x , i ) , i = 1 , 2 , μ 2 = 0.5 s 3 = s 3 ( y i = 1 , x , i ) , i = 2 , 3 , μ 3 = 0.8 s 4 = s 4 ( y 3 = 2 , x , 3 ) μ 4 = 0.5
t_1 =t_1(y_{i-1} = 1, y_i =2,x,i), i =2,3,\;\;\lambda_1=1\\
t_2 =t_2(y_1=1,y_2=1,x,2)\;\;\lambda_2=0.6\\
t_3 =t_3(y_2=2,y_3=1,x,3)\;\;\lambda_3=1\\
t_4 =t_4(y_1=2,y_2=1,x,2)\;\;\lambda_4=1\\
t_5 =t_5(y_2=2,y_3=2,x,3)\;\;\lambda_5=0.2\\
s_1 =s_1(y_1=1,x,1)\;\;\mu_1 =1\\
s_2 =s_2( y_i =2,x,i), i =1,2,\;\;\mu_2=0.5\\
s_3 =s_3( y_i =1,x,i), i =2,3,\;\;\mu_3=0.8\\
s_4 =s_4(y_3=2,x,3)\;\;\mu_4 =0.5
t 1 = t 1 ( y i − 1 = 1 , y i = 2 , x , i ) , i = 2 , 3 , λ 1 = 1 t 2 = t 2 ( y 1 = 1 , y 2 = 1 , x , 2 ) λ 2 = 0 . 6 t 3 = t 3 ( y 2 = 2 , y 3 = 1 , x , 3 ) λ 3 = 1 t 4 = t 4 ( y 1 = 2 , y 2 = 1 , x , 2 ) λ 4 = 1 t 5 = t 5 ( y 2 = 2 , y 3 = 2 , x , 3 ) λ 5 = 0 . 2 s 1 = s 1 ( y 1 = 1 , x , 1 ) μ 1 = 1 s 2 = s 2 ( y i = 2 , x , i ) , i = 1 , 2 , μ 2 = 0 . 5 s 3 = s 3 ( y i = 1 , x , i ) , i = 2 , 3 , μ 3 = 0 . 8 s 4 = s 4 ( y 3 = 2 , x , 3 ) μ 4 = 0 . 5
求标记(1,2,2)的最可能的标记序列。
状态转移图如下:
首先初始化:δ 1 ( 1 ) = μ 1 s 1 = 1 δ 1 ( 2 ) = μ 2 s 2 = 0.5 Ψ 1 ( 1 ) = Ψ 1 ( 2 ) = s t a r t
\delta_1(1) = \mu_1s_1 = 1\;\;\;\delta_1(2) = \mu_2s_2 = 0.5\;\;\;\Psi_{1}(1) =\Psi_{1}(2) = start
δ 1 ( 1 ) = μ 1 s 1 = 1 δ 1 ( 2 ) = μ 2 s 2 = 0 . 5 Ψ 1 ( 1 ) = Ψ 1 ( 2 ) = s t a r t
接下来开始递推,先看位置2的:δ 2 ( 1 ) = m a x { δ 1 ( 1 ) + t 2 λ 2 + μ 3 s 3 , δ 1 ( 2 ) + t 4 λ 4 + μ 3 s 3 } = m a x { 1 + 0.6 + 0.8 , 0.5 + 1 + 0.8 } = 2.4
\begin{aligned}
\delta_2(1) &= max\{\delta_1(1) + t_2\lambda_2+\mu_3s_3, \delta_1(2) + t_4\lambda_4+\mu_3s_3 \} \\
&= max\{1+0.6+0.8,0.5+1+0.8\} \\
&=2.4\;\;\;\\
\end{aligned}
δ 2 ( 1 ) = m a x { δ 1 ( 1 ) + t 2 λ 2 + μ 3 s 3 , δ 1 ( 2 ) + t 4 λ 4 + μ 3 s 3 } = m a x { 1 + 0 . 6 + 0 . 8 , 0 . 5 + 1 + 0 . 8 } = 2 . 4
Ψ 2 ( 1 ) = 1
\Psi_{2}(1) =1
Ψ 2 ( 1 ) = 1
δ 2 ( 2 ) = m a x { δ 1 ( 1 ) + t 1 λ 1 + μ 2 s 2 , δ 1 ( 2 ) + μ 2 s 2 } = m a x { 1 + 1 + 0.5 , 0.5 + 0.5 } = 2.5
\begin{aligned}
\delta_2(2) &= max\{\delta_1(1) + t_1\lambda_1+\mu_2s_2, \delta_1(2) + \mu_2s_2\}\\& = max\{1+1+0.5,0.5+0.5\} \\&=2.5\;\;\;
\end{aligned}
δ 2 ( 2 ) = m a x { δ 1 ( 1 ) + t 1 λ 1 + μ 2 s 2 , δ 1 ( 2 ) + μ 2 s 2 } = m a x { 1 + 1 + 0 . 5 , 0 . 5 + 0 . 5 } = 2 . 5
Ψ 2 ( 2 ) = 1
\Psi_{2}(2) =1
Ψ 2 ( 2 ) = 1
再看位置3的:δ 3 ( 1 ) = m a x { δ 2 ( 1 ) + μ 3 s 3 , δ 2 ( 2 ) + t 3 λ 3 + μ 3 s 3 } = m a x { 2.4 + 0.8 , 2.5 + 1 + 0.8 } = 4.3
\begin{aligned}
\delta_3(1) &= max\{\delta_2(1) +\mu_3s_3, \delta_2(2) + t_3\lambda_3+\mu_3s_3\} \\&= max\{2.4+0.8,2.5+1+0.8\} \\&=4.3
\end{aligned}
δ 3 ( 1 ) = m a x { δ 2 ( 1 ) + μ 3 s 3 , δ 2 ( 2 ) + t 3 λ 3 + μ 3 s 3 } = m a x { 2 . 4 + 0 . 8 , 2 . 5 + 1 + 0 . 8 } = 4 . 3
Ψ 3 ( 1 ) = 2
\Psi_{3}(1) =2
Ψ 3 ( 1 ) = 2
δ 3 ( 2 ) = m a x { δ 2 ( 1 ) + t 1 λ 1 + μ 4 s 4 , δ 2 ( 2 ) + t 5 λ 5 + μ 4 s 4 } = m a x { 2.4 + 1 + 0.5 , 2.5 + 0.2 + 0.5 } = 3.9
\begin{aligned}
\delta_3(2) &= max\{\delta_2(1) +t_1\lambda_1 + \mu_4s_4, \delta_2(2) + t_5\lambda_5+\mu_4s_4\} \\&= max\{2.4+1+0.5,2.5+0.2+0.5\} \\&=3.9
\end{aligned}
δ 3 ( 2 ) = m a x { δ 2 ( 1 ) + t 1 λ 1 + μ 4 s 4 , δ 2 ( 2 ) + t 5 λ 5 + μ 4 s 4 } = m a x { 2 . 4 + 1 + 0 . 5 , 2 . 5 + 0 . 2 + 0 . 5 } = 3 . 9
Ψ 3 ( 2 ) = 1
\Psi_{3}(2) =1
Ψ 3 ( 2 ) = 1
最终得到y 3 ∗ = arg m a x { δ 3 ( 1 ) , δ 3 ( 2 ) } y_3^* =\arg\;max\{\delta_3(1), \delta_3(2)\} y 3 ∗ = arg m a x { δ 3 ( 1 ) , δ 3 ( 2 ) } 递推回去,得到:y 2 ∗ = Ψ 3 ( 1 ) = 2 y 1 ∗ = Ψ 2 ( 2 ) = 1
y_2^* = \Psi_3(1) =2\;\;y_1^* = \Psi_2(2) =1
y 2 ∗ = Ψ 3 ( 1 ) = 2 y 1 ∗ = Ψ 2 ( 2 ) = 1
即最终的结果为( 1 , 2 , 1 ) (1,2,1) ( 1 , 2 , 1 ) ,即标记为(名词,动词,名词)。
LR & CRF & HMM
说明:
第一行是生成模型,第二行是判别模型
第一行是有向图模型,第二行是无向图模型
HMM可以看成朴素贝叶斯的扩展模型
CRF可以看成逻辑回归的扩展模型