反向传播算法推导

13.3.1 反向传播算法推导

如下图所示为一个神经网络的结构图,由于本文主要探讨**函数在反向传播过程中的作用,因此不会带入数值进行计算,而是以两个权重的更新为案例进行公式的推导,分别为如何通过反向传播算法更新 w 11 2 w^2_{11} w112 w 11 1 w^1_{11} w111的值。
反向传播算法推导

13.3.1.1 前向传播

首先,需要知道的是,整个网络中 i 1 i_1 i1, i 2 i_2 i2以及所有的权重值均为定值,权重值为网络初始化时按照一定概率分布随机赋值的。则 h 1 h_1 h1内部结构如下:
反向传播算法推导

其中, n e t h 1 net_{h_1} neth1表示加权后的值, o u t h 1 out_{h_1} outh1表示加权计算后经过**函数得到的值, t a r g e t o 1 target_{o_1} targeto1表示标签的值。具体计算方法如下:
n e t h 1 = i 1 ∗ w 11 1 + i 2 ∗ w 21 1 + b 1 net_{h_1}=i_1*w^1_{11}+i_2*w^1_{21}+b_1 neth1=i1w111+i2w211+b1
o u t h 1 = f ( n e t h 1 ) out_{h_1}=f(net_{h_1}) outh1=f(neth1)
同理得到:
n e t o 1 = o u t h 1 ∗ w 11 2 + o u t h 2 ∗ w 21 2 + b 2 net_{o_1}=out_{h_1}*w^2_{11}+out_{h_2}*w^2_{21}+b_2 neto1=outh1w112+outh2w212+b2
o u t o 1 = f ( n e t o 1 ) out_{o_1}=f(net_{o_1}) outo1=f(neto1)
因此,我们可以知道对于输出层 o 1 o_1 o1的误差可以按照如下公式计算出来:
E t o t a l = E o 1 + E o 2 E_{total}=E_{o_1}+E_{o_2} Etotal=Eo1+Eo2
E o 1 = 1 2 ( t a r g e t o 1 − o u t o 1 ) 2 E_{o_1}=\frac{1}{2}(target_{o_1}-out_{o_1})^2 Eo1=21(targeto1outo1)2

13.3.1.2 反向传播

1. w 11 2 w^2_{11} w112更新
首先我们计算如何更新 w 11 2 w^2_{11} w112权重。
我们使用 ∂ E t o t a l ∂ w 11 2 \frac{\partial E_{total}}{\partial w^2_{11}} w112Etotal求解,参数 w 11 2 w^2_{11} w112对最终计算误差的影响程度。则:
∂ E t o t a l ∂ w 11 2 = ∂ E o 1 ∂ w 11 2 + ∂ E o 2 ∂ w 11 2 = ∂ E o 1 ∂ w 11 2 \frac{\partial E_{total}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}}+\frac{\partial E_{o_2}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}} w112Etotal=w112Eo1+w112Eo2=w112Eo1
则,根据链式法则,我们可以将 ∂ E o 1 ∂ w 11 2 \frac{\partial E_{o_1}}{\partial w^2_{11}} w112Eo1推导成: ∂ E o 1 ∂ w 11 2 = ∂ E o 1 ∂ o u t o 1 ∗ ∂ o u t o 1 ∂ n e t o 1 ∗ ∂ n e t o 1 ∂ w 11 2 \frac{\partial E_{o_1}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial out_{o_1}}*\frac{\partial out_{o_1}}{\partial net{o_1}}*\frac{\partial net{o_1}}{\partial w_{11}^2} w112Eo1=outo1Eo1neto1outo1w112neto1
如下图所示,展示了这个推导过程,其中橙色的箭头为推导路径。
反向传播算法推导

接下来具体计算每个导数的值:
由前向传播过程中误差值可以知道:
E o 1 = 1 2 ( t a r g e t o 1 − o u t o 1 ) 2 E_{o_1}=\frac{1}{2}(target_{o_1}-out_{o_1})^2 Eo1=21(targeto1outo1)2
则:
∂ E o 1 ∂ o u t o 1 = 2 ∗ 1 2 ∗ ( t a r g e t o 1 − o u t o 1 ) ∗ ( 0 − 1 ) = o u t o 1 − t a r g e t o 1 \frac{\partial E_{o_1}}{\partial out_{o_1}}=2*\frac{1}{2}*(target_{o_1}-out_{o_1})*(0-1)=out_{o_1}-target_{o_1} outo1Eo1=221(targeto1outo1)(01)=outo1targeto1

o u t h 1 = f ( n e t h 1 ) out_{h_1}=f(net_{h_1}) outh1=f(neth1)
则可以发现:
∂ o u t o 1 ∂ n e t o 1 \frac{\partial out_{o_1}}{\partial net{o_1}} neto1outo1的值与其**函数形式有关,我们这里暂且不做讨论,保留通用形式。

n e t o 1 = o u t h 1 ∗ w 11 2 + o u t h 2 ∗ w 21 2 + b 2 net_{o_1}=out_{h_1}*w^2_{11}+out_{h_2}*w^2_{21}+b_2 neto1=outh1w112+outh2w212+b2
可以推导:
∂ n e t o 1 ∂ w 11 2 = o u t h 1 + 0 + 0 = o u t h 1 \frac{\partial net{o_1}}{\partial w_{11}^2}=out_{h_1}+0+0=out_{h_1} w112neto1=outh1+0+0=outh1
因此, w 11 2 w^2_{11} w112对总误差的影响如下:
∂ E t o t a l ∂ w 11 2 = ∂ E o 1 ∂ w 11 2 = ( o u t o 1 − t a r g e t o 1 ) ∗ ∂ o u t o 1 ∂ n e t o 1 ∗ o u t h 1 \frac{\partial E_{total}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net{o_1}}*out_{h_1} w112Etotal=w112Eo1=(outo1targeto1)neto1outo1outh1
上述公式中, o u t o 1 , t a r g e t o 1 , o u t h 1 out_{o_1},target_{o_1},out_{h_1} outo1,targeto1,outh1均为定值,因此 ∂ o u t o 1 ∂ n e t o 1 \frac{\partial out_{o_1}}{\partial net{o_1}} neto1outo1成为影响该结果为唯一变量,由于该结果与**函数有关,不同**函数求导结果不同。
η η η为学习率,则更新方法为:
w 11 2 ~ = w 11 2 − η ∗ ∂ E t o t a l ∂ w 11 2 \tilde{w^2_{11}}=w^2_{11}-η*\frac{\partial E_{total}}{\partial w^2_{11}} w112~=w112ηw112Etotal
2. w 11 1 w^1_{11} w111更新
如下图所示, w 11 1 w^1_{11} w111权重,将影响到 o 1 o_1 o1 o 2 o_2 o2。因此通过 ∂ E t o t a l ∂ w 11 1 \frac{\partial E_{total}}{\partial w^1_{11}} w111Etotal计算出 w 11 1 w^1_{11} w111权重对总损失的影响。
反向传播算法推导

由于:
E t o t a l = E o 1 + E o 2 E_{total}=E_{o_1}+E_{o_2} Etotal=Eo1+Eo2
则,根据导数加法原则有:
∂ E t o t a l ∂ w 11 1 = ∂ E o 1 ∂ w 11 1 + ∂ E o 2 ∂ w 11 1 \frac{\partial E_{total}}{\partial w^1_{11}}=\frac{\partial E_{o_1}}{\partial w^1_{11}}+\frac{\partial E_{o_2}}{\partial w^1_{11}} w111Etotal=w111Eo1+w111Eo2
此时便可拆解为从两条路径独立的求解该权重对不同输出值得误差影响,最后将两个值加在一起即得到该权重对整体误差的影响。
首先解: ∂ E o 1 ∂ w 11 1 \frac{\partial E_{o_1}}{\partial w^1_{11}} w111Eo1
∂ E o 1 ∂ w 11 1 = ∂ E o 1 ∂ o u t o 1 ∗ ∂ o u t o 1 ∂ n e t o 1 ∗ ∂ n e t o 1 ∂ o u t h 1 ∗ ∂ o u t h 1 ∂ n e t h 1 ∗ ∂ n e t h 1 ∂ w 11 1 \frac{\partial E_{o_1}}{\partial w^1_{11}}=\frac{\partial E_{o_1}}{\partial out_{o_1}}*\frac{\partial out_{o_1}}{\partial net_{o_1}}*\frac{\partial net_{o_1}}{\partial out_{h_1}}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*\frac{\partial net_{h_1}}{\partial w^1_{11}} w111Eo1=outo1Eo1neto1outo1outh1neto1neth1outh1w111neth1
则,分别求解之:
∂ E o 1 ∂ o u t o 1 = o u t o 1 − t a r g e t o 1 \frac{\partial E_{o_1}}{\partial out_{o_1}}=out_{o_1}-target_{o_1} outo1Eo1=outo1targeto1;
∂ o u t o 1 ∂ n e t o 1 \frac{\partial out_{o_1}}{\partial net_{o_1}} neto1outo1为变量,受**函数控制;
由于 n e t o 1 = w 11 2 ∗ o u t h 1 + w 21 2 ∗ o u t h 2 + b 2 net_{o_1}=w^2_{11}*out_{h_1}+w^2_{21}*out_{h_2}+b_2 neto1=w112outh1+w212outh2+b2,则:
∂ n e t o 1 ∂ o u t h 1 = w 11 2 + 0 + 0 = w 11 2 \frac{\partial net_{o_1}}{\partial out_{h_1}}=w^2_{11}+0+0=w^2_{11} outh1neto1=w112+0+0=w112;
∂ o u t h 1 ∂ n e t h 1 \frac{\partial out_{h_1}}{\partial net_{h_1}} neth1outh1为变量,其形式受**函数控制;
由于 n e t h 1 = w 11 1 ∗ i 1 + w 21 1 ∗ i 2 + b 1 net_{h_1}=w^1_{11}*i_1+w^1_{21}*i_2+b_1 neth1=w111i1+w211i2+b1,则:
∂ n e t h 1 ∂ w 11 1 = i 1 \frac{\partial net_{h_1}}{\partial w^1_{11}}=i_1 w111neth1=i1
故而:
∂ E o 1 ∂ w 11 1 = ( o u t o 1 − t a r g e t o 1 ) ∗ ∂ o u t o 1 ∂ n e t o 1 ∗ w 11 2 ∗ ∂ o u t h 1 ∂ n e t h 1 ∗ i 1 \frac{\partial E_{o_1}}{\partial w^1_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net_{o_1}}*w^2_{11}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1 w111Eo1=(outo1targeto1)neto1outo1w112neth1outh1i1
同理得到:
∂ E o 2 ∂ w 11 1 = ( o u t o 2 − t a r g e t o 2 ) ∗ ∂ o u t o 2 ∂ n e t o 2 ∗ w 12 2 ∗ ∂ o u t h 1 ∂ n e t h 1 ∗ i 1 \frac{\partial E_{o_2}}{\partial w^1_{11}}=(out_{o_2}-target_{o_2})*\frac{\partial out_{o_2}}{\partial net_{o_2}}*w^2_{12}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1 w111Eo2=(outo2targeto2)neto2outo2w122neth1outh1i1
因此:
∂ E t o t a l ∂ w 11 1 = ( o u t o 1 − t a r g e t o 1 ) ∗ ∂ o u t o 1 ∂ n e t o 1 ∗ w 11 2 ∗ ∂ o u t h 1 ∂ n e t h 1 ∗ i 1 + ( o u t o 2 − t a r g e t o 2 ) ∗ ∂ o u t o 2 ∂ n e t o 2 ∗ w 12 2 ∗ ∂ o u t h 1 ∂ n e t h 1 ∗ i 1 \frac{\partial E_{total}}{\partial w^1_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net_{o_1}}*w^2_{11}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1+(out_{o_2}-target_{o_2})*\frac{\partial out_{o_2}}{\partial net_{o_2}}*w^2_{12}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1 w111Etotal=(outo1targeto1)neto1outo1w112neth1outh1i1+(outo2targeto2)neto2outo2w122neth1outh1i1
所以, w 11 1 w^1_{11} w111的调整值为:
w ~ 11 1 = w 11 1 − η ∗ ∂ E t o t a l ∂ w 11 1 \tilde{w}^1_{11}=w^1_{11}-η*\frac{\partial E_{total}}{\partial w^1_{11}} w~111=w111ηw111Etotal

13.3.1.3 讨论

反向传播算法推导

如上面公式所示,从输出层向前逐渐传导的方式进行权重参数的学习修正,但是随着神经网络层数越深,需要对**函数求导的次数也就越多,因此在学习过程中,**函数起到十分重要的作用。如果**函数接近于0,则会导致 ∂ E t o t a l ∂ w 11 1 \frac{\partial E_{total}}{\partial w^1_{11}} w111Etotal也接近于0,通过公式: w ~ 11 1 = w 11 1 − η ∗ ∂ E t o t a l ∂ w 11 1 \tilde{w}^1_{11}=w^1_{11}-η*\frac{\partial E_{total}}{\partial w^1_{11}} w~111=w111ηw111Etotal得知, w ~ 11 1 \tilde{w}^1_{11} w~111基本不会产生多大的更新。这需要进一步对**函数的性质进行探讨。

个人公众号

反向传播算法推导