反向传播算法推导
13.3.1 反向传播算法推导
如下图所示为一个神经网络的结构图,由于本文主要探讨**函数在反向传播过程中的作用,因此不会带入数值进行计算,而是以两个权重的更新为案例进行公式的推导,分别为如何通过反向传播算法更新
w
11
2
w^2_{11}
w112和
w
11
1
w^1_{11}
w111的值。
13.3.1.1 前向传播
首先,需要知道的是,整个网络中
i
1
i_1
i1,
i
2
i_2
i2以及所有的权重值均为定值,权重值为网络初始化时按照一定概率分布随机赋值的。则
h
1
h_1
h1内部结构如下:
其中,
n
e
t
h
1
net_{h_1}
neth1表示加权后的值,
o
u
t
h
1
out_{h_1}
outh1表示加权计算后经过**函数得到的值,
t
a
r
g
e
t
o
1
target_{o_1}
targeto1表示标签的值。具体计算方法如下:
n
e
t
h
1
=
i
1
∗
w
11
1
+
i
2
∗
w
21
1
+
b
1
net_{h_1}=i_1*w^1_{11}+i_2*w^1_{21}+b_1
neth1=i1∗w111+i2∗w211+b1
o
u
t
h
1
=
f
(
n
e
t
h
1
)
out_{h_1}=f(net_{h_1})
outh1=f(neth1)
同理得到:
n
e
t
o
1
=
o
u
t
h
1
∗
w
11
2
+
o
u
t
h
2
∗
w
21
2
+
b
2
net_{o_1}=out_{h_1}*w^2_{11}+out_{h_2}*w^2_{21}+b_2
neto1=outh1∗w112+outh2∗w212+b2
o
u
t
o
1
=
f
(
n
e
t
o
1
)
out_{o_1}=f(net_{o_1})
outo1=f(neto1)
因此,我们可以知道对于输出层
o
1
o_1
o1的误差可以按照如下公式计算出来:
E
t
o
t
a
l
=
E
o
1
+
E
o
2
E_{total}=E_{o_1}+E_{o_2}
Etotal=Eo1+Eo2
E
o
1
=
1
2
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
2
E_{o_1}=\frac{1}{2}(target_{o_1}-out_{o_1})^2
Eo1=21(targeto1−outo1)2
13.3.1.2 反向传播
1.
w
11
2
w^2_{11}
w112更新
首先我们计算如何更新
w
11
2
w^2_{11}
w112权重。
我们使用
∂
E
t
o
t
a
l
∂
w
11
2
\frac{\partial E_{total}}{\partial w^2_{11}}
∂w112∂Etotal求解,参数
w
11
2
w^2_{11}
w112对最终计算误差的影响程度。则:
∂
E
t
o
t
a
l
∂
w
11
2
=
∂
E
o
1
∂
w
11
2
+
∂
E
o
2
∂
w
11
2
=
∂
E
o
1
∂
w
11
2
\frac{\partial E_{total}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}}+\frac{\partial E_{o_2}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}}
∂w112∂Etotal=∂w112∂Eo1+∂w112∂Eo2=∂w112∂Eo1
则,根据链式法则,我们可以将
∂
E
o
1
∂
w
11
2
\frac{\partial E_{o_1}}{\partial w^2_{11}}
∂w112∂Eo1推导成:
∂
E
o
1
∂
w
11
2
=
∂
E
o
1
∂
o
u
t
o
1
∗
∂
o
u
t
o
1
∂
n
e
t
o
1
∗
∂
n
e
t
o
1
∂
w
11
2
\frac{\partial E_{o_1}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial out_{o_1}}*\frac{\partial out_{o_1}}{\partial net{o_1}}*\frac{\partial net{o_1}}{\partial w_{11}^2}
∂w112∂Eo1=∂outo1∂Eo1∗∂neto1∂outo1∗∂w112∂neto1
如下图所示,展示了这个推导过程,其中橙色的箭头为推导路径。
接下来具体计算每个导数的值:
由前向传播过程中误差值可以知道:
E
o
1
=
1
2
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
2
E_{o_1}=\frac{1}{2}(target_{o_1}-out_{o_1})^2
Eo1=21(targeto1−outo1)2
则:
∂
E
o
1
∂
o
u
t
o
1
=
2
∗
1
2
∗
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
∗
(
0
−
1
)
=
o
u
t
o
1
−
t
a
r
g
e
t
o
1
\frac{\partial E_{o_1}}{\partial out_{o_1}}=2*\frac{1}{2}*(target_{o_1}-out_{o_1})*(0-1)=out_{o_1}-target_{o_1}
∂outo1∂Eo1=2∗21∗(targeto1−outo1)∗(0−1)=outo1−targeto1
由
o
u
t
h
1
=
f
(
n
e
t
h
1
)
out_{h_1}=f(net_{h_1})
outh1=f(neth1)
则可以发现:
∂
o
u
t
o
1
∂
n
e
t
o
1
\frac{\partial out_{o_1}}{\partial net{o_1}}
∂neto1∂outo1的值与其**函数形式有关,我们这里暂且不做讨论,保留通用形式。
由
n
e
t
o
1
=
o
u
t
h
1
∗
w
11
2
+
o
u
t
h
2
∗
w
21
2
+
b
2
net_{o_1}=out_{h_1}*w^2_{11}+out_{h_2}*w^2_{21}+b_2
neto1=outh1∗w112+outh2∗w212+b2
可以推导:
∂
n
e
t
o
1
∂
w
11
2
=
o
u
t
h
1
+
0
+
0
=
o
u
t
h
1
\frac{\partial net{o_1}}{\partial w_{11}^2}=out_{h_1}+0+0=out_{h_1}
∂w112∂neto1=outh1+0+0=outh1
因此,
w
11
2
w^2_{11}
w112对总误差的影响如下:
∂
E
t
o
t
a
l
∂
w
11
2
=
∂
E
o
1
∂
w
11
2
=
(
o
u
t
o
1
−
t
a
r
g
e
t
o
1
)
∗
∂
o
u
t
o
1
∂
n
e
t
o
1
∗
o
u
t
h
1
\frac{\partial E_{total}}{\partial w^2_{11}}=\frac{\partial E_{o_1}}{\partial w^2_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net{o_1}}*out_{h_1}
∂w112∂Etotal=∂w112∂Eo1=(outo1−targeto1)∗∂neto1∂outo1∗outh1
上述公式中,
o
u
t
o
1
,
t
a
r
g
e
t
o
1
,
o
u
t
h
1
out_{o_1},target_{o_1},out_{h_1}
outo1,targeto1,outh1均为定值,因此
∂
o
u
t
o
1
∂
n
e
t
o
1
\frac{\partial out_{o_1}}{\partial net{o_1}}
∂neto1∂outo1成为影响该结果为唯一变量,由于该结果与**函数有关,不同**函数求导结果不同。
令
η
η
η为学习率,则更新方法为:
w
11
2
~
=
w
11
2
−
η
∗
∂
E
t
o
t
a
l
∂
w
11
2
\tilde{w^2_{11}}=w^2_{11}-η*\frac{\partial E_{total}}{\partial w^2_{11}}
w112~=w112−η∗∂w112∂Etotal
2.
w
11
1
w^1_{11}
w111更新
如下图所示,
w
11
1
w^1_{11}
w111权重,将影响到
o
1
o_1
o1和
o
2
o_2
o2。因此通过
∂
E
t
o
t
a
l
∂
w
11
1
\frac{\partial E_{total}}{\partial w^1_{11}}
∂w111∂Etotal计算出
w
11
1
w^1_{11}
w111权重对总损失的影响。
由于:
E
t
o
t
a
l
=
E
o
1
+
E
o
2
E_{total}=E_{o_1}+E_{o_2}
Etotal=Eo1+Eo2
则,根据导数加法原则有:
∂
E
t
o
t
a
l
∂
w
11
1
=
∂
E
o
1
∂
w
11
1
+
∂
E
o
2
∂
w
11
1
\frac{\partial E_{total}}{\partial w^1_{11}}=\frac{\partial E_{o_1}}{\partial w^1_{11}}+\frac{\partial E_{o_2}}{\partial w^1_{11}}
∂w111∂Etotal=∂w111∂Eo1+∂w111∂Eo2
此时便可拆解为从两条路径独立的求解该权重对不同输出值得误差影响,最后将两个值加在一起即得到该权重对整体误差的影响。
首先解:
∂
E
o
1
∂
w
11
1
\frac{\partial E_{o_1}}{\partial w^1_{11}}
∂w111∂Eo1
∂
E
o
1
∂
w
11
1
=
∂
E
o
1
∂
o
u
t
o
1
∗
∂
o
u
t
o
1
∂
n
e
t
o
1
∗
∂
n
e
t
o
1
∂
o
u
t
h
1
∗
∂
o
u
t
h
1
∂
n
e
t
h
1
∗
∂
n
e
t
h
1
∂
w
11
1
\frac{\partial E_{o_1}}{\partial w^1_{11}}=\frac{\partial E_{o_1}}{\partial out_{o_1}}*\frac{\partial out_{o_1}}{\partial net_{o_1}}*\frac{\partial net_{o_1}}{\partial out_{h_1}}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*\frac{\partial net_{h_1}}{\partial w^1_{11}}
∂w111∂Eo1=∂outo1∂Eo1∗∂neto1∂outo1∗∂outh1∂neto1∗∂neth1∂outh1∗∂w111∂neth1
则,分别求解之:
∂
E
o
1
∂
o
u
t
o
1
=
o
u
t
o
1
−
t
a
r
g
e
t
o
1
\frac{\partial E_{o_1}}{\partial out_{o_1}}=out_{o_1}-target_{o_1}
∂outo1∂Eo1=outo1−targeto1;
∂
o
u
t
o
1
∂
n
e
t
o
1
\frac{\partial out_{o_1}}{\partial net_{o_1}}
∂neto1∂outo1为变量,受**函数控制;
由于
n
e
t
o
1
=
w
11
2
∗
o
u
t
h
1
+
w
21
2
∗
o
u
t
h
2
+
b
2
net_{o_1}=w^2_{11}*out_{h_1}+w^2_{21}*out_{h_2}+b_2
neto1=w112∗outh1+w212∗outh2+b2,则:
∂
n
e
t
o
1
∂
o
u
t
h
1
=
w
11
2
+
0
+
0
=
w
11
2
\frac{\partial net_{o_1}}{\partial out_{h_1}}=w^2_{11}+0+0=w^2_{11}
∂outh1∂neto1=w112+0+0=w112;
∂
o
u
t
h
1
∂
n
e
t
h
1
\frac{\partial out_{h_1}}{\partial net_{h_1}}
∂neth1∂outh1为变量,其形式受**函数控制;
由于
n
e
t
h
1
=
w
11
1
∗
i
1
+
w
21
1
∗
i
2
+
b
1
net_{h_1}=w^1_{11}*i_1+w^1_{21}*i_2+b_1
neth1=w111∗i1+w211∗i2+b1,则:
∂
n
e
t
h
1
∂
w
11
1
=
i
1
\frac{\partial net_{h_1}}{\partial w^1_{11}}=i_1
∂w111∂neth1=i1。
故而:
∂
E
o
1
∂
w
11
1
=
(
o
u
t
o
1
−
t
a
r
g
e
t
o
1
)
∗
∂
o
u
t
o
1
∂
n
e
t
o
1
∗
w
11
2
∗
∂
o
u
t
h
1
∂
n
e
t
h
1
∗
i
1
\frac{\partial E_{o_1}}{\partial w^1_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net_{o_1}}*w^2_{11}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1
∂w111∂Eo1=(outo1−targeto1)∗∂neto1∂outo1∗w112∗∂neth1∂outh1∗i1
同理得到:
∂
E
o
2
∂
w
11
1
=
(
o
u
t
o
2
−
t
a
r
g
e
t
o
2
)
∗
∂
o
u
t
o
2
∂
n
e
t
o
2
∗
w
12
2
∗
∂
o
u
t
h
1
∂
n
e
t
h
1
∗
i
1
\frac{\partial E_{o_2}}{\partial w^1_{11}}=(out_{o_2}-target_{o_2})*\frac{\partial out_{o_2}}{\partial net_{o_2}}*w^2_{12}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1
∂w111∂Eo2=(outo2−targeto2)∗∂neto2∂outo2∗w122∗∂neth1∂outh1∗i1
因此:
∂
E
t
o
t
a
l
∂
w
11
1
=
(
o
u
t
o
1
−
t
a
r
g
e
t
o
1
)
∗
∂
o
u
t
o
1
∂
n
e
t
o
1
∗
w
11
2
∗
∂
o
u
t
h
1
∂
n
e
t
h
1
∗
i
1
+
(
o
u
t
o
2
−
t
a
r
g
e
t
o
2
)
∗
∂
o
u
t
o
2
∂
n
e
t
o
2
∗
w
12
2
∗
∂
o
u
t
h
1
∂
n
e
t
h
1
∗
i
1
\frac{\partial E_{total}}{\partial w^1_{11}}=(out_{o_1}-target_{o_1})*\frac{\partial out_{o_1}}{\partial net_{o_1}}*w^2_{11}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1+(out_{o_2}-target_{o_2})*\frac{\partial out_{o_2}}{\partial net_{o_2}}*w^2_{12}*\frac{\partial out_{h_1}}{\partial net_{h_1}}*i_1
∂w111∂Etotal=(outo1−targeto1)∗∂neto1∂outo1∗w112∗∂neth1∂outh1∗i1+(outo2−targeto2)∗∂neto2∂outo2∗w122∗∂neth1∂outh1∗i1
所以,
w
11
1
w^1_{11}
w111的调整值为:
w
~
11
1
=
w
11
1
−
η
∗
∂
E
t
o
t
a
l
∂
w
11
1
\tilde{w}^1_{11}=w^1_{11}-η*\frac{\partial E_{total}}{\partial w^1_{11}}
w~111=w111−η∗∂w111∂Etotal
13.3.1.3 讨论
如上面公式所示,从输出层向前逐渐传导的方式进行权重参数的学习修正,但是随着神经网络层数越深,需要对**函数求导的次数也就越多,因此在学习过程中,**函数起到十分重要的作用。如果**函数接近于0,则会导致 ∂ E t o t a l ∂ w 11 1 \frac{\partial E_{total}}{\partial w^1_{11}} ∂w111∂Etotal也接近于0,通过公式: w ~ 11 1 = w 11 1 − η ∗ ∂ E t o t a l ∂ w 11 1 \tilde{w}^1_{11}=w^1_{11}-η*\frac{\partial E_{total}}{\partial w^1_{11}} w~111=w111−η∗∂w111∂Etotal得知, w ~ 11 1 \tilde{w}^1_{11} w~111基本不会产生多大的更新。这需要进一步对**函数的性质进行探讨。