VGG 16前向传播与反向传播公式推导

VGG 16 公式推导

VGG-16共有13层卷积层,5层池化层和3层全连接层,对前两层全连接网络采用dropout和L2正则化防止过拟合,采用批量梯度下降+Momentum以交叉熵为目标损失进行训练优化。
VGG 16前向传播与反向传播公式推导

  • nln^l—第ll层网络节点(卷积核)数目;

  • kp,qlk_{p,q}^l—第llpp通道与第l1l-1qq通道对应卷积核;

  • bplb_p^l—第llpp节点(通道)的偏置;

  • WlW^l—第ll层全连接网络的权重;

  • zlz^l—第ll层未经过**函数的前向输入;

  • ala^l—第ll层经过**函数后的前向输出;

前向传递

ll层卷积操作公式:
zpl(i,j)=q=1nl1u=11v=11aql1(iu,jv)kp,ql(u,v)+bplapl(i,j)=ReLU(zpl(i,j)) z_{p}^{l}(i,j)=\sum\limits_{q=1}^{{{n}^{l-1}}}{\sum\limits_{u=-1}^{1}{\sum\limits_{v=-1}^{1}{a_{q}^{l-1}(i-u,j-v)k_{p,q}^{l}(u,v)}}}+b_{p}^{l} \\ a_{p}^{l}(i,j)=ReLU\left( z_{p}^{l}(i,j) \right)
ll层最大池化公式:
zpl(i,j)=max(apl1(2iu,2jv))u,v{0,1} z_{p}^{l}(i,j)=\max \left( a_{p}^{l-1}(2i-u,2j-v) \right)u,v\in \left\{ 0,1 \right\}
经过前18层的卷积核池化操作后可获得7×7×5127×7×512大小的特征图,需要将其转化为一个25,088维的向量以便作为全连接层的输入,该过程输出为a18a^{18}:
a18=F({zp18}p=1,2,,512) a^{18}=F \left(\left\{z_p^{18}\right\}_{p=1,2,⋯,512}\right)
全连接网络的前两层采用dropout,设为dd,第ll层节点的连通可用rlr^l来表示,其服从伯努利分布:
rlBernoulli(d) {{r}^{l}}\sim Bernoulli(d)
前向传播为:
a~l=rlalzl+1=Wl+1a~l+bl+1al+1=ReLU(zl+1) {{{\tilde{a}}}^{l}}={{r}^{l}}\odot {{a}^{l}} \\ {{z}^{l+1}}={{W}^{l+1}}{{{\tilde{a}}}^{l}}+{{b}^{l+1}} \\ {{a}^{l+1}}=ReLU({{z}^{l+1}})
其中,⨀为Hadmard积,即矩阵对应元素相乘。

输出层的**函数为softmax:
aiL=softmax(ziL)=eziLk=1nLezkL a_{i}^{L}=softmax(z_{i}^{L})=\frac{{{e}^{z_{i}^{L}}}}{\sum\limits_{k=1}^{{{n}^{L}}}{{{e}^{z_{k}^{L}}}}}
采用交叉熵损失作为损失函数:
L=i=1nLyilogaiL L=-\sum\limits_{i=1}^{{{n}^{L}}}{{{y}_{i}}\log a_{i}^{L}}

反向传播

引入中间变量δl\delta^l,为第ll层的误差,表示损失函数对第l层前向输入zlz^l 的梯度,即为Lzl\frac{\partial{L}}{\partial{z^l}}

Softmax函数偏导数计算公式为:

i=ji=j时,
zj(ezjk=1nezk)=ezjk=1nezk(ezj)2(k=1nezk)2=aj(1aj) \frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{j}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{{{e}^{{{z}_{j}}}}\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}-{{\left( {{e}^{{{z}_{j}}}} \right)}^{2}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}} ={{a}_{j}}\left( 1-{{a}_{j}} \right)
iji\ne j时,
zj(ezik=1nezk)=eziezj(k=1nezk)2=aiaj \frac{\partial }{\partial {{z}_{j}}}\left( \frac{{{e}^{{{z}_{i}}}}}{\sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}}} \right)=\frac{-{{e}^{{{z}_{i}}}}{{e}^{{{z}_{j}}}}}{{{\left( \sum\nolimits_{k=1}^{n}{{{e}^{{{z}_{k}}}}} \right)}^{2}}}=-{{a}_{i}}{{a}_{j}}
输出层第jj个节点的误差为:
δjL=LzjL=i=1nLLaiLaiLzjL=LajLajLzjL+ijLaiLaiLzjL=yjajLajL(1ajL)+ijyiaiL(aiLajL)=yj(1ajL)+ajLijyi=ajLyj \begin{aligned} \delta _{j}^{L}&=\frac{\partial L}{\partial z_{j}^{L}} \\ &=\sum\limits_{i=1}^{{{n}^{L}}}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =\frac{\partial L}{\partial a_{j}^{L}}\frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}+\sum\limits_{i\ne j}{\frac{\partial L}{\partial a_{i}^{L}}\frac{\partial a_{i}^{L}}{\partial z_{j}^{L}}} \\ & =-\frac{{{y}_{j}}}{a_{j}^{L}}a_{j}^{L}(1-a_{j}^{L})+\sum\limits_{i\ne j}{-\frac{{{y}_{i}}}{a_{i}^{L}}(-a_{i}^{L}a_{j}^{L})} \\ & =-{{y}_{j}}(1-a_{j}^{L})+a_{j}^{L}\sum\limits_{i\ne j}{{{y}_{i}}} \\ & =a_{j}^{L}-{{y}_{j}} \end{aligned}
输出层的反向传播误差为:
δL=aLy {{\delta }^{L}}={{a}^{L}}-y
ll个隐藏层第jj个节点的反向传播误差为:
δjl=Lzjl=i=1nl+1Lzil+1zil+1a~jla~jlajlajlzjl=i=1nl+1δil+1Wi,jl+1rjlReLU(zjl)=(W:,jl+1)Tδl+1rjlReLU(zjl) \begin{aligned} \delta _{j}^{l}&=\frac{\partial L}{\partial z_{j}^{l}}\\ &=\sum\limits_{i=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{i}^{l+1}}\frac{\partial z_{i}^{l+1}}{\partial \tilde{a}_{j}^{l}}\frac{\partial \tilde{a}_{j}^{l}}{\partial a_{j}^{l}}\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}}}\\ &=\sum_{i=1}^{n^{l+1}}{\delta^{l+1}_iW^{l+1}_{i,j}r^l_jReLU(z^l_j)'}\\ &={{\left( W_{:,j}^{l+1} \right)}^{T}}{{\delta }^{l+1}}r_{j}^{l}ReLU(z_{j}^{l}{)}' \end{aligned}
因此第ll全连接层的反向传播误差为:
δl=(Wl+1)Tδl+1rlReLU(zl) {{\delta }^{l}}={{\left( {{W}^{l+1}} \right)}^{T}}{{\delta }^{l+1}}\odot {{r}^{l}}\odot ReLU({{z}^{l}}{)}'
由全连接层到池化层的反向传播误差为:
δ18=F1((W19)Tδ19) {{\delta }^{18}}={{F}^{-1}}\left( {{({{W}^{19}})}^{T}}{{\delta }^{19}} \right)
其中δ18\delta^{18}7×7×5127×7×512的张量。

由第l+1l+1池化层的δl+1\delta^{l+1}推导第ll卷积层的反向传播误差时,对于最大池化,我们需要利用上采样将δl+1\delta^{l+1}中每个通道中的元素放到之前做前向传播时最大值的位置处,其他元素为0:
δpl=Lzpl=Laplaplzpl=upsample(δpl+1)ReLU(zpl) \begin{aligned} \delta _{p}^{l}&=\frac{\partial L}{\partial z_{p}^{l}} \\ & =\frac{\partial L}{\partial a_{p}^{l}}\frac{\partial a_{p}^{l}}{\partial z_{p}^{l}} \\ & =upsample(\delta _{p}^{l+1})\odot ReLU(z_{p}^{l}{)}' \end{aligned}
因此,由池化层误差计算卷积层反向传播误差公式为:
δl=upsample(δl+1)ReLU(zl) {{\delta }^{l}}=upsample({{\delta }^{l\text{+1}}})\odot ReLU({{z}^{l}}{)}'
由第l+1l+1卷积层的δl+1\delta^{l+1}推导第ll卷积层的反向传播误差:
zpl+1=q=1nlaqlkp,ql+1+bpl+1 z_{p}^{l+1}=\sum\limits_{q=1}^{{{n}^{l}}}{a_{q}^{l}*k_{p,q}^{l+1}}+b_{p}^{l+1}

δql=Lzql=p=1nl+1Lzpl+1zpl+1aqlaqlzql \delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\sum\limits_{p=1}^{{{n}^{l+1}}}{\frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}\frac{\partial a_{q}^{l}}{\partial z_{q}^{l}}}

Lzpl+1zpl+1aql=δpl+1rot180(kp,ql+1) \frac{\partial L}{\partial z_{p}^{l+1}}\frac{\partial z_{p}^{l+1}}{\partial a_{q}^{l}}=\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1})

因此,
δql=Lzql=[p=1nl+1δpl+1rot180(kp,ql+1)]ReLU(zql) \delta _{q}^{l}=\frac{\partial L}{\partial z_{q}^{l}}=\left[ \sum\limits_{p=1}^{{{n}^{l+1}}}{\delta _{p}^{l+1}*rot180(k_{p,q}^{l+1})} \right]\odot ReLU(z_{q}^{l}{)}'
当第ll层为池化层时,ReLU(zql)=1ReLU(z_{q}^{l}{)}'=1

梯度计算

已知全连接层ll的反向传播误差δl\delta^l,计算梯度LWl\frac{\partial L}{\partial W^l}Lbl\frac{\partial L}{\partial b^l}
LWl=LzlzlWl=δl(al1)T \frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{W}^{l}}}={{\delta }^{l}}{{\left( {{a}^{l-1}} \right)}^{T}}

Lbl=Lzlzlbl=δl \frac{\partial L}{\partial {{b}^{l}}}=\frac{\partial L}{\partial {{z}^{l}}}\frac{\partial {{z}^{l}}}{\partial {{b}^{l}}}={{\delta }^{l}}

当获得平均梯度信息后添加L2正则项,设系数为γ\gamma
LWl=LWl+γ(rl(rl1)T)Wl \frac{\partial L}{\partial {{W}^{l}}}=\frac{\partial L}{\partial {{W}^{l}}}+\gamma \left( {{r}^{l}}{{\left( {{r}^{l-1}} \right)}^{T}} \right){{W}^{l}}
已知卷积层的反向传播误差δl\delta^l,计算梯度Lkp,ql\frac{\partial L}{\partial k^l_{p,q}}Lbpl\frac{\partial L}{\partial b^l_p}
Lkp,ql=Lzplzplkp,ql=δplaql1 \frac{\partial L}{\partial k_{p,q}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial k_{p,q}^{l}}=\delta _{p}^{l}*a_{q}^{l-1}

Lbpl=Lzplzplbpl=i=1nlj=1nlδpl(i,j) \frac{\partial L}{\partial b_{p}^{l}}=\frac{\partial L}{\partial z_{p}^{l}}\frac{\partial z_{p}^{l}}{\partial b_{p}^{l}}=\sum\limits_{i=1}^{{{n}^{l}}}{\sum\limits_{j=1}^{{{n}^{l}}}{\delta _{p}^{l}(i,j)}}

参数更新

当获得一个批次的平均梯度信息后利用批量梯度下降+Momentum方法进行参数更新:
vdk=βvdk+(1β)Lklvdb=βvdb+(1β)LblvdW=βvdW+(1β)LWl \begin{aligned} {{v}_{dk}}&=\beta {{v}_{dk}}+(1-\beta )\frac{\partial L}{\partial {{k}^{l}}} \\ {{v}_{db}}&=\beta {{v}_{db}}+(1-\beta )\frac{\partial L}{\partial {{b}^{l}}} \\ {{v}_{dW}}&=\beta {{v}_{dW}}+(1-\beta )\frac{\partial L}{\partial {{W}^{l}}} \end{aligned}
参数更新:
kl=klαvdklbl=blαvdblWl=WlαvdWl \begin{aligned} {{k}^{l}}&={{k}^{l}}-\alpha {{v}_{d{{k}^{l}}}} \\ {{b}^{l}}&={{b}^{l}}-\alpha {{v}_{d{{b}^{l}}}} \\ {{W}^{l}}&={{W}^{l}}-\alpha {{v}_{d{{W}^{l}}}} \end{aligned}

[1]. Very Deep Convolutional Networks for Large-Scale Image Recognition

[2]. Dropout: A Simple Way to Prevent Neural Networks from Overfitting

[3]. 卷积神经网络(CNN)反向传播算法