深度之眼西瓜书——神经网络笔记



神经网络

Lei_ZM
2019-10-07



1. 感知机

1.1. 定义

假设输入空间是XRn\mathcal{X}\subseteq R^{n},输出空间是y={10}\mathcal{y}=\{1,0\}。输入xX\boldsymbol{x}\in\mathcal{X}七表示实例的特征向量,对应于输入空间的点;输出yYy\in\mathcal{Y}表示实例的类别。由输入空间到输出空间的如下函数:

f(x)=sgn(wTx+b) f(\boldsymbol{x})=\operatorname{sgn}\left(\boldsymbol{w}^{T} \boldsymbol{x}+b\right)

称为感知机参数,其中w\boldsymbol{w}bb为感知机模型参数,sgn\operatorname{sgn}为阶跃函数,即:

sgn(z)={1,z00,z<0 \operatorname{sgn}(z)=\left\{ \begin{array}{ll} {1,} & {z \geqslant 0} \\ {0,} & {z<0}\end{array} \right.



1.2. 感知机的几何解释

线性方程wTx+b=0\boldsymbol{w}^{T} \boldsymbol{x}+b=0对应于特征空间(输入空间)RnR^{n}中的一个超平面SS,其中w\boldsymbol{w}是超平面的法向量,bb是超平面的截距。这个超平面将特征空间划分为两个部分。位于两边的点(特征向量)分别被分为正、负两类。因此,超平面SS称为分离超平面,如图所示:

深度之眼西瓜书——神经网络笔记



1.3. 学习策略

假设训练数据集是线性可分的,感知机学习的目标是求得一个能够将训练集正实例点和负实例点完全正确分开的超平面。为了找出这样的超平面S\mathrm{S},即确定感知机模型参数w\boldsymbol{w}bb,需要确定一个学习策略,即定义损失函数并将损失函数极小化。损失函数的一个自然选择是误分类点的总数。但是,这样的损失函数不是参数w\boldsymbol{w}bb的连续可导函数,不易优化,所以感知机采用的损失函数为误分类点到超平面的总距离。

输入空间RnR^{n}中点x0\boldsymbol{x_{0}}到超平面S\mathrm{S}的距离公式为:

wTx0+bw \frac{\left|\boldsymbol{w}^{T} \boldsymbol{x}_{0}+b\right|}{\|\boldsymbol{w}\|}

其中,w\|\boldsymbol{w}\|表示向量w\boldsymbol{w}L2L_{2}范数,也就是模长。

若将bb看成哑结点,即将其合进至w\boldsymbol{w}可得:

w^Tx^0w^ \frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{0}\right|}{\|\hat{\boldsymbol{w}}\|}

设误分类点集合为MM,那么所有误分类点到超平面S\mathrm{S}的总距离为:

x^iMw^Tx^iw^ \sum_{\hat{x}_{i}\in M} \frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right|}{\|\hat{\boldsymbol{w}}\|}

又因为,对于任意误分类点x^iM\hat{x}_{i}\in M来说, 都有:

(y^iyi)w^Tx^i>0 \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} > 0

其中,y^i\hat{y}_{i}为当期感知机的输出。于是所有误分类点到超平面S\mathrm{S}的总距离为:

x^iM(y^iyi)w^Tx^iw^ \sum_{\hat{\boldsymbol{x}}_{i} \in M} \frac{\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}}{\|\hat{\boldsymbol{w}}\|}

由于训练完成后无误分类点,即损失函数值为0,与分母w^\|\hat{\boldsymbol{w}}\|无关,故可舍去,即不考虑1w^\frac{1}{\|\hat{\boldsymbol{w}}\|},此时得到感知机的函数为:

L(w^)=x^iM(y^iyi)w^Tx^i L\left(\hat{\boldsymbol{w}}\right)= \sum_{\hat{\boldsymbol{x}}_{i} \in M} \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}

显然,损失函数L(w^)L\left(\hat{\boldsymbol{w}}\right)是非负的。如果没有误分类点,损失函数值为0。而且误分类点越少,误分类点离超平面越近,损失函数值就越小,在误分类时是参数心的线性函数,在正确分类时是0。因此,给定训练数据集,损失函数L(w^)L\left(\hat{\boldsymbol{w}}\right)w^\hat{\boldsymbol{w}}的连续可导函数。



1.4. 算法

感知机学习算法是对以下最优化问题的算法,给定训练数据集:

T={(x^1,y1),(x^2,y2),,(x^N,yN)} T=\left\{\left(\hat{\boldsymbol{x}}_{1}, y_{1}\right),\left(\hat{\boldsymbol{x}}_{2}, y_{2}\right), \cdots,\left(\hat{\boldsymbol{x}}_{N}, y_{N}\right)\right\}

其中,x^iRn+1\hat{x}_{i}\in R^{n+1}yi{0,1}y_{i}\in \{0, 1\},求参数w^\hat{\boldsymbol{w}}使其为以下损失函数极小化问题的解。

L(w^)=x^iM(y^iyi)w^Tx^i L\left(\hat{\boldsymbol{w}}\right)= \sum_{\hat{\boldsymbol{x}}_{i} \in M} \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}

其中,MM为误分类点的集合。

感知机学习算法是误分类驱动的,具体采用随机梯度下降法。首先,任意选取一个超平面w^0Tx^=0\hat{\boldsymbol{w}}^{T}_{0} \hat{\boldsymbol{x}}=0用梯度下降法不断地极小化损失函数L(w^)L\left(\hat{\boldsymbol{w}}\right),极小化过程中不是一次使MM中所有误分类点的梯度下降,而是一次随机选取一个误分类点使其梯度下降。已知损失函数的梯度为:

L(w^)=L(w^)w^=w^[x^iM(y^iyi)w^Tx^i]=x^iM[(y^iyi)w^(w^Tx^i)]矩阵微分公式xTax=a=x^iM(y^iyi)x^i \begin{aligned} \nabla L(\hat{\boldsymbol{w}}) =\frac{\partial L(\hat{\boldsymbol{w}})}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\sum_{\hat{x}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right] \\ &=\sum_{\hat{x}_{i} \in M}\left[\left(\hat{y}_{i}-y_{i}\right) \frac{\partial}{\partial \hat{\boldsymbol{w}}}\left(\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i} \right)\right] \\ &\qquad \text{矩阵微分公式} \frac{\partial \boldsymbol{x}^{T} \boldsymbol{a}}{\partial \boldsymbol{x}}=\boldsymbol{a} \\ &=\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i} \end{aligned}

那么随机选取一个误分类点x^i\hat{x}_{i}进行梯度下降,可得参数w^\hat{\boldsymbol{w}}的更新公式:

w^w^+Δw^Δw^=ηL(w^)w^ηL(w^)选取一个误分类点L(w^)=(y^iyi)x^iw^η(y^iyi)x^i=w^+η(yiy^i)x^i \begin{aligned} \hat{\boldsymbol{w}} &\leftarrow \hat{\boldsymbol{w}}+\Delta\hat{\boldsymbol{w}} \\ &\qquad \Delta\hat{\boldsymbol{w}}=-\eta \nabla L(\hat{\boldsymbol{w}}) \\ &\leftarrow \hat{\boldsymbol{w}}-\eta \nabla L(\hat{\boldsymbol{w}}) \\ &\qquad \text{选取一个误分类点} \\ &\qquad \nabla L(\hat{\boldsymbol{w}}) = \left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i} \\ &\leftarrow \hat{\boldsymbol{w}}-\eta\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i}=\hat{\boldsymbol{w}}+\eta\left(y_{i}-\hat{y}_{i}\right) \hat{\boldsymbol{x}}_{i} \end{aligned}

即有:

w^w^+Δw^=w^+η(yiy^i) \hat{\boldsymbol{w}}\leftarrow \hat{\boldsymbol{w}}+\Delta\hat{\boldsymbol{w}}=\hat{\boldsymbol{w}}+\eta\left(y_{i}-\hat{y}_{i}\right)
Δw^=η(yiy^i)(西瓜书式5.2) \Rightarrow \Delta\hat{\boldsymbol{w}} = \eta\left(y_{i}-\hat{y}_{i}\right) \tag{西瓜书式5.2}




2. 神经网络

2.1. 模型结构

单隐层前馈网络模型结构如下:

深度之眼西瓜书——神经网络笔记

其中,

  • D={(x1,y1),(x1,y1),,(x1,y1)}D=\{(\boldsymbol{x}_{1}, \boldsymbol{y}_{1}), (\boldsymbol{x}_{1}, \boldsymbol{y}_{1}), \cdots, (\boldsymbol{x}_{1}, \boldsymbol{y}_{1})\}xiRd\boldsymbol{x}_{i}\in \mathbb{R}^{d}yiRl\boldsymbol{y}_{i}\in \mathbb{R}^{l}:训练集

  • dd:神经元输入个数,输入示例属性描述的个数

  • ll:输出神经元个数,输出的实值向量维数

  • qq:隐层神经元的个数

  • θj\theta_{j}:输出层第jj个神经元的阈值

  • γh\gamma_{h}:隐层第hh个神经元的阈值

  • vihv_{ih}:输入层第ii个神经元与隐层第hh个神经元之间的连接权重

  • whjw_{hj}:隐层第hh个神经元与输出层第jj个神经元之间的连接权重

  • αh=i=1dvihxi\alpha_{h}=\sum_{i=1}^{d} v_{ih} x_{i}:隐层第hh个神经元接收到的输入

  • βj=h=1qwhjbh\beta_{j}=\sum_{h=1}^{q} w_{hj} b_{h}:输出层第jj个神经元接收到的输入

  • bhb_{h}:隐层第hh个神经元的输出



2.2. 标准BP算法

给定一个训练样本(xk,yk)(\boldsymbol{x}_{k}, \boldsymbol{y}_{k}),假设神经网络模型的输出为y^k=(y^1,y^2,,y^l)\hat{\boldsymbol{y}}_{k}=(\hat{y}_{1}, \hat{\boldsymbol{y}}_{2}, \cdots, \hat{\boldsymbol{y}}_{l}),即:

y^jk=f(βjθj)(西瓜书式5.3) \hat{y}_{j}^{k}=f \left(\beta_{j}-\theta_{j}\right) \tag{西瓜书式5.3}

其中 ffSigmoid\mathbf{Sigmoid},则网络在一个训练样本(xk,yk)(\boldsymbol{x}_{k}, \boldsymbol{y}_{k})上的均方误差为:

Ek=12j=1l(y^jkyjk)2(西瓜书式5.4) E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \tag{西瓜书式5.4}

如果按照梯度下降法更新模型参数,那么各个参数的更新公式为:

whjwhj+Δwhj=whjηEkwhjθjθj+Δθj=θjηEkθjvihvih+Δvih=vihηEkvihγhγh+Δγh=γhηEkγh \begin{aligned} w_{h j} \leftarrow w_{h j}+\Delta w_{h j} &=w_{h j}-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ \theta_{j} \leftarrow \theta_{j}+\Delta \theta_{j} &=\theta_{j}-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ v_{i h} \leftarrow v_{i h}+\Delta v_{i h} &=v_{i h}-\eta \frac{\partial E_{k}}{\partial v_{i h}} \\ \gamma_{h} \leftarrow \gamma_{h}+\Delta \gamma_{h} &=\gamma_{h}-\eta \frac{\partial E_{k}}{\partial \gamma_{h}} \end{aligned}



2.2.1. 参数whjw_{hj}的更新

已知EkE_{k}whjw_{hj}的函数链式关系为:

Ek=12j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbh \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\ \beta_{j}=\sum_{h=1}^{q} w_{h j} b_{h} \end{aligned} \end{aligned}

所以:

Ekwhj=Eky^jky^jkβjβjwhj \frac{\partial E_{k}}{\partial w_{h j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}}

其中:

Eky^jk=[12j=1l(y^jkyjk)2]y^jk=12×2×(y^jkyjk)×1=y^jkyjky^jkβj=[f(βjθj)]βj=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)βjwhj=(h=1qwhjbh)whj=bh \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial w_{h j}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial w_{h j}} \\ &=b_{h} \end{aligned} \end{aligned}

则令gjg_j有:

gj=Eky^jky^jkβj=(y^jkyjk)f(βjθj)=y^jk(1y^jk)(yjky^jk)(西瓜书式5.10) \begin{aligned} g_{j} &=-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \\ &=-\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) f^{\prime}\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \end{aligned} \tag{西瓜书式5.10}

故有:

Δwhj=ηEkwhj=ηEky^jky^jkβjβjwhj=ηgjbh(西瓜书式5.11) \begin{aligned} \Delta w_{h j} &=-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ &=-\eta \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}} \\ &=\eta g_{j} b_{h} \end{aligned} \tag{西瓜书式5.11}



2.2.2. 参数θj\theta_{j}的参数更新

已知EkE_{k}θj\theta_{j}的函数链式关系为:

Ek=12j=1l(y^jkyjk)2  y^jk=f(βjθj) \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \end{aligned}

所以:

Ekθj=Eky^jky^jkθj \frac{\partial E_{k}}{\partial \theta_{j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}}

其中:

Eky^jk=[12j=1l(y^jkyjk)2]y^jk=12×2×(y^jkyjk)×1=y^jkyjky^jkθj=[f(βjθj)]θj=f(βjθj)×(1)其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]×(1)西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)×(1) \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \theta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times (-1) \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \times (-1) \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \times (-1) \end{aligned} \\ \end{aligned}

故有:

Δθj=ηEkθj=ηEky^jky^jkθj=η(y^jkyjk)y^jk(1y^jk)×(1)=η(yjky^jk)y^jk(1y^jk)西瓜书式5.10 gj=y^jk(1y^jk)(yjky^jk)=ηgj(西瓜书式5.12) \begin{aligned} \Delta \theta_{j} &=-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ &=-\eta \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} \\ &=-\eta \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot \hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \times (-1) \\ &=-\eta \left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \cdot \hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \\ &\qquad \text{西瓜书式5.10}\ g_{j}=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \\ &=-\eta g_{j} \end{aligned} \tag{西瓜书式5.12}

2.2.3. 参数vihv_{ih}的更新

已知EkE_{k}vihv_{ih}的函数链式关系为:

Ek=12j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbhbh=f(αhγh)  αh=i=1dvihxi \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\begin{aligned} \ \beta_{j}=\sum_{h=1}^{q} w_{h j} &b_{h} \\ &\downarrow \\ &\begin{aligned} b_{h}=f & \left(\alpha_{h} - \gamma_{h}\right) \\ &\ \downarrow \\ &\ \alpha_{h}=\sum_{i=1}^{d} v_{ih} x_{i} \end{aligned} \end{aligned} \end{aligned} \end{aligned}

所以:

Ekvih=j=1lEky^jky^jkβjβjbhbhαhαhvih \frac{\partial E_{k}}{\partial v_{i h}}=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \cdot \frac{\partial \alpha_{h}}{\partial v_{i h}}

这里vihv_{ih}存在于每一个y^j\hat{y}_{j}中,故共有ll个函数链关系。

其中:

Eky^jk=[12j=1l(y^jkyjk)2]y^jk=12×2×(y^jkyjk)×1=y^jkyjky^jkβj=[f(βjθj)]βj=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)βjbh=(h=1qwhjbh)bh=whjbhαh=[f(αhγh)]αh=f(αhγh)×1其中 f(x)=f(x)(1f(x))=f(αhγh)×[1f(αhγh)]=bh(1bh)αhvih=(i=1dvihxi)vih=xi \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial b_{h}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial b_{h}} \\ &=w_{h j} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial b_{h}}{\partial \alpha_{h}} &=\frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \alpha_{h}} \\ &=f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \\ &=b_{h}\left(1-b_{h}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \alpha_{h}}{\partial v_{i h}} &=\frac{\partial\left(\sum_{i=1}^{d} v_{i h} x_{i}\right)}{\partial v_{i h}} \\ &=x_{i} \end{aligned} \\ \end{aligned}

则令ehe_{h}有:

eh=Ekαh=j=1lEky^jky^jkβjβjbhbhαh=j=1l(Eky^jky^jkβj)(βjbh)(bhαh)=j=1lgjwhjbh(1bh)=bh(1bh)j=1lwhjgj(西瓜书式5.15) \begin{aligned} e_{h} &=-\frac{\partial E_{k}}{\partial \alpha_{h}} \\ &=-\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \\ &=\sum_{j=1}^{l} \left(-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}\right) \cdot \left(\frac{\partial \beta_{j}}{\partial b_{h}}\right) \cdot \left(\frac{\partial b_{h}}{\partial \alpha_{h}}\right) \\ &=\sum_{j=1}^{l} g_{j} \cdot w_{h j} \cdot b_{h}\left(1-b_{h}\right) \\ &=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \end{aligned} \tag{西瓜书式5.15}

所以:

Δvih=ηEkvih=ηj=1lEky^jky^jkβjβjbhbhαhαhvih=η(j=1lEky^jky^jkβjβjbhbhαh)(αhvih)=ηehxi(西瓜书式5.13) \begin{aligned} \Delta v_{i h} &=-\eta \frac{\partial E_{k}}{\partial v_{i h}} \\ &=-\eta \sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \cdot \frac{\partial \alpha_{h}}{\partial v_{i h}} \\ &=\eta \left(-\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}}\right) \cdot \left(\frac{\partial \alpha_{h}}{\partial v_{i h}}\right) \\ &=\eta e_{h} x_{i} \end{aligned} \tag{西瓜书式5.13}

2.2.4. 参数γh\gamma_{h}的更新

已知EkE_{k}γh\gamma_{h}的函数链式关系为:

Ek=12j=1l(y^jkyjk)2  y^jk=f(βjθj)  βj=h=1qwhjbhbh=f(αhγh) \begin{aligned} E_{k}=\frac{1}{2} \sum_{j=1}^{l}&\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} \\ &\ \downarrow \\ &\begin{aligned} \ \hat{y}_{j}^{k}=f & \left(\beta_{j}-\theta_{j}\right) \\ &\ \downarrow \\ &\begin{aligned} \ \beta_{j}=\sum_{h=1}^{q} w_{h j} &b_{h} \\ &\downarrow \\ &b_{h}=f\left(\alpha_{h} - \gamma_{h}\right) \end{aligned} \end{aligned} \end{aligned}

所以:

Ekγh=j=1lEky^jky^jkβjβjbhbhγh \frac{\partial E_{k}}{\gamma_{h}}=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \gamma_{h}}

这里γh\gamma_{h}存在于每一个y^j\hat{y}_{j}中,故共有ll个函数链关系。

其中:

Eky^jk=[12j=1l(y^jkyjk)2]y^jk=12×2×(y^jkyjk)×1=y^jkyjky^jkβj=[f(βjθj)]βj=f(βjθj)×1其中 f(x)=f(x)(1f(x))=f(βjθj)×[1f(βjθj)]西瓜书式5.3 y^jk=f(βjθj)=y^jk(1y^jk)βjbh=(h=1qwhjbh)bh=whjbhγh=[f(αhγh)]γh其中 f(x)=f(x)(1f(x))=f(αhγh)×(1)=f(αhγh)×[1f(αhγh)]×(1)=bh(1bh)×(1) \begin{aligned} &\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times \left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1 \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &\qquad \text{西瓜书式5.3}\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right) \\ &=\hat{y}_{j}^{k} \left(1-\hat{y}_{j}^{k}\right) \end{aligned} \\ \\ &\begin{aligned} \frac{\partial \beta_{j}}{\partial b_{h}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial b_{h}} \\ &=w_{h j} \end{aligned} \\ \\ &\begin{aligned} \frac{\partial b_{h}}{\partial \gamma_{h}} &=\frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \gamma_{h}} \\ &\qquad \text{其中} \ f^{\prime}(x)=f(x)(1-f(x)) \\ &=f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \times (-1) \\ &=f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \times (-1)\\ &=b_{h}\left(1-b_{h}\right) \times (-1) \end{aligned} \\ \end{aligned}

故有:

Δγh=ηEkγh=ηj=1lEky^jky^jkβjβjbhbhγh=ηj=1l(Eky^jky^jkβj)(βjbh)(bhγh)=ηj=1lgjwhjbh(1bh)×(1)=ηbh(1bh)j=1lwhjgj西瓜书式5.15 eh=bh(1bh)j=1lwhjgj=ηeh(西瓜书式5.14) \begin{aligned} \Delta \gamma_{h} &=-\eta \frac{\partial E_{k}}{\gamma_{h}} \\ &=-\eta \sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \gamma_{h}} \\ &=\eta \sum_{j=1}^{l} \left(-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}}\cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}\right) \cdot \left(\frac{\partial \beta_{j}}{\partial b_{h}}\right) \cdot \left(\frac{\partial b_{h}}{\partial \gamma_{h}}\right) \\ &=\eta \sum_{j=1}^{l} g_{j} \cdot w_{hj} \cdot b_{h}\left(1-b_{h}\right) \times (-1) \\ &=-\eta b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \\ &\qquad \text{西瓜书式5.15} \ e_{h}=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j} \\ &=-\eta e_{h} \end{aligned} \tag{西瓜书式5.14}