吴恩达机器学习练习4：神经网络学习（反向传播）

在本小节，将使用反向传播算法来计算神经网络损失函数的梯度。
1、S型函数梯度
由前几节可知，sigmoid函数为：
吴恩达机器学习练习4：神经网络学习（反向传播）
通过对S型函数求导，得到sigmoid的梯度为：

则可以得到sigmoidGradient函数为：

function g = sigmoidGradient(z)
g = zeros(size(z));
g = sigmoid(z).*(1-sigmoid(z));
end

对函数进行测试：

>> g = sigmoidGradient(-200)

g =

   1.3839e-87

>> g = sigmoidGradient(200)

g =

     0

>> g = sigmoidGradient(0)

g =

    0.2500

（2）权重参数的随机初始化
当训练神经网络时，对于对称性破缺来说随机初始化参数是非常重要的。随机初始化参数的一个有效的策略就是在区间吴恩达机器学习练习4：神经网络学习（反向传播）内随机选择权重值。这里的初始值选择为0.12。
另一种选择初始值的策略是根据神经网络中神经元的数量确定。

其中：

完成函数W = randInitializeWeights(L_in, L_out)
这里L_in指的是当前需要初始化层权重参数的神经网络的输入神经元个数，L_out指的是当前需要初始化权重参数的神经网络输出神经元个数，W为初始化完成的权重参数矩阵，其维度为L_out×(L_in+1)。
例如，本例中设计的神经网络为三层，其输入层神经元个数为400，隐藏层神经元个数为25，输出层神经元个数为10。
当需要初始化权重参数Theta1时，其需要调用函数Theta1 = randInitializeWeights(400, 25)，输出的Theta1维度为25×401。
当需要初始化权重参数Theta2时，其需要调用函数Theta2 = randInitializeWeights(25, 10)，输出的Theta2维度为10×26。
完成该函数有：

function W = randInitializeWeights(L_in, L_out)
W = zeros(L_out, 1 + L_in);

epsilon_init = sqrt(6)/(sqrt(L_out)+sqrt(L_in));
W = rand(L_out,1+L_in)*2*epsilon_init-epsilon_init;
end

（3）反向传播

吴恩达机器学习练习4：神经网络学习（反向传播）
当给定一个训练样本时，首选进行前向传播，计算出每个神经元（包含输出）的**值，然后通过反向传播，计算出每层（除输入层）神经元的误差。
计算第三层的误差：

计算第二层的误差：
吴恩达机器学习练习4：神经网络学习（反向传播）
此处有个小点需要注意，delta2的需要丢弃掉第一行元素。
计算损失函数梯度：

补充完整函数：
function [J grad] = nnCostFunction(nn_params, …
input_layer_size, …
hidden_layer_size, …
num_labels, …
X, y, lambda)

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);
         
% You need to return the following variables correctly 
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

T1_squ = sum(Theta1.*Theta1);
T2_squ = sum(Theta2.*Theta2);
X = [ones(m,1),X];%对输入样本增加第一列，值为1。5000*401
A1 = X';%第一层激励，输入矩阵。401*5000。
Z2 = Theta1*A1;%第二层神经网络，25*5000
A2 = [ones(1,m);sigmoid(Z2)];%第二层神经网络激励,26*5000
Z3 = Theta2*A2;%第三层神经网络,10*5000
p = sigmoid(Z3);%神经网络的输出，10*5000
y_label = zeros(num_labels,m);%重新改写矩阵y,与神经网络的输出为同一形式
for i = 1:m
    y_label(y(i),i) = 1;
end
J = sum(sum(y_label.*log(p)+(1-y_label).*log(1-p)))/(-m)...
    +(sum(T1_squ(2:end))+sum(T2_squ(2:end)))*lambda/(2*m);

delta3 = p - y_label;%delt3为第三层的误差，维度为10*5000
delta2 = Theta2'*delta3;
delta2 = delta2(2:end,:);
delta2 = delta2.*sigmoidGradient(Z2);%delt2为第二层的误差，维度为25*5000

Delta1 = zeros(size(Theta1_grad));
Delta2 = zeros(size(Theta2_grad));
Delta1 = Delta1+delta2*A1';
Delta2 = Delta2+delta3*A2';
Theta1_grad = Delta1./m;
Theta2_grad = Delta2./m;

grad = [Theta1_grad(:) ; Theta2_grad(:)];
end

运行函数checkNNGradients，此处默认lamda=0，即梯度未正则化，结果为：

Checking Backpropagation... 
   -0.0093   -0.0093
    0.0089    0.0089
   -0.0084   -0.0084
    0.0076    0.0076
   -0.0067   -0.0067
   -0.0000   -0.0000
    0.0000    0.0000
   -0.0000   -0.0000
    0.0000    0.0000
   -0.0000   -0.0000
   -0.0002   -0.0002
    0.0002    0.0002
   -0.0003   -0.0003
    0.0003    0.0003
   -0.0004   -0.0004
   -0.0001   -0.0001
    0.0001    0.0001
   -0.0001   -0.0001
    0.0002    0.0002
   -0.0002   -0.0002
    0.3145    0.3145
    0.1111    0.1111
    0.0974    0.0974
    0.1641    0.1641
    0.0576    0.0576
    0.0505    0.0505
    0.1646    0.1646
    0.0578    0.0578
    0.0508    0.0508
    0.1583    0.1583
    0.0559    0.0559
    0.0492    0.0492
    0.1511    0.1511
    0.0537    0.0537
    0.0471    0.0471
    0.1496    0.1496
    0.0532    0.0532
    0.0466    0.0466

The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your backpropagation implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 2.45374e-11

Program paused. Press enter to continue.

可见代码编写正确，其运行结果正确。
（4）正则化神经网络
吴恩达机器学习练习4：神经网络学习（反向传播）
修改函数为：

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)

Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);
         
% You need to return the following variables correctly 
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

%改写成矢量计算形式：
T1_squ = sum(Theta1.*Theta1);
T2_squ = sum(Theta2.*Theta2);
X = [ones(m,1),X];%对输入样本增加第一列，值为1。5000*401
A1 = X';%第一层激励，输入矩阵。401*5000。
Z2 = Theta1*A1;%第二层神经网络，25*5000
A2 = [ones(1,m);sigmoid(Z2)];%第二层神经网络激励,26*5000
Z3 = Theta2*A2;%第三层神经网络,10*5000
p = sigmoid(Z3);%神经网络的输出，10*5000
y_label = zeros(num_labels,m);%重新改写矩阵y,与神经网络的输出为同一形式
for i = 1:m
    y_label(y(i),i) = 1;
end
J = sum(sum(y_label.*log(p)+(1-y_label).*log(1-p)))/(-m)...
    +(sum(T1_squ(2:end))+sum(T2_squ(2:end)))*lambda/(2*m);

delta3 = p - y_label;%delt3为第三层的误差，维度为10*5000
delta2 = Theta2'*delta3;
delta2 = delta2(2:end,:);
delta2 = delta2.*sigmoidGradient(Z2);%delt2为第二层的误差，维度为25*5000

Delta1 = zeros(size(Theta1_grad));
Delta2 = zeros(size(Theta2_grad));
Delta1 = Delta1+delta2*A1';
Delta2 = Delta2+delta3*A2';
Theta1_grad = Delta1./m + Theta1.*lambda./m;%10*26
Theta1_grad(:,1) = Delta1(:,1)./m;
Theta2_grad = Delta2./m + Theta2.*lambda./m;%25*401
Theta2_grad(:,1) = Delta2(:,1)./m;

grad = [Theta1_grad(:) ; Theta2_grad(:)];

end

运行函数
lambda = 3;
checkNNGradients(lambda);
运行结果为：

Checking Backpropagation (w/ Regularization) ... 
   -0.0093   -0.0093
    0.0089    0.0089
   -0.0084   -0.0084
    0.0076    0.0076
   -0.0067   -0.0067
   -0.0168   -0.0168
    0.0394    0.0394
    0.0593    0.0593
    0.0248    0.0248
   -0.0327   -0.0327
   -0.0602   -0.0602
   -0.0320   -0.0320
    0.0249    0.0249
    0.0598    0.0598
    0.0386    0.0386
   -0.0174   -0.0174
   -0.0576   -0.0576
   -0.0452   -0.0452
    0.0091    0.0091
    0.0546    0.0546
    0.3145    0.3145
    0.1111    0.1111
    0.0974    0.0974
    0.1187    0.1187
    0.0000    0.0000
    0.0337    0.0337
    0.2040    0.2040
    0.1171    0.1171
    0.0755    0.0755
    0.1257    0.1257
   -0.0041   -0.0041
    0.0170    0.0170
    0.1763    0.1763
    0.1131    0.1131
    0.0862    0.0862
    0.1323    0.1323
   -0.0045   -0.0045
    0.0015    0.0015

The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your backpropagation implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 2.35158e-11


Cost at (fixed) debugging parameters (w/ lambda = 3.000000): 0.576051 
(for lambda = 3, this value should be about 0.576051)

说明函数nnCostFunction代码编写完全正确！
（5）使用函数fmincg训练神经网络

options = optimset('MaxIter', 50);
lambda = 1;
costFunction = @(p) nnCostFunction(p, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

设置最大迭代次数为50，lambda为1，使用fmincg函数进行训练，其返回值nn_params为权重矢量，cost为对应的最小损失值。
其运行结果为：

Iteration     1 | Cost: 3.395712e+00
Iteration     2 | Cost: 3.230467e+00
Iteration     3 | Cost: 3.180124e+00
Iteration     4 | Cost: 2.839368e+00
Iteration     5 | Cost: 2.438647e+00
Iteration     6 | Cost: 2.340916e+00
Iteration     7 | Cost: 2.031691e+00
Iteration     8 | Cost: 1.829924e+00
Iteration     9 | Cost: 1.662148e+00
Iteration    10 | Cost: 1.480869e+00
Iteration    11 | Cost: 1.364458e+00
Iteration    12 | Cost: 1.281460e+00
Iteration    13 | Cost: 1.223385e+00
Iteration    14 | Cost: 1.135034e+00
Iteration    15 | Cost: 1.096922e+00
Iteration    16 | Cost: 1.045772e+00
Iteration    17 | Cost: 9.922327e-01
Iteration    18 | Cost: 9.492212e-01
Iteration    19 | Cost: 8.940602e-01
Iteration    20 | Cost: 8.764712e-01
Iteration    21 | Cost: 8.618610e-01
Iteration    22 | Cost: 8.353473e-01
Iteration    23 | Cost: 8.189515e-01
Iteration    24 | Cost: 8.088698e-01
Iteration    25 | Cost: 7.940161e-01
Iteration    26 | Cost: 7.831477e-01
Iteration    27 | Cost: 7.681306e-01
Iteration    28 | Cost: 7.457756e-01
Iteration    29 | Cost: 7.199646e-01
Iteration    30 | Cost: 6.876955e-01
Iteration    31 | Cost: 6.786129e-01
Iteration    32 | Cost: 6.755323e-01
Iteration    33 | Cost: 6.538856e-01
Iteration    34 | Cost: 6.214977e-01
Iteration    35 | Cost: 6.031376e-01
Iteration    36 | Cost: 5.983392e-01
Iteration    37 | Cost: 5.921359e-01
Iteration    38 | Cost: 5.900204e-01
Iteration    39 | Cost: 5.831726e-01
Iteration    40 | Cost: 5.648102e-01
Iteration    41 | Cost: 5.430304e-01
Iteration    42 | Cost: 5.335807e-01
Iteration    43 | Cost: 5.308768e-01
Iteration    44 | Cost: 5.096503e-01
Iteration    45 | Cost: 5.050734e-01
Iteration    46 | Cost: 4.993757e-01
Iteration    47 | Cost: 4.968139e-01
Iteration    48 | Cost: 4.931216e-01
Iteration    49 | Cost: 4.910791e-01
Iteration    50 | Cost: 4.895258e-01

可知，迭代第50次时，其损失值为4.895258e-01。
（6）隐藏层可视化
查看隐藏层捕获的字符图像特征。
吴恩达机器学习练习4：神经网络学习（反向传播）
（7）训练准确度
predict函数与前几节几乎一致。

function p = predict(Theta1, Theta2, X)
m = size(X, 1);
num_labels = size(Theta2, 1);
p = zeros(size(X, 1), 1);
h1 = sigmoid([ones(m, 1) X] * Theta1');
h2 = sigmoid([ones(m, 1) h1] * Theta2');
[dummy, p] = max(h2, [], 2);
end

执行结果为：


Training Set Accuracy: 95.100000

其训练的准确度为95.1%。
（8）改变lambda参数和迭代参数训练
当将lambda改为0，迭代次数改成200时，其训练的准确度高达100%。

Training Set Accuracy: 100.000000

但实际为过拟合状态，因为隐藏层捕获到了太多的无关特征量。
其隐藏层可视化如下：
吴恩达机器学习练习4：神经网络学习（反向传播）

吴恩达机器学习练习4：神经网络学习（反向传播）

相关推荐