吴恩达机器学习练习4:神经网络学习(反向传播)
在本小节,将使用反向传播算法来计算神经网络损失函数的梯度。
1、S型函数梯度
由前几节可知,sigmoid函数为:
通过对S型函数求导,得到sigmoid的梯度为:
则可以得到sigmoidGradient函数为:
function g = sigmoidGradient(z)
g = zeros(size(z));
g = sigmoid(z).*(1-sigmoid(z));
end
对函数进行测试:
>> g = sigmoidGradient(-200)
g =
1.3839e-87
>> g = sigmoidGradient(200)
g =
0
>> g = sigmoidGradient(0)
g =
0.2500
(2)权重参数的随机初始化
当训练神经网络时,对于对称性破缺来说随机初始化参数是非常重要的。随机初始化参数的一个有效的策略就是在区间内随机选择权重值。这里的初始值选择为0.12。
另一种选择初始值的策略是根据神经网络中神经元的数量确定。
其中:
完成函数W = randInitializeWeights(L_in, L_out)
这里L_in指的是当前需要初始化层权重参数的神经网络的输入神经元个数,L_out指的是当前需要初始化权重参数的神经网络输出神经元个数,W为初始化完成的权重参数矩阵,其维度为L_out×(L_in+1)。
例如,本例中设计的神经网络为三层,其输入层神经元个数为400,隐藏层神经元个数为25,输出层神经元个数为10。
当需要初始化权重参数Theta1时,其需要调用函数Theta1 = randInitializeWeights(400, 25),输出的Theta1维度为25×401。
当需要初始化权重参数Theta2时,其需要调用函数Theta2 = randInitializeWeights(25, 10),输出的Theta2维度为10×26。
完成该函数有:
function W = randInitializeWeights(L_in, L_out)
W = zeros(L_out, 1 + L_in);
epsilon_init = sqrt(6)/(sqrt(L_out)+sqrt(L_in));
W = rand(L_out,1+L_in)*2*epsilon_init-epsilon_init;
end
(3)反向传播
当给定一个训练样本时,首选进行前向传播,计算出每个神经元(包含输出)的**值,然后通过反向传播,计算出每层(除输入层)神经元的误差。
计算第三层的误差:
计算第二层的误差:
此处有个小点需要注意,delta2的需要丢弃掉第一行元素。
计算损失函数梯度:
补充完整函数:
function [J grad] = nnCostFunction(nn_params, …
input_layer_size, …
hidden_layer_size, …
num_labels, …
X, y, lambda)
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
% Setup some useful variables
m = size(X, 1);
% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
T1_squ = sum(Theta1.*Theta1);
T2_squ = sum(Theta2.*Theta2);
X = [ones(m,1),X];%对输入样本增加第一列,值为1。5000*401
A1 = X';%第一层激励,输入矩阵。401*5000。
Z2 = Theta1*A1;%第二层神经网络,25*5000
A2 = [ones(1,m);sigmoid(Z2)];%第二层神经网络激励,26*5000
Z3 = Theta2*A2;%第三层神经网络,10*5000
p = sigmoid(Z3);%神经网络的输出,10*5000
y_label = zeros(num_labels,m);%重新改写矩阵y,与神经网络的输出为同一形式
for i = 1:m
y_label(y(i),i) = 1;
end
J = sum(sum(y_label.*log(p)+(1-y_label).*log(1-p)))/(-m)...
+(sum(T1_squ(2:end))+sum(T2_squ(2:end)))*lambda/(2*m);
delta3 = p - y_label;%delt3为第三层的误差,维度为10*5000
delta2 = Theta2'*delta3;
delta2 = delta2(2:end,:);
delta2 = delta2.*sigmoidGradient(Z2);%delt2为第二层的误差,维度为25*5000
Delta1 = zeros(size(Theta1_grad));
Delta2 = zeros(size(Theta2_grad));
Delta1 = Delta1+delta2*A1';
Delta2 = Delta2+delta3*A2';
Theta1_grad = Delta1./m;
Theta2_grad = Delta2./m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
运行函数checkNNGradients,此处默认lamda=0,即梯度未正则化,结果为:
Checking Backpropagation...
-0.0093 -0.0093
0.0089 0.0089
-0.0084 -0.0084
0.0076 0.0076
-0.0067 -0.0067
-0.0000 -0.0000
0.0000 0.0000
-0.0000 -0.0000
0.0000 0.0000
-0.0000 -0.0000
-0.0002 -0.0002
0.0002 0.0002
-0.0003 -0.0003
0.0003 0.0003
-0.0004 -0.0004
-0.0001 -0.0001
0.0001 0.0001
-0.0001 -0.0001
0.0002 0.0002
-0.0002 -0.0002
0.3145 0.3145
0.1111 0.1111
0.0974 0.0974
0.1641 0.1641
0.0576 0.0576
0.0505 0.0505
0.1646 0.1646
0.0578 0.0578
0.0508 0.0508
0.1583 0.1583
0.0559 0.0559
0.0492 0.0492
0.1511 0.1511
0.0537 0.0537
0.0471 0.0471
0.1496 0.1496
0.0532 0.0532
0.0466 0.0466
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)
If your backpropagation implementation is correct, then
the relative difference will be small (less than 1e-9).
Relative Difference: 2.45374e-11
Program paused. Press enter to continue.
可见代码编写正确,其运行结果正确。
(4)正则化神经网络
修改函数为:
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
% Setup some useful variables
m = size(X, 1);
% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
%改写成矢量计算形式:
T1_squ = sum(Theta1.*Theta1);
T2_squ = sum(Theta2.*Theta2);
X = [ones(m,1),X];%对输入样本增加第一列,值为1。5000*401
A1 = X';%第一层激励,输入矩阵。401*5000。
Z2 = Theta1*A1;%第二层神经网络,25*5000
A2 = [ones(1,m);sigmoid(Z2)];%第二层神经网络激励,26*5000
Z3 = Theta2*A2;%第三层神经网络,10*5000
p = sigmoid(Z3);%神经网络的输出,10*5000
y_label = zeros(num_labels,m);%重新改写矩阵y,与神经网络的输出为同一形式
for i = 1:m
y_label(y(i),i) = 1;
end
J = sum(sum(y_label.*log(p)+(1-y_label).*log(1-p)))/(-m)...
+(sum(T1_squ(2:end))+sum(T2_squ(2:end)))*lambda/(2*m);
delta3 = p - y_label;%delt3为第三层的误差,维度为10*5000
delta2 = Theta2'*delta3;
delta2 = delta2(2:end,:);
delta2 = delta2.*sigmoidGradient(Z2);%delt2为第二层的误差,维度为25*5000
Delta1 = zeros(size(Theta1_grad));
Delta2 = zeros(size(Theta2_grad));
Delta1 = Delta1+delta2*A1';
Delta2 = Delta2+delta3*A2';
Theta1_grad = Delta1./m + Theta1.*lambda./m;%10*26
Theta1_grad(:,1) = Delta1(:,1)./m;
Theta2_grad = Delta2./m + Theta2.*lambda./m;%25*401
Theta2_grad(:,1) = Delta2(:,1)./m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
运行函数
lambda = 3;
checkNNGradients(lambda);
运行结果为:
Checking Backpropagation (w/ Regularization) ...
-0.0093 -0.0093
0.0089 0.0089
-0.0084 -0.0084
0.0076 0.0076
-0.0067 -0.0067
-0.0168 -0.0168
0.0394 0.0394
0.0593 0.0593
0.0248 0.0248
-0.0327 -0.0327
-0.0602 -0.0602
-0.0320 -0.0320
0.0249 0.0249
0.0598 0.0598
0.0386 0.0386
-0.0174 -0.0174
-0.0576 -0.0576
-0.0452 -0.0452
0.0091 0.0091
0.0546 0.0546
0.3145 0.3145
0.1111 0.1111
0.0974 0.0974
0.1187 0.1187
0.0000 0.0000
0.0337 0.0337
0.2040 0.2040
0.1171 0.1171
0.0755 0.0755
0.1257 0.1257
-0.0041 -0.0041
0.0170 0.0170
0.1763 0.1763
0.1131 0.1131
0.0862 0.0862
0.1323 0.1323
-0.0045 -0.0045
0.0015 0.0015
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)
If your backpropagation implementation is correct, then
the relative difference will be small (less than 1e-9).
Relative Difference: 2.35158e-11
Cost at (fixed) debugging parameters (w/ lambda = 3.000000): 0.576051
(for lambda = 3, this value should be about 0.576051)
说明函数nnCostFunction代码编写完全正确!
(5)使用函数fmincg训练神经网络
options = optimset('MaxIter', 50);
lambda = 1;
costFunction = @(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
设置最大迭代次数为50,lambda为1,使用fmincg函数进行训练,其返回值nn_params为权重矢量,cost为对应的最小损失值。
其运行结果为:
Iteration 1 | Cost: 3.395712e+00
Iteration 2 | Cost: 3.230467e+00
Iteration 3 | Cost: 3.180124e+00
Iteration 4 | Cost: 2.839368e+00
Iteration 5 | Cost: 2.438647e+00
Iteration 6 | Cost: 2.340916e+00
Iteration 7 | Cost: 2.031691e+00
Iteration 8 | Cost: 1.829924e+00
Iteration 9 | Cost: 1.662148e+00
Iteration 10 | Cost: 1.480869e+00
Iteration 11 | Cost: 1.364458e+00
Iteration 12 | Cost: 1.281460e+00
Iteration 13 | Cost: 1.223385e+00
Iteration 14 | Cost: 1.135034e+00
Iteration 15 | Cost: 1.096922e+00
Iteration 16 | Cost: 1.045772e+00
Iteration 17 | Cost: 9.922327e-01
Iteration 18 | Cost: 9.492212e-01
Iteration 19 | Cost: 8.940602e-01
Iteration 20 | Cost: 8.764712e-01
Iteration 21 | Cost: 8.618610e-01
Iteration 22 | Cost: 8.353473e-01
Iteration 23 | Cost: 8.189515e-01
Iteration 24 | Cost: 8.088698e-01
Iteration 25 | Cost: 7.940161e-01
Iteration 26 | Cost: 7.831477e-01
Iteration 27 | Cost: 7.681306e-01
Iteration 28 | Cost: 7.457756e-01
Iteration 29 | Cost: 7.199646e-01
Iteration 30 | Cost: 6.876955e-01
Iteration 31 | Cost: 6.786129e-01
Iteration 32 | Cost: 6.755323e-01
Iteration 33 | Cost: 6.538856e-01
Iteration 34 | Cost: 6.214977e-01
Iteration 35 | Cost: 6.031376e-01
Iteration 36 | Cost: 5.983392e-01
Iteration 37 | Cost: 5.921359e-01
Iteration 38 | Cost: 5.900204e-01
Iteration 39 | Cost: 5.831726e-01
Iteration 40 | Cost: 5.648102e-01
Iteration 41 | Cost: 5.430304e-01
Iteration 42 | Cost: 5.335807e-01
Iteration 43 | Cost: 5.308768e-01
Iteration 44 | Cost: 5.096503e-01
Iteration 45 | Cost: 5.050734e-01
Iteration 46 | Cost: 4.993757e-01
Iteration 47 | Cost: 4.968139e-01
Iteration 48 | Cost: 4.931216e-01
Iteration 49 | Cost: 4.910791e-01
Iteration 50 | Cost: 4.895258e-01
可知,迭代第50次时,其损失值为4.895258e-01。
(6)隐藏层可视化
查看隐藏层捕获的字符图像特征。
(7)训练准确度
predict函数与前几节几乎一致。
function p = predict(Theta1, Theta2, X)
m = size(X, 1);
num_labels = size(Theta2, 1);
p = zeros(size(X, 1), 1);
h1 = sigmoid([ones(m, 1) X] * Theta1');
h2 = sigmoid([ones(m, 1) h1] * Theta2');
[dummy, p] = max(h2, [], 2);
end
执行结果为:
Training Set Accuracy: 95.100000
其训练的准确度为95.1%。
(8)改变lambda参数和迭代参数训练
当将lambda改为0,迭代次数改成200时,其训练的准确度高达100%。
Training Set Accuracy: 100.000000
但实际为过拟合状态,因为隐藏层捕获到了太多的无关特征量。
其隐藏层可视化如下: