12.4Homework#4_Deep Learning
这是我的Advanced data science and architecture的一次作业,也恰巧是CNN人脸识别项目的一部分。偷懒使用****的markdown来编辑,索性直接贴出来吧。代码没有贴出来。
Option B: Use Deep Learning for analysis of your project data.
Part A - Deep Learning model (40 points)
- For this project, we applied Convolutional Neural Network to make machine recognition W’s face and distinguish from other people’s.
- The data included contains 11,500 face pictures of W capturesd and processed using Python library Dlib and OpenCV. The noise pictures are mainly contributed by UMass Amherst and University of Science and Technology of China.
- The method is Convoluntional Neural Network realized with Tensorflow-gpu which is developed by Google.
- The best accuracy so far is beyond 96%.
For the adjustment below, if not specially notified, we use a basic model of conditions that: RELU for activation function, softmax_cross_entropy for loss function, 20 epochs, Adam for gradient estimation optimizer,random_normal to initialize the parameters. And the basic architecture is three pair of convolution(filter 3x3, stride is [1,1,1,1]) and pooling layers (max_pooling, batch is 2x2 and stride is [1,2,2,1]), followed by two full connection layers, among which the output of last layer is to classify yes and no(if W’s face or not)
Part B - Activation function (10 points)
This part is to show how activation affect the accuracy and training time (time to plateaus). This part contains a accuracy table and accuracy as shown below.
Activation Function | Accuracy(%) |
---|---|
ReLU | 96.20 |
ELU | 95.05 |
TanH | 52.00 |
Sigmoid | 51.60 |
Softplus | 51.40 |
And the plots.
Accuracy: From the plot and table above, it is found that RELU brings highest accuracy, 96.20%, but is similar with that of ELU activation Function. At the same time, TanH, Sigmoid, and Softplus function are not suitable for our work because the accuracy is similar with that of naive rule(50%).
Plateauing time: We can find RELU plateaus very fast.
Part C - Cost function (10 points)
For this part we change the loss function as follows.
Loss Function | Accuracy(%) |
---|---|
softmax_cross_entropy | 95.35 |
cosine_distance | 48.00 |
hinge | 93.65 |
sigmoid_cross_entropy | 94.55 |
mean_squared_error | 90.95 |
Accuracy:
From the table and plot above, we can conclude that:
- Except cosine distance, the other loss functions creates high accuracy, among which softmax_cross_entropy and sigmoid_cross_entropy creats highest accuracy under given condition.
Plateauing time:
From the plots above, we can find except cosine distance, the networks with the left loss functions all have the trend of plateauing. And sigmoid_cross_entropy uses relatively lee time to plateau.
Part D - Epochs (10 points)
For this part we have two main checkpoints. Here we pick epoch numbers as 1,3,5,10,20,50 (batch size is 200) and generates accuracy table and plot as below.
Number of epoch | Accuracy(%) |
---|---|
1 | 85.00 |
3 | 93.00 |
5 | 91.80 |
10 | 92.80 |
20 | 93.00 |
50 | 96.76 |
100 | 96.12 |
Accuracy:
From the table and plot above, we can conclude that:
- More epochs brings higher accuracy
- At the early part, each epoch brings large improvement of accuracy, while in the latter part of training, each can bring relatively smaller improvement of accuracy.
Plateauing time:
From the exact data for each epoch, we apply early stopping. When can find that if we stop when accuracy improvement by one epoch is less than 0.1%, this network plateaus at epoch #31.
Part E - Gradient estimation (10 points)
For this part, we tried different method as gradient estimation optimizer, i.e. Adam, Momentum, .The accuracy table and plots are shown below.
Optimizer | Accuracy(%) |
---|---|
Adadelta | 59.00 |
Adam | 94.35 |
GradientDescent | 56.00 |
Adagrad | 59.00 |
RMSProp | 89.45 |
Momentum | 49.00 |
Accuracy:
From the results above, we can find Adam(Adaptive Moment Estimation) and RMSprop can bring the highest accuracy, 0.9435 and 89.45 under learning rate of 0.01. While Momentum, Adagrad, GradientDescent and Adadelta can not leads to low accuracy.
Plateauing time:
Adam is fastest, and RMSprop is the second, while Momentum, Adagrad, GradientDescent and Adadelta plots are horizontal line.
Also, even for the Adam method, if the learning rate is a bit high, say 0.08 , it will also not plateau under given number epochs(20). And generally smaller learning rate get better accuracy, say both learning rates of 0.005 generate very high accuracy(98.35%).
Part F - Network Architecture (10 points)¶
On your Deep Learning model data
Change the network architecture. How does it effect the accuracy?
How does it effect how quickly the network plateaus?
Part F - Network Architecture (10 points)
We have three pair of layers of convolution and pooling. Here we keep the stride of filters is [1,1,1,1] and of pooling is [1,2,2,1]. To change the architecture of this network, here we change the size for the kernels/filters(1st plot) and the channels of full connection layers(2nd plot).
filter size(50 epoch) | Accuracy(%) |
---|---|
architecture_777 | 86.10 |
architecture_555 | 90.96 |
architecture_357 | 95.28 |
architecture_355 | 92.54 |
architecture_333 | 97.72 |
architecture_222 | 95.18 |
full connection channel | Accuracy(%) |
---|---|
256 | 90.30 |
512 | 96.80 |
1024 | 96.55 |
2048 | 94.30 |
Accuracy:
For this part, we can find filter size of 3x3 for each layer and 3x3,5x5,7x7 for the corresponding layer give best accuracy.
And under the condition of filter size is 3x3, full connection layer with 512 and 1024 give the highest accuracy.
Plateauing time:
This part is similar, filter size of 3x3 for each layer and 3x3,5x5,7x7 for the corresponding layer plateau fastest and under the condition of filter size is 3x3, full connection layer with 512 uses least time to plateau.
Part G - Network initialization (10 points)
For this part we have two main checkpoints. Here we pick initialization methods of zeros, random_uniform, random_gaussian/normal (and with different standard deviations). The below is the accuracy number and plateauing plots.
initialization | Accuracy(%) |
---|---|
random_uniform | 54.00 |
zeros | 49.00 |
random_gamma | 57.4 |
random_normal_0.01 | 95.7 |
random_normal_0.015 | 85.80 |
random_normal_0.008 | 92.85 |
Accuracy:
From the results above, we can find initialization of random_normal(Gaussian) brings the highest accuracy, over 90% given 20 epochs. While the left leads to relatively low accuracy given 20 epochs.
Plateauing time:
Among these method, Gaussian is the only method that can plateau with the given number of epochs, the left may plateaus if given more epochs. And we can find stand deviation around 0.01 is very good for this network.