Coursera吴恩达机器学习week3笔记
Evaluating learning algorithm
Evaluating a Hypothesis
Once we have done some trouble shooting for errors in our predictions by:
- Getting more training examples:Fixes high variance
- Trying smaller sets of features:Fixes high variance
- Trying additional features:Fixes high bias
- Trying polynomial features:Fixes high bias
- Increasing λ:Fixes high variance
- decreasing λ:Fixes high bias
可能有的公式针对训练集已经有很低的错误了,但是依然不够准确,因为这是过拟合的情况。所以为了分析假说公式,我们把数据集分为两类:训练集(70%)和测试集(30%)
Model Selection and Train/Validation/Test Sets
One way to break down our dataset into the three sets is:
- Training set: 60%
- Cross validation set: 20%
- Test set: 20%
We can now calculate three separate error values for the three different sets using the following method:
- Optimize the parameters in Θ using the training set for each polynomial degree.
- Find the polynomial degree d with the least error using the cross validation set.
- Estimate the generalization error using the test set with J , (d = theta from polynomial with lower error);
This way, the degree of the polynomial d has not been trained using the test set.
Bias vs Variance
Dignosing bisa vs variance
Regularization and Bias/Variance
Learning Curve
Diagnosing Neural Networks
- A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
- A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.
Building a Spam Classifier
Prioritizing What to Work On
- Collect lots of data (for example “honeypot” project but doesn’t always work)
- Develop sophisticated features (for example: using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam).
Error Analysis
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
Handling Skewed Data
F1 Score: 2*P*R/(P+R)