Coursera吴恩达机器学习week1笔记
What is Machine Learning?
Arthur Samuel described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.
Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
Machine learning algorithms:
- Supervised learning
监督学习中,对于数据集中的每个数据, 都有相应的正确答案,(训练集) 算法就是基于这些来做出预测
主要是分为两类:回归(regression)和分类(classification)
回归:输入导入连续的函数,结果也是连续的
分类:把输入分成不相关的类别
- Unsupervised learning
无监督学习允许我们解决问题即使我们对结果没有什么想法,我们可以得出结构通过对数据进行聚类分析基于数据的联系
梯度下降算法(Gradient Descent)
θj:=θj−α∂/(∂θj)J(θ0,θ1)
采用同步更新,θ0和θ1需要同步更新,不能先更新θ0,后更新θ1,虽然也有可能是正确的答案,但是梯度下降的思想就是同步更新。
α是学习速率,它控制我们以多大幅度更新θj
Multivariate Linear Regression(多元线性回归)
Featrue Scaling:
get every feature into approximately a -1<=x<=1 range
Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
Mean normalization(均值归一化):
involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
To implement both of these techniques, adjust your input values as shown in this formula:
xi:=(xi−μi)/si
Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
Polynomial Regression
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Normal Equation正规方程
There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:
Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
O (kn2) O (n3), need to calculate inverse of XTX
Works well when n is large Slow if n is very large
With the normal equation, computing the inversion has complexity O(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
Classification
Decision Boundary
Logistic regression cost function
Simplified Cost Function
Gradient Descent
Advancd Optimazation
Multiclass Classification: One-vs-all
The problem of overfitting
Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
- Regularization
- Keep all the features, but reduce the magnitude of parameters θj.
- Regularization works well when we have a lot of slightly useful features.