Machine Learning Andrew Ng -2. Linear regression with one varible
2.1 Model representation (模型描述)
In supervised learning, we have a data set and this data set is called a training set (训练集).
: one training example
: the training example
Hypothesis (假设函数) :
How to go about implementing this model ?
2.2 Cost function
How to fit the best possible straight line to our data ?
With the different choices of parameters and , we get different hypothesis, different hypothesis functions.
In linear regression we have a training set, what we want to do is to come up with values for the parameters and , so that the straight line we get out of this corresponds to a straight line that somehow fits the data well.
How do we come up with values and ?
Linear regression, what we’re going to do is to solve a minimization problem. What we are going to do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.
Minimize the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for.
m is the size of the training set.
Define a cost function.
Cost function is also called the squared error function (平方误差函数), or sometimes called the square error cost function (平方误差代价函数).
2.3 Cost function intuition I
, we have
, we have
, we have
For different values of , we can compute range of values, and get something like this :
Each value of corresponds to a different hypothesis, or to a different straight line fit on the left.
For each value of we could then derive a different value of .
We want to choose the value of that minimize , this was our objective function for the linear regression.
2.4 Cost function intuition II
When we have two parameters, it turns out the cost function also has a similar sort of bowl shape. And in fact, depending on the training set, we might get a cost function that may be looks something like this :
This is a 3-D surface plot, where the axes are labeled and . As you vary and , the two parameters, you get different values of the cost function , and the height of this surface above a particular point of and indicates the value of .
Contour plots (等高线图) also call contour figures
The axis are and . And each of these ovals (椭圆形), what each of these ellipses shows is a set of points that takes on the same value for .
2.5 Gradient descent (梯度下降)
Gradient descent is used not only in linear regression. It’s actually used all over the place in machine learning.
Gradient descent for minimizing some arbitrary functions .
Problem:
A property (性质) of gradient descent: Start at the first point, we will find a local optimum (局部最优) , but if started just a little bit, a slightly different location, you would have wound up at a very different local optimum.
The notation ,we use this to denote assignment (赋值), what is means in a computer, this means take the value in and use it to overwrite whatever the value of , this means we will set to be equal to the value of .
, then this is a truth assertion (真假判定) ,
is called the learning rate. What does is, it basically controls how big a step we take downhill with gradient descent. If is very large, then that corresponds to a very aggressive gradient descent procedure, where we’re trying to take huge steps downhill. And if is very small, then we’re taking little, little baby steps downhill.
How to set ? We will discuss later…
Simultaneously update and .
2.6 Gradient descent intuition
In order to convey these intuitions, we use a slightly simpler example where we want to minimize the function of just one parameter.
What if the parameter is already at a local minimum ?
Local minimum is when you have this derivative equal to zero.
2.7 Gradient descent for linear regression
Put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data.
\begin {align*}
\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)
&= \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}{m}(h_{\theta}(x{(i)})-y{(i)})2\
&=\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}{m}(\theta_0+\theta_1x{(i)}+y{(i)})2
\end {align*}
…em这都不支持??? 算辽 不分行了…
我想要的效果是这样
It turns out that the cost function for linear regression is always going to be a bow-shaped function, that it is called a convex function (凸函数).
Convex function doesn’t have any local optima, except for the one global optimum.
We get this:
依旧弄不明白如何缩放图片…
我想要的效果是这样
okok 大一点更清楚
"Batch" Gradient Descent Algorithm
“Batch” means that each step of gradient descent uses all the training examples.
Normal equations methods (正规方程组法) : solving for the minimum of the cost function without needing to use an iterative (迭代) algorithm like gradient descent.
Gradient descent will scale better to larger data sets than that normal equations methods.