Machine Learning Andrew Ng -2. Linear regression with one varible

2.1 Model representation (模型描述)

In supervised learning, we have a data set and this data set is called a training set (训练集).

Machine Learning Andrew Ng -2. Linear regression with one varible

$(x,y)$ : one training example

$(x^{(i)},y^{(i)} )$ : the $i^{th}$ training example

$x^{(1)} = 2104$

$x^{(2)} = 1416$

$y^{(1)} = 460$

$y^{(2)} =232$

Machine Learning Andrew Ng -2. Linear regression with one varible

Hypothesis (假设函数) : $h_\theta (x)$

How to go about implementing this model ?

2.2 Cost function

How to fit the best possible straight line to our data ?

Machine Learning Andrew Ng -2. Linear regression with one varible

With the different choices of parameters $\theta_0$ and $\theta_1$ , we get different hypothesis, different hypothesis functions.

Machine Learning Andrew Ng -2. Linear regression with one varible

In linear regression we have a training set, what we want to do is to come up with values for the parameters $\theta_0$ and $\theta_1$ , so that the straight line we get out of this corresponds to a straight line that somehow fits the data well.

How do we come up with values $\theta_0$ and $\theta_1$ ?

Linear regression, what we’re going to do is to solve a minimization problem. What we are going to do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.
$min(\theta_0, \theta_1) \sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2$
Minimize the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for.

m is the size of the training set.

Define a cost function.
$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2$

$minimize(\theta_0, \theta_1) J(\theta_0, \theta_1)$

Cost function is also called the squared error function (平方误差函数), or sometimes called the square error cost function (平方误差代价函数).

Machine Learning Andrew Ng -2. Linear regression with one varible

2.3 Cost function intuition I

Machine Learning Andrew Ng -2. Linear regression with one varible

$\theta_1 = 1$ , we have

Machine Learning Andrew Ng -2. Linear regression with one varible

$J(\theta_1)=\frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2 =\frac{1}{2m}\sum_{i = 1}^{m}(\theta_1 x^{(i)}-y^{(i)})^2=\frac{1}{2m}(0^2+0^2+0^2)=0$

$\theta_1 =0.5$ , we have

Machine Learning Andrew Ng -2. Linear regression with one varible
$J(0.5)= \frac{1}{2m}[(0.5-1)^2+(1-2)^2+(1.5-3)^2]=\frac{1}{2+3}\cdot(3.5)=\frac{3.5}{6}\thickapprox0.58$
$\theta_1 =0$ , we have

Machine Learning Andrew Ng -2. Linear regression with one varible

For different values of $\theta_1$ , we can compute range of values, and get something like this :

Machine Learning Andrew Ng -2. Linear regression with one varible

Each value of $\theta_1$ corresponds to a different hypothesis, or to a different straight line fit on the left.

For each value of $\theta_1$ we could then derive a different value of $J(\theta_1)$ .

Machine Learning Andrew Ng -2. Linear regression with one varible
We want to choose the value of $\theta_1$ that minimize $J(\theta_1)$ , this was our objective function for the linear regression.

2.4 Cost function intuition II

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible When we have two parameters, it turns out the cost function also has a similar sort of bowl shape. And in fact, depending on the training set, we might get a cost function that may be looks something like this :

Machine Learning Andrew Ng -2. Linear regression with one varible

This is a 3-D surface plot, where the axes are labeled $\theta_0$ and $\theta_1$ . As you vary $\theta_0$ and $\theta_1$ , the two parameters, you get different values of the cost function $J(\theta_0, \theta_1)$ , and the height of this surface above a particular point of $\theta_0$ and $\theta_1$ indicates the value of $J(\theta_0, \theta_1)$ .

Contour plots (等高线图) also call contour figures

Machine Learning Andrew Ng -2. Linear regression with one varible
The axis are $\theta_0$ and $\theta_1$ . And each of these ovals (椭圆形), what each of these ellipses shows is a set of points that takes on the same value for $J(\theta_0,\theta_1)$ .

Machine Learning Andrew Ng -2. Linear regression with one varible

2.5 Gradient descent (梯度下降)

Gradient descent is used not only in linear regression. It’s actually used all over the place in machine learning.

Gradient descent for minimizing some arbitrary functions $J$ .

Problem:

Machine Learning Andrew Ng -2. Linear regression with one varible

A property (性质) of gradient descent: Start at the first point, we will find a local optimum (局部最优) , but if started just a little bit, a slightly different location, you would have wound up at a very different local optimum.

Machine Learning Andrew Ng -2. Linear regression with one varible

The notation $:=$ ,we use this to denote assignment (赋值), $a:=b$ what is means in a computer, this means take the value in $b$ and use it to overwrite whatever the value of $a$ , this means we will set $a$ to be equal to the value of $b$ .

$a=b$ , then this is a truth assertion (真假判定) ,

$\alpha$ is called the learning rate. What $\alpha$ does is, it basically controls how big a step we take downhill with gradient descent. If $\alpha$ is very large, then that corresponds to a very aggressive gradient descent procedure, where we’re trying to take huge steps downhill. And if $\alpha$ is very small, then we’re taking little, little baby steps downhill.

How to set $\alpha$ ? We will discuss later…

Simultaneously update $\theta_0$ and $\theta_1$ .

Machine Learning Andrew Ng -2. Linear regression with one varible

2.6 Gradient descent intuition

Machine Learning Andrew Ng -2. Linear regression with one varible

In order to convey these intuitions, we use a slightly simpler example where we want to minimize the function of just one parameter.
$\min_{\theta_1} J(\theta_1) \quad \quad \theta_1\in\mathbb{R}$
Machine Learning Andrew Ng -2. Linear regression with one varible

What if the parameter $\theta_1$ is already at a local minimum ?

Machine Learning Andrew Ng -2. Linear regression with one varible

Local minimum is when you have this derivative equal to zero.

Machine Learning Andrew Ng -2. Linear regression with one varible

2.7 Gradient descent for linear regression

Put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data.

Machine Learning Andrew Ng -2. Linear regression with one varible

\begin {align*}
\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)
&= \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{{m}(h_{\theta}(x}{(i)})-y^{(i)})2\
&=\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{{m}(\theta_0+\theta_1x}{(i)}+y^{(i)})2
\end {align*}

…em这都不支持??? 算辽不分行了…
我想要的效果是这样

$\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) = \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 =\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(\theta_0+\theta_1x^{(i)}+y^{(i)})^2$

$(\theta_0) \quad j=0:\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})$

$(\theta_1) \quad j=1:\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}$

Machine Learning Andrew Ng -2. Linear regression with one varible

It turns out that the cost function for linear regression is always going to be a bow-shaped function, that it is called a convex function (凸函数).

Convex function doesn’t have any local optima, except for the one global optimum.

Machine Learning Andrew Ng -2. Linear regression with one varible
We get this:

依旧弄不明白如何缩放图片…
我想要的效果是这样

okok 大一点更清楚

Machine Learning Andrew Ng -2. Linear regression with one varible

"Batch" Gradient Descent Algorithm

“Batch” means that each step of gradient descent uses all the training examples.

Normal equations methods (正规方程组法) : solving for the minimum of the cost function $J$ without needing to use an iterative (迭代) algorithm like gradient descent.

Gradient descent will scale better to larger data sets than that normal equations methods.

Machine Learning Andrew Ng -2. Linear regression with one varible

2.1 Model representation (模型描述)

2.2 Cost function

2.3 Cost function intuition I

2.4 Cost function intuition II

2.5 Gradient descent (梯度下降)

2.6 Gradient descent intuition

2.7 Gradient descent for linear regression

相关推荐