Machine Learning Andrew Ng -2. Linear regression with one varible

2.1 Model representation (模型描述)

Machine Learning Andrew Ng -2. Linear regression with one varible

In supervised learning, we have a data set and this data set is called a training set (训练集).

Machine Learning Andrew Ng -2. Linear regression with one varible

(x,y)(x,y) : one training example

(x(i),y(i))(x^{(i)},y^{(i)} ) : the ithi^{th} training example

x(1)=2104x^{(1)} = 2104

x(2)=1416x^{(2)} = 1416

y(1)=460y^{(1)} = 460

y(2)=232y^{(2)} =232

Machine Learning Andrew Ng -2. Linear regression with one varible

Hypothesis (假设函数) : hθ(x)h_\theta (x)

How to go about implementing this model ?

2.2 Cost function

How to fit the best possible straight line to our data ?

Machine Learning Andrew Ng -2. Linear regression with one varible

With the different choices of parameters θ0\theta_0 and θ1\theta_1, we get different hypothesis, different hypothesis functions.

Machine Learning Andrew Ng -2. Linear regression with one varible

In linear regression we have a training set, what we want to do is to come up with values for the parameters θ0\theta_0 and θ1\theta_1, so that the straight line we get out of this corresponds to a straight line that somehow fits the data well.

How do we come up with values θ0\theta_0 and θ1\theta_1 ?

Linear regression, what we’re going to do is to solve a minimization problem. What we are going to do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.
min(θ0,θ1)i=1m(hθ(x(i))y(i))2 min(\theta_0, \theta_1) \sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2
Minimize the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for.

m is the size of the training set.

Define a cost function.
J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2 J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2

minimize(θ0,θ1)J(θ0,θ1) minimize(\theta_0, \theta_1) J(\theta_0, \theta_1)

Cost function is also called the squared error function (平方误差函数), or sometimes called the square error cost function (平方误差代价函数).

Machine Learning Andrew Ng -2. Linear regression with one varible

2.3 Cost function intuition I

Machine Learning Andrew Ng -2. Linear regression with one varible

θ1=1\theta_1 = 1 , we have

Machine Learning Andrew Ng -2. Linear regression with one varible

J(θ1)=12mi=1m(hθ(x(i))y(i))2=12mi=1m(θ1x(i)y(i))2=12m(02+02+02)=0 J(\theta_1)=\frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2 =\frac{1}{2m}\sum_{i = 1}^{m}(\theta_1 x^{(i)}-y^{(i)})^2=\frac{1}{2m}(0^2+0^2+0^2)=0

θ1=0.5\theta_1 =0.5 , we have

Machine Learning Andrew Ng -2. Linear regression with one varible
J(0.5)=12m[(0.51)2+(12)2+(1.53)2]=12+3(3.5)=3.560.58 J(0.5)= \frac{1}{2m}[(0.5-1)^2+(1-2)^2+(1.5-3)^2]=\frac{1}{2+3}\cdot(3.5)=\frac{3.5}{6}\thickapprox0.58
θ1=0\theta_1 =0 , we have

Machine Learning Andrew Ng -2. Linear regression with one varible

For different values of θ1\theta_1, we can compute range of values, and get something like this :

Machine Learning Andrew Ng -2. Linear regression with one varible

Each value of θ1\theta_1 corresponds to a different hypothesis, or to a different straight line fit on the left.

For each value of θ1\theta_1 we could then derive a different value of J(θ1)J(\theta_1).

Machine Learning Andrew Ng -2. Linear regression with one varible
We want to choose the value of θ1\theta_1 that minimize J(θ1)J(\theta_1), this was our objective function for the linear regression.

2.4 Cost function intuition II

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varibleWhen we have two parameters, it turns out the cost function also has a similar sort of bowl shape. And in fact, depending on the training set, we might get a cost function that may be looks something like this :

Machine Learning Andrew Ng -2. Linear regression with one varible

This is a 3-D surface plot, where the axes are labeled θ0\theta_0 and θ1\theta_1. As you vary θ0\theta_0 and θ1\theta_1, the two parameters, you get different values of the cost function J(θ0,θ1)J(\theta_0, \theta_1), and the height of this surface above a particular point of θ0\theta_0 and θ1\theta_1 indicates the value of J(θ0,θ1)J(\theta_0, \theta_1).

Contour plots (等高线图) also call contour figures

Machine Learning Andrew Ng -2. Linear regression with one varible
The axis are θ0\theta_0 and θ1\theta_1. And each of these ovals (椭圆形), what each of these ellipses shows is a set of points that takes on the same value for J(θ0,θ1)J(\theta_0,\theta_1).

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible

2.5 Gradient descent (梯度下降)

Gradient descent is used not only in linear regression. It’s actually used all over the place in machine learning.

Gradient descent for minimizing some arbitrary functions JJ.


Machine Learning Andrew Ng -2. Linear regression with one varible

A property (性质) of gradient descent: Start at the first point, we will find a local optimum (局部最优) , but if started just a little bit, a slightly different location, you would have wound up at a very different local optimum.

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible

Machine Learning Andrew Ng -2. Linear regression with one varible

The notation :=:= ,we use this to denote assignment (赋值), a:=ba:=b what is means in a computer, this means take the value in bb and use it to overwrite whatever the value of aa, this means we will set aa to be equal to the value of bb.

a=ba=b , then this is a truth assertion (真假判定) ,

α\alpha is called the learning rate. What α\alpha does is, it basically controls how big a step we take downhill with gradient descent. If α\alpha is very large, then that corresponds to a very aggressive gradient descent procedure, where we’re trying to take huge steps downhill. And if α\alpha is very small, then we’re taking little, little baby steps downhill.

How to set α\alpha ? We will discuss later…

Simultaneously update θ0\theta_0 and θ1\theta_1.

Machine Learning Andrew Ng -2. Linear regression with one varible

2.6 Gradient descent intuition

Machine Learning Andrew Ng -2. Linear regression with one varible

In order to convey these intuitions, we use a slightly simpler example where we want to minimize the function of just one parameter.
minθ1J(θ1)θ1R \min_{\theta_1} J(\theta_1) \quad \quad \theta_1\in\mathbb{R}
Machine Learning Andrew Ng -2. Linear regression with one varible

What if the parameter θ1\theta_1 is already at a local minimum ?

Machine Learning Andrew Ng -2. Linear regression with one varible

Local minimum is when you have this derivative equal to zero.

Machine Learning Andrew Ng -2. Linear regression with one varible

2.7 Gradient descent for linear regression

Put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data.

Machine Learning Andrew Ng -2. Linear regression with one varible

\begin {align*}
&= \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}{m}(h_{\theta}(x{(i)})-y{(i)})2\
\end {align*}

…em这都不支持??? 算辽 不分行了…
Machine Learning Andrew Ng -2. Linear regression with one varible

θjJ(θ0,θ1)=θj12mi=1m(hθ(x(i))y(i))2=θj12mi=1m(θ0+θ1x(i)+y(i))2 \frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) = \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 =\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(\theta_0+\theta_1x^{(i)}+y^{(i)})^2

(θ0)j=0:θ0J(θ0,θ1)=1mi=1m(hθ(x(i))y(i)) (\theta_0) \quad j=0:\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})

(θ1)j=1:θ1J(θ0,θ1)=1mi=1m(hθ(x(i))y(i))x(i) (\theta_1) \quad j=1:\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}

Machine Learning Andrew Ng -2. Linear regression with one varible

It turns out that the cost function for linear regression is always going to be a bow-shaped function, that it is called a convex function (凸函数).

Convex function doesn’t have any local optima, except for the one global optimum.

Machine Learning Andrew Ng -2. Linear regression with one varible
We get this:

Machine Learning Andrew Ng -2. Linear regression with one varible
okok 大一点更清楚

Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible
Machine Learning Andrew Ng -2. Linear regression with one varible

"Batch" Gradient Descent Algorithm

“Batch” means that each step of gradient descent uses all the training examples.

Normal equations methods (正规方程组法) : solving for the minimum of the cost function JJ without needing to use an iterative (迭代) algorithm like gradient descent.

Gradient descent will scale better to larger data sets than that normal equations methods.