Machine Learning 06 - Support Vector Machine

正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。

6.1 Large Margin Classification

6.1.1 Optimizaiton objective

Here we intorduce the last supervised algorithm : Support Vector Machine.

Hypothesis :

hθ(x)={1ifθTx00otherwise

Cost function :

minθ Ci=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+12i=1nθj2

where cost1 is the cost when y=1 and cost0 is the cost when y=0. An intuitive explanation is below :

Machine Learning 06 - Support Vector Machine

Decision boundary :

Machine Learning 06 - Support Vector Machine

SVM will find a line that has the largest margin between the data. And the regularized term C is intuitively show below :

Machine Learning 06 - Support Vector Machine

6.1.2 Concept of kernels

In this part, in order to fit Non-linear decision boundary, we will adapt the hypothesis function to

hθ(x)={1θ0+θ1f1+θ2f200 otherwise 

(1) Polynomial

fi=xik(i,j=1,2,)

It can fit dataset very well, but we don’t know which features to add and it is very computationally expensive.

(2) Gaussian Kernel
First, choose some landmarks l(i)( i=1,2,)

Second, define fi( i=1,2,), such as Gaussian Kernel :

fi=exp(xl(i)22σ2)=sim(x,l(i))

It mesures the similarity of two points :

  • If xl(i):fi1,
  • If x is far from l(i) : fi0.

And the σ just like a scale of the distance of two points :

Machine Learning 06 - Support Vector Machine

Finally, what it perdicet (for example) is :

Machine Learning 06 - Support Vector Machine

6.1.3 SVM with kernels

(1) Choose landmarks

Given (x(1),y(1)),(x(2),y(2)),,(x(m),y(m)) ,and choose l(1)=x(1),l(2)=x(2),,l(m)=x(m) .

(2) Define kernels

We define f as Gaussian Kernel :

f(i)=[f0(i)f1(i)fm(i)]=[sim(x(i),l(1))sim(x(i),l(2))sim(x(i),l(m))],i=1,2,,m

(3) Training

minθ Ci=1m[y(i)cost1(θTf(i))+(1y(i))cost0(θTf(i))]+12j=1mθj2

Use minimization algorirhm to solve it.

(4) Evaluation

  • Large C:Lower bias, higher variance.
  • Small C:Higher bias, lower variance.
  • Large σ2: Higher bias, lower variance. (f is more “smooth”)
  • Small σ2: Lower bias, higher variance.

(5) Note

  • Perform feature scaling before using the Gaussian Kernel .
  • Not all similarity functions make valid kernals. (Need to satisfy “Mercer Theorem” to make sure SVM packages run correctly)
  • Other kernels : Polynomial kernel, String kernel, …
  • Muti-class classification : one-vs-all method.
  • If nm, use logistic regression or SVM without kernel; if n is samll, m is intermediate, use SVM with kernel; if mn, create more features, and turn to case one. Neural network likely to work well for most of these things.