Machine Learning 06 - Support Vector Machine

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。

6.1 Large Margin Classification

6.1.1 Optimizaiton objective

Here we intorduce the last supervised algorithm : Support Vector Machine.

Hypothesis :

h_{θ} (x) = {\begin{matrix} 1 & if θ^{T} x \geq 0 \\ 0 & otherwise \end{matrix}

Cost function :

min_{θ} C \sum_{i = 1}^{m} [y^{(i)} c o s t_{1} (θ^{T} x^{(i)}) + (1 - y^{(i)}) c o s t_{0} (θ^{T} x^{(i)})] + \frac{1}{2} \sum_{i = 1}^{n} θ_{j}^{2}

where $c o s t_{1}$ is the cost when $y = 1$ and $c o s t_{0}$ is the cost when $y = 0$ . An intuitive explanation is below :

Machine Learning 06 - Support Vector Machine

Decision boundary :

Machine Learning 06 - Support Vector Machine

SVM will find a line that has the largest margin between the data. And the regularized term $C$ is intuitively show below :

Machine Learning 06 - Support Vector Machine

6.1.2 Concept of kernels

In this part, in order to fit Non-linear decision boundary, we will adapt the hypothesis function to

h_{θ} (x) = {\begin{cases} 1 & θ_{0} + θ_{1} f_{1} + θ_{2} f_{2} \dots \geq 0 \\ 0 & otherwise \end{cases}

(1) Polynomial

f_{i} = x_{i}^{k} (i, j = 1, 2, \dots)

It can fit dataset very well, but we don’t know which features to add and it is very computationally expensive.

(2) Gaussian Kernel
First, choose some landmarks $l^{(i)} (i = 1, 2, \dots)$

Second, define $f_{i} (i = 1, 2, \dots)$ , such as Gaussian Kernel :

f_{i} = e x p (- \frac{{‖ x - l^{(i)} ‖}^{2}}{2 σ^{2}}) = sim (x, l^{(i)})

It mesures the similarity of two points :

If $x \approx l^{(i)} : f_{i} \approx 1$ ,
If $x$ is far from $l^{(i)}$ : $f_{i} \approx 0$ .

And the $σ$ just like a scale of the distance of two points :

Machine Learning 06 - Support Vector Machine

Finally, what it perdicet (for example) is :

Machine Learning 06 - Support Vector Machine

6.1.3 SVM with kernels

(1) Choose landmarks

Given $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})$ ,and choose $l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \dots, l^{(m)} = x^{(m)}$ .

(2) Define kernels

We define $f$ as Gaussian Kernel :

f^{(i)} = [\begin{matrix} f_{0}^{(i)} \\ f_{1}^{(i)} \\ ⋮ \\ f_{m}^{(i)} \end{matrix}] = [\begin{matrix} sim (x^{(i)}, l^{(1)}) \\ sim (x^{(i)}, l^{(2)}) \\ \dots \\ sim (x^{(i)}, l^{(m)}) \end{matrix}], i = 1, 2, \dots, m

(3) Training

min_{θ} C \sum_{i = 1}^{m} [y^{(i)} c o s t_{1} (θ^{T} f^{(i)}) + (1 - y^{(i)}) c o s t_{0} (θ^{T} f^{(i)})] + \frac{1}{2} \sum_{j = 1}^{m} θ_{j}^{2}

Use minimization algorirhm to solve it.

(4) Evaluation

Large $C$ :Lower bias, higher variance.
Small $C$ :Higher bias, lower variance.
Large $σ^{2}$ : Higher bias, lower variance. ( $f$ is more “smooth”)
Small $σ^{2}$ : Lower bias, higher variance.

(5) Note

Perform feature scaling before using the Gaussian Kernel .
Not all similarity functions make valid kernals. (Need to satisfy “Mercer Theorem” to make sure SVM packages run correctly)
Other kernels : Polynomial kernel, String kernel, …
Muti-class classification : one-vs-all method.
If $n ≫ m$ , use logistic regression or SVM without kernel; if $n$ is samll, $m$ is intermediate, use SVM with kernel; if $m ≫ n$ , create more features, and turn to case one. Neural network likely to work well for most of these things.

Machine Learning 06 - Support Vector Machine

6.1 Large Margin Classification

6.1.1 Optimizaiton objective

6.1.2 Concept of kernels

6.1.3 SVM with kernels

相关推荐