第7章支持向量机（大间距分类器，Support Vector Machine）

1 Optimization Objective 优化目标
2 Large Margin Intution
3 数学原理
4 Kernels 核函数

4.1 Gaussian Kernel
4.2 Linear Kernel
5 SVM with Kernels
6 参数 $C$ 和 $\sigma^2$ 的影响

7 Using an SVM

7.1 Other Choices of Kernel
7.2 Multi-class classification
8 逻辑回归和支持向量机

9 Reference

1 Optimization Objective 优化目标

Hypothesis:
$h_\theta(x)=\begin{cases} 1&\text{, if $\theta^Tx≥0$}\\ 0&\text{, otherwise} \end{cases}$
Cost Function:
$J(\theta)=C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tx^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2$
$C类似于\frac{1}{\lambda}$ C类似于λ1
Goal:
$\mathop{\text{minimize}}\limits_{\theta} J(\theta)$

2 Large Margin Intution

【机器学习】7 支持向量机

$\begin{aligned} &\text{if $y^{(i)}=1$, we want $\theta^Tx≥1$（not just $≥0$）}\\ &\text{if $y^{(i)}=0$, we want $\theta^Tx≤-1$（not just $≤0$）} \end{aligned}$

3 数学原理

$u=\left[\begin{matrix} u_1\\u_2\end{matrix}\right]，v=\left[\begin{matrix} v_1\\v_2\end{matrix}\right]$

$u^Tv=p\ \cdot ||u||=u_1v_1+u_2v_2$
$p$ p：length of projection of $v$ v onto $u（有符号）$ u（有符号）
$u^Tv=v^Tu$
$||u||=\sqrt{{u_1}^2+{u_2}^2}$
$\theta_0=0$ ：意味着决策边界必须通过原点(0,0)
支持向量机就是极小化 $||\theta||$
$\begin{aligned} &\mathop{\text{minimize}}\limits_{\theta} \frac{1}{2}\sum_{j=1}^n{\theta_j}^2\\ \text{s.t.} \ \ &p^{(i)}\ \cdot ||\theta||≥1&\text{if $y^{(i)}=1$}\\ \ \ &p^{(i)}\ \cdot ||\theta||≤-1&\text{if $y^{(i)}=0$} \end{aligned}$
要使投影足够大，才能有大间距

4 Kernels 核函数

$h_\theta(x)=\theta_1f_1+\theta_2f_2+···+\theta_nf_n$
Given $x$ , compute new features $f$ depending on proximity to landmarks $l$
Kernel： $f_i=similarity(x,l^{(i)})$
Choose the landmarks： $l^{(1)}=x^{(1)}, l^{(2)}=x^{(2)}, ···, l^{(m)}=x^{(m)}$
$f^{(i)}=\left[\begin{matrix} f_0^{(i)}=1\\ f_1^{(i)}=sim(x^{(i)},l^{(1)})\\ f_2^{(i)}=sim(x^{(i)},l^{(2)})\\ \vdots\\ f_i^{(i)}=sim(x^{(i)},l^{(i)})=e^0=1\\ \vdots\\ f_1^{(m)}=sim(x^{(i)},l^{(m)}) \end{matrix}\right]$

4.1 Gaussian Kernel

$f_i=similarity(x,l^{(i)})=exp\left(-\frac{{||x-l^{(i)}||}^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n{(x_j-l_j^{(i)})}^2}{2\sigma^2}\right)$
与正态分布没啥联系
if $x≈l^{(i)}$ , $f_i≈1$
if $x$ is far from $l^{(i)}$ , $f_i≈0$
使用前需要进行特征缩放
需要选择 $\sigma^2$

4.2 Linear Kernel

do not use kernels
Predict “ $y=1$ ” if $\theta^Tf≥0$

5 SVM with Kernels

Hypothesis：Given $x$ , compute features $f∈\mathbb{R}^{m+1}$ .
Training：
$\mathop{\text{min}}\limits_{\theta}C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tf^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tf^{(i)})\right]+\theta^TM\theta$
其中， $M$ M是根据我们选择的核函数而不同的一个矩阵

6 参数 $C$ 和 $\sigma^2$ 的影响

$C$ 较大时，相当于 $\lambda$ 较小，可能会导致过拟合，高方差
$C$ 较小时，相当于 $\lambda$ 较大，可能会导致欠拟合，高偏差
$\sigma^2$ 较大时，可能会导致低方差，高偏差（ $f_i$ very more smoothly）
$\sigma^2$ 较大时，可能会导致低偏差，高方差（ $f_i$ very less smoothly）

7 Using an SVM

Use SVM software package to solve for parameters $\theta$
Need to specify:
(1) Choice of parameter $C$
(2) Choice of kernel ( similarity function )

7.1 Other Choices of Kernel

Not all similarity function make valid kernels
Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge
Many off-the-shelf kernels：
(1) Polynomial kernel（多项式核函数）
(2) String kernel（字符串核函数）
(3) chi-square kernel（卡方核函数）
(4) histogram intersection kernel（直方图交集核函数）

7.2 Multi-class classification

-Train $K$ SVMs, one to distinguish $y=i$ from the rest, for $i=1, 2, ···, K$ , get $\theta^{(1)}, \theta^{(2)}, ···, \theta^{(K)}$ . Pick class $i$ with largest ${(\theta^{(i)})}^Tx$

Or use packages already

8 逻辑回归和支持向量机

$n$ ：特征数
$m$ ：训练样本数

$n、m$	选择
$n>>m$	逻辑回归、不带核函数的支持向量机
$n$ 较小， $m$ 中等	高斯核函数的支持向量机
$n$ 较小， $m$ 较大	创造、增加更多的特征，再逻辑回归、不带核函数的支持向量机

神经网络在上述情况下可能非常慢
支持向量机主要在于它的代价函数是凸函数，不存在局部最小值

9 Reference

吴恩达机器学习 coursera machine learning
黄海广机器学习笔记