第7章 支持向量机(大间距分类器,Support Vector Machine)
1 Optimization Objective 优化目标
Hypothesis :h θ ( x ) = { 1 , if θ T x ≥ 0 0 , otherwise h_\theta(x)=\begin{cases}
1&\text{, if $\theta^Tx≥0$}\\
0&\text{, otherwise}
\end{cases} h θ ( x ) = { 1 0 , if θ T x ≥ 0 , otherwise
Cost Function :J ( θ ) = C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tx^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2 J ( θ ) = C i = 1 ∑ m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 2 1 j = 1 ∑ n θ j 2 C 类 似 于 1 λ C类似于\frac{1}{\lambda} C 类 似 于 λ 1
Goal :minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θ minimize J ( θ )
2 Large Margin Intution
if y ( i ) = 1 , we want θ T x ≥ 1 (not just ≥ 0 ) if y ( i ) = 0 , we want θ T x ≤ − 1 (not just ≤ 0 ) \begin{aligned}
&\text{if $y^{(i)}=1$, we want $\theta^Tx≥1$(not just $≥0$)}\\
&\text{if $y^{(i)}=0$, we want $\theta^Tx≤-1$(not just $≤0$)}
\end{aligned} if y ( i ) = 1, we want θ T x ≥ 1 ( not just ≥ 0 ) if y ( i ) = 0, we want θ T x ≤ −1 ( not just ≤ 0 )
3 数学原理
u = [ u 1 u 2 ] , v = [ v 1 v 2 ] u=\left[\begin{matrix} u_1\\u_2\end{matrix}\right],v=\left[\begin{matrix} v_1\\v_2\end{matrix}\right] u = [ u 1 u 2 ] , v = [ v 1 v 2 ]
u T v = p ⋅ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 u^Tv=p\ \cdot ||u||=u_1v_1+u_2v_2 u T v = p ⋅ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 p p p :length of projection of v v v onto u ( 有 符 号 ) u(有符号) u ( 有 符 号 )
u T v = v T u u^Tv=v^Tu u T v = v T u
∣ ∣ u ∣ ∣ = u 1 2 + u 2 2 ||u||=\sqrt{{u_1}^2+{u_2}^2} ∣ ∣ u ∣ ∣ = u 1 2 + u 2 2
θ 0 = 0 \theta_0=0 θ 0 = 0 :意味着决策边界必须通过原点(0,0)
支持向量机就是极小化∣ ∣ θ ∣ ∣ ||\theta|| ∣ ∣ θ ∣ ∣ minimize θ 1 2 ∑ j = 1 n θ j 2 s.t. p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≥ 1 if y ( i ) = 1 p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≤ − 1 if y ( i ) = 0 \begin{aligned}
&\mathop{\text{minimize}}\limits_{\theta} \frac{1}{2}\sum_{j=1}^n{\theta_j}^2\\
\text{s.t.} \ \ &p^{(i)}\ \cdot ||\theta||≥1&\text{if $y^{(i)}=1$}\\
\ \ &p^{(i)}\ \cdot ||\theta||≤-1&\text{if $y^{(i)}=0$}
\end{aligned} s.t. θ minimize 2 1 j = 1 ∑ n θ j 2 p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≥ 1 p ( i ) ⋅ ∣ ∣ θ ∣ ∣ ≤ − 1 if y ( i ) = 1 if y ( i ) = 0
要使投影足够大,才能有大间距
4 Kernels 核函数
h θ ( x ) = θ 1 f 1 + θ 2 f 2 + ⋅ ⋅ ⋅ + θ n f n h_\theta(x)=\theta_1f_1+\theta_2f_2+···+\theta_nf_n h θ ( x ) = θ 1 f 1 + θ 2 f 2 + ⋅ ⋅ ⋅ + θ n f n
Given x x x , compute new features f f f depending on proximity to landmarks l l l
Kernel :f i = s i m i l a r i t y ( x , l ( i ) ) f_i=similarity(x,l^{(i)}) f i = s i m i l a r i t y ( x , l ( i ) )
Choose the landmarks :l ( 1 ) = x ( 1 ) , l ( 2 ) = x ( 2 ) , ⋅ ⋅ ⋅ , l ( m ) = x ( m ) l^{(1)}=x^{(1)}, l^{(2)}=x^{(2)}, ···, l^{(m)}=x^{(m)} l ( 1 ) = x ( 1 ) , l ( 2 ) = x ( 2 ) , ⋅ ⋅ ⋅ , l ( m ) = x ( m ) f ( i ) = [ f 0 ( i ) = 1 f 1 ( i ) = s i m ( x ( i ) , l ( 1 ) ) f 2 ( i ) = s i m ( x ( i ) , l ( 2 ) ) ⋮ f i ( i ) = s i m ( x ( i ) , l ( i ) ) = e 0 = 1 ⋮ f 1 ( m ) = s i m ( x ( i ) , l ( m ) ) ] f^{(i)}=\left[\begin{matrix}
f_0^{(i)}=1\\
f_1^{(i)}=sim(x^{(i)},l^{(1)})\\
f_2^{(i)}=sim(x^{(i)},l^{(2)})\\
\vdots\\
f_i^{(i)}=sim(x^{(i)},l^{(i)})=e^0=1\\
\vdots\\
f_1^{(m)}=sim(x^{(i)},l^{(m)})
\end{matrix}\right] f ( i ) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ f 0 ( i ) = 1 f 1 ( i ) = s i m ( x ( i ) , l ( 1 ) ) f 2 ( i ) = s i m ( x ( i ) , l ( 2 ) ) ⋮ f i ( i ) = s i m ( x ( i ) , l ( i ) ) = e 0 = 1 ⋮ f 1 ( m ) = s i m ( x ( i ) , l ( m ) ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
4.1 Gaussian Kernel
f i = s i m i l a r i t y ( x , l ( i ) ) = e x p ( − ∣ ∣ x − l ( i ) ∣ ∣ 2 2 σ 2 ) = e x p ( − ∑ j = 1 n ( x j − l j ( i ) ) 2 2 σ 2 ) f_i=similarity(x,l^{(i)})=exp\left(-\frac{{||x-l^{(i)}||}^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n{(x_j-l_j^{(i)})}^2}{2\sigma^2}\right) f i = s i m i l a r i t y ( x , l ( i ) ) = e x p ( − 2 σ 2 ∣ ∣ x − l ( i ) ∣ ∣ 2 ) = e x p ⎝ ⎛ − 2 σ 2 ∑ j = 1 n ( x j − l j ( i ) ) 2 ⎠ ⎞ 与正态分布没啥联系
if x ≈ l ( i ) x≈l^{(i)} x ≈ l ( i ) , f i ≈ 1 f_i≈1 f i ≈ 1
if x x x is far from l ( i ) l^{(i)} l ( i ) , f i ≈ 0 f_i≈0 f i ≈ 0
使用前需要进行特征缩放
需要选择σ 2 \sigma^2 σ 2
4.2 Linear Kernel
do not use kernels
Predict “y = 1 y=1 y = 1 ” if θ T f ≥ 0 \theta^Tf≥0 θ T f ≥ 0
5 SVM with Kernels
Hypothesis :Given x x x , compute features f ∈ R m + 1 f∈\mathbb{R}^{m+1} f ∈ R m + 1 .
Training :min θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T f ( i ) ) ] + θ T M θ \mathop{\text{min}}\limits_{\theta}C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tf^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tf^{(i)})\right]+\theta^TM\theta θ min C i = 1 ∑ m [ y ( i ) c o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T f ( i ) ) ] + θ T M θ 其中,M M M 是根据我们选择的核函数而不同的一个矩阵
6 参数C C C 和σ 2 \sigma^2 σ 2 的影响
C C C 较大时,相当于λ \lambda λ 较小,可能会导致过拟合,高方差C C C 较小时,相当于λ \lambda λ 较大,可能会导致欠拟合,高偏差
σ 2 \sigma^2 σ 2 较大时,可能会导致低方差,高偏差(f i f_i f i very more smoothly)σ 2 \sigma^2 σ 2 较大时,可能会导致低偏差,高方差(f i f_i f i very less smoothly)
7 Using an SVM
Use SVM software package to solve for parameters θ \theta θ
Need to specify:
(1) Choice of parameter C C C
(2) Choice of kernel ( similarity function )
7.1 Other Choices of Kernel
Not all similarity function make valid kernelsNeed to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge
Many off-the-shelf kernels:
(1) Polynomial kernel(多项式核函数)
(2) String kernel(字符串核函数)
(3) chi-square kernel(卡方核函数)
(4) histogram intersection kernel(直方图交集核函数)
7.2 Multi-class classification
-Train K K K SVMs, one to distinguish y = i y=i y = i from the rest, for i = 1 , 2 , ⋅ ⋅ ⋅ , K i=1, 2, ···, K i = 1 , 2 , ⋅ ⋅ ⋅ , K , get θ ( 1 ) , θ ( 2 ) , ⋅ ⋅ ⋅ , θ ( K ) \theta^{(1)}, \theta^{(2)}, ···, \theta^{(K)} θ ( 1 ) , θ ( 2 ) , ⋅ ⋅ ⋅ , θ ( K ) . Pick class i i i with largest ( θ ( i ) ) T x {(\theta^{(i)})}^Tx ( θ ( i ) ) T x
8 逻辑回归和支持向量机
n n n :特征数m m m :训练样本数
n 、 m n、m n 、 m
选择
n > > m n>>m n > > m
逻辑回归、不带核函数的支持向量机
n n n 较小,m m m 中等
高斯核函数的支持向量机
n n n 较小,m m m 较大
创造、增加更多的特征,再逻辑回归、不带核函数的支持向量机
神经网络在上述情况下可能非常慢
支持向量机主要在于它的代价函数是凸函数,不存在局部最小值
9 Reference
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记