【机器学习】7 支持向量机

1 Optimization Objective 优化目标

  • Hypothesis:
    hθ(x)={1, if θTx00, otherwiseh_\theta(x)=\begin{cases} 1&\text{, if $\theta^Tx≥0$}\\ 0&\text{, otherwise} \end{cases}
  • Cost Function:
    J(θ)=Ci=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+12j=1nθj2J(\theta)=C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tx^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2
    C1λC类似于\frac{1}{\lambda}
  • Goal:
    minimizeθJ(θ)\mathop{\text{minimize}}\limits_{\theta} J(\theta)

2 Large Margin Intution

【机器学习】7 支持向量机

  • if y(i)=1, we want θTx1(not just 0if y(i)=0, we want θTx1(not just 0\begin{aligned} &\text{if $y^{(i)}=1$, we want $\theta^Tx≥1$(not just $≥0$)}\\ &\text{if $y^{(i)}=0$, we want $\theta^Tx≤-1$(not just $≤0$)} \end{aligned}

3 数学原理

u=[u1u2]v=[v1v2]u=\left[\begin{matrix} u_1\\u_2\end{matrix}\right],v=\left[\begin{matrix} v_1\\v_2\end{matrix}\right]

  • uTv=p u=u1v1+u2v2u^Tv=p\ \cdot ||u||=u_1v_1+u_2v_2
    pp:length of projection of vv onto uu(有符号)
  • uTv=vTuu^Tv=v^Tu
  • u=u12+u22||u||=\sqrt{{u_1}^2+{u_2}^2}
  • θ0=0\theta_0=0:意味着决策边界必须通过原点(0,0)
  • 支持向量机就是极小化θ||\theta||
    minimizeθ12j=1nθj2s.t.  p(i) θ1if y(i)=1  p(i) θ1if y(i)=0\begin{aligned} &\mathop{\text{minimize}}\limits_{\theta} \frac{1}{2}\sum_{j=1}^n{\theta_j}^2\\ \text{s.t.} \ \ &p^{(i)}\ \cdot ||\theta||≥1&\text{if $y^{(i)}=1$}\\ \ \ &p^{(i)}\ \cdot ||\theta||≤-1&\text{if $y^{(i)}=0$} \end{aligned}
  • 要使投影足够大,才能有大间距

4 Kernels 核函数

  • hθ(x)=θ1f1+θ2f2++θnfnh_\theta(x)=\theta_1f_1+\theta_2f_2+···+\theta_nf_n
  • Given xx, compute new features ff depending on proximity to landmarks ll
  • Kernelfi=similarity(x,l(i))f_i=similarity(x,l^{(i)})
  • Choose the landmarksl(1)=x(1),l(2)=x(2),,l(m)=x(m)l^{(1)}=x^{(1)}, l^{(2)}=x^{(2)}, ···, l^{(m)}=x^{(m)}
    f(i)=[f0(i)=1f1(i)=sim(x(i),l(1))f2(i)=sim(x(i),l(2))fi(i)=sim(x(i),l(i))=e0=1f1(m)=sim(x(i),l(m))]f^{(i)}=\left[\begin{matrix} f_0^{(i)}=1\\ f_1^{(i)}=sim(x^{(i)},l^{(1)})\\ f_2^{(i)}=sim(x^{(i)},l^{(2)})\\ \vdots\\ f_i^{(i)}=sim(x^{(i)},l^{(i)})=e^0=1\\ \vdots\\ f_1^{(m)}=sim(x^{(i)},l^{(m)}) \end{matrix}\right]

4.1 Gaussian Kernel

  • fi=similarity(x,l(i))=exp(xl(i)22σ2)=exp(j=1n(xjlj(i))22σ2)f_i=similarity(x,l^{(i)})=exp\left(-\frac{{||x-l^{(i)}||}^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n{(x_j-l_j^{(i)})}^2}{2\sigma^2}\right)
    与正态分布没啥联系
  • if xl(i)x≈l^{(i)}, fi1f_i≈1
    if xx is far from l(i)l^{(i)}, fi0f_i≈0
  • 使用前需要进行特征缩放
  • 需要选择σ2\sigma^2

4.2 Linear Kernel

  • do not use kernels
  • Predict “y=1y=1” if θTf0\theta^Tf≥0

5 SVM with Kernels

  • Hypothesis:Given xx, compute features fRm+1f∈\mathbb{R}^{m+1}.
  • Training
    minθCi=1m[y(i)cost1(θTf(i))+(1y(i))cost0(θTf(i))]+θTMθ\mathop{\text{min}}\limits_{\theta}C\sum_{i=1}^m\left[y^{(i)}{cost}_{1}(\theta^Tf^{(i)})+(1-y^{(i)}){cost}_0(\theta^Tf^{(i)})\right]+\theta^TM\theta
    其中,MM是根据我们选择的核函数而不同的一个矩阵

6 参数CCσ2\sigma^2的影响

  • CC较大时,相当于λ\lambda较小,可能会导致过拟合,高方差
    CC较小时,相当于λ\lambda较大,可能会导致欠拟合,高偏差
  • σ2\sigma^2较大时,可能会导致低方差,高偏差(fif_i very more smoothly)
    σ2\sigma^2较大时,可能会导致低偏差,高方差(fif_i very less smoothly)

7 Using an SVM

  • Use SVM software package to solve for parameters θ\theta
  • Need to specify:
    (1) Choice of parameter CC
    (2) Choice of kernel ( similarity function )

7.1 Other Choices of Kernel

  • Not all similarity function make valid kernels
    Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge
  • Many off-the-shelf kernels:
    (1) Polynomial kernel(多项式核函数)
    (2) String kernel(字符串核函数)
    (3) chi-square kernel(卡方核函数)
    (4) histogram intersection kernel(直方图交集核函数)

7.2 Multi-class classification

-Train KK SVMs, one to distinguish y=iy=i from the rest, for i=1,2,,Ki=1, 2, ···, K, get θ(1),θ(2),,θ(K)\theta^{(1)}, \theta^{(2)}, ···, \theta^{(K)}. Pick class ii with largest (θ(i))Tx{(\theta^{(i)})}^Tx

  • Or use packages already

8 逻辑回归和支持向量机

nn:特征数
mm:训练样本数

nmn、m 选择
n>>mn>>m 逻辑回归、不带核函数的支持向量机
nn较小,mm中等 高斯核函数的支持向量机
nn较小,mm较大 创造、增加更多的特征,再逻辑回归、不带核函数的支持向量机
  • 神经网络在上述情况下可能非常慢
  • 支持向量机主要在于它的代价函数是凸函数,不存在局部最小值

9 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记