机器学习基石笔记(二):学会说 是/否

Lecture 2: Learning to Answer Yes/No

Perceptron Hypothesis Set

For x=(x1,x2,,xd)x = (x_1 ,x_2 ,··· ,x_d ) ‘features of sample’, compute a weighted ‘score’(i=1dwixi>threshold\sum_{i=1}^{d}w_ix_i>threshold)
and approve credit if score>threshold, deny credit if score>threshold and ignore the equals.
Y:{+1,1}Y:\{+1, -1\}:
h(x)=sign((i=1dwixi)threshold)=w0=threshold,x0=+1sign((i=1dwixi)+w0x0)=sign((i=0dwixi))=sign(wTx) h(x)=sign((\sum_{i=1}^{d}w_ix_i)-threshold) \stackrel{w_0 = -threshold, x_0 = +1}{\xlongequal{\quad\quad\quad\quad\quad\quad\quad}} sign((\sum_{i=1}^{d}w_ix_i)+w_0*x_0) \\ = sign((\sum_{i=0}^{d}w_ix_i)) = sign(w^Tx)
called ‘perceptron’ hypothesis historically

Perceptrons in R2R^2

机器学习基石笔记(二):学会说 是/否

perceptrons(感知器) ⇔ linear (binary) classifiers


Fun Time

Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1. coffee, tea, hamburger, steak
2. free, drug, fantastic, deal  \checkmark
3. machine, learning, statistics, textbook
4. national, *, university, coursera

Explanation
垃圾邮件常含"免费",’'打折","好消息"等词。

Perceptron Learning Algorithm (PLA)

PLA算法就是要在Perceptrons(Perceptron Hypothesis Set)找到合适的Perceptron,而Perceptron由ww决定,即找到合适ww,使样本完全分开。

Cyclic PLA

Cyclic PLA算法如下所示
(1)初始化ww​->w0w_0​
(2)更新ww​
  For t = 0,1,…
    find a mistake of wtw_t​ called (xn(t)x_{n(t)}​,yn(t)y_{n(t)}​)
      sign(wtTxn(t))yn(t)sign(w_t^Tx_{n(t)}) \ne y_{n(t)}​

    correct the mistake by
      w(t+1)wt+yn(t)xn(t)w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​

  … until no more mistakes
  return last w (called wPLAw_{PLA}​ ) as g

PLA算法的精髓在于找到与感知器分类不正确的点,然后通过不正确的点修正感知器,直到所有的样本点分类正确。
修正公式:w(t+1)wt+yn(t)xn(t)w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​
修正图示:
机器学习基石笔记(二):学会说 是/否

Fun Time

Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign(wtTxn)ynw(t+1)wt+ynxn sign(w_t^Tx_{n}) \ne y_{n},\quad w_{(t+1)} ← w_t + y_{n}x_{n}
1. wt+1Txn=ynw_{t+1}^Tx_{n} = y_{n}
2. sign(wt+1Txn)=ynsign(w_{t+1}^Tx_{n}) = y_{n}
3. ynwt+1TxnynwtTxny_{n}w_{t+1}^Tx_{n} \geq y_{n}w_{t}^Tx_{n}  \checkmark
4. ynwt+1Txn<ynwtTxny_{n}w_{t+1}^Tx_{n} < y_{n}w_{t}^Tx_{n}​

Explanation
经过(xnx_{n}, yny_{n})更新后,wtw_twt+1w_{t+1}的变化
由于感知器是相对修正(Δw=ynxn\Delta w = y_{n}x_{n}),修正后不一定就能将(xnx_{n}, yny_{n})分类正确。

ynwt+1TxnynwtTxn=(wt+1TwtT)ynxn=xnTynTynxn=ynTyn=+1xnTxn0 y_{n}w_{t+1}^Tx_{n} - y_{n}w_{t}^Tx_{n} = (w_{t+1}^T-w_{t}^T)y_{n}x_{n} = x_{n}^Ty_{n}^Ty_{n}x_{n} \stackrel{y_{n}^Ty_{n}= +1}{\xlongequal{\quad\quad\quad}} x_{n}^Tx_{n} \ge 0​


Guarantee of PLA


Linear Separability

if PLA halts (i.e. no more mistakes), (necessary condition) D allows some w to make no mistake call such D linear separable.

linear separable DD ⇔ exists perfect wf\mathbf{w_f} such that sign(wfTxn)=ynsign(w_f^Tx_{n}) = y_{n}​
机器学习基石笔记(二):学会说 是/否

感知器的修正

Fun Time中我们推导出感知器每次修正,ynwt+1TxnynwtTxny_{n}w_{t+1}^Tx_{n} \ge y_{n}w_{t}^Tx_{n}, wtw_t就会向wfw_f逼近。
我们计算ΔwfTwt+1\Delta w_f^Tw_{t+1}​再次确认下
wf is perfect for xnyn(t)wtTxn(t)minn yn(t)wtTxn(t)>0wtTwn(t) by updating with any (xn(t),yn(t))wfTwt+1=wfT(wt+yn(t)xn(t))wfTwt+minnynwfTxn>wfTwt+0 \because w_f \ is \ perfect \ for \ x_n \\ \therefore y_{n(t)}w_t^Tx_{n(t)} \ge \mathop{min}\limits_n \, y_{n(t)}w_t^Tx_{n(t)} > 0 \\ \mathbf{w_t^Tw_{n(t)} \uparrow} \ by \ updating \ with \ any \ (x_{n(t)}, y_{n(t)}) \\ \begin{aligned} \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} &= \mathbf{w}_{f}^{T}\left(\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)}\right) \\ & \geq \mathbf{w}_{f}^{T} \mathbf{w}_{t}+\min _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \\ &>\mathbf{w}_{f}^{T} \mathbf{w}_{t}+0 \end{aligned}

感知器每次修正,wfTwt+1>wfTwt\mathbf{w}_{f}^{T} \mathbf{w}_{t+1} > \mathbf{w}_{f}^{T} \mathbf{w}_{t}​wtw_t​是在向wfw_f​逼近。

但是我们希望的逼近是夹角的逼近,而不是数值的增大。计算Δwt2\Delta||w_t||^2​

wt+12=wt+yn(t)xn(t)2=wt2+2yn(t)wtTxn(t)+yn(t)xn(t)2 sign(wtTxn(t))yn(t)yn(t)wtTxn(t)0wt2+0+yn(t)xn(t)2wt2+maxnynxn2wt2+maxnxn2 \begin{aligned} \left\|\mathbf{w}_{t+1}\right\|^{2} &=\left\|\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)} \right\|^{2} \\ &=\left\|\mathbf{w}_{t}\right\|^{2}+2 y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ 又 \ \operatorname{sign} & \left(\mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}\right) \neq y_{n(t)} \Leftrightarrow y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)} \leq 0 \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+0+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\|y_{n} \mathbf{x}_{n}\right\|^{2} \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\| \mathbf{x}_{n}\right\|^{2} \end{aligned}
最大增长不会超过maxnxn2\mathop{max} \limits_{n}\left\|\mathbf{x}_{n}\right\|^{2}​

实际上w0=0w_0=0,在T次修正后cosθ=wfTwfwTwTTconstant\cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant​
证明:
R2=maxnxn2ρ=minnynwfTxnwfw0=0wfTwTTminnynwfTxnwT2Tmaxnxn2cosθ=wfTwfwTwT=1wfwfTwTwT=1wfTminnynwfTxnTmaxnxn2TρRcosθ=wfTwfwTwTTconstant,  constant=ρR R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}} \\ \\ \because \mathbf{w_0 = 0} \quad \mathbf{w}_{f}^{T}\mathbf{w}_{T} \geq T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \quad \left\|\mathbf{w}_{T}\right\|^{2} \leq T\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2} \\ cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{f}^{T} \mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n}}{\sqrt{T}\sqrt{\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2}}} \geq \sqrt{T}\frac{\rho}{R} \\ \therefore \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant,\, 其中\, constant = \frac{\rho}{R}


Fun Time

Let’s upper-bound T, the number of mistakes that PLA ‘corrects’.
Define R2=maxnxn2ρ=minnynwfTxnwf Define \, R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}}

We want to show that TT \leq \square. Express the upper bound \square by the two terms above.
1 Rρ\frac{R}{\rho}
2 R2ρ2\frac{R^2}{\rho^2}  \checkmark
3 Rρ2\frac{R}{\rho^2}
4 ρ2R2\frac{\rho^2}{R^2}


Explanation
TρRwfTwfwTwT=cosθ1Tρ2R2 \sqrt{T}\frac{\rho}{R} \leq \frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = cos\theta\leq 1 \\ \therefore T \geq \frac{\rho^2}{R^2}


Non-Separable Data

More about PLA

保证
数据线性可分,存在一个感知器能使数据完全分开。

过程
通过一次次修正,更新感知器的参数。

结果
最终会达到完全可分,不知道会更新多少次。
机器学习基石笔记(二):学会说 是/否

Pocket PLA

假设数据存在噪音(noise),线性不可分,此时我们想得到一个大致完全分类的感知器。
Pocket PLA:modify PLA algorithm (black lines) by keeping best weights in pocket
(1)初始化ww->w0w_0
(2)更新ww
  For t = 0,1,…
    find a (random) mistake of wtw_t called (xn(t)x_{n(t)},yn(t)y_{n(t)})
      sign(wtTxn(t))yn(t)sign(w_t^Tx_{n(t)}) \ne y_{n(t)}

    correct the mistake by
      w(t+1)wt+yn(t)xn(t)w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​

    if wt+1w_ {t+1}​ makes fewer mistakes than w^\hat{\mathbf{w}}​, replace w^\hat{\mathbf{w}}​ by wt+1w_ {t+1}​
      w(t+1)wt+yn(t)xn(t)w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)}​

  … until engough iterations
  return w^\hat{\mathbf{w}} (called wPocketw_{Pocket} ) as g

Pocket PLA算法:
  按照PLA一样更新wtw_{t}​的值,但是用一个Pocket(​w^\hat{\mathbf{w}}​)存储当前犯错误最少的感知器。
  当执行一定时间或者更新次数,返回当前​w^\hat{\mathbf{w}}​


Fun Time

Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?
1 pocket on D is slower than PLA  \checkmark
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA

Explanation
如果数据线性可分,PLA和Pocket PLA结果一样,获得一个将数据完全分类(no mistakes)的感知器;
但是Pocket PLA会慢些,主要原因是每一轮,需要检查犯错的个数;其次与当前最优感知器比较,替换也会耗费一些时间。

Summary

讲义总结

Perceptron Hypothesis Set
  感知器假设集是多维空间的超平面

Perceptron Learning Algorithm (PLA)
  sign(wtTxn(t))yn(t)sign(w_t^Tx_{n(t)}) \ne y_{n(t)}​w(t+1)wt+yn(t)xn(t)w_{(t+1)} \leftarrow w_t + y_{n(t)}x_{n(t)}​
  gwTg \leftarrow w_T ​

Guarantee of PLA
  若数据集线性可分,则使用PLA算法,找一个完全分类的感知器。

Non-Separable Data
  若数据集线性不可分,则使用Pocket PLA算法,找一个较好的感知器。
  Pocket PLA算法按照PLA一样更新wtw_{t}的值,但是用一个Pocket(w^\hat{\mathbf{w}})存储当前犯错误最少的感知器。
  gw^g \leftarrow \hat{\mathbf{w}}

PLA AA takes linear separable DD and perceptrons H to get hypothesis g

参考文献

《Machine Learning Foundations》(机器学习基石)—— Hsuan-Tien Lin (林轩田)