Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
For x = ( x 1 , x 2 , ⋅ ⋅ ⋅ , x d ) x = (x_1 ,x_2 ,··· ,x_d ) x = ( x 1 , x 2 , ⋅ ⋅ ⋅ , x d ) ‘features of sample’, compute a weighted ‘score’(∑ i = 1 d w i x i > t h r e s h o l d \sum_{i=1}^{d}w_ix_i>threshold ∑ i = 1 d w i x i > t h r e s h o l d )
and approve credit if score>threshold, deny credit if score>threshold and ignore the equals.Y : { + 1 , − 1 } Y:\{+1, -1\} Y : { + 1 , − 1 } :h ( x ) = s i g n ( ( ∑ i = 1 d w i x i ) − t h r e s h o l d ) = w 0 = − t h r e s h o l d , x 0 = + 1 s i g n ( ( ∑ i = 1 d w i x i ) + w 0 ∗ x 0 ) = s i g n ( ( ∑ i = 0 d w i x i ) ) = s i g n ( w T x )
h(x)=sign((\sum_{i=1}^{d}w_ix_i)-threshold) \stackrel{w_0 = -threshold, x_0 = +1}{\xlongequal{\quad\quad\quad\quad\quad\quad\quad}}
sign((\sum_{i=1}^{d}w_ix_i)+w_0*x_0)
\\ = sign((\sum_{i=0}^{d}w_ix_i)) = sign(w^Tx)
h ( x ) = s i g n ( ( i = 1 ∑ d w i x i ) − t h r e s h o l d ) w 0 = − t h r e s h o l d , x 0 = + 1 s i g n ( ( i = 1 ∑ d w i x i ) + w 0 ∗ x 0 ) = s i g n ( ( i = 0 ∑ d w i x i ) ) = s i g n ( w T x )
called ‘perceptron ’ hypothesis historically
Perceptrons in R 2 R^2 R 2
perceptrons(感知器) ⇔ linear (binary) classifiers
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1. coffee, tea, hamburger, steak
2. free, drug, fantastic, deal ✓ \checkmark ✓
3. machine, learning, statistics, textbook
4. national, *, university, coursera
Explanation
垃圾邮件常含"免费",’'打折","好消息"等词。
Perceptron Learning Algorithm (PLA)
PLA算法就是要在Perceptrons(Perceptron Hypothesis Set)找到合适的Perceptron,而Perceptron由w w w 决定,即找到合适w w w ,使样本完全分开。
Cyclic PLA
Cyclic PLA算法如下所示
(1)初始化w w w ->w 0 w_0 w 0
(2)更新w w w
For t = 0,1,…
find a mistake of w t w_t w t called (x n ( t ) x_{n(t)} x n ( t ) ,y n ( t ) y_{n(t)} y n ( t ) )
s i g n ( w t T x n ( t ) ) ≠ y n ( t ) sign(w_t^Tx_{n(t)}) \ne y_{n(t)} s i g n ( w t T x n ( t ) ) ̸ = y n ( t )
correct the mistake by
w ( t + 1 ) ← w t + y n ( t ) x n ( t ) w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)} w ( t + 1 ) ← w t + y n ( t ) x n ( t )
… until no more mistakes
return last w (called w P L A w_{PLA} w P L A ) as g
PLA算法的精髓在于找到与感知器分类不正确的点,然后通过不正确的点修正感知器,直到所有的样本点分类正确。
修正公式:w ( t + 1 ) ← w t + y n ( t ) x n ( t ) w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)} w ( t + 1 ) ← w t + y n ( t ) x n ( t )
修正图示:
Fun Time
Let’s try to think about why PLA may work. Let n = n(t), according to the rule of PLA below, which formula is true? s i g n ( w t T x n ) ≠ y n , w ( t + 1 ) ← w t + y n x n
sign(w_t^Tx_{n}) \ne y_{n},\quad
w_{(t+1)} ← w_t + y_{n}x_{n}
s i g n ( w t T x n ) ̸ = y n , w ( t + 1 ) ← w t + y n x n
1. w t + 1 T x n = y n w_{t+1}^Tx_{n} = y_{n} w t + 1 T x n = y n
2. s i g n ( w t + 1 T x n ) = y n sign(w_{t+1}^Tx_{n}) = y_{n} s i g n ( w t + 1 T x n ) = y n
3. y n w t + 1 T x n ≥ y n w t T x n y_{n}w_{t+1}^Tx_{n} \geq y_{n}w_{t}^Tx_{n} y n w t + 1 T x n ≥ y n w t T x n ✓ \checkmark ✓
4. y n w t + 1 T x n < y n w t T x n y_{n}w_{t+1}^Tx_{n} < y_{n}w_{t}^Tx_{n} y n w t + 1 T x n < y n w t T x n
Explanation
经过(x n x_{n} x n , y n y_{n} y n )更新后,w t w_t w t 到w t + 1 w_{t+1} w t + 1 的变化
由于感知器是相对修正(Δ w = y n x n \Delta w = y_{n}x_{n} Δ w = y n x n ),修正后不一定就能将(x n x_{n} x n , y n y_{n} y n )分类正确。
y n w t + 1 T x n − y n w t T x n = ( w t + 1 T − w t T ) y n x n = x n T y n T y n x n = y n T y n = + 1 x n T x n ≥ 0
y_{n}w_{t+1}^Tx_{n} - y_{n}w_{t}^Tx_{n} = (w_{t+1}^T-w_{t}^T)y_{n}x_{n} = x_{n}^Ty_{n}^Ty_{n}x_{n}
\stackrel{y_{n}^Ty_{n}= +1}{\xlongequal{\quad\quad\quad}} x_{n}^Tx_{n} \ge 0
y n w t + 1 T x n − y n w t T x n = ( w t + 1 T − w t T ) y n x n = x n T y n T y n x n y n T y n = + 1 x n T x n ≥ 0
Guarantee of PLA
Linear Separability
if PLA halts (i.e. no more mistakes), (necessary condition) D allows some w to make no mistake call such D linear separable.
linear separable D D D ⇔ exists perfect w f \mathbf{w_f} w f such that s i g n ( w f T x n ) = y n sign(w_f^Tx_{n}) = y_{n} s i g n ( w f T x n ) = y n
感知器的修正
Fun Time中我们推导出感知器每次修正,y n w t + 1 T x n ≥ y n w t T x n y_{n}w_{t+1}^Tx_{n} \ge y_{n}w_{t}^Tx_{n} y n w t + 1 T x n ≥ y n w t T x n , w t w_t w t 就会向w f w_f w f 逼近。
我们计算Δ w f T w t + 1 \Delta w_f^Tw_{t+1} Δ w f T w t + 1 再次确认下∵ w f i s p e r f e c t f o r x n ∴ y n ( t ) w t T x n ( t ) ≥ m i n n  y n ( t ) w t T x n ( t ) > 0 w t T w n ( t ) ↑ b y u p d a t i n g w i t h a n y ( x n ( t ) , y n ( t ) ) w f T w t + 1 = w f T ( w t + y n ( t ) x n ( t ) ) ≥ w f T w t + min n y n w f T x n > w f T w t + 0
\because w_f \ is \ perfect \ for \ x_n
\\ \therefore y_{n(t)}w_t^Tx_{n(t)} \ge \mathop{min}\limits_n \, y_{n(t)}w_t^Tx_{n(t)} > 0
\\
\mathbf{w_t^Tw_{n(t)} \uparrow} \ by \ updating \ with \ any \ (x_{n(t)}, y_{n(t)})
\\
\begin{aligned}
\mathbf{w}_{f}^{T} \mathbf{w}_{t+1} &= \mathbf{w}_{f}^{T}\left(\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)}\right)
\\ & \geq \mathbf{w}_{f}^{T} \mathbf{w}_{t}+\min _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n}
\\ &>\mathbf{w}_{f}^{T} \mathbf{w}_{t}+0
\end{aligned}
∵ w f i s p e r f e c t f o r x n ∴ y n ( t ) w t T x n ( t ) ≥ n m i n y n ( t ) w t T x n ( t ) > 0 w t T w n ( t ) ↑ b y u p d a t i n g w i t h a n y ( x n ( t ) , y n ( t ) ) w f T w t + 1 = w f T ( w t + y n ( t ) x n ( t ) ) ≥ w f T w t + n min y n w f T x n > w f T w t + 0
感知器每次修正,w f T w t + 1 > w f T w t \mathbf{w}_{f}^{T} \mathbf{w}_{t+1} > \mathbf{w}_{f}^{T} \mathbf{w}_{t} w f T w t + 1 > w f T w t ,w t w_t w t 是在向w f w_f w f 逼近。
但是我们希望的逼近是夹角的逼近,而不是数值的增大。计算Δ ∣ ∣ w t ∣ ∣ 2 \Delta||w_t||^2 Δ ∣ ∣ w t ∣ ∣ 2 。
∥ w t + 1 ∥ 2 = ∥ w t + y n ( t ) x n ( t ) ∥ 2 = ∥ w t ∥ 2 + 2 y n ( t ) w t T x n ( t ) + ∥ y n ( t ) x n ( t ) ∥ 2 又 sign ( w t T x n ( t ) ) ≠ y n ( t ) ⇔ y n ( t ) w t T x n ( t ) ≤ 0 ≤ ∥ w t ∥ 2 + 0 + ∥ y n ( t ) x n ( t ) ∥ 2 ≤ ∥ w t ∥ 2 + max n ∥ y n x n ∥ 2 ≤ ∥ w t ∥ 2 + max n ∥ x n ∥ 2
\begin{aligned}
\left\|\mathbf{w}_{t+1}\right\|^{2} &=\left\|\mathbf{w}_{t}+y_{n(t)} \mathbf{x}_{n(t)} \right\|^{2}
\\ &=\left\|\mathbf{w}_{t}\right\|^{2}+2 y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2}
\\ 又 \ \operatorname{sign} & \left(\mathbf{w}_{t}^{T} \mathbf{x}_{n(t)}\right) \neq y_{n(t)} \Leftrightarrow y_{n(t)} \mathbf{w}_{t}^{T} \mathbf{x}_{n(t)} \leq 0
\\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+0+\left\|y_{n(t)} \mathbf{x}_{n(t)}\right\|^{2} \\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\|y_{n} \mathbf{x}_{n}\right\|^{2}
\\ & \leq\left\|\mathbf{w}_{t}\right\|^{2}+\max _{n}\left\| \mathbf{x}_{n}\right\|^{2}
\end{aligned}
∥ w t + 1 ∥ 2 又 s i g n = ∥ ∥ w t + y n ( t ) x n ( t ) ∥ ∥ 2 = ∥ w t ∥ 2 + 2 y n ( t ) w t T x n ( t ) + ∥ ∥ y n ( t ) x n ( t ) ∥ ∥ 2 ( w t T x n ( t ) ) ̸ = y n ( t ) ⇔ y n ( t ) w t T x n ( t ) ≤ 0 ≤ ∥ w t ∥ 2 + 0 + ∥ ∥ y n ( t ) x n ( t ) ∥ ∥ 2 ≤ ∥ w t ∥ 2 + n max ∥ y n x n ∥ 2 ≤ ∥ w t ∥ 2 + n max ∥ x n ∥ 2
最大增长不会超过m a x n ∥ x n ∥ 2 \mathop{max} \limits_{n}\left\|\mathbf{x}_{n}\right\|^{2} n m a x ∥ x n ∥ 2
实际上w 0 = 0 w_0=0 w 0 = 0 ,在T次修正后cos θ = w f T ∥ w f ∥ w T ∥ w T ∥ ≥ T ⋅ c o n s t a n t \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant cos θ = ∥ w f ∥ w f T ∥ w T ∥ w T ≥ T ⋅ c o n s t a n t 证明: R 2 = max n ∥ x n ∥ 2 ρ = min n y n w f T x n ∥ w f ∥ ∵ w 0 = 0 w f T w T ≥ T min n y n w f T x n ∥ w T ∥ 2 ≤ T max n ∥ x n ∥ 2 c o s θ = w f T ∥ w f ∥ w T ∥ w T ∥ = 1 ∥ w f ∥ w f T w T ∥ w T ∥ = 1 ∥ w f ∥ T min n y n w f T x n T max n ∥ x n ∥ 2 ≥ T ρ R ∴ cos θ = w f T ∥ w f ∥ w T ∥ w T ∥ ≥ T ⋅ c o n s t a n t ,  其 中  c o n s t a n t = ρ R
R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}}
\\
\\ \because \mathbf{w_0 = 0} \quad
\mathbf{w}_{f}^{T}\mathbf{w}_{T} \geq T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n} \quad
\left\|\mathbf{w}_{T}\right\|^{2} \leq T\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2}
\\
cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|}
= \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{f}^{T} \mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|}
= \frac{1}{\left\|\mathbf{w}_{f}\right\|} \frac{T\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} \mathbf{x}_{n}}{\sqrt{T}\sqrt{\max \limits _{n}\left\| \mathbf{x}_{n}\right\|^{2}}}
\geq \sqrt{T}\frac{\rho}{R}
\\ \therefore \cos\theta=\frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} \geq \sqrt{T} \cdot constant,\, 其中\, constant = \frac{\rho}{R}
R 2 = n max ∥ x n ∥ 2 ρ = ∥ w f ∥ n min y n w f T x n ∵ w 0 = 0 w f T w T ≥ T n min y n w f T x n ∥ w T ∥ 2 ≤ T n max ∥ x n ∥ 2 c o s θ = ∥ w f ∥ w f T ∥ w T ∥ w T = ∥ w f ∥ 1 ∥ w T ∥ w f T w T = ∥ w f ∥ 1 T n max ∥ x n ∥ 2 T n min y n w f T x n ≥ T R ρ ∴ cos θ = ∥ w f ∥ w f T ∥ w T ∥ w T ≥ T ⋅ c o n s t a n t , 其 中 c o n s t a n t = R ρ
Fun Time
Let’s upper-bound T, the number of mistakes that PLA ‘corrects’. D e f i n e  R 2 = max n ∥ x n ∥ 2 ρ = min n y n w f T x n ∥ w f ∥
Define \, R^2 = \max _{n} \|\mathbf{x}_{n}\|^{2} \quad \rho = \frac{\min \limits _{n} y_{n} \mathbf{w}_{f}^{T} x_n}{\mathbf{\| w_f \|}}
D e f i n e R 2 = n max ∥ x n ∥ 2 ρ = ∥ w f ∥ n min y n w f T x n
We want to show that T ≤ □ T \leq \square T ≤ □ . Express the upper bound □ \square □ by the two terms above.
1 R ρ \frac{R}{\rho} ρ R
2 R 2 ρ 2 \frac{R^2}{\rho^2} ρ 2 R 2 ✓ \checkmark ✓
3 R ρ 2 \frac{R}{\rho^2} ρ 2 R
4 ρ 2 R 2 \frac{\rho^2}{R^2} R 2 ρ 2 Explanation T ρ R ≤ w f T ∥ w f ∥ w T ∥ w T ∥ = c o s θ ≤ 1 ∴ T ≥ ρ 2 R 2
\sqrt{T}\frac{\rho}{R} \leq \frac{\mathbf{w}_{f}^{T}}{\left\|\mathbf{w}_{f}\right\|} \frac{\mathbf{w}_{T}}{\left\|\mathbf{w}_{T}\right\|} = cos\theta\leq 1
\\ \therefore T \geq \frac{\rho^2}{R^2}
T R ρ ≤ ∥ w f ∥ w f T ∥ w T ∥ w T = c o s θ ≤ 1 ∴ T ≥ R 2 ρ 2
Non-Separable Data
More about PLA
保证
数据线性可分,存在一个感知器能使数据完全分开。
过程
通过一次次修正,更新感知器的参数。
结果
最终会达到完全可分,不知道会更新多少次。
Pocket PLA
假设数据存在噪音(noise),线性不可分,此时我们想得到一个大致完全分类的感知器。Pocket PLA :modify PLA algorithm (black lines) by keeping best weights in pocket
(1)初始化w w w ->w 0 w_0 w 0
(2)更新w w w
For t = 0,1,…
find a (random) mistake of w t w_t w t called (x n ( t ) x_{n(t)} x n ( t ) ,y n ( t ) y_{n(t)} y n ( t ) )
s i g n ( w t T x n ( t ) ) ≠ y n ( t ) sign(w_t^Tx_{n(t)}) \ne y_{n(t)} s i g n ( w t T x n ( t ) ) ̸ = y n ( t )
correct the mistake by
w ( t + 1 ) ← w t + y n ( t ) x n ( t ) w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)} w ( t + 1 ) ← w t + y n ( t ) x n ( t )
if w t + 1 w_ {t+1} w t + 1 makes fewer mistakes than w ^ \hat{\mathbf{w}} w ^ , replace w ^ \hat{\mathbf{w}} w ^ by w t + 1 w_ {t+1} w t + 1
w ( t + 1 ) ← w t + y n ( t ) x n ( t ) w_{(t+1)} ← w_t + y_{n(t)}x_{n(t)} w ( t + 1 ) ← w t + y n ( t ) x n ( t )
… until engough iterations
return w ^ \hat{\mathbf{w}} w ^ (called w P o c k e t w_{Pocket} w P o c k e t ) as g
Pocket PLA算法:
按照PLA一样更新w t w_{t} w t 的值,但是用一个Pocket(w ^ \hat{\mathbf{w}} w ^ )存储当前犯错误最少的感知器。
当执行一定时间或者更新次数,返回当前w ^ \hat{\mathbf{w}} w ^ 。
Fun Time
Should we use pocket or PLA? Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?
1 pocket on D is slower than PLA ✓ \checkmark ✓
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA
Explanation
如果数据线性可分,PLA和Pocket PLA结果一样,获得一个将数据完全分类(no mistakes)的感知器;
但是Pocket PLA会慢些,主要原因是每一轮,需要检查犯错的个数;其次与当前最优感知器比较,替换也会耗费一些时间。
Summary
讲义总结
Perceptron Hypothesis Set
感知器假设集是多维空间的超平面
Perceptron Learning Algorithm (PLA)
s i g n ( w t T x n ( t ) ) ≠ y n ( t ) sign(w_t^Tx_{n(t)}) \ne y_{n(t)} s i g n ( w t T x n ( t ) ) ̸ = y n ( t ) ,w ( t + 1 ) ← w t + y n ( t ) x n ( t ) w_{(t+1)} \leftarrow w_t + y_{n(t)}x_{n(t)} w ( t + 1 ) ← w t + y n ( t ) x n ( t )
g ← w T g \leftarrow w_T g ← w T
Guarantee of PLA
若数据集线性可分,则使用PLA算法,找一个完全分类的感知器。
Non-Separable Data
若数据集线性不可分,则使用Pocket PLA算法,找一个较好的感知器。
Pocket PLA算法按照PLA一样更新w t w_{t} w t 的值,但是用一个Pocket(w ^ \hat{\mathbf{w}} w ^ )存储当前犯错误最少的感知器。
g ← w ^ g \leftarrow \hat{\mathbf{w}} g ← w ^
PLA A A A takes linear separable D D D and perceptrons H to get hypothesis g
参考文献
《Machine Learning Foundations》(机器学习基石)—— Hsuan-Tien Lin (林轩田)