深度之眼西瓜书——线性模型笔记



线性模型

Lei_ZM
2019-09-10



1. 一元线性回归

求解偏置bb和权重ww推导思路

  1. 由最小二乘法导出损失函数E(w,b)E(w, b)

  2. 证明损失函数

  3. 分别对损失函数E(w,b)E(w, b)关于bbww求一阶偏导数

  4. 令各自的一阶偏导数等于0解出bbww


1.1. 由最小二乘法导出损失函数E(w,b)E(w, b)

E(w,b)=i=1m(yif(xi))2=i=1m(yi(wxi+b))2=i=1m(yiwxib)2(西瓜书式3.4) \begin{aligned} E_{(w, b)} &=\sum_{i=1}^{m}\left(y_{i}-f\left(x_{i}\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned} \tag{西瓜书式3.4}



1.2. 证明损失函数

1.2.1. 二元函数判断凹凸性:

f(x,y)f(x, y)在区域DD上具有二阶连续偏导数,记A=fxx(x,y)A=f_{x x}^{\prime \prime}(x, y)B=fxy(x,y)B=f_{x y}^{\prime \prime}(x, y)C=fyy(x,y)C=f_{y y}^{\prime \prime}(x, y)。则:

  1. DD上恒有A>0A>0,且ACB20AC-B^{2}\geq 0时,f(x,y)f(x, y)在区域DD上是凸函数
  2. DD上恒有A<0A<0,且ACB20AC-B^{2}\geq 0时,f(x,y)f(x, y)在区域DD上是凹函数

1.2.2. 二元凹凸函数求最值:

f(x,y)f(x, y)是在开区域DD内具有连续偏导数的凸(或者凹)函数,(x0,y0)D(x_{0}, y_{0})\in D,且fx(x0,y0)=0f_{x}^{\prime}(x_{0}, y_{0})=0fy(x0,y0)=0f_{y}^{\prime}(x_{0}, y_{0})=0,则f(x0,y0)f(x_{0}, y_{0})必为f(x,y)f(x, y)DD内的最小值(或最大值)。


1.2.3. 证明

证明损失函数E(w,b)E(w, b)是关于wwbb的凸函数——求A=fxx(x,y)A=f_{xx}^{\prime \prime}(x, y)

E(w,b)w=w[i=1m(yi(wxi+b))2]=i=1mw(yiwxib)2=i=1m2(yiwxib)(xi)=2(wi=1mxi2i=1m(yib)xi)(西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5}

故有:

2E(w,b)w2=w(E(w,b)w)=w[2(wi=1mxi2i=1m(yib)xi)]=w[2wi=1mxi2]=2i=1mxi2 \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w^{2}} &=\frac{\partial}{\partial w}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial w}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial w}\left[2 w \sum_{i=1}^{m} x_{i}^{2}\right] \\ &=2 \sum_{i=1}^{m} x_{i}^{2} \end{aligned}

此式即为A=fxx(x,y)A=f_{xx}^{\prime \prime}(x, y)

证明损失函数E(w,b)E(w, b)是关于wwbb的凸函数——求B=fxy(x,y)B=f_{xy}^{\prime \prime}(x, y)

2E(w,b)wb=b(E(w,b)w)=b[2(wi=1mxi2i=1m(yib)xi)]=b[2i=1m(yib)xi]=b(2i=1myixi+2i=1mbxi)=b(2i=1mbxi)=2i=1mxi \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial w \partial b} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial b}\left[-2 \sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right] \\ &=\frac{\partial}{\partial b}\left(-2 \sum_{i=1}^{m} y_{i} x_{i}+2 \sum_{i=1}^{m} b x_{i}\right) \\ &=\frac{\partial}{\partial b}\left(2 \sum_{i=1}^{m} b x_{i}\right) \\ &=2 \sum_{i=1}^{m} x_{i} \end{aligned}

此式即为B=fxy(x,y)B=f_{xy}^{\prime \prime}(x, y)

证明损失函数E(w,b)E(w, b)是关于wwbb的凸函数——求C=fyy(x,y)C=f_{yy}^{\prime \prime}(x, y)

E(w,b)b=b[i=1m(yi(wxi+b))2]=i=1mb(yiwxib)2=i=1m2(yiwxib)(1)=2(mbi=1m(yiwxi))(西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6}

故有:

2E(w,b)b2=b(E(w,b)b)=b[2(mbi=1m(yiwxi))]=b(2mb)=2m \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial b^{2}} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial b}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\right] \\ &=\frac{\partial}{\partial b}(2 m b) \\ &=2 m \end{aligned}

此式即为C=fyy(x,y)C=f_{yy}^{\prime \prime}(x, y)

综上所述,有:

{A=fxx(x,y)=2i=1mxi2B=fxy(x,y)=2i=1mxiC=fyy(x,y)=2m \left\{ \begin{aligned} &A=f_{xx}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i}^{2} \\ &B=f_{xy}^{\prime \prime}(x, y)=2 \sum_{i=1}^{m} x_{i} \\ &C=f_{yy}^{\prime \prime}(x, y)=2 m \end{aligned} \right.

所以:

ACB2=2m2i=1mxi2(2i=1mxi)2=4mi=1mxi24(i=1mxi)2=4mi=1mxi24m1m(i=1mxi)2=4mi=1mxi24mxˉi=1mxi=4m(i=1mxi2i=1mxixˉ)=4mi=1m(xi2xixˉ)=4mi=1m(xi2xixˉxixˉ+xixˉ)i=1mxixˉ=xˉi=1mxi=xˉm1mi=1mxi=mxˉ2=i=1mxˉ2=4mi=1m(xi2xixˉxixˉ+xˉ2)=4mi=1m(xixˉ)2 \begin{aligned} A C-B^{2} &=2 m \cdot 2 \sum_{i=1}^{m} x_{i}^{2}-\left(2 \sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 \cdot m \cdot \frac{1}{m} \cdot\left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=4 m \sum_{i=1}^{m} x_{i}^{2}-4 m \cdot \bar{x} \cdot \sum_{i=1}^{m} x_{i} \\ &=4 m\left(\sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m} x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+x_{i} \bar{x}\right) \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=4 m \sum_{i=1}^{m}\left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}+\bar{x}^{2}\right) \\ &=4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \end{aligned}

故有:

ACB2=4mi=1m(xixˉ)20 AC-B^{2} = 4 m \sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2} \geq 0

也即损失函数E(w,b)E(w, b)是关于wwbb的凸函数,得证!



1.3. 分别对损失函数E(w,b)E(w, b)关于bbww求一阶偏导数

损失函数E(w,b)E(w, b)关于bb求一阶偏导数:

E(w,b)b=b[i=1m(yi(wxi+b))2]=i=1mb(yiwxib)2=i=1m2(yiwxib)(1)=2(mbi=1m(yiwxi))(西瓜书式3.6) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned} \tag{西瓜书式3.6}

损失函数E(w,b)E(w, b)关于ww求一阶偏导数:

E(w,b)w=w[i=1m(yi(wxi+b))2]=i=1mw(yiwxib)2=i=1m2(yiwxib)(xi)=2(wi=1mxi2i=1m(yib)xi)(西瓜书式3.5) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} \tag{西瓜书式3.5}



1.4. 令各自的一阶偏导数等于0解出bbww

令损失函数E(w,b)E(w, b)关于bb的一阶偏导数等于0解出bb

E(w,b)b=2(mbi=1m(yiwxi))=0mbi=1m(yiwxi)=0b=1mi=1m(yiwxi)=1mi=1myiw1mi=1mxi=yˉwxˉ(西瓜书式3.8) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) =0 \\ &\Rightarrow m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)=0 \\ & \begin{aligned} \Rightarrow b&=\frac{1}{m}\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \\ &=\frac{1}{m}\sum_{i=1}^{m} y_{i} - w \frac{1}{m}\sum_{i=1}^{m} x_{i} \\ &=\bar{y}-w\bar{x} \end{aligned} \end{aligned} \tag{西瓜书式3.8}

令损失函数E(w,b)E(w, b)关于ww的一阶偏导数等于0解出ww

E(w,b)w=2(wi=1mxi2i=1m(yib)xi)=0wi=1mxi2i=1m(yib)xi=0wi=1mxi2=i=1myixii=1mbxib=yˉwxˉwi=1mxi2=i=1myixii=1m(yˉwxˉ)xiwi=1mxi2=i=1myixiyˉi=1mxi+wxˉi=1mxiwi=1mxi2wxˉi=1mxi=i=1myixiyˉi=1mxiw(i=1mxi2xˉi=1mxi)=i=1myixiyˉi=1mxiw=i=1myixiyˉi=1mxii=1mxi2xˉi=1mxiyˉi=1mxi=1mi=1myii=1mxi=xˉi=1myixˉi=1mxi=1mi=1mxii=1mxi=1m(i=1mxi)2=i=1myixixˉi=1myii=1mxi21m(i=1mxi)2=i=1myi(xixˉ)i=1mxi21m(i=1mxi)2(西瓜书式3.7) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) =0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}=0 \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} = \sum_{i=1}^{m}y_{i} x_{i} - \sum_{i=1}^{m} b x_{i} \\ &\qquad b=\bar{y}-w\bar{x} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}=\sum_{i=1}^{m} y_{i} x_{i}-\sum_{i=1}^{m}(\bar{y}-w \bar{x}) x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2} =\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}+w \bar{x} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w \sum_{i=1}^{m} x_{i}^{2}-w \bar{x} \sum_{i=1}^{m} x_{i}=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\Rightarrow w\left(\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}\right)=\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i} \\ &\begin{aligned} \Rightarrow w &= \frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{y} \sum_{i=1}^{m} x_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}} \\ &\qquad \bar{y} \sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} y_{i} \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} y_{i} \\ &\qquad \bar{x}\sum_{i=1}^{m} x_{i} = \frac{1}{m}\sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} x_{i} = \frac{1}{m} \left(\sum_{i=1}^{m} x_{i}\right)^{2} \\ &=\frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{x} \sum_{i=1}^{m} y_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \\ &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎‪‫‫‌‬‎‮‌‌‫⁠‌ \end{aligned}‏‌‎‬‎‪‍‭‎ \end{aligned} \tag{西瓜书式3.7}

ww向量化,有:

w=i=1myi(xixˉ)i=1mxi21m(i=1mxi)21m(i=1mxi)2=(1mi=1mxi)i=1mxi=xˉi=1mxi=i=1mxixˉ=i=1m(yixiyixˉ)i=1m(xi2xixˉ)=i=1m(yixiyixˉyixˉyixˉ)i=1m(xi2xixˉxixˉxixˉ)i=1myixˉ=xˉi=1myi=1mi=1mxii=1myi=i=1mxi1mi=1myi=i=1mxiyˉi=1myixˉ=xˉi=1myi=xˉm1mi=1myi=mxˉyˉ=i=1mxˉyˉi=1mxixˉ=xˉi=1mxi=xˉm1mi=1mxi=mxˉ2=i=1mxˉ2=i=1m(yixiyixˉxiyˉxˉyˉ)i=1m(xi2xixˉxixˉxˉ2)=i=1m(xixˉ)(yiyˉ)i=1m(xixˉ)2x=(x1,x2,,xm)Ty=(y1,y2,,ym)Txd=(x1xˉ,x2xˉ,,xmxˉ)Tyd=(y1yˉ,y2yˉ,,ymyˉ)T=xdTydxdTxd \begin{aligned} w &=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &\qquad \frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2} = \left(\frac{1}{m} \sum_{i=1}^{m} x_{i}\right) \sum_{i=1}^{m} x_{i} = \bar{x} \sum_{i=1}^{m} x_{i} = \sum_{i=1}^{m} x_{i} \bar{x} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-y_{i} \bar{x}-y_{i} \bar{x}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-x_{i} \bar{x}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\frac{1}{m} \sum_{i=1}^{m} x_{i} \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=\sum_{i=1}^{m} x_{i} \bar{y} \\ &\qquad \sum_{i=1}^{m} y_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} y_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} y_{i}=m \bar{x} \bar{y}=\sum_{i=1}^{m} \bar{x} \bar{y} \\ &\qquad \sum_{i=1}^{m} x_{i} \bar{x}=\bar{x} \sum_{i=1}^{m} x_{i}=\bar{x} \cdot m \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} x_{i}=m \bar{x}^{2}=\sum_{i=1}^{m} \bar{x}^{2} \\ &=\frac{\sum_{i=1}^{m} \left(y_{i} x_{i}-y_{i} \bar{x}-x_{i} \bar{y}-\bar{x}\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}^{2}-x_{i} \bar{x}-x_{i} \bar{x}-\bar{x}^{2}\right)}‎‬‎‪‍‭‎‏‌‎‬‎‪‍‭‎ \\ &=\frac{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{m} \left(x_{i}-\bar{x}\right)^{2}}‎ \\ &\qquad x=\left(x_{1},x_{2},\cdots, x_{m}\right)^{T} \\ &\qquad y=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \\ &\qquad x_{d}=\left(x_{1}-\bar{x},x_{2}-\bar{x},\cdots,x_{m}-\bar{x}\right)^{T} \\ &\qquad y_{d}=\left(y_{1}-\bar{y},y_{2}-\bar{y},\cdots,y_{m}-\bar{y}\right)^{T} \\ &=\frac{x_{d}^{T} y_{d}}{x_{d}^{T} x_{d}} \end{aligned}




2. 二元线性回归

求解权重w^\hat{w}的公式推导推导思路:

  1. 由最小二乘法导出损失函数Ew^E_{\hat{w}}

  2. 证明损失函数Ew^E_{\hat{w}}是关于w^\hat{w}的凸函数

  3. 对损失函数Ew^E_{\hat{w}}关于w^\hat{w}求一阶偏导数

  4. 令各自的一阶偏导数等于0解出w^\hat{w}^{*}


2.1. 将wwbb组合成w^\hat{w}

f(xi)=wTxi+b=(w1w2wd)(xi1xi2xid)+b=w1xi1+w2xi2++wdxid+bwd+1=b=w1xi1+w2xi2++wdxid+wd+11=(w1w2wdwd+1)(xi1xi2xid1)=w^Tx^i \begin{aligned} f\left(\boldsymbol{x}_{i}\right) &=\boldsymbol{w}^{T} \boldsymbol{x}_{i}+b \\ &=\left(\begin{array}{cccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}}\end{array}\right)+b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+b \\ &\qquad w_{d+1}=b \\ &=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+w_{d+1} \cdot 1 \\ &=\left(\begin{array}{ccccc} {w_{1}} & {w_{2}} & {\dots} & {w_{d}} & {w_{d+1}}\end{array}\right) \left(\begin{array}{c}{x_{i 1}} \\ {x_{i 2}} \\ {\vdots} \\ {x_{i d}} \\ 1\end{array}\right) \\ &=\hat{w}^{T}\hat{x}_{i} \end{aligned}



2.2. 由最小二乘法导出损失函数Ew^E_{\hat{w}}

Ew^=i=1m(yif(x^i))2=m(yiw^Tx^i)2X=(x11x12x1d1x21x22x2d1xm1xm2xmd1)=(x1T1x2T1xmT1)=(x^1Tx^2Tx^mT)y=(y1,y2,,ym)T=(y1w^Tx^1)2+(y2w^Tx^2)2++(ymw^Tx^m)2=(y1w^Tx^1y2w^Tx^2ymw^Tx^m)(y1w^Tx^1y2w^Tx^2ymw^Tx^m)(y1w^Tx^1y2w^Tx^2ymw^Tx^m)=(y1y2ym)(w^Tx^1w^Tx^2w^Tx^m)=(y1y2ym)(x^1Tw^x^2Tw^x^mTw^)=(y1y2ym)(x^1Tx^2Tx^mT)w^=yXw^(y1w^Tx^1y2w^Tx^2ymw^Tx^m)=(y1w^Tx^1y2w^Tx^2ymw^Tx^m)T=(yXw^)T=(yXw^)T(yXw^) \begin{aligned} E_{\hat{\boldsymbol{w}}} &=\sum_{i=1}^{m}\left(y_{i}-f\left(\hat{\boldsymbol{x}}_{i}\right)\right)^{2} \\ &=\sum^{m}\left(y_{i}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right)^{2} \\ &\qquad \begin{aligned} &\mathbf{X} =\left(\begin{array}{ccccc} {x_{11}} & {x_{12}} & {\dots} & {x_{1 d}} & {1} \\ {x_{21}} & {x_{22}} & {\dots} & {x_{2 d}} & {1} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ {x_{m 1}} & {x_{m 2}} & {\dots} & {x_{m d}} & {1} \end{array}\right) =\left(\begin{array}{cc} {\boldsymbol{x}_{1}^{\mathrm{T}}} & {1} \\ {\boldsymbol{x}_{2}^{\mathrm{T}}} & {1} \\ {\vdots} & {\vdots} \\ {\boldsymbol{x}_{m}^{\mathrm{T}}} & {1} \end{array}\right) =\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \\ &\boldsymbol{y}=\left(y_{1},y_{2},\cdots,y_{m}\right)^{T} \end{aligned} \\ &=\left(y_{1}-\hat{\boldsymbol{w}}^{T} \hat{x}_{1}\right)^{2} + \left(y_{2}-\hat{\boldsymbol{w}}^{T} \hat{x}_{2}\right)^{2} + \cdots + \left(y_{m}-\hat{\boldsymbol{w}}^{T} \hat{x}_{m}\right)^{2} \\ &=\left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) \\ &\qquad \left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T} \hat{\boldsymbol{w}}} \\ {\hat{\boldsymbol{x}}_{2}^{T} \hat{\boldsymbol{w}}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T} \hat{\boldsymbol{w}}} \end{array}\right) =\left(\begin{array}{c} {y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}} \end{array}\right) -\left(\begin{array}{c} {\hat{\boldsymbol{x}}_{1}^{T}} \\ {\hat{\boldsymbol{x}}_{2}^{T}} \\ {\vdots} \\ {\hat{\boldsymbol{x}}_{m}^{T}} \end{array}\right) \hat{\boldsymbol{w}} =\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}} \\ &\qquad \left(\begin{array}{cccc} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} & {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} & {\cdots} & {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right) =\left(\begin{array}{c} {y_{1}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{1}} \\ {y_{2}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{2}} \\ {\vdots} \\ {y_{m}-\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{m}} \end{array}\right)^{T} =\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T} \\ &=\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right)^{T}\left(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}}\right) \end{aligned}



2.3. 证明损失函数Ew^E_{\hat{w}}是关于w^\hat{w}的凸函数

凸集定义:

设集合DRnD\in R^{n},如果对任意的x,yDx,y\in D与任意的a[0,1]a\in [0,1],有ax+(1a)yDax+(1-a)y\in D,则称集合DD是凸集。

凸集的几何意义:

若两个点属于此集合,则这两点连线上的任意一点均属于此集合。

深度之眼西瓜书——线性模型笔记

梯度定义:

nn元函数f(x)f(\boldsymbol{x})对自变量x=(x1,x2,,xn)T\boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T}的各分量xix_{i}的偏导数f(x)xi(i=1,2,,n)\frac{\partial f(\boldsymbol{x})}{\partial x_{i}} \quad \left(i=1,2,\cdots,n\right)都存在,则称函数f(x)f(\boldsymbol{x})x\boldsymbol{x}处一阶可导,并称向量

f(x)=(f(x)x1f(x)x2f(x)xn) \nabla f(\boldsymbol{x}) =\left(\begin{array}{c} {\frac{\partial f(\boldsymbol{x})}{\partial x_{1}}} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{2}}} \\ {\vdots} \\ {\frac{\partial f(\boldsymbol{x})}{\partial x_{n}}}\end{array}\right)

为函数f(x)f(\boldsymbol{x})x\boldsymbol{x}处的一阶导数或梯度,记为f(x)\nabla f(\boldsymbol{x})(列向量)。

Hessian(海塞)矩阵定义:设nn元函数f(x)f(\boldsymbol{x})对自变量x=(x1,x2,,xn)T\boldsymbol{x}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{T}的各分量xix_{i}的二阶偏导数2f(x)xixj(i=1,2,,n;j=1,2,,n)\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}} \quad \left(i=1,2,\cdots,n; j=1,2,\cdots,n\right)都存在,则称函数f(x)f(\boldsymbol{x})x\boldsymbol{x}处二阶可导,并称矩阵

2f(x)=[2f(x)x122f(x)x1x22f(x)x1xn2f(x)x2x12f(x)x222f(x)x2xn2f(x)xnx12f(x)xnx22f(x)xn2] \nabla^{2} f(\boldsymbol{x}) =\left[\begin{array}{cccc} {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n}^{2}}} \end{array}\right]

为函数f(x)f(\boldsymbol{x})x\boldsymbol{x}处的二阶导数或Hessian(海塞)矩阵,记为2f(x)\nabla^{2} f(\boldsymbol{x})。若f(x)f(\boldsymbol{x})x\boldsymbol{x}各变元的所有二阶偏导数都连续,则2f(x)xixj=2f(x)xjxi\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{i} \partial x_{j}}=\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{j} \partial x_{i}},此时2f(x)\nabla^{2} f(\boldsymbol{x})为对称矩阵。

多元实值函数凹凸性判定定理:

DRnD\subset R^{n}是非空开凸集,f:DRnRf:D\subset R^{n} \to R,且f(x)f(\boldsymbol{x})DD上二阶连续可微,如果f(x)f(\boldsymbol{x})HessianHessian矩阵2f(x)\nabla^{2} f(\boldsymbol{x})DD上是正定的,则f(x)f(\boldsymbol{x})DD上的严格凸函数。

凸充分性定理:

f:RnRf:R^{n} \to R是凸函数,且f(x)f(\boldsymbol{x})一阶连续可微,则xx^{*}是全局解的充分必要条件是f(x)=0\nabla f(\boldsymbol{x}^{*})=0,其中f(x)\nabla f(\boldsymbol{x})f(x)f(\boldsymbol{x})关于x\boldsymbol{x}的一阶导数(也称梯度)。



2.4. 对损失函数Ew^E_{\hat{w}}关于w^\hat{w}求一阶偏导数

Ew^w^=w^[(yXw^)T(yXw^)]=w^[(yTw^TXT)(yXw^)]=w^[yTyyTXw^w^TXTy+w^TXTXw^]=w^[yTXw^w^TXTy+w^TXTXw^]=yTXw^w^w^TXTyw^+w^TXTXw^w^xTax=aTxx=axTBxx=(B+BT)x=XTyXTy+(XTX+XTX)w^=2XT(Xw^y)(西瓜书式3.10) \begin{aligned} \frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{T}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\left(\boldsymbol{y}^{T}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T}\right)(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\boldsymbol{y}^{T} \boldsymbol{y}-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[-\boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}+\hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=-\frac{\partial \boldsymbol{y}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}}-\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \boldsymbol{y}}{\partial \hat{\boldsymbol{w}}}+\frac{\partial \hat{\boldsymbol{w}}^{T} \mathbf{X}^{T} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \boldsymbol{a}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{a}^{T} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a} \\ &\qquad \frac{\partial \boldsymbol{x}^{T} \mathbf{B} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\mathbf{B}+\mathbf{B}^{T}\right) \boldsymbol{x} \\ &=-\mathbf{X}^{T} \boldsymbol{y}-\mathbf{X}^{T} \boldsymbol{y}+\left(\mathbf{X}^{T} \mathbf{X}+\mathbf{X}^{T} \mathbf{X}\right) \hat{w} \\ &=2\mathbf{X}^{T}\left(\mathbf{X} \hat{w}-\boldsymbol{y}\right) \end{aligned} \tag{西瓜书式3.10}

所以有:

2Ew^w^w^T=w^(Ew^w^)=w^[2XT(Xw^y)]=w^(2XTXw^2XTy)=2XTXw^(Hessian矩阵) \begin{aligned} \frac{\partial^{2} E_{\hat{w}}}{\partial \hat{w} \partial \hat{w}^{T}} &=\frac{\partial}{\partial \hat{w}}\left(\frac{\partial E_{\hat{w}}}{\partial \hat{w}}\right) \\ &=\frac{\partial}{\partial \hat{w}}\left[2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})\right] \\ &=\frac{\partial}{\partial \hat{w}}\left(2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}\right) \\ &=2 \mathbf{X}^{T} \mathbf{X} \hat{w} \end{aligned} \tag{Hessian矩阵}



2.5. 令一阶偏导数等于0解出w^\hat{w}^{*}

Ew^w^=2XT(Xw^y)=02XTXw^2XTy=02XTXw^=2XTyw^=(XTX)1XTy(西瓜书式3.11) \begin{aligned} &\quad \frac{\partial E_{\hat{w}}}{\partial \hat{w}} =2 \mathbf{X}^{T}(\mathbf{X} \hat{w}-\boldsymbol{y})=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}-2 \mathbf{X}^{T} \boldsymbol{y}=0 \\ &\Rightarrow 2 \mathbf{X}^{T} \mathbf{X} \hat{w}=2 \mathbf{X}^{T} \boldsymbol{y} \\ &\Rightarrow \hat{w} = \left(\mathbf{X}^{T} \mathbf{X} \right)^{-1} \mathbf{X}^{T} \boldsymbol{y} \end{aligned} \tag{西瓜书式3.11}




3. 广义线性模型

3.1. 指数族分布

指数族(Exponential family)分布是一类分布的总称,该类分布的分布律(或者概率密度函数)的一般形式如下:

p(y;η)=b(y)exp(ηTT(y)a(η)) p(y ; \eta)=b(y) \exp \left(\eta^{T} T(y)-a(\eta)\right)

其中,η\eta称为该分布的自然参数;T(y)T(y)为充分统计量,视具体的分布而定,通常是等于随机变量yy本身;a(η)a(\eta)为配分函数;b(y)b(y)为关于随机变量yy的函数,常见的伯努利分布和正态分布均属于指数族分布。

证明伯努利分布属于指数族分布:

已知伯努利分布的分布律为:

p(y)=ϕy(1ϕ)1y p(y)=\phi^{y}(1-\phi)^{1-y}

其中y{0,1}y\in\{0,1\}ϕ\phiy=1y=1的概率,即p(y=1)=ϕp(y=1)=\phi,对上式恒等变形可得:

p(y)=ϕy(1ϕ)1y=exp(ln(ϕy(1ϕ)1y))=exp(lnϕy+ln(1ϕ)1y)=exp(ylnϕ+(1y)ln(1ϕ))=exp(ylnϕ+ln(1ϕ)yln(1ϕ))=exp(y(lnϕln(1ϕ))+ln(1ϕ))=exp(yln(ϕ1ϕ)+ln(1ϕ)) \begin{aligned} p(y) &=\phi^{y}(1-\phi)^{1-y} \\ &=\exp \left(\ln \left(\phi^{y}(1-\phi)^{1-y}\right)\right) \\ &=\exp \left(\ln \phi^{y}+\ln(1-\phi)^{1-y}\right) \\ &=\exp (y \ln \phi+(1-y) \ln (1-\phi)) \\ &=\exp (y \ln \phi+\ln (1-\phi)-y \ln (1-\phi)) \\ &=\exp (y(\ln \phi-\ln (1-\phi))+\ln (1-\phi)) \\ &=\exp \left(y \ln \left(\frac{\phi}{1-\phi}\right)+\ln (1-\phi)\right) \end{aligned}

对比指数分布的一般形式p(y;η)=b(y)exp(η(T)T(y)a(η))p(y;\eta)=b(y)exp\left(\eta^(T)T(y)-a(\eta)\right),可知:

所以,伯努利分布的指数族分布对应参数为:

b(y)=1η=ln(fracϕ1ϕ)T(y)=ya(η)=ln(1ϕ)=ln(1+expη) \begin{aligned} b(y)&=1 \\ \eta&=\ln\left(frac{\phi}{1-\phi}\right) \\ T(y)&=y \\ a(\eta)&=-\ln(1-\phi)=ln(1+exp{\eta}) \end{aligned}



3.2. 广义线性模型的三条假设

  1. 在给定x\boldsymbol{x}的条件下,假设随机变量y\boldsymbol{y}服从某个指数族分布

  2. 在给定x\boldsymbol{x}的条件下,我们的目标是得到一个模型h(x)h(\boldsymbol{x})能预测出T(y)T(\boldsymbol{y})的期望值

  3. 假设该指数族分布中的自然参数η\etax\boldsymbol{x}呈线性关系,即η=wTx\eta=w^{T}x




4. 对数几率回归

对数几率回归是在对一个二分类问题进行建模,并且假设被建模的随机变量yy取值为0或1,因此我们可以很自然地假设yy服从伯努利分布。此时,如果我们想要构建一个线性模型来预测在给定x\boldsymbol{x}的条件下yy的取值的话,可以考虑使用广义线性模型来进行建模。

4.1. 对数几率回归的广义线性模型推导

已知yy是服从伯努利分布,而伯努利分布属于指数在发布,所以满足广义线性模型的第一条假设,接着根据广义线性模型的第二条假设我们可以推得模型h(x)h(x)的表达式为:

h(x)=E[T(yx)] h(\boldsymbol{x})=E[T(y|\boldsymbol{x})]

由于伯努利分布的T(yx)=yxT(y|\boldsymbol{x})=y|\boldsymbol{x},所以:

h(x)=E[yx] h(\boldsymbol{x})=E[y|\boldsymbol{x}]

又因为E[yx]=1×p(y=1x)+0×p(y=0x)=p(y=1x)=ϕE[y|\boldsymbol{x}]=1\times p(y=1|\boldsymbol{x})+0\times p(y=0|\boldsymbol{x})=p(y=1|\boldsymbol{x})=\phi,所以:

h(x)=ϕ h(\boldsymbol{x})=\phi

在前面证明伯努利分布属于指数族分布时我们知道:

η=ln(ϕ1ϕ)eη=ϕ1ϕeη=1ϕϕeη=1ϕ11+eη=1ϕ11+eη=ϕ \begin{aligned} &\eta=\ln \left(\frac{\phi}{1-\phi}\right) \\ &e^{\eta}=\frac{\phi}{1-\phi} \\ &e^{-\eta}=\frac{1-\phi}{\phi} \\ &e^{-\eta}=\frac{1}{\phi}-1 \\ &1+e^{-\eta}=\frac{1}{\phi} \\ &\frac{1}{1+e^{-\eta}}=\phi &\end{aligned}

ϕ=11+eη\phi=\frac{1}{1+e^{-\eta}}代入h(x)h(\boldsymbol{x})的表达式可得:

h(x)=ϕ=11+eη h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-\eta}}

根据广义模型的第三条假设:η=wTx\eta=w^{T}xh(x)h(\boldsymbol{x})最终可化为:

h(x)=ϕ=11+ewTx=p(y=1x)(西瓜书式3.23) h(\boldsymbol{x})=\phi=\frac{1}{1+e^{-w^{T}x}}=p(y=1|\boldsymbol{x}) \tag{西瓜书式3.23}

此即为对数几率回归模型。



4.2. 极大似然估计法

设总体的概率密度函数(或分布律)为f(y,w1,w2,,wk)f(y, w_{1}, w_{2}, \cdots, w_{k})y1y_{1}y2y_{2},…,ymy_{m},为从该总体中抽出的样本。因为y1y_{1}y2y_{2},…,ymy_{m}相互独立且同分布,于是,它们的联合概率密度函数(或联合概率)为:

L(y1,y2,,ym;w1,w2,,wk)=i=1mf(yi,w1,w2,,wk) L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right)

其中,w1w_{1}w2w_{2},…,wmw_{m}被看作固定但是未知的参数。当我们已经观测到一组样本观测值y1y_{1}y2y_{2},…,ymy_{m}时,要去估计未知参数,一种直观的想法就是,哪一组参数使得现在的样本观测值出现的概率最大,哪一组参数可能就是真正的参数,我们就用它作为参数的估计值,这就是所谓的极大似然估计。

极大似然估计的具体方法:

通常记L(y1,y2,,ym;w1,w2,,wk)=L(w)L\left(y_{1}, y_{2}, \ldots, y_{m} ; w_{1}, w_{2}, \ldots, w_{k}\right)=L\left(w\right),并称为其似然函数。于是求ww的极大似然估计就归结为L(w)L(w)的最大值点。由于对数函数是单调递增函数,所以:

lnL(w)=ln(i=1mf(yi,w1,w2,,wk))=i=1mlnf(yi,w1,w2,,wk) \begin{aligned} \ln L(\boldsymbol{w}) &=\ln \left(\prod_{i=1}^{m} f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right)\right) \\ &=\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right) \end{aligned}

L(w)L(w)有相同的最大值点,而在许多情况下,求lnL(w)\ln L(w)的最大值点比较简单,于是,我们就将求L(w)L(w)的最大值点转化为了求lnL(w)\ln L(w)的最大值点,通常称lnL(w)\ln L(w)为对数似然函数。

对数几率回归的极大似然估计:

已知随机变量yy取1和0的概率分别为:

p(y=1x)=ewTx+b1+ewTx+bp(y=0x)=11+ewTx+b \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b}} \end{aligned}

β=(w;b)\boldsymbol{\beta}=(w;b)x^=(x;1)\hat{\boldsymbol{x}}=(\boldsymbol{x}; 1),则wTx+bw^{T}\boldsymbol{x}+b可简化为βTx^\boldsymbol{\beta}^{T}\hat{x},于是上式可化简为:

p(y=1x)=eβTx^1+eβTx^p(y=0x)=11+eβTx^ \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \end{aligned}

记:

p(y=1x)=eβTx^1+eβTx^=p1(x^;β)p(y=0x)=11+eβTx^=p0(x^;β) \begin{aligned} &p(y=1 | \boldsymbol{x})=\frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \\ &p(y=0 | \boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}=p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) \end{aligned}

于是,使用一个小技巧即可得到随机变量yy的分布律表达式:

p(yx;w,b)=yp1(x^;β)+(1y)p0(x^;β)(西瓜书式3.26) p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta}) \tag{西瓜书式3.26}

或者:

p(yx;w,b)=[p1(x^;β)]y[p0(x^;β)]1y p(y | \boldsymbol{x} ; \boldsymbol{w}, b) =\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y} \left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y}



4.3. 对数几率回归的参数估计

根据对数似然函数的定义可知:

lnL(w)=i=1mlnf(yi,w1,w2,,wk) \ln L(\boldsymbol{w}) =\sum_{i=1}^{m} \ln f\left(y_{i}, w_{1}, w_{2}, \ldots, w_{k}\right)

由于此时的yy为离散型,所以将对数似然函数中的概率密度函数换成分布律即可,既有:

(w,b):=lnL(w,b)=i=1mlnf(yixi;w,b)(西瓜书式3.25) \ell(w,b) :=\ln L(\boldsymbol{w},b) =\sum_{i=1}^{m} \ln f\left(y_{i} | x_{i}; \boldsymbol{w},b\right) \tag{西瓜书式3.25}

p(yx;w,b)=yp1(x^;β)+(1y)p0(x^;β)p(y | \boldsymbol{x} ; \boldsymbol{w}, b)=y \cdot p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})+(1-y) \cdot p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})代入对数似然函数可得:

(β)=i=1mln(yip1(x^i;β)+(1yi)p0(x^i;β))p1(x^;β)=eβTx^1+eβTx^p0(x^;β)=11+eβTx^=i=1mln(yieβTx^i1+eβTx^i+1yi1+eβTx^i)=i=1mln(yieβTx^i+1yi1+eβTx^i)=i=1m(ln(yieβTx^i+1yi)ln(1+eβTx^i))yi{0,1}yi=0(β)=i=1m(ln(0eβTx^i+10)ln(1+eβTx^i))=i=1m(ln1ln(1+eβTx^i))=i=1m(ln(1+eβTx^i))yi=1(β)=i=1m(ln(1eβTx^i+11)ln(1+eβTx^i))=i=1m(lneriTln(1+eβTzi))=i=1m(βTx^iln(1+eβTx^i))=i=1m(yiβTx^iln(1+eβTx^i))(西瓜书式3.27) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(y_{i} p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)+\left(1-y_{i}\right) p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right) \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}+\frac{1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m} \ln \left(\frac{y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}}\right) \\ &=\sum_{i=1}^{m}\left(\ln \left(y_{i} e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}+1-y_{i}\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}\in \{0,1\} \\ &\qquad y_{i}=0 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(0 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-0\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln 1-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &\qquad y_{i}=1 \\ &\qquad \quad \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(\ln \left(1 \cdot e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}+1-1\right)-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{x}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\ln e^{\boldsymbol{r}_{i}^{T}}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \boldsymbol{z}_{i}}\right)\right)=\sum_{i=1}^{m}\left(\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned} \tag{西瓜书式3.27}

p(yx;w,b)=[p1(x^;β)]y[p0(x^;β)]1yp(y | \boldsymbol{x} ; \boldsymbol{w}, b)=\left[p_{1}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{y}\left[p_{0}(\hat{\boldsymbol{x}} ; \boldsymbol{\beta})\right]^{1-y},将其代入对数似然函数可得:

(β)=i=1mln([p1(x^i;β)]yi[p0(x^i;β)]1yi)=i=1m[ln([p1(x^i;β)]yi)+ln([p0(x^i;β)]1yi)]=i=1m[yiln(p1(x^i;β))+(1yi)ln(p0(x^i;β))]=i=1m{yi[ln(p1(x^i;β))ln(p0(x^i;β))]+ln(p0(x^i;β))}=i=1m[yilnp1(x^i;β)p0(x^i;β)+ln(p0(x^i;β))]p1(x^;β)=eβTx^1+eβTx^p0(x^;β)=11+eβTx^=i=1m[yiln(eβTx^)+ln(p0(x^i;β))]=i=1m(yiβTx^iln(1+eβTx^i)) \begin{aligned} \ell(\boldsymbol{\beta}) &=\sum_{i=1}^{m} \ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right) \\ &=\sum_{i=1}^{m}\left[\ln \left(\left[p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{y_{i}}\right)+\ln \left(\left[p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right]^{1-y_{i}}\right)\right] \\ &=\sum_{i=1}^{m}\left[y_{i} \ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)+\left(1-y_{i}\right) \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left\{y_{i}\left[\ln \left(p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)-\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right]+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right\} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \frac{p_{1}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}{p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)}+\ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &\qquad p_{1}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &\qquad p_{0}(\hat{\boldsymbol{x}};\boldsymbol{\beta}) = \frac{1}{1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}} \\ &=\sum_{i=1}^{m}\left[y_{i}\ln \left(e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}}\right) + \ln \left(p_{0}\left(\hat{\boldsymbol{x}}_{i} ; \boldsymbol{\beta}\right)\right)\right] \\ &=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{T} \hat{\boldsymbol{x}}_{i}}\right)\right) \end{aligned}