机器学习入门 07 逻辑回归

1 什么是逻辑回归(Logistic Regression)

1.1 介绍

​ 一般用来解决分类问题,只能解决二分类问题。

​ 将样本的特征和样本的概率联系起来,概率是一个数,所以可以叫做回归问题。

​ 在多项式回归中,y^=f(x)=θxb\hat { y } ={ f }(x)={ \theta }^{ \intercal }\cdot { x }_{ b }θ{ \theta }^{ \intercal }是系数,xb{ x }_{ b }是添加了xb1{ x }_{ b }\equiv 1的向量(矩阵)。y^\hat { y }的值域为(,)(-\infty ,\infty),而概率p^\hat { p }的值域为[0,1][0,1],这里使用Sigmoid函数将y的值域范围转换为p的值域范围。

​ Sigmoid函数:σ(t)=11+et\sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } }{t>0,p>0.5t<0,p<0.5\left\{ \begin{matrix} t>0,\quad p>0.5 \\ t<0,\quad p<0.5 \end{matrix} \right .

​ 函数图像:

机器学习入门 07 逻辑回归

​ 令 p^=σ(θxb)=11+eθxb\hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } },最终分类 y^={1,p^0.50,p^0.5\hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } \le 0.5\quad \end{matrix} \right.

问题:

​ 对于给定的样本数据集X和y,如何找到参数θ\theta,使得用这样的方式,可以最大程度的获得样本数据集X对应的分类输出y。

2 逻辑回归的损失函数

cost={log(p^)ify=1log(1p^)ify=0cost=ylog(p^)(1y)log(1p^)J(θ)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))],p^(i)=σ(Xb(i)θ)=11+eXb(i)θcost=\left\{ \begin{matrix} -log(\hat { p } )\quad if\quad y=1 \\ -log(1-\hat { p } )\quad if\quad y=0 \end{matrix} \right. \Rightarrow cost=-ylog(\hat { p } )-(1-y)log(1-\hat { p } )\\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] }, { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } }

3 使用梯度下降法求θ\theta 使得J(θ)J(\theta )最小

J(θ)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))],p^(i)=σ(Xb(i)θ)=11+eXb(i)θJ(θ)=1mi=1m[y(i)log(σ(Xb(i)θ))+(1y(i))log(1σ(Xb(i)θ))]{σ(t)=11+et=(1+et)1,σ(t)=(1+et)2et[logσ(t)]=1σ(t)σ(t)=1(1+et)1(1+et)2et=et1+et=1+et11+et=1σ(t)[log(1σ(t))]=11σ(t)(1)σ(t)=11σ(t)(1+et)2et=11+et1+et11+et(1+et)2et=1+etet(1+et)2et=(1+et)1=σ(t){d(y(i)log(σ(Xb(i)θ)))dθj=y(i)(1σ(Xb(i)θ))Xj(i)d((1y(i))log(1σ(Xb(i)θ)))dθj=(1y(i))(σ(Xb(i)θ))Xj(i)+[y(i)σ(Xb(i)θ)]Xj(i)dJ(θ)dθj=1mi=1m(σ(Xb(i)θ)y(i))Xj(i)J(θ)=(J/θ0J/θ1J/θn)=1m(i=1m(σ(Xb(i)θ)y(i))X0(i))i=1m(σ(Xb(i)θ)y(i))X1(i))i=1m(σ(Xb(i)θ)y(i))Xn(i)))=1m(i=1m(p^(i)y(i))X0(i))i=1m(p^(i)y(i))X1(i))i=1m(p^(i)y(i))Xn(i)))(1m(i=1m(y^(i)y(i))X0(i))i=1m(y^(i)y(i))X1(i))i=1m(y^(i)y(i))Xn(i))))=1mXb[σ(Xbθ)y]\\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] } ,\quad { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } } \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))+(1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))] } \\ \left\{ \begin{matrix} \sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } ={ (1+{ e }^{ -t }) }^{ -1 },{ \sigma (t) }\prime ={ (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ { [log\sigma (t)] }\prime =\frac { 1 }{ \sigma (t) } \sigma (t)\prime =\frac { 1 }{ { (1+{ e }^{ -t }) }^{ -1 } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=\frac { { e }^{ -t } }{ 1+{ e }^{ -t } } =\frac { 1+{ e }^{ -t }-1 }{ 1+{ e }^{ -t } } =1-\sigma (t) \\ [log(1-\sigma (t))]\prime =\frac { 1 }{ 1-\sigma (t) } (-1){ \sigma (t) }\prime =-\frac { 1 }{ 1-\sigma (t) } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-\frac { 1 }{ \frac { 1+{ e }^{ -t } }{ 1+{ e }^{ -t } } -\frac { 1 }{ 1+{ e }^{ -t } } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ =-\frac { 1+{ e }^{ -t } }{ { e }^{ -t } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-{ (1+{ e }^{ -t }) }^{ -1 }=-\sigma (t) \end{matrix} \right. \\ \left\{ \begin{matrix} \frac { d({ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } ={ y }^{ (i) }(1-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \\ \frac { d((1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } =(1-{ y }^{ (i) })(-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \end{matrix}\begin{matrix} + \\ \Longrightarrow \end{matrix} \right .[{ y }^{ (i) }-\sigma ({ X }_{ b }^{ (i) }\theta )]{ X }_{ j }^{ (i) }\\ \Rightarrow \frac { dJ(\theta ) }{ d{ \theta }_{ j } } =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ j }^{ (i) }\\ \Rightarrow \nabla J(\theta )=\left( \begin{matrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) (\Rightarrow \frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) )\\ =\frac { 1 }{ m } { X }_{ b }^{ \intercal }[\sigma ({ X }_{ b }\theta )-{ y }]

4 Python实现逻辑回归算法

逻辑回归模块:

import numpy as np
from sklearn.metrics import accuracy_score

class LogisticRegression:

    def __init__(self):
        """初始化Logistic Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def _sigmoid(self, t):
        return 1./(1.+np.exp(-t))

    def fit(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Logistic Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            """目标函数"""
            y_hat = self._sigmoid(X_b.dot(theta))
            try:
                return -np.sum(y*np.log(y_hat)+(1-y)*np.log(1-y_hat)) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            """梯度"""
            return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(
            X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict_proba(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果概率向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return self._sigmoid(X_b.dot(self._theta))

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        proba=self.predict_proba(X_predict)
        return np.array(proba>=0.5,dtype='int')

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "LogisticRegression()"

使用上述模块:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# 准备数据
iris=datasets.load_iris()

X=iris.data
y=iris.target

# 只选取其中两个分类
X=X[y<2,:2]
y=y[y<2]

X.shape # (100, 2)
# 绘图
plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.show()

机器学习入门 07 逻辑回归

from sklearn.model_selection import train_test_split
import LogisticRegression
# 分割数据集
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=666)

log_reg=LogisticRegression.LogisticRegression()
log_reg.fit(X_train,y_train)
# 模型准确率
log_reg.score(X_test,y_test) # 1.0

5 决策边界

{p^=σ(θxb)=11+eθxb,y^={1,p^0.50,p^&lt;0.5p=σ(t)=11+ett&gt;0,p&gt;0.5t&lt;0,p&lt;0,5y^={1,p^0.5,θxb00,p^&lt;0.5,θxb&lt;0θxb=0\left\{ \begin{matrix} \hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } } ,\quad \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } &lt;0.5 \end{matrix} \right. \\ p=\sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } \Rightarrow \begin{matrix} t&gt;0,\quad p&gt;0.5 \\ t&lt;0,\quad p&lt;0,5 \end{matrix} \end{matrix} \right. \\ \Rightarrow \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }\ge 0\quad \\ 0,\quad \hat { p } &lt;0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }&lt;0 \end{matrix} \right. \\ \Rightarrow { \theta }^{ \intercal }\cdot { x }_{ b }=0

决策边界为: θxb=0{ \theta }^{ \intercal }\cdot { x }_{ b }=0

若X有两个特征:θ0+θ1x1+θ2x2=0x2=θ0θ1x1θ2{ \theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }+{ \theta }_{ 2 }{ x }_{ 2 }=0\quad \Rightarrow { \quad x }_{ 2 }=\frac { -{ \theta }_{ 0 }-{ \theta }_{ 1 }{ x }_{ 1 } }{ { \theta }_{ 2 } }

绘图演示:

def x2(x1):
    return (-log_reg.coef_[0]*x1-log_reg.intercept_)/log_reg.coef_[1]
    
x1_plot=np.linspace(4,8,1000)
x2_plot=x2(x1_plot)

plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.plot(x1_plot,x2_plot)
plt.show()

机器学习入门 07 逻辑回归