

  • 数据分类与模型选择
  • 广义线性模型概述
  • Logistic回归模型
  • 对数线性模型
  • 一般线性模型的计算

1 数据的分类与模型选择

1.1 变量取值类型

因变量y{   y \in \left\{\begin{matrix} 连续变量 & & \\ 二分类变量 & & \\ 有序变量 & & \\ 多分类变量 & & \\ 连续伴有删失变量 & & \end{matrix}\right.

解释变量x{x \in \left\{\begin{matrix} 连续变量 & & \\ 分类变量 & & \\ 等级变量 & & \end{matrix}\right.


1.2 案例

1.2.1 建立Poisson对数线性模型


glm(formula, family=poisson(link = log), data, …)

d5.2 = read.table("clipboard", hearder = T)   #读取数据
log = glm(y~x1+x2, family=poisson, data=d5,2)  #对数线性模型

summary(log)                        #检验结果Deviance Residuals: 
      1        2        3        4        5        6  
-10.784   14.444   -8.468   -2.620    4.960   -3.142  

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  6.15687    0.14196  43.371  < 2e-16 ***
x1           0.12915    0.04370   2.955  0.00312 ** 
x2          -1.12573    0.08262 -13.625  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 662.84  on 5  degrees of freedom
Residual deviance: 437.97  on 3  degrees of freedom
AIC: 481.96

Number of Fisher Scoring iterations: 5

分析p1,p2,p3p_1, p_2, p_3


2 广义线性模型

(y不再连续) \rightarrow 指数分布族


2.1 广义线性模型函数glm()

glm(formula, family = gaussian, data, …)

{(gaussian)(binomial)(poission)(gamma)\left\{\begin{matrix} 正态分布(gaussian) &amp; &amp; \\ 二项分布(binomial) &amp; &amp; \\ 泊松分布(poission) &amp; &amp; \\ 伽马分布(gamma) &amp; &amp; \end{matrix}\right.

2.2 说明:Logistic模型

Logit(y)=ln(P1P)=β0+β1x1++βpxp=XβLogit(y) = ln(\frac{P}{1 - P}) = \beta_0 + \beta_1 x_1 + \cdot\cdot\cdot + \beta_p x_p = X\beta

2.3 举例

d5.1 = read.table("clipboard", header = T)   #读取数据
logit <- glm(y~x1 + x2 + x3, family = binomial, data = d5.1)  #Logistic模型
glm(formula = y ~ x1 + x2 + x3, family = binomial, data = d5.1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5636  -0.9131  -0.7892   0.9637   1.6000  

             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.597610   0.894831   0.668   0.5042  
x1          -1.496084   0.704861  -2.123   0.0338 *
x2          -0.001595   0.016758  -0.095   0.9242  
x3           0.315865   0.701093   0.451   0.6523  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62.183  on 44  degrees of freedom
Residual deviance: 57.026  on 41  degrees of freedom
AIC: 65.026

Number of Fisher Scoring iterations: 4
logit.step = step(logit)
    Start:  AIC=65.03
y ~ x1 + x2 + x3

       Df Deviance    AIC
- x2    1   57.035 63.035
- x3    1   57.232 63.232
<none>      57.026 65.026
- x1    1   61.936 67.936

Step:  AIC=63.03
y ~ x1 + x3

       Df Deviance    AIC
- x3    1   57.241 61.241
<none>      57.035 63.035
- x1    1   61.991 65.991

Step:  AIC=61.24
y ~ x1

       Df Deviance    AIC
<none>      57.241 61.241
- x1    1   62.183 64.183

glm(formula = y ~ x1, family = binomial, data = d5.1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4490  -0.8782  -0.8782   0.9282   1.5096  

            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.6190     0.4688   1.320   0.1867  
x1           -1.3728     0.6353  -2.161   0.0307 *
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62.183  on 44  degrees of freedom
Residual deviance: 57.241  on 43  degrees of freedom
AIC: 61.241

Number of Fisher Scoring iterations: 4

2.4 一般线性模型:完全随机设计模型

d5.3 = read.table("clipboard", header = T)
anova(lm(Y~factor(A), data = d5.3)
Analysis of Variance Table

Response: Y
          Df   Sum Sq  Mean Sq F value   Pr(>F)    
factor(A)  2 0.122233 0.061117  40.534 8.94e-07 ***
Residuals 15 0.022617 0.001508                     
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

2.5 随机单位组设计模型

yij=μ+αi+βj+eij  i=1,2,...,G   j=1,2,...,ny_{ij} = \mu +\alpha_i + \beta_j + e_{ij}   i=1,2,...,G   j=1,2,...,n

d5.4 = read.table("clipboard", header = T);d5.4    #读取数据
anova(lm(Y~factor(A)+factor(B),data = d5.4))
Analysis of Variance Table

Response: Y
          Df Sum Sq Mean Sq F value Pr(>F)
factor(A)  3  15759    5253  0.4306 0.7387
factor(B)  2  22385   11192  0.9174 0.4491
Residuals  6  73198   12200  


2.6 关于多元线性回归模型的基本假定


  1. 多元线性回归模型有哪些基本假定?
  2. 为什么要求多元线性回归模型满足一些基本假设?
  3. 当这些假定不满足时对回归模型有何影响?


  • 解释变量XiX_i是确定性变量, 不是随机变量;解释变量之间互不相关,无多重共线性
  • 随机误差项具有0均差和同方差
  • 随机误差项不存在序列相关关系
  • 随机误差项与解释变量之间不相关
  • 随机误差项服从0均值、同方差的正态分布

3. 若这些假定不满足那么,我们建立这样的多元线性回归模型就不合理,无依据,得到的模型自然就不能恰当的用于拟合,建立模型也就没有意义