数据挖掘 R 回归分析
- List item
回归分析是统计的核心,通常指使用一个或多个预测变量来预测响应变量。
回归分析也通常选择与响应变量有关的变量来作为解释变量,以此来描述两者之间的关系。也可以生成一个等式,用解释变量来解释响应变量。
在R 中封装了lm()函数来实现单变量,多变量回归。
R中符号的说明如下:
data(women)
fit<-lm(women$height~women$weight,data=women)
summary(fit)
Call:
lm(formula = women$height ~ women$weight, data = women)
Residuals:
Min 1Q Median 3Q Max
-0.83233 -0.26249 0.08314 0.34353 0.49790
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.723456 1.043746 24.64 2.68e-12 ***
women$weight 0.287249 0.007588 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.44 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
fitted(fit)
1 2 3 4 5 6
58.75712 59.33162 60.19336 61.05511 61.91686 62.77861
7 8 9 10 11 12
63.64035 64.50210 65.65110 66.51285 67.66184 68.81084
13 14 15
69.95984 71.39608 72.83233
residuals(fit)
1 2 3 4 5
-0.75711680 -0.33161526 -0.19336294 -0.05511062 0.08314170
6 7 8 9 10
0.22139402 0.35964634 0.49789866 0.34890175 0.48715407
11 12 13 14 15
0.33815716 0.18916026 0.04016335 -0.39608278 -0.83232892
多项式回归
可以添加一项二次项sq(X)来提高回归的预测精度
fit<-lm(women$weight~women$height+I(women$height^2),data=women)
summary(fit)
Call:
lm(formula = women$weight ~ women$height + I(women$height^2),
data = women)
Residuals:
Min 1Q Median 3Q Max
-0.50941 -0.29611 -0.00941 0.28615 0.59706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
women$height -7.34832 0.77769 -9.449 6.58e-07 ***
I(women$height^2) 0.08306 0.00598 13.891 9.32e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
分析结果可以看书,回归系数都非常显著,模型方差解释率已经增加到了99.9%。
我们也可以可视化一下:
plot(womenweight)
lines(women$height,fitted(fit))