回归与相关性
1.简单线性回归
通过线性回归来描述连个变量之间的联系。函数lm(linear model,线性模型)可以用来进行线性回归分析。
> attach(thuesen) > lm(short.velocity~blood.glucose) Call: lm(formula = short.velocity ~ blood.glucose) Coefficients: (Intercept) blood.glucose 1.09781 0.02196
> summary(lm(short.velocity~blood.glucose)) Call: lm(formula = short.velocity ~ blood.glucose) Residuals: Min 1Q Median 3Q Max -0.40141 -0.14760 -0.02202 0.03001 0.43490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.09781 0.11748 9.345 6.26e-09 blood.glucose 0.02196 0.01045 2.101 0.0479 (Intercept) *** blood.glucose * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2167 on 21 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479
> plot(blood.glucose,short.velocity)
> abline(lm(short.velocity~blood.glucose))
2,残差与回归值
析取函数fitted(返回的是回归值)和resid(显示的回归值与观测值之差)
> lm.velo <- lm(short.velocity~blood.glucose) > fitted(lm.velo) 1 2 3 4 5 1.433841 1.335010 1.275711 1.526084 1.255945 6 7 8 9 10 1.214216 1.302066 1.341599 1.262534 1.365758 11 12 13 14 15 1.244964 1.212020 1.515103 1.429449 1.244964 17 18 19 20 21 1.190057 1.324029 1.372346 1.451411 1.389916 22 23 24 1.205431 1.291085 1.306459 > resid(lm.velo) 1 2 3 0.326158532 0.004989882 -0.005711308 4 5 6 -0.056084062 0.014054962 0.275783754 7 8 9 0.007933665 -0.251598875 -0.082533795 10 11 12 -0.145757649 0.005036223 -0.022019994 13 14 15 0.434897199 -0.149448964 0.275036223 17 18 19 -0.070057471 0.045971143 -0.182346406 20 21 22 -0.401411486 -0.069916424 -0.175431237 23 24 -0.171085074 0.393541161
> qqnorm(resid(lm.velo))
也可以用Q-Q图的线性性
3. 预测与置信带
回归线通常与不确切的边界一起展示。窄边界,又叫置信带反映了这条线本身的不确定性,宽边界,又称预测带,包含了未来观测值的不确定性。
> predict(lm.velo) 1 2 3 4 5 1.433841 1.335010 1.275711 1.526084 1.255945 6 7 8 9 10 1.214216 1.302066 1.341599 1.262534 1.365758 11 12 13 14 15 1.244964 1.212020 1.515103 1.429449 1.244964 17 18 19 20 21 1.190057 1.324029 1.372346 1.451411 1.389916 22 23 24 1.205431 1.291085 1.306459 > predict(lm.velo,int = "c") fit lwr upr 1 1.433841 1.291371 1.576312 2 1.335010 1.240589 1.429431 3 1.275711 1.169536 1.381887 4 1.526084 1.306561 1.745607 5 1.255945 1.139367 1.372523 6 1.214216 1.069315 1.359118 7 1.302066 1.205244 1.398889 8 1.341599 1.246317 1.436881 9 1.262534 1.149694 1.375374 10 1.365758 1.263750 1.467765 11 1.244964 1.121641 1.368287 12 1.212020 1.065457 1.358583 13 1.515103 1.305352 1.724854 14 1.429449 1.290217 1.568681 15 1.244964 1.121641 1.368287 17 1.190057 1.026217 1.353898 18 1.324029 1.230050 1.418008 19 1.372346 1.267629 1.477064 20 1.451411 1.295446 1.607377 21 1.389916 1.276444 1.503389 22 1.205431 1.053805 1.357057 23 1.291085 1.191084 1.391086 24 1.306459 1.210592 1.40232
predit函数加上参数,就可以在预测值向量的基础上得到边界的值。
4. 相关性
相关性就是一个对称并且不随尺度变化的量,用于衡量两个随机变量之间的关联程度(-1到1),一个 变量的较大值与另一个变量的较小值有关联时,相关性是负的,两个变量有同时变大或者减小的趋势,那么相关性就是正的。
4.1 皮尔逊相关系数
函数cor能计算两个或者多个向量之间的相关系数
> cor(blood.glucose,short.velocity,use = "complete.obs") [1] 0.4167546
> cor.test(blood.glucose,short.velocity) Pearson's product-moment correlation data: blood.glucose and short.velocity t = 2.101, df = 21, p-value = 0.0479 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.005496682 0.707429479 sample estimates: cor 0.4167546
斯皮尔曼相关系数(非参数检验)
> cor.test(blood.glucose,short.velocity,method = "spearman") Spearman's rank correlation rho data: blood.glucose and short.velocity S = 1380.4, p-value = 0.1392 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.318002 Warning message: In cor.test.default(blood.glucose, short.velocity, method = "spearman") : 无法给连结计算精確p值
肯德尔等级相关系数(基于统计一致对和不一致对的数量)
> cor.test(blood.glucose,short.velocity,method = "kendall") Kendall's rank correlation tau data: blood.glucose and short.velocity z = 1.5604, p-value = 0.1187 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.2350616 Warning message: In cor.test.default(blood.glucose, short.velocity, method = "kendall") : 无法给连结计算精確p值