简单的朴素贝叶斯算法应用于一个简单的数据
朴素贝叶斯在一些情况下分类效果还是不错的,一般常用于文本分析,如垃圾邮件分类等。
本文不介绍朴素贝叶斯算法的理论部分,直接给出一个数据集的案例分析。
数据集由几个医学预测变量和一个目标变量组成Outcome。预测变量包括患者的怀孕次数,BMI,胰岛素水平,年龄等。
- 数据集的目的是基于数据集中包括的某些诊断测量来诊断性地预测患者是否患有糖尿病。从较大的数据库中选择这些实例存在一些限制。特别是,这里的所有患者都是至少21岁的皮马印第安人遗产的女性。
本文基于朴素贝叶斯算法,使用R语言建模
目的是建立一个预测模型,以预测新科目的糖尿病。
数据将分为训练和测试数据集75/25分流比。
评估将在测试数据集上进行。
其中,将结果变量更改为一个因子:真(有糖尿病)和假(没有糖尿病)。
-
先加载要使用的包,若包未下载,使用
install.packages()
下载即可。
library(tidyverse)
library(ggplot2)
library(caret)
library(caretEnsemble)
library(psych)
library(Amelia)
library(mice)
library(GGally)
library(rpart)
- 读入数据,路径自设,此处省略
Xdata<- read.csv("....../diabetes.csv")
#将结果变量Outcome转化为因子变量
Xdata$Outcome <- factor(Xdata$Outcome, levels = c(0,1), labels = c("False", "True"))
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.1.0 √ purrr 0.2.5
√ tibble 2.1.1 √ dplyr 0.8.0.1
√ tidyr 0.8.3 √ stringr 1.3.1
√ readr 1.3.1 √ forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
载入需要的程辑包:lattice
载入程辑包:‘caret’
The following object is masked from ‘package:purrr’:
lift
Warning message:
程辑包‘caret’是用R版本3.5.3 来建造的
载入程辑包:‘caretEnsemble’
The following object is masked from ‘package:ggplot2’:
autoplot
Warning message:
程辑包‘caretEnsemble’是用R版本3.5.3 来建造的
载入程辑包:‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
Warning message:
程辑包‘psych’是用R版本3.5.3 来建造的
载入需要的程辑包:Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
Warning message:
程辑包‘Amelia’是用R版本3.5.3 来建造的
载入程辑包:‘mice’
The following object is masked from ‘package:tidyr’:
complete
The following objects are masked from ‘package:base’:
cbind, rbind
Warning message:
程辑包‘mice’是用R版本3.5.3 来建造的
载入程辑包:‘GGally’
The following object is masked from ‘package:dplyr’:
nasa
Warning message:
程辑包‘GGally’是用R版本3.5.3 来建造的
Warning message:
程辑包‘rpart’是用R版本3.5.3 来建造的
#观察一下数据
str(Xdata)
head(Xdata)
describe(Xdata)
'data.frame': 768 obs. of 9 variables:
$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
$ Age : int 50 31 32 21 33 30 26 29 53 54 ...
$ Outcome : Factor w/ 2 levels "False","True": 2 1 2 1 2 1 2 1 2 2 ...
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | True |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | False |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | True |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | False |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | True |
5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | False |
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pregnancies | 1 | 768 | 3.85 | 3.37 | 3.00 | 3.46 | 2.97 | 0.00 | 17.00 | 17.00 | 0.90 | 0.14 | 0.12 |
Glucose | 2 | 768 | 120.89 | 31.97 | 117.00 | 119.38 | 29.65 | 0.00 | 199.00 | 199.00 | 0.17 | 0.62 | 1.15 |
BloodPressure | 3 | 768 | 69.11 | 19.36 | 72.00 | 71.36 | 11.86 | 0.00 | 122.00 | 122.00 | -1.84 | 5.12 | 0.70 |
SkinThickness | 4 | 768 | 20.54 | 15.95 | 23.00 | 19.94 | 17.79 | 0.00 | 99.00 | 99.00 | 0.11 | -0.53 | 0.58 |
Insulin | 5 | 768 | 79.80 | 115.24 | 30.50 | 56.75 | 45.22 | 0.00 | 846.00 | 846.00 | 2.26 | 7.13 | 4.16 |
BMI | 6 | 768 | 31.99 | 7.88 | 32.00 | 31.96 | 6.82 | 0.00 | 67.10 | 67.10 | -0.43 | 3.24 | 0.28 |
DiabetesPedigreeFunction | 7 | 768 | 0.47 | 0.33 | 0.37 | 0.42 | 0.25 | 0.08 | 2.42 | 2.34 | 1.91 | 5.53 | 0.01 |
Age | 8 | 768 | 33.24 | 11.76 | 29.00 | 31.54 | 10.38 | 21.00 | 81.00 | 60.00 | 1.13 | 0.62 | 0.42 |
Outcome* | 9 | 768 | 1.35 | 0.48 | 1.00 | 1.31 | 0.00 | 1.00 | 2.00 | 1.00 | 0.63 | -1.60 | 0.02 |
- 数据检查/清洁
- 如上所示,葡萄糖,血压,Skinthickness,胰岛素和BMI的最小值均为零,这是不可能的,因此我将这些值视为缺失的观察结果。
- 我们将这些值更改为NA并使用缺失贴图进行可视化
#将讨论的变量更改为NA
Xdata[, 2:7][Xdata[, 2:7] == 0] <- NA
#可视化丢失的数据
missmap(Xdata)
- 图中显示出含有缺失值的观测点比不含缺失值的观测点多。
- NA中的模式看起来是随机的。
- 由于我们的数据集很小,我们无法采取删除观察结果这种方式。
- 将NA的值用众数/中值/均值填补这种方式将引入偏差。
- 我们唯一的选择是使用多重插补。
#使用 mice 包来预测缺失值
mice_mod <- mice(Xdata[, c("Glucose","BloodPressure","SkinThickness","Insulin","BMI")], method='rf')
mice_complete <- complete(mice_mod)
#将预测的缺失值传输到主数据集中
Xdata$Glucose <- mice_complete$Glucose
Xdata$BloodPressure <- mice_complete$BloodPressure
Xdata$SkinThickness <- mice_complete$SkinThickness
Xdata$Insulin<- mice_complete$Insulin
Xdata$BMI <- mice_complete$BMI
#检查效果
missmap(Xdata)
iter imp variable
1 1 Glucose BloodPressure SkinThickness Insulin BMI
1 2 Glucose BloodPressure SkinThickness Insulin BMI
1 3 Glucose BloodPressure SkinThickness Insulin BMI
1 4 Glucose BloodPressure SkinThickness Insulin BMI
1 5 Glucose BloodPressure SkinThickness Insulin BMI
2 1 Glucose BloodPressure SkinThickness Insulin BMI
2 2 Glucose BloodPressure SkinThickness Insulin BMI
2 3 Glucose BloodPressure SkinThickness Insulin BMI
2 4 Glucose BloodPressure SkinThickness Insulin BMI
2 5 Glucose BloodPressure SkinThickness Insulin BMI
3 1 Glucose BloodPressure SkinThickness Insulin BMI
3 2 Glucose BloodPressure SkinThickness Insulin BMI
3 3 Glucose BloodPressure SkinThickness Insulin BMI
3 4 Glucose BloodPressure SkinThickness Insulin BMI
3 5 Glucose BloodPressure SkinThickness Insulin BMI
4 1 Glucose BloodPressure SkinThickness Insulin BMI
4 2 Glucose BloodPressure SkinThickness Insulin BMI
4 3 Glucose BloodPressure SkinThickness Insulin BMI
4 4 Glucose BloodPressure SkinThickness Insulin BMI
4 5 Glucose BloodPressure SkinThickness Insulin BMI
5 1 Glucose BloodPressure SkinThickness Insulin BMI
5 2 Glucose BloodPressure SkinThickness Insulin BMI
5 3 Glucose BloodPressure SkinThickness Insulin BMI
5 4 Glucose BloodPressure SkinThickness Insulin BMI
5 5 Glucose BloodPressure SkinThickness Insulin BMI
Pregnancies Glucose BloodPressure SkinThickness
Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
Mean : 3.845 Mean :121.6 Mean : 72.36 Mean :28.92
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:36.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
1st Qu.: 76.0 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
Median :120.0 Median :32.15 Median :0.3725 Median :29.00
Mean :148.5 Mean :32.44 Mean :0.4719 Mean :33.24
3rd Qu.:182.0 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
False:500
True :268
#可视化图 1
ggplot(Xdata, aes(Age, colour = Outcome)) +
geom_freqpoly(binwidth = 1) + labs(title="Age Distribution by Outcome")
#可视化图 2
c <- ggplot(Xdata, aes(x=Pregnancies, fill=Outcome, color=Outcome)) +
geom_histogram(binwidth = 1) + labs(title="Pregnancy Distribution by Outcome")
c + theme_bw()
#可视化图 3
P <- ggplot(Xdata, aes(x=BMI, fill=Outcome, color=Outcome)) +
geom_histogram(binwidth = 1) + labs(title="BMI Distribution by Outcome")
P + theme_bw()
#可视化图 4
ggplot(Xdata, aes(Glucose, colour = Outcome)) +
geom_freqpoly(binwidth = 1) + labs(title="Glucose Distribution by Outcome")
#可视化图 5
ggpairs(Xdata)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#将数据拆分为训练和测试数据集,比例为3:1
indxTrain <- createDataPartition(y = Xdata$Outcome,p = 0.75,list = FALSE)
training <- Xdata[indxTrain,]
testing <- Xdata[-indxTrain,]
#检查结果变量拆分的比例
prop.table(table(Xdata$Outcome)) * 100
prop.table(table(training$Outcome)) * 100
prop.table(table(testing$Outcome)) * 100
False True
65.10417 34.89583
False True
65.10417 34.89583
False True
65.10417 34.89583
#这部分是为了抑制警告信息,使分析在美学上令人愉悦
options(warn = -1)
#创建保存预测变量的对象x和保存响应变量的y
x = training[,-9]
y = training$Outcome
#加载“e1071”包
library(e1071)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model
Naive Bayes
576 samples
8 predictor
2 classes: 'False', 'True'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 518, 519, 518, 519, 518, 518, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.7223533 0.3878780
TRUE 0.7518754 0.4648232
Tuning parameter 'fL' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
= 1.
- 在下一节中,我们将使用混淆矩阵检查测试数据集的结果
#预测测试集
Predict <- predict(model,newdata = testing )
#获取混淆矩阵以查看准确度值和其他参数值
confusionMatrix(Predict, testing$Outcome )
Confusion Matrix and Statistics
Reference
Prediction False True
False 104 21
True 21 46
Accuracy : 0.7812
95% CI : (0.716, 0.8376)
No Information Rate : 0.651
P-Value [Acc > NIR] : 6.137e-05
Kappa : 0.5186
Mcnemar's Test P-Value : 1
Sensitivity : 0.8320
Specificity : 0.6866
Pos Pred Value : 0.8320
Neg Pred Value : 0.6866
Prevalence : 0.6510
Detection Rate : 0.5417
Detection Prevalence : 0.6510
Balanced Accuracy : 0.7593
'Positive' Class : False
#画图观察各变量的表现
X <- varImp(model)
plot(X)
- 如上所示,Glucuse在确定新受试者的糖尿病方面具有最高的预测效用。朴素的贝叶斯模型对于包含看不见的数据的平衡准确性达到了0.76,留下了改进的空间。
本文使用的数据集链接:
链接:https://pan.baidu.com/s/1iyvSle8y8TsLTZdMEcQeXA
提取码:r73n