简单的朴素贝叶斯算法应用于一个简单的数据

朴素贝叶斯在一些情况下分类效果还是不错的,一般常用于文本分析,如垃圾邮件分类等。
本文不介绍朴素贝叶斯算法的理论部分,直接给出一个数据集的案例分析。

数据集由几个医学预测变量和一个目标变量组成Outcome。预测变量包括患者的怀孕次数,BMI,胰岛素水平,年龄等。

  • 数据集的目的是基于数据集中包括的某些诊断测量来诊断性地预测患者是否患有糖尿病。从较大的数据库中选择这些实例存在一些限制。特别是,这里的所有患者都是至少21岁的皮马印第安人遗产的女性。

本文基于朴素贝叶斯算法,使用R语言建模

目的是建立一个预测模型,以预测新科目的糖尿病。

数据将分为训练和测试数据集75/25分流比。
评估将在测试数据集上进行。

其中,将结果变量更改为一个因子:真(有糖尿病)和假(没有糖尿病)。

  • 先加载要使用的包,若包未下载,使用install.packages()下载即可
library(tidyverse) 
library(ggplot2)
library(caret)
library(caretEnsemble)
library(psych)
library(Amelia)
library(mice)
library(GGally)
library(rpart)
  • 读入数据,路径自设,此处省略
Xdata<- read.csv("....../diabetes.csv") 

#将结果变量Outcome转化为因子变量
Xdata$Outcome <- factor(Xdata$Outcome, levels = c(0,1), labels = c("False", "True"))

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.1.0       √ purrr   0.2.5  
√ tibble  2.1.1       √ dplyr   0.8.0.1
√ tidyr   0.8.3       √ stringr 1.3.1  
√ readr   1.3.1       √ forcats 0.3.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

载入需要的程辑包:lattice

载入程辑包:‘caret’

The following object is masked frompackage:purrr’:

    lift

Warning message:
程辑包‘caret’是用R版本3.5.3 来建造的 
载入程辑包:‘caretEnsemble’

The following object is masked frompackage:ggplot2’:

    autoplot

Warning message:
程辑包‘caretEnsemble’是用R版本3.5.3 来建造的 

载入程辑包:‘psych’

The following objects are masked frompackage:ggplot2’:

    %+%, alpha

Warning message:
程辑包‘psych’是用R版本3.5.3 来建造的 

载入需要的程辑包:Rcpp
## 
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 
Warning message:
程辑包‘Amelia’是用R版本3.5.3 来建造的 

载入程辑包:‘mice’

The following object is masked frompackage:tidyr’:

    complete

The following objects are masked frompackage:base’:

    cbind, rbind

Warning message:
程辑包‘mice’是用R版本3.5.3 来建造的 

载入程辑包:‘GGally’

The following object is masked frompackage:dplyr’:

    nasa

Warning message:
程辑包‘GGally’是用R版本3.5.3 来建造的 

Warning message:
程辑包‘rpart’是用R版本3.5.3 来建造的 
#观察一下数据 
str(Xdata)
head(Xdata)
describe(Xdata)
'data.frame':	768 obs. of  9 variables:
 $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
 $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
 $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
 $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
 $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
 $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
 $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
 $ Outcome                 : Factor w/ 2 levels "False","True": 2 1 2 1 2 1 2 1 2 2 ...
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 0 33.6 0.627 50 True
1 85 66 29 0 26.6 0.351 31 False
8 183 64 0 0 23.3 0.672 32 True
1 89 66 23 94 28.1 0.167 21 False
0 137 40 35 168 43.1 2.288 33 True
5 116 74 0 0 25.6 0.201 30 False
vars n mean sd median trimmed mad min max range skew kurtosis se
Pregnancies 1 768 3.85 3.37 3.00 3.46 2.97 0.00 17.00 17.00 0.90 0.14 0.12
Glucose 2 768 120.89 31.97 117.00 119.38 29.65 0.00 199.00 199.00 0.17 0.62 1.15
BloodPressure 3 768 69.11 19.36 72.00 71.36 11.86 0.00 122.00 122.00 -1.84 5.12 0.70
SkinThickness 4 768 20.54 15.95 23.00 19.94 17.79 0.00 99.00 99.00 0.11 -0.53 0.58
Insulin 5 768 79.80 115.24 30.50 56.75 45.22 0.00 846.00 846.00 2.26 7.13 4.16
BMI 6 768 31.99 7.88 32.00 31.96 6.82 0.00 67.10 67.10 -0.43 3.24 0.28
DiabetesPedigreeFunction 7 768 0.47 0.33 0.37 0.42 0.25 0.08 2.42 2.34 1.91 5.53 0.01
Age 8 768 33.24 11.76 29.00 31.54 10.38 21.00 81.00 60.00 1.13 0.62 0.42
Outcome* 9 768 1.35 0.48 1.00 1.31 0.00 1.00 2.00 1.00 0.63 -1.60 0.02
  • 数据检查/清洁
  • 如上所示,葡萄糖,血压,Skinthickness,胰岛素和BMI的最小值均为零,这是不可能的,因此我将这些值视为缺失的观察结果。
  • 我们将这些值更改为NA并使用缺失贴图进行可视化
#将讨论的变量更改为NA
Xdata[, 2:7][Xdata[, 2:7] == 0] <- NA

#可视化丢失的数据
missmap(Xdata)

简单的朴素贝叶斯算法应用于一个简单的数据

  • 图中显示出含有缺失值的观测点比不含缺失值的观测点多。
  • NA中的模式看起来是随机的。
  • 由于我们的数据集很小,我们无法采取删除观察结果这种方式。
  • 将NA的值用众数/中值/均值填补这种方式将引入偏差。
  • 我们唯一的选择是使用多重插补。
#使用 mice 包来预测缺失值
mice_mod <- mice(Xdata[, c("Glucose","BloodPressure","SkinThickness","Insulin","BMI")], method='rf')
mice_complete <- complete(mice_mod)

#将预测的缺失值传输到主数据集中
Xdata$Glucose <- mice_complete$Glucose
Xdata$BloodPressure <- mice_complete$BloodPressure
Xdata$SkinThickness <- mice_complete$SkinThickness
Xdata$Insulin<- mice_complete$Insulin
Xdata$BMI <- mice_complete$BMI

#检查效果
missmap(Xdata)
 iter imp variable
  1   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  1   2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  1   3  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  1   4  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  1   5  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  2   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  2   2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  2   3  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  2   4  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  2   5  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  3   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  3   2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  3   3  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  3   4  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  3   5  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  4   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  4   2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  4   3  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  4   4  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  4   5  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  5   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  5   2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  5   3  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  5   4  Glucose  BloodPressure  SkinThickness  Insulin  BMI
  5   5  Glucose  BloodPressure  SkinThickness  Insulin  BMI

  Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
 Mean   : 3.845   Mean   :121.6   Mean   : 72.36   Mean   :28.92  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:36.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
 1st Qu.: 76.0   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
 Median :120.0   Median :32.15   Median :0.3725           Median :29.00  
 Mean   :148.5   Mean   :32.44   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:182.0   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
  Outcome   
 False:500  
 True :268  

简单的朴素贝叶斯算法应用于一个简单的数据

#可视化图 1
ggplot(Xdata, aes(Age, colour = Outcome)) +
  geom_freqpoly(binwidth = 1) + labs(title="Age Distribution by Outcome")

#可视化图  2
c <- ggplot(Xdata, aes(x=Pregnancies, fill=Outcome, color=Outcome)) +
  geom_histogram(binwidth = 1) + labs(title="Pregnancy Distribution by Outcome")
c + theme_bw()

#可视化图  3
P <- ggplot(Xdata, aes(x=BMI, fill=Outcome, color=Outcome)) +
  geom_histogram(binwidth = 1) + labs(title="BMI Distribution by Outcome")
P + theme_bw()

#可视化图  4
ggplot(Xdata, aes(Glucose, colour = Outcome)) +
  geom_freqpoly(binwidth = 1) + labs(title="Glucose Distribution by Outcome")

#可视化图  5
ggpairs(Xdata)

简单的朴素贝叶斯算法应用于一个简单的数据
简单的朴素贝叶斯算法应用于一个简单的数据
简单的朴素贝叶斯算法应用于一个简单的数据
简单的朴素贝叶斯算法应用于一个简单的数据

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
                                                                    

简单的朴素贝叶斯算法应用于一个简单的数据

#将数据拆分为训练和测试数据集,比例为3:1
indxTrain <- createDataPartition(y = Xdata$Outcome,p = 0.75,list = FALSE)
training <- Xdata[indxTrain,]
testing <- Xdata[-indxTrain,]

#检查结果变量拆分的比例
prop.table(table(Xdata$Outcome)) * 100
prop.table(table(training$Outcome)) * 100
prop.table(table(testing$Outcome)) * 100

   False     True 
65.10417 34.89583 

   False     True 
65.10417 34.89583 

   False     True 
65.10417 34.89583 
#这部分是为了抑制警告信息,使分析在美学上令人愉悦
options(warn = -1)

#创建保存预测变量的对象x和保存响应变量的y
x = training[,-9]
y = training$Outcome

#加载“e1071”包
library(e1071)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model

Naive Bayes 

576 samples
  8 predictor
  2 classes: 'False', 'True' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 518, 519, 518, 519, 518, 518, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.7223533  0.3878780
   TRUE      0.7518754  0.4648232

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
 = 1.
  • 在下一节中,我们将使用混淆矩阵检查测试数据集的结果
#预测测试集
Predict <- predict(model,newdata = testing )
#获取混淆矩阵以查看准确度值和其他参数值
confusionMatrix(Predict, testing$Outcome )

Confusion Matrix and Statistics

          Reference
Prediction False True
     False   104   21
     True     21   46
                                         
               Accuracy : 0.7812         
                 95% CI : (0.716, 0.8376)
    No Information Rate : 0.651          
    P-Value [Acc > NIR] : 6.137e-05      
                                         
                  Kappa : 0.5186         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.8320         
            Specificity : 0.6866         
         Pos Pred Value : 0.8320         
         Neg Pred Value : 0.6866         
             Prevalence : 0.6510         
         Detection Rate : 0.5417         
   Detection Prevalence : 0.6510         
      Balanced Accuracy : 0.7593         
                                         
       'Positive' Class : False          
                                 
#画图观察各变量的表现
X <- varImp(model)
plot(X)

简单的朴素贝叶斯算法应用于一个简单的数据

  • 如上所示,Glucuse在确定新受试者的糖尿病方面具有最高的预测效用。朴素的贝叶斯模型对于包含看不见的数据的平衡准确性达到了0.76,留下了改进的空间。

本文使用的数据集链接:

链接:https://pan.baidu.com/s/1iyvSle8y8TsLTZdMEcQeXA
提取码:r73n