朴素贝叶斯中文文本分类器的研究与实现（1）[88250原创]

转载请保留作者信息：

作者：88250

Blog：http:/blog.****.net/DL88250

MSN & Gmail & QQ：[email protected]

引言

将文本信息按预先指定的类别归类的技术可以追溯到上世纪60年代。不过，在最近的10年里，由于文本信息数字化而带来的海量数据，导致我们不得不将这些信息进行分类。由此，文本信息的自动分类得到了广泛的关注和快速的发展。

一些研究表明，机器学习技术解决这个问题是较为有效的方法：通过一种广义的诱导学习建立相应的自动分类器，形成预先文档信息的一个或多个特征的分类集合。基于机器学习的分类方式在分类效果和灵活性上都比之前基于知识工程和专家系统（通过某个领域里的专家人为地定义分类器）的文本分类模式有所突破，大量节省了专家人力的投入，可以很方便地用于各种不同的领域。

目前，文本自动分类算法基本都是基于概率统计模型的，例如贝叶斯分类算法（Naive Bayes，Bayes Network），支持向量机（SVM），最大熵模型（Maximum Entropy Model），K近邻算法（KMM）等等。本文就基于概率模型的朴素贝叶斯分类算法作了一些讨论，并根据理论描述使用Java语言构建了一个素朴贝叶斯分类器。实验表明，贝叶斯分类算法简单，可以取得了优良的分类效果。

正文

一贝叶斯理论与中文文本分类概述

1. 基本概念

条件概率

定义设A, B是两个事件，且P(A)>0 称

P(B∣A)=P(AB)/P(A)

为在条件A下发生的条件事件B发生的条件概率。

乘法公式

设P(A)>0 则有

P(AB)=P(B∣A)P(A)

全概率公式和贝叶斯公式

定义设S为试验E的样本空间，B1, B2, …Bn为E的一组事件，若

BiBj≠Ф, i≠j, i, j=1, 2, …,n;

B1∪B2∪…∪Bn=S

则称B1, B2, …, Bn为样本空间的一个划分。

定理设试验E的样本空间为，A为E的事件，B1, B2, …,Bn为的一个划分，且P(Bi)>0 (i=1, 2, …n)，则

P(A)=P(A∣B1)P(B1)+P(A∣B2)+ …+P(A∣Bn)P(Bn)

称为全概率公式。

定理设试验俄E的样本空间为S，A为E的事件，B1, B2, …,Bn为的一个划分，则

P(Bi∣A)=P(A∣Bi)P(Bi)/∑P(B｜Aj)P(Aj)=P(B｜Ai)P(Ai)/P(B)

称为贝叶斯公式。

说明：i，j均为下标，求和均是1到n

2. 朴素贝叶斯模型的假设与文本特征变量

文本特征变量

文本特征变量可以描述为文本中的字／词构成的属性。例如给出文本：

Ding Liang is a programmer.

可以获得该文本的特征变量集：{Ding, Liang, is, a, programmer.}

朴素贝叶斯模型是文本分类模型中的一种简单但性能优越的的分类模型。为了简化计算过程，假定各待分类文本特征变量是相互独立的，即“朴素贝叶斯模型的假设”。相互独立表明了所有特征变量之间的表述是没有关联的。如上例中，[Ding]和[Liang]这两个特征变量就是没有任何关联的。

虽然这种条件独立的假设在许多应用领域未必能很好满足，甚至是不成立的。但这种简化的贝叶斯分类器在许多实际应用中还是得到了较好的分类精度。

在上例中，文本是英文，但由于中文本身是没有自然分割符（如空格之类符号），所以要获得中文文本的特征变量向量首先需要对文本进行中文分词

3. 中文分词

中文分词的概念、意义以及算法简介可以查看这篇文章 :-)

关于中文分词的实际可用工具，前人已经做了很多成功的实践：

ICTCLAS

中科院的ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)应该是目前最好的中文分词系统了。不过1.0版本后收费了，而且是在Windows下封装的DLL库，要作移植比较困难。

ICTCLAS4J

Java版本的开源LCTCLAS，由于代码的开发人员没有太多考虑到跨平台，导致了在非Windows平台上的配置比较繁琐。并且，其提供的接口不是很友好，其词库的依赖关系颇为复杂。

海量分词组件

海量信息的中文智能分词组件虽然可以免费使用，但是也是提供的DLL，没有平台移植性。

极易中文分词组件

由极易软件提供的极易中文分词组件可以免费使用，提供Lucene接口，跨平台，性能可靠。

本次分类器是使用Java语言编写的，所以选择了极易中文分词组件作为基本的分词工具。

4. 朴素贝叶斯分类推导

根据联合概率公式（全概率公式）

M——训练文本集合中经过踢出无用词去除文本预处理之后关键字的数量。

二用Java构造朴素贝叶斯中文文本分类器

在前面，我们介绍了贝叶斯理论与中文分词技术。现在，让我们实践吧！

1. 开发环境与工具配置

OS: Ubuntu 7.10 GNU/Linux

IDE:NetBeans 6.0.1

JDK: 1.6.0_03-b05

Lucene：lucene-core-2.3.0.jar

分词工具：je-analysis-1.5.3.jar （极易分词组件1.5.3）

2. 朴素贝叶斯分类器设计

工程分为了两个包：bayes和util

Package cn.edu.ynu.sei.classifier.bayes

Class Summary
BayesClassifier	朴素贝叶斯分类器
ClassifyResult	分类结果实体

n.edu.ynu.sei.classifier.bayes
Class BayesClassifier

朴素贝叶斯分类器

c_NB=arg Max P(c_j)∏₁^C P(x_i|c_j)

Constructor Summary
`BayesClassifier()` 默认的构造器，初始化训练库路径

Method Summary
`java.lang.String`	`classify(java.lang.Stringtext)` 对给定的文本进行分类

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

BayesClassifier

默认的构造器，初始化训练库路径

Method Detail

classify

对给定的文本进行分类

Parameters:: text - 给定的文本
Returns:: 分类结果

cn.edu.ynu.sei.classifier.bayes
Class ClassifyResult

分类结果实体

Field Summary
`java.lang.String`	`classification` 类别
`float`	`p` 概率

Constructor Summary
`ClassifyResult()`

Method Summary

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

classification

类别

p

概率

Constructor Detail

ClassifyResult

Package cn.edu.ynu.sei.classifier.util

Class Summary
ChineseSpliter	中文分词器简单地封装了一下极易中文分词组件
ClassConditionalProbability	类条件概率计算
KeySearcher	关键字／词搜索器简单地封装了一下极易中文分词组件
PriorProbability	先验概率计算
TrainingDataManager	训练语料库搜索器

cn.edu.ynu.sei.classifier.util
Class ChineseSpliter

中文分词器简单地封装了一下极易中文分词组件

Constructor Summary
`ChineseSpliter()`

Method Summary
`staticjava.lang.String`	`split(java.lang.Stringtext, java.lang.StringsplitToken)` 中文分词

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

ChineseSpliter

Method Detail

split

中文分词

Parameters:: text - 给定的文本; splitToken - 用于分割的标记
Returns:: 分词完毕的文本

cn.edu.ynu.sei.classifier.util
Class ClassConditionalProbability

类条件概率计算

类条件概率

P(x_j|c_j)=( N(X=x_i, C=c_j)+1 ) / ( N(C=c_j)+M+V )
其中，N(X=x_i, C=c_j）表示类别c_j中包含属性x_i的训练文本数量；N(C=c_j)表示类别c_j中的训练文本数量；M值用于避免 N(X=x_i, C=c_j）过小所引发的问题；V表示类别的总数。

条件概率

定义设A, B是两个事件，且P(A)>0 称
P(B∣A)=P(AB)/P(A)
为在条件A下发生的条件事件B发生的条件概率。

Constructor Summary
`ClassConditionalProbability()`

Method Summary
`staticfloat`	`calculatePxc(java.lang.Stringx, java.lang.Stringc)` 计算类条件概率

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

ClassConditionalProbability

Method Detail

calculatePxc

计算类条件概率

Parameters:: x - 给定的文本属性; c - 给定的分类
Returns:: 给定条件下的类条件概率

cn.edu.ynu.sei.classifier.util
Class PriorProbability

先验概率计算

P(c_j)=N(C=c_j)/N
其中，N(C=c_j)表示类别c_j中的训练文本数量； N表示训练文本集总数量。

Constructor Summary
`PriorProbability()`

Method Summary
`staticfloat`	`calculatePc(java.lang.Stringc)` 先验概率

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

PriorProbability

Method Detail

calculatePc

先验概率

Parameters:: c - 给定的分类
Returns:: 给定条件下的先验概率

cn.edu.ynu.sei.classifier.util
Class TrainingDataManager

训练语料库搜索器

Constructor Summary
`TrainingDataManager()` 默认的构造器
`TrainingDataManager(java.lang.StringtraningDataDirPath)` 带参数的构造器

Method Summary
`int`	`getCountContainKeyOfClassification(java.lang.Stringclassification, java.lang.Stringkey)` 返回给定分类中包含关键字／词的训练文本的数目
`java.lang.String[]`	`getFilesPath(java.lang.Stringclassification)` 根据训练文本类别返回这个类别下的所有训练文本路径（full path）
`staticjava.lang.String`	`getText(java.lang.StringfilePath)` 返回给定路径的文本文件内容
`int`	`getTrainingFileCount()` 返回训练文本集中所有的文本数目
`int`	`getTrainingFileCountOfClassification(java.lang.Stringclassification)` 返回训练文本集中在给定分类下的训练文本数目
`java.lang.String[]`	`getTraningClassifications()` 返回训练文本类别，这个类别就是目录名

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

TrainingDataManager

带参数的构造器

Parameters:: traningDataDirPath - 训练语料库根目录路径

TrainingDataManager

默认的构造器

Method Detail

getTraningClassifications

返回训练文本类别，这个类别就是目录名

Returns:: 训练文本类别

getFilesPath

根据训练文本类别返回这个类别下的所有训练文本路径（full path）

Parameters:: classification - 给定的分类
Returns:: 给定分类下所有文件的路径（full path）

getText

返回给定路径的文本文件内容

Parameters:: filePath - 给定的文本文件路径
Returns:: 文本内容
Throws:: java.io.FileNotFoundException; java.io.IOException

getTrainingFileCount

返回训练文本集中所有的文本数目

Returns:: 训练文本集中所有的文本数目

getTrainingFileCountOfClassification

返回训练文本集中在给定分类下的训练文本数目

Parameters:: classification - 给定的分类
Returns:: 训练文本集中在给定分类下的训练文本数目

getCountContainKeyOfClassification

返回给定分类中包含关键字／词的训练文本的数目

Parameters:: classification - 给定的分类; key - 给定的关键字／词
Returns:: 给定分类中包含关键字／词的训练文本的数目

以上便是当前阶段下的分类器设计，为了更进一步说明，下面列出关键的源代码:-)

3. 关键源代码清单

/*

*@(#)TrainingDataManager.java

*

*Thisprogramisfreesoftware;youcanredistributeitand/ormodify

*itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby

*theFreeSoftwareFoundation;eitherversion3oftheLicense,or

*(atyouroption)anylaterversion.

*

*Thisprogramisdistributedinthehopethatitwillbeuseful,

*butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof

*MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe

*GNULibraryGeneralPublicLicenseformoredetails.

*

*YoushouldhavereceivedacopyoftheGNUGeneralPublicLicense

*alongwiththisprogram;ifnot,writetotheFreeSoftware

*Foundation,Inc.,59TemplePlace-Suite330,Boston,MA02111-1307,USA.

*/

packagecn.edu.ynu.sei.classifier.util;

importjava.io.BufferedReader;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileNotFoundException;

importjava.io.IOException;

importjava.io.InputStreamReader;

importjava.util.Properties;

importjava.util.logging.Level;

importjava.util.logging.Logger;

/**

*训练语料库搜索器

*@author88250

*@version1.0.0.0,Feb18,2008

*/

publicclassTrainingDataManager{

privateString[]traningFileClassifications;

privateFiletraningTextDir;

/**

*带参数的构造器

*@paramtraningDataDirPath训练语料库根目录路径

*/

publicTrainingDataManager(StringtraningDataDirPath){

traningTextDir=newFile(traningDataDirPath);

if(!traningTextDir.isDirectory()){

thrownewIllegalArgumentException("训练语料库搜索失败！["+

traningDataDirPath+"]");

}

this.traningFileClassifications=traningTextDir.list();

}

/**

*默认的构造器

*/

publicTrainingDataManager(){

try{

Propertiesproperties=newProperties();

FileInputStreaminputFile;

inputFile=newFileInputStream("/home/daniel/TempData/BayesTextClassifySystem/Training.properties");

properties.load(inputFile);

StringtraningDataDirPath=properties.getProperty("path");

traningTextDir=newFile(traningDataDirPath);

if(!traningTextDir.isDirectory()){

thrownewIllegalArgumentException("训练语料库搜索失败！["+

traningDataDirPath+"]");

}

this.traningFileClassifications=traningTextDir.list();

}catch(IOExceptionex){

Logger.getLogger(TrainingDataManager.class.getName()).

log(Level.SEVERE,null,ex);

}

}

/**

*返回训练文本类别，这个类别就是目录名

*@return训练文本类别

*/

publicString[]getTraningClassifications(){

returnthis.traningFileClassifications;

}

/**

*根据训练文本类别返回这个类别下的所有训练文本路径（fullpath）

*@paramclassification给定的分类

*@return给定分类下所有文件的路径（fullpath）

*/

publicString[]getFilesPath(Stringclassification){

FileclassDir=newFile(traningTextDir.getPath()+

File.separator+

classification);

String[]ret=classDir.list();

for(inti=0;i<ret.length;i++){

ret[i]=traningTextDir.getPath()+

File.separator+

classification+

File.separator+

ret[i];

}

returnret;

}

/**

*返回给定路径的文本文件内容

*@paramfilePath给定的文本文件路径

*@return文本内容

*@throwsjava.io.FileNotFoundException

*@throwsjava.io.IOException

*/

publicstaticStringgetText(StringfilePath)throwsFileNotFoundException,

IOException{

InputStreamReaderisReader=

newInputStreamReader(newFileInputStream(filePath),

"GBK");

BufferedReaderreader=newBufferedReader(isReader);

Stringaline;

StringBuildersb=newStringBuilder();

while((aline=reader.readLine())!=null){

sb.append(aline+" ");

}

isReader.close();

reader.close();

returnsb.toString();

}

/**

*返回训练文本集中所有的文本数目

*@return训练文本集中所有的文本数目

*/

publicintgetTrainingFileCount(){

intret=0;

for(inti=0;i<traningFileClassifications.length;i++){

ret+=

getTrainingFileCountOfClassification(traningFileClassifications[i]);

}

returnret;

}

/**

*返回训练文本集中在给定分类下的训练文本数目

*@paramclassification给定的分类

*@return训练文本集中在给定分类下的训练文本数目

*/

publicintgetTrainingFileCountOfClassification(Stringclassification){

FileclassDir=newFile(traningTextDir.getPath()+

File.separator+

classification);

returnclassDir.list().length;

}

/**

*返回给定分类中包含关键字／词的训练文本的数目

*@paramclassification给定的分类

*@paramkey给定的关键字／词

*@return给定分类中包含关键字／词的训练文本的数目

*/

publicintgetCountContainKeyOfClassification(Stringclassification,

Stringkey){

intret=0;

try{

String[]filePath=getFilesPath(classification);

for(intj=0;j<filePath.length;j++){

Stringtext=getText(filePath[j]);

if(text.contains(key)){

ret++;

}

}

}catch(FileNotFoundExceptionex){

Logger.getLogger(TrainingDataManager.class.getName()).

log(Level.SEVERE,null,

ex);

}catch(IOExceptionex){

Logger.getLogger(TrainingDataManager.class.getName()).

log(Level.SEVERE,null,

ex);

}

returnret;

}

}

/*

*@(#)PriorProbability.java

*

*Thisprogramisfreesoftware;youcanredistributeitand/ormodify

*itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby

*theFreeSoftwareFoundation;eitherversion3oftheLicense,or

*(atyouroption)anylaterversion.

*

*Thisprogramisdistributedinthehopethatitwillbeuseful,

*butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof

*MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe

*GNULibraryGeneralPublicLicenseformoredetails.

*

*YoushouldhavereceivedacopyoftheGNUGeneralPublicLicense

*alongwiththisprogram;ifnot,writetotheFreeSoftware

*Foundation,Inc.,59TemplePlace-Suite330,Boston,MA02111-1307,USA.

*/

packagecn.edu.ynu.sei.classifier.util;

/**

*先验概率计算

*<h3>先验概率计算</h3>

*P(cj)=N(C=cj)/N 

*其中，N(C=cj)表示类别cj中的训练文本数量；

*N表示训练文本集总数量。

*@author88250

*@version1.0.0.0,Feb19,2008

*/

publicclassPriorProbability{

privatestaticTrainingDataManagertdm=

newTrainingDataManager();

/**

*先验概率

*@paramc给定的分类

*@return给定条件下的先验概率

*/

publicstaticfloatcalculatePc(Stringc){

floatret=0F;

floatNc=tdm.getTrainingFileCountOfClassification(c);

floatN=tdm.getTrainingFileCount();

ret=Nc/N;

returnret;

}

}

/*

*@(#)ClassConditionalProbability.java

*

*Thisprogramisfreesoftware;youcanredistributeitand/ormodify

*itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby

*theFreeSoftwareFoundation;eitherversion3oftheLicense,or

*(atyouroption)anylaterversion.

*

*Thisprogramisdistributedinthehopethatitwillbeuseful,

*butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof

*MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe

*GNULibraryGeneralPublicLicenseformoredetails.

*

*YoushouldhavereceivedacopyoftheGNUGeneralPublicLicense

*alongwiththisprogram;ifnot,writetotheFreeSoftware

*Foundation,Inc.,59TemplePlace-Suite330,Boston,MA02111-1307,USA.

*/

packagecn.edu.ynu.sei.classifier.util;

/**

*类条件概率计算

*

*<h3>类条件概率</h3>

*P(xj|cj)=(N(X=xi,C=cj

*)+1)/(N(C=cj)+M+V) 

*其中，N(X=xi,C=cj）表示类别cj中包含属性x

*i的训练文本数量；N(C=cj)表示类别cj中的训练文本数量；M值用于避免

*N(X=xi,C=cj）过小所引发的问题；V表示类别的总数。

*

*<h3>条件概率</h3>

*定义设A,B是两个事件，且P(A)>0称 

*<tt>P(B∣A)=P(AB)/P(A)</tt> 

*为在条件A下发生的条件事件B发生的条件概率。

*@author88250

*@version1.0.0.0,Feb19,2008

*/

publicclassClassConditionalProbability{

privatestaticTrainingDataManagertdm=

newTrainingDataManager();

privatestaticfinalfloatM=0F;

/**

*计算类条件概率

*@paramx给定的文本属性

*@paramc给定的分类

*@return给定条件下的类条件概率

*/

publicstaticfloatcalculatePxc(Stringx,Stringc){

floatret=0F;

floatNxc=tdm.getCountContainKeyOfClassification(c,x);

floatNc=tdm.getTrainingFileCountOfClassification(c);

floatV=tdm.getTraningClassifications().length;

ret=(Nxc+1)/(Nc+M+V);

returnret;

}

}

/*

*@(#)BayesClassifier.java

*

*Thisprogramisfreesoftware;youcanredistributeitand/ormodify

*itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby

*theFreeSoftwareFoundation;eitherversion3oftheLicense,or

*(atyouroption)anylaterversion.

*

*Thisprogramisdistributedinthehopethatitwillbeuseful,

*butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof

*MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe

*GNULibraryGeneralPublicLicenseformoredetails.

*

*YoushouldhavereceivedacopyoftheGNUGeneralPublicLicense

*alongwiththisprogram;ifnot,writetotheFreeSoftware

*Foundation,Inc.,59TemplePlace-Suite330,Boston,MA02111-1307,USA.

*/

packagecn.edu.ynu.sei.classifier.bayes;

importcn.edu.ynu.sei.classifier.util.ChineseSpliter;

importcn.edu.ynu.sei.classifier.util.ClassConditionalProbability;

importcn.edu.ynu.sei.classifier.util.PriorProbability;

importcn.edu.ynu.sei.classifier.util.TrainingDataManager;

importjava.io.FileInputStream;

importjava.io.IOException;

importjava.util.ArrayList;

importjava.util.Comparator;

importjava.util.List;

importjava.util.Properties;

importjava.util.Properties;

importjava.util.logging.Level;

importjava.util.logging.Logger;

/**

*朴素贝叶斯分类器

*

*cNB=argMaxP(cj)∏1C

*P(xi|cj)

*

*@author88250

*@version1.0.0.0,Feb19,2008

*/

publicclassBayesClassifier{

privateTrainingDataManagertdm;

/**

*默认的构造器，初始化训练库路径

*/

publicBayesClassifier(){

try{

Propertiesproperties=newProperties();

FileInputStreaminputFile;

inputFile=newFileInputStream("/home/daniel/TempData/BayesTextClassifySystem/Training.properties");

properties.load(inputFile);

tdm=

newTrainingDataManager(properties.getProperty("path"));

}catch(IOExceptionex){

Logger.getLogger(BayesClassifier.class.getName()).

log(Level.SEVERE,null,ex);

}

}

/**

*计算给定的文本属性向量<code>X</code>在给定的分类<code>Cj</code>中的类条件概率

*<code>ClassConditionalProbability</code>连乘(∏)值

*@paramX给定的文本属性向量

*@paramCj给定的类别

*@return类条件概率连乘(∏)值，即 

*P(cj)∏1CP(xi|cj)

*@seecn.edu.ynu.sei.classifier.util.ClassConditionalProbability

*/

floatcalcProd(String[]X,StringCj){

floatret=0F;

//类条件概率连乘

for(inti=0;i<

X.length;i++){

StringXi=X[i];

ret+=

ClassConditionalProbability.calculatePxc(Xi,Cj);

}

//再乘以先验概率

ret*=PriorProbability.calculatePc(Cj);

returnret;

}

/**

*对给定的文本进行分类

*@paramtext给定的文本

*@return分类结果

*/

@SuppressWarnings("unchecked")

publicStringclassify(

Stringtext){

String[]X=ChineseSpliter.split(text,"").split("/s");

String[]C=tdm.getTraningClassifications();

floatp=0F;

List<ClassifyResult>crs=newArrayList<ClassifyResult>();

for(inti=0;i<

C.length;i++){

StringCi=C[i];

p=

calcProd(X,Ci);

ClassifyResultcr=newClassifyResult();

cr.classification=Ci;

cr.p=p;

System.out.println("Inprocess....");

System.out.println(Ci+"："+p);

crs.add(cr);

}

java.util.Collections.sort(crs,

newComparator(){

publicintcompare(Objecto1,

Objecto2){

ClassifyResultm1=

(ClassifyResult)o1;

ClassifyResultm2=

(ClassifyResult)o2;

floatret=m1.p-m2.p;

if(ret<0){

return1;

}else{

return-1;

}

}

});

returncrs.get(0).classification;

}

}

三训练库与分类测试

作为测试，我用的是Sogou实验室的文本分类数据，下载了mini版本和精简版本。

迷你版本有10个类别，共计100篇文章，总大小284.7KB

精简版本有9个类别，共计17910篇文章，总大小 48.6 MB

测试机器配置：

Pm 750(1.86G), 1.0GB RAM

对于给定的文本：

[ 微软公司提出以446亿美元的价格收购雅虎

中国网2月1日报道美联社消息，微软公司提出以446亿美元现金加股票的价格收购搜索网站雅虎公司。

微软提出以每股31美元的价格收购雅虎。微软的收购报价较雅虎1月31日的收盘价19.18美元溢价62%。微软公司称雅虎公司的股东可以选择以现金或股票进行交易。

微软和雅虎公司在2006年底和2007年初已在寻求双方合作。而近两年，雅虎一直处于困境：市场份额下滑、

运营业绩不佳、股价大幅下跌。对于力图在互联网市场有所作为的微软来说，收购雅虎无疑是一条捷径，因为双方具有非常强的互补性。(小桥)

]

使用mini版本的测试结果：

init:

deps-jar:

compile:

compile-test-single:

Testsuite: cn.edu.ynu.sei.classifier.bayes.BayesClassifierTest

classify

In process....

文化：0.70500064

In process....

健康：0.9200004

In process....

旅游：0.8250006

In process....

招聘：0.89000064

In process....

汽车：1.1150006

In process....

教育：0.82000035

In process....

体育：0.7400004

In process....

IT：1.1150006

In process....

财经：1.0150005

In process....

军事：0.76500034

属于[ IT ]

Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.572 sec

可以看出，对于该文本，“汽车”的相似度和“IT”是一样的。

使用精简版本的测试结果：

Compiling 1 source file to /home/daniel/Work/Sources/Java/BayesTextClassifier/build/test/classes

compile-test-single:

Testsuite: cn.edu.ynu.sei.classifier.bayes.BayesClassifierTest

classify

In process....

文化：0.7996222

In process....

健康：0.691568

In process....

旅游：0.6269246

In process....

招聘：0.95492196

In process....

教育：0.64076483

In process....

体育：0.41798678

In process....

IT：1.2663554

In process....

财经：1.1997666

In process....

军事：0.9136235

属于[ IT ]

Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 594.815 sec

关于分类器的评估

使用Mini版本训练库时，分类耗时4.572秒；

使用精简版本训练库时，分类耗时594.815秒，也就是9分多钟。

测试用的文章只有261个字，在没有经过任何降维处理的情况下，使用稍大一点的词库后，性能问题出来了。

另外，分类器的评估有专门的评估模型和方法，这里不再赘述。

后话

当前，是实验的第一阶段，注重的是“纯”朴素贝叶斯理论研究和实现，所以，在分类效率上可能较为低下。但是，基于本次构建的朴素贝叶斯分类器，下一阶段将对她作出优化处理。目前想到的优化手段：

文本先进行降维处理，具体方法就是踢出一些无用的词汇（例如：的，吗，么。。。。等等助词之类的）

在先验条件的处理上要考虑对训练库进行处理后的先验条件数据保存，下次直接读取就可以用了

找台多核的机子做多线程处理 :-)

好了，本文到此就结束了。下一次的题目是《朴素贝叶斯中文文本分类器的研究与实现（2）[88250原创]》，期待大家的关注！

朴素贝叶斯中文文本分类器的研究与实现（1）[88250原创]

引言

正文

一 贝叶斯理论与中文文本分类概述

1. 基本概念

条件概率

乘法公式

全概率公式和贝叶斯公式

2. 朴素贝叶斯模型的假设与文本特征变量

文本特征变量

3. 中文分词

ICTCLAS

ICTCLAS4J

海量分词组件

极易中文分词组件

4. 朴素贝叶斯分类推导

二 用Java构造朴素贝叶斯中文文本分类器

1. 开发环境与工具配置

2. 朴素贝叶斯分类器设计

Package cn.edu.ynu.sei.classifier.bayes

n.edu.ynu.sei.classifier.bayes Class BayesClassifier

BayesClassifier

classify

cn.edu.ynu.sei.classifier.bayes Class ClassifyResult

classification

p

ClassifyResult

Package cn.edu.ynu.sei.classifier.util

cn.edu.ynu.sei.classifier.util Class ChineseSpliter

ChineseSpliter

split

cn.edu.ynu.sei.classifier.util Class ClassConditionalProbability

类条件概率

条件概率

ClassConditionalProbability

calculatePxc

cn.edu.ynu.sei.classifier.util Class PriorProbability

先验概率计算

PriorProbability

calculatePc

cn.edu.ynu.sei.classifier.util Class TrainingDataManager

TrainingDataManager

TrainingDataManager

getTraningClassifications

getFilesPath

getText

getTrainingFileCount

getTrainingFileCountOfClassification

getCountContainKeyOfClassification

3. 关键源代码清单

三 训练库与分类测试

关于分类器的评估

后话

相关推荐

一贝叶斯理论与中文文本分类概述

二用Java构造朴素贝叶斯中文文本分类器

n.edu.ynu.sei.classifier.bayes
Class BayesClassifier

cn.edu.ynu.sei.classifier.bayes
Class ClassifyResult

cn.edu.ynu.sei.classifier.util
Class ChineseSpliter

cn.edu.ynu.sei.classifier.util
Class ClassConditionalProbability

cn.edu.ynu.sei.classifier.util
Class PriorProbability

cn.edu.ynu.sei.classifier.util
Class TrainingDataManager

三训练库与分类测试