朴素贝叶斯与Apache星火MLlib
问题描述:
我用朴素贝叶斯与Apache星火MLlib文本分类如下教程:http://avulanov.blogspot.com/2014/08/text-classification-with-apache-spark.html朴素贝叶斯与Apache星火MLlib
/* instantiate Spark context (not needed for running inside Spark shell */
val sc = new SparkContext("local", "test")
/* word to vector space converter, limit to 10000 words */
val htf = new HashingTF(10000)
/* load positive and negative sentences from the dataset */
/* let 1 - positive class, 0 - negative class */
/* tokenize sentences and transform them into vector space model */
val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
.map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
.map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
/* split the data 60% for training, 40% for testing */
val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
/* union train data with positive and negative sentences */
val training = posSplits(0).union(negSplits(0))
/* union test data with positive and negative sentences */
val test = posSplits(1).union(negSplits(1))
/* Multinomial Naive Bayesian classifier */
val model = NaiveBayes.train(training)
/* predict */
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
/* metrics */
val metrics = new MulticlassMetrics(predictionAndLabels)
/* output F1-measure for all labels (0 and 1, negative and positive) */
metrics.labels.foreach(l => println(metrics.fMeasure(l)))
但是,之后的训练数据。如果我想知道句子“祝你有美好的一天”是积极的还是消极的,我该怎么办? 谢谢。
答
一般来说,你需要两件事情做出一个原始数据预测:
-
应用你用于训练数据相同的转换。如果某些变压器需要拟合(如IDF,标准化,编码),则必须使用适合于训练数据的变压器。因为你的方法是非常简单的,所有你需要在这里是这样的:
val testData = htf.transform("Have a nice day".split(" "))
-
使用
predict
的训练模型的方法:model.predict(testData)