文本挖掘-Text-mining-based Fake News Detection Using Ensemble Methods -文献阅读笔记


前言

本文主要记录作者对于文本挖掘相关论文的读后感,包括主要方法,对文献的评价,应用的的背景,以及总结与结论等方面的内容。

一、数据挖掘-文本挖掘?

近年来数据挖掘比较火,文本挖掘就是数据的挖掘的一种。 通俗的说,从海量数据中寻找自己需要的数据集。

二、相关工作

1.对比了多种假新闻检测的方法:

a) Shu et al.[2] classify fake news detection models into news content models and social context models.
b) Conroy et al.[8] propose operational guidelines for designing a system for verification of news.
c) Gild: TF-IDF of bi-grams with stochastic gradient descent model identified fake news with an accuracy of 77.2%.
d) Ruchansky et al.[11] proposed a model with three modules: capture, score, and integrate.89.2%(Twitter)-95.3%(W
e) Buntain and Golbeck[12] used structural, contentbased, user and temporal features to design a system to detect fake news in popular Twitter threads. 65.29%(BuzzFeed)( is limited to highly re-tweeted threads of Twitter conversations)
f) Krishnan and Chen: A combination of textual and user features was used. 80.68%(Hurricane Sandy dataset)
g) Jin et al.[14] made one of the significant attempts to use images for verification of news by using visual features like clarity and coherence score, and statistical features of images like count and image ratio. 83.6%(more than 7% compared with other approaches that use non-image features only.)
h) Yang et al.[15] , have used both text and image information to train a model named as the text and image information based convolutional neural network (TI-CNN).

2.提出并解释本文使用的方法(using stylometric features of the text, i.e., the features based on the style of writing, as well as word vector representations of the text for classifying the news.):

本文方法:
a) 提取书写文章的风格特征,例如:大写字母数量和文章所引用的数量;
b) 使用集成方法:RF、SGD、extra trees classifier;
c) 文本词向量表示:TF-IDF vector 、skip-gram Word2Vec ;

a) 数据库(including FakeNewsNet dataset and McIntire Dataset)
b) 数据处理(The training set contains 5405 news articles and the test set contains 1352 news articles)
c) 特征提取(including Stylometric features and Word vector features【BOW、BOW TF-IDF、CBOW、SG、finally carry on 2 methods】)
d) 特征筛选(只选对文本真假性有影响的特征)
e) 使用的分类器(RF, NB, SVM, LR KNN ,SVM, bagging with general bagging classifier and extra trees classifie, adaboost, stochastic gradient boosting[GB])
f) 将风格特征和词向量结合起来(using bagging, boosting and voting to combine)
g) 度量的标准(using accuracy, precision, recall and F-score to metric)

3.分析各种不同的组合方式的实验结果并讨论:

a) 在风格特征(文本)上应用NB和RFC分类器实验找出接下来要使用的风格特征集(选用特征集合3)
b) 分析将要使用的风格特征集合并从风格特征集合中选出重要的有代表性的风格特征.(uppercase letters, number of quotes等特征。选出了50个风格特征)
c) 在50个风格特征上应用分类器(RF, NB, SVM, LR and KNN are shown in Table 5, using ensemble methods have been tabulated in Table 6.)
d) 在不同的词向量特征上应用各种分类器,对比使用哪一种词向量表示方法与分类器组合起来效果更好。
e) 分别使用不对词汇进行降维(table7)、提前处理文本的方法(lemmatization and stemming)降低单词维度(table8)、用卡房分析法(chi-square test)+ lemmatization, stemming降低单词维度(table9),只使用卡房分析法降低单词维度(table10)、用了BOW count vector和 BOW TF-IDF vector表示词向量特征。在LR,NB,RF分类器上。(结果表明分别在三种不同降维方法下且在TF-IDF词向量表示方法下,LR分类器更好)
f) 在相同的分类器(LR,RF,NB)的条件下,对比前文提到的词向量特征表示(CBOW and skip-gram)输入分类器的两个方法(m1:标记向量的平均值分给某一篇新闻的每个标记;m2:第j个词的嵌入向量的平均值添加到第i行第j列如果…)。(结果表明第二种输入分类器的方法更好table11 和table12)
g) 使用集成方法(bagging, boosting and voting)结合风格特征和词向量特征。其中bagging显示将随机子集的个数设定为15个效果较好(table13);boosting用了Gradient boosting和AdaBoost,结果在table14;voting两组实验(table15 voting on feature set 3 + TF-IDF (post feature selection) + table16 voting on feature set 3 + word vector features (WV))

总结

(table17 An accuracy of 95.49% is obtained on using boosting method on the combination on both stylometric and word vector features. )
文本挖掘-Text-mining-based Fake News Detection Using Ensemble Methods -文献阅读笔记