文本[多等级]分类与许多输出
问题陈述:文本[多等级]分类与许多输出
要分类文本文档它所属的类别并且还向上分类到的类别的两个层次。
样品训练集:
Description Category Level1 Level2
The gun shooting that happened in Vegas killed two Crime | High Crime High
Donald Trump elected as President of America Politics | High Politics High
Rian won in football qualifier Sports | Low Sports Low
Brazil won in football final Sports | High Sports High
初步尝试:
我试图创建一个分类模型,其将尝试使用随机森林法的类别进行分类,它给了我90%总体。
代码1:
import pandas as pd
#import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))
train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True)
data_features = vectorizer.fit_transform(train_data)
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
问题:
我想两个级别添加到模型中,他们是Level1和Level2添加它们是当我跑分类为1级的原因独自我有96%的准确性。我被困在分裂训练和测试数据集并且训练有三个分类的模型。
是否可以创建三种分类的模型或创建三种模型?如何拆分火车和测试数据?
EDIT1: 进口串 进口编解码器 进口大熊猫作为PD 进口numpy的为NP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True)
data_features = vectorizer.fit_transform(train_data)
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
的scikit学习随机森林分类本身就支持多路输出(见this example)。因此,您不需要创建三个单独的模型。
从RandomForestClassifier.fit文档,输入到fit
功能是:
X : array-like or sparse matrix of shape = [n_samples, n_features]
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
因此,需要作为输入大小为N×3的阵列y
(您的标签),以您的RandomForestClassifier。为了分割你的训练和测试集,你可以这样做:
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)
你train_label
和test_label
应该是大小为N×3,你可以用它来适应你的模型比较你的预测(NB阵列:我没有在这里测试它,你可能需要做一些转换)。
我会检查这个与我的程序,并会让你知道 – The6thSense
@ The6thSense它的工作? – nbeuchat
我非常抱歉,我还没有尝试过,我不接近我的系统。我一定会明天检查一下,并会尽快通知你。谢谢 – The6thSense
你能否澄清两层应该是什么?在您提供的样本训练集中,您的类别类似于“犯罪|高”,然后您的水平只是类别中的第一个和第二个单词(因此它不提供任何新信息)。另外,只是为了确保 - 类别总是由两个单词组成? –
@MiriamFarber yes类别始终包含由管道分隔的两个单词。添加level1和level2的原因是我对level1的准确性越来越高,所以即使类别错误,它也会减少向下的过程。 – The6thSense
好了,只要确保 - 当你运行一个目标的模型时,如果此目标是类别列,则获得90%成功,如果此目标是1级列,则获得96%成功,并且要构建一个模型,你有3个目标(这三个列对应描述,1级和2级),对吗? –