的Python：如何把字计数表成适合格式CountVectorizer

问题描述：

我有如下形式的字符串〜100000名名单：
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145']等
基本上弥补了我的文集。每个列表都包含文档中的单词和单词数量。的Python：如何把字计数表成适合格式CountVectorizer

我该如何将这个语料库放入一个表单中，然后将其输入到CountVectorizer中？

是否有比将每个列表转换为包含''''''''''''''的字符串更快的方法？

答

假设你想达到什么是稀疏矩阵格式的矢量语料库，与训练有素的矢量器一起，你可以模拟矢量化过程，而无需重复数据：

from scipy.sparse.lil import lil_matrix 
from sklearn.feature_extraction.text import CountVectorizer 

corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'], 
      ['king: 20', 'of: 16', 'the: 400', 'jungle: 110']] 


# Prepare a vocabulary for the vectorizer 
vocabulary = {item.split(':')[0] for document in corpus for item in document} 
indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)} 
vectorizer = CountVectorizer(vocabulary=indexed_vocabulary) 

# Vectorize the corpus using the coordinates known to the vectorizer 
X = lil_matrix((len(corpus), len(vocabulary))) 
X.data = [[int(item.split(':')[1]) for item in document] for document in corpus] 
X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document] 
      for document in corpus] 

# Convert the matrix to csr format to be compatible with vectorizer.transform output 
X = X.tocsr()

在这个例子中，输出将是：

[[ 168. 216. 0. 159. 652. 145. 0.] 
[ 0. 16. 110. 0. 400. 0. 20.]]

这可以让更多的文件矢量：

vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])

其中收益率为：

[[0 0 1 0 0 1 0] 
[0 0 2 0 1 0 0]]

这是可怕的，谢谢。 – Unstack

如果我想在此过程中删除停用词，那么在构建词汇之前或之后做这件事会更好？ – Unstack

这是一个棘手的问题，因为语料库的矢量化并没有在矢量化程序中完成。我能想到的一个简单方法是向vectorizer构造函数提供'stop_words'。稍后在构建'data'和'rows'时，向矢量化器停用词（'vectorizer.get_stop_words（）'）添加一个过滤器。这种方法保持了快速的矢量化过程并支持将来的文档转换。 – Elisha

的Python：如何把字计数表成适合格式CountVectorizer

相关推荐