stri_replace_all_fixed缓慢的大数据集 - 有没有其他选择？

问题描述：

我试图在R中使用stri_replace_all_fixed函数来干〜4000个文档。但是，它非常慢，因为我的词干字典包含约。 300k字。我这样做是因为文件是丹麦文，因此Porter Stemmer Algortihm没有用（这太过分了）。stri_replace_all_fixed缓慢的大数据集 - 有没有其他选择？

我已经发布了下面的代码。有没有人知道这样做的替代方案？

逻辑：查看每个文档中的每个单词 - >如果word =来自voc-table的单词，则用tran-word替换。

##Read in the dictionary 
voc <- read.table("danish.csv", header = TRUE, sep=";") 
#Using the library 'stringi' to make the stemming 
library(stringi) 
#Split the voc corpus and put the word and stem column into different corpus 
word <- Corpus(VectorSource(voc))[1] 
tran <- Corpus(VectorSource(voc))[2] 
#Using stri_replace_all_fixed to stem words 
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ## 
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))

“VOC”数据帧的结构：

 Word   Stem 
1  abandonnere abandonner 
2  abandonnerede abandonner 
3  abandonnerende abandonner 
... 
313273 åsyns   åsyn

答

要一本字典踏着快，你需要实现一些聪明的数据结构，如前缀树。 300000x搜索和替换只是不规模。

我不认为这会在R中有效，但您需要编写一个C或C++扩展。你在那里有许多微小的操作，R解释器的开销会在纯R中做到这一点时会杀死你。

stri_replace_all_fixed缓慢的大数据集 - 有没有其他选择？

相关推荐