Javascript：使用字典过滤掉字符串中的单词？

问题描述：

我需要从字符串中过滤掉几百个“停止”字词。由于有很多“停”的话，我不认为这将是一个好主意，做这样的事情：Javascript：使用字典过滤掉字符串中的单词？

sentence.replace(/\b(?:the|it is|we all|an?|by|to|you|[mh]e|she|they|we...)\b/ig, '');

我怎样才能创造这样一个哈希表来存储停止的话吗？在这张图中，一个关键词本身就是一个停词，价值并不重要。然后过滤将导致检查停用词映射中该单词是否不存在。用什么数据结构来构建这样的地图？

相关：[John Resig的字典查找系列]（http://ejohn.org/blog/revised-javascript-dictionary-search/） – 2012-02-22 22:01:57

你有多少这样的“停用词”？答案可能很重要。 – ChaosPandion 2012-02-22 22:03:40

[JavaScript代码过滤掉字符串中的常见单词]的可能重复（http://*.com/questions/6686718/javascript-code-to-filter-out-common-words-in-a-string）。请注意解决方案从字符串构建字典。如果您从字典开始，则可以跳过该部分。 – outis 2012-02-22 22:06:52

答

没有什么东西可以打败这类工作的正则表达式。但是，他们有两个问题 - 难以维护（您在帖子中指出的内容）以及非常大的性能问题。我不知道一个正则表达式可以处理多少个替代方案，但是我认为在任何情况下都可以达到20-30。

因此，您需要一些代码来从一些数据结构（可以是数组或单个字符串）动态构建正则表达式。我个人更喜欢刺痛，因为它最容易维护。

// taken from http://www.ranks.nl/resources/stopwords.html 
stops = "" 
+"a about above after again against all am an and any are aren't as " 
+"at be because been before being below between both but by can't " 
+"cannot could couldn't did didn't do does doesn't doing don't down " 
+"during each few for from further had hadn't has hasn't have  " 
+"haven't having he he'd he'll he's her here here's hers herself  " 
+"him himself his how how's i i'd i'll i'm i've if in into is isn't " 
+"it it's its itself let's me more most mustn't my myself no nor  " 
+"not of off on once only or other ought our ours ourselves out  " 
+"over own same shan't she she'd she'll she's should shouldn't so " 
+"some such than that that's the their theirs them themselves then " 
+"there there's these they they'd they'll they're they've this  " 
+"those through to too under until up very was wasn't we we'd we'll " 
+"we're we've were weren't what what's when when's where where's  " 
+"which while who who's whom why why's with won't would wouldn't  " 
+"you you'd you'll you're you've your yours yourself yourselves  " 

// how many to replace at a time 
reSize = 20 

// build regexps 
regexes = [] 
stops = stops.match(/\S+/g).sort(function(a, b) { return b.length - a.length }) 
for (var n = 0; n < stops.length; n += reSize) 
    regexes.push(new RegExp("\\b(" + stops.slice(n, n + reSize).join("|") + ")\\b", "gi"));

一旦你得到了这一点，其余的是很明显的：

regexes.forEach(function(r) { 
    text = text.replace(r, '') 
})

您需要reSize值进行实验，找出了正则表达式的长度和正则表达式的总数之间的最佳平衡。如果性能至关重要，您还可以运行一次生成部分，然后在某处缓存结果（即生成的正则表达式）。

查找哈希映射中的单词会比使用正则表达式更慢？可以将一个普通的JS对象视为散列映射，其中属性名称是映射键，属性值是映射值？ – dokondr 2012-02-23 09:46:09

取决于你在做什么，但是，在大多数情况下，一次替换20个单词将比逐个定位/消除更快。是的，JavaScript对象是哈希映射。 – georg 2012-02-23 10:57:26

好的，但是如何在JS运行时实现20个字的正则替换？国际海事组织将有两个数组：'句子'和'stopWords'。无论如何，句子中的每个单词都会与stopWords中的每个单词进行比较。对我来说看起来效率不高。 – dokondr 2012-02-23 11:15:50

Javascript：使用字典过滤掉字符串中的单词？

相关推荐