如何查找包含给定单词列表中的单词的行？不仅是某个词，在某些列表中的任何字计数

问题描述：

我有话给定的列表，例如：如何查找包含给定单词列表中的单词的行？不仅是某个词，在某些列表中的任何字计数

words <- c("breast","cancer","chemotherapy")

而且我有一个非常大的数据帧，1个变量和超过10,000个条目（行）。

我想选择所有包含在“词”的任何单词的行。不仅某个单词，“单词”中的任何单词都是重要的。包含来自“单词”的多个单词也很重要。

如果我知道这个“字”是什么，我可以做stringr提取多次。然而，这些“词”每次都会改变，而且看不到。有没有直接的方法来做到这一点？

另外，是否有可能选择包含2个或更多单词的所有行在“单词”中？例如。只包含“癌症”并不算数，但包含“乳房”和“癌症”数量。再次，这些“单词”每次都会改变，而且无法看到。任何直接的方式？

答

一些假的数据：

words <- c("breast","cancer","chemotherapy") 
df <- data.frame(v1 = c("there was nothing found","the chemotherapy is effective","no cancer no chemotherapy","the breast looked normal","something"))

你可以使用的grepl组合，sapply和rowSums：

df[rowSums(sapply(words, grepl, df$v1)) > 0, , drop = FALSE]

这导致：

       v1 
2 the chemotherapy is effective 
3  no cancer no chemotherapy 
4  the breast looked normal

如果只想SELCT具有至少两个词的行，则：

df[rowSums(sapply(words, grepl, df$v1)) > 1, , drop = FALSE]

结果：

       v1 
3  no cancer no chemotherapy

注意：您需要使用drop = FALSE因为你的数据框有一个变量（列）。如果你的数据帧有多个变量（列），那么不需要使用drop = FALSE。

如何查找包含给定单词列表中的单词的行？不仅是某个词，在某些列表中的任何字计数

相关推荐