正则表达式模式 - 获取特定单词前的数字-gsub
问题描述:
我刚开始学习正则表达式并陷入一个问题。 我收到了一个包含电影奖项信息的数据集。正则表达式模式 - 获取特定单词前的数字-gsub
**Award**
Won 2 Oscars. Another 7 wins & 37 nominations.
6 wins& 30 nominations
5 wins
Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.
我想拉出“胜利”和“提名”之前的数字,并为每个添加两列。例如,对于第一个,这将是6胜列和37列提名
我使用的模式是
df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards)
都不尽如人意。我不知道如何编写“胜利”模式。 :( 任何人都可以请帮助?
非常感谢!
答
我们可以提取数字的list
,然后填充NAS进行情况后rbind
那里只有一个单一的元素
lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)",
df2$Award, perl = TRUE))
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`,
max(lengths(lst))), as.numeric))
df2
# Award new1 new2
#1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
#2 6 wins& 30 nominations 6 30
#3 5 wins 5 NA
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3
答
我们可以使用str_extract
以正则表达式得到值
library(stringr)
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.",
"6 wins& 30 nominations",
"5 wins",
"Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.")
df <- data.frame(text = text)
df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)")
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)")
> df
text value1 value2
1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
2 6 wins& 30 nominations 6 30
3 5 wins 5 <NA>
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3
对不起,第一个对于win列将是7。 –