正则表达式模式 - 获取特定单词前的数字-gsub

问题描述:

我刚开始学习正则表达式并陷入一个问题。 我收到了一个包含电影奖项信息的数据集。正则表达式模式 - 获取特定单词前的数字-gsub

**Award** 
    Won 2 Oscars. Another 7 wins & 37 nominations. 
    6 wins& 30 nominations 
    5 wins 
    Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 

我想拉出“胜利”和“提名”之前的数字,并为每个添加两列。例如,对于第一个,这将是6胜列和37列提名

我使用的模式是

df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards) 

都不尽如人意。我不知道如何编写“胜利”模式。 :( 任何人都可以请帮助?

非常感谢!

+0

对不起,第一个对于win列将是7。 –

我们可以提取数字的list,然后填充NAS进行情况后rbind那里只有一个单一的元素

lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)", 
       df2$Award, perl = TRUE)) 
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`, 
          max(lengths(lst))), as.numeric)) 
df2 
#                Award new1 new2 
#1     Won 2 Oscars. Another 7 wins & 37 nominations. 7 37 
#2           6 wins& 30 nominations 6 30 
#3               5 wins 5 NA 
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3 

我们可以使用str_extract以正则表达式得到值

library(stringr) 
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.", 
      "6 wins& 30 nominations", 
      "5 wins", 
      "Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.") 
df <- data.frame(text = text) 

df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)") 
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)") 

> df 
                   text value1 value2 
1     Won 2 Oscars. Another 7 wins & 37 nominations.  7  37 
2           6 wins& 30 nominations  6  30 
3               5 wins  5 <NA> 
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.  1  3