R - grepl超过700万观测值 - 如何提高效率?

问题描述:

我已经遇到了一些我写过的R代码,我想也许你会知道如何使整个事情变得可行,从效率可以提高的意义上说。R - grepl超过700万观测值 - 如何提高效率?

所以,我想要做的是以下几点:

我有一个推特数据集〜700万周的观察。目前,我对推文或任何其他元数据不感兴趣,但仅在“位置”字段中有效,因此我已将这些数据提取到新的data.frame中,其中包含位置变量(字符串)和一个新的,当前为空的“isRelevant”变量(逻辑)。此外,我有一个包含文本信息的矢量,格式如下:“地名(1)|地名(2)[...] |地名(i)”。我想要做的是在grepl位置变量的每一行中查看是否与Placenames向量匹配,如果是,则在isRelevant变量中返回“TRUE”,否则返回“FALSE”。

要做到这一点,我写了一些R代码里面,这基本上可以归结到这条线:

locations.df$isRelevant <- sapply(locations.df$locations, function(s) grepl(grep_places, s, ignore.case = TRUE)) 

因此grep_places是通过分离可能匹配项列表“|”字符,让R知道它可以匹配矢量中的任何元素。我正在一台远程高容量计算机上运行该程序,该程序使用RStudio(R 3.2.0)提供超过2 TB的RAM,并且使用'pbsapply'运行它,该程序为我提供了一个进度条。事实证明,这是可笑的长。到目前为止,它已经完成了大约45%(我从一个多星期前开始),它说它还需要270多个小时才能完成。这显然不是一个真正可行的情况,因为我将来必须使用更大的数据集来运行类似的代码。你有什么想法,我可以在更可接受的时间内完成这项工作,也许就像一天或类似的事情(记住超强计算机)。

编辑

下面是一些半模拟数据表明了什么我工作大约是这样的:

print(grep_places) 
> grep_places 
"Acworth NH|Albany NH|Alexandria NH|Allenstown NH|Alstead NH|Alton NH|Amherst NH|Andover NH|Antrim NH|Ashland NH|Atkinson NH|Auburn NH|Barnstead NH|Barrington NH|Bartlett NH|Bath NH|Bedford NH|Belmont NH|Bennington NH|Benton NH|Berlin NH|Bethlehem NH|Boscawen NH|Bow NH|Bradford NH|Brentwood NH|Bridgewater NH|Bristol NH|*field NH|*line NH|Campton NH|Canaan NH|Candia NH|Canterbury NH|Carroll NH|CenterHarbor NH|Charlestown NH|Chatham NH|Chester NH|Chesterfield NH|Chichester NH|Claremont NH|Clarksville NH|Cole* NH|Columbia NH|Concord NH|Conway NH|Cornish NH|Croydon NH|Dalton NH|Danbury NH|Danville NH|Deerfield NH|Deering NH|Derry NH|Dorchester NH|Dover NH|Dublin NH|Dummer NH|Dunbarton NH|Durham NH|EastKingston NH|Easton NH|Eaton NH|Effingham NH|Ellsworth NH|Enfield NH|Epping NH|Epsom NH|Errol NH|Exeter NH|Farmington NH|Fitzwilliam NH|Francestown NH|Franconia NH|Franklin NH|Freedom NH|Fremont NH|Gilford NH|Gilmanton NH|Gilsum NH|Goffstown NH|Gorham NH|Goshen NH|Grafton NH|Grantham NH|Greenfield NH|Greenland NH|Greenville NH|Groton NH|Hampstead NH|Hampton NH|HamptonFalls NH|Hancock NH|Hanover NH|Harrisville NH|Hart'sLocation NH|Haverhill NH|Hebron NH|Henniker NH|Hill NH|Hillsborough NH|Hinsdale NH|Holderness NH|Hollis NH|Hooksett NH|Hopkinton NH|Hudson NH|Jackson NH|Jaffrey NH|Jefferson NH|Keene NH|Kensington NH|Kingston NH|Laconia NH|Lancaster NH|Landaff NH|Langdon NH|Lebanon NH|Lee NH|Lempster NH|Lincoln NH|Lisbon NH|Litchfield NH|Littleton NH|Londonderry NH|Loudon NH|Lyman NH|Lyme NH|Lyndeborough NH|Madbury NH|Madison NH|Manchester NH|Marlborough NH|Marlow NH|Mason NH|Meredith NH|Merrimack NH|Middleton NH|Milan NH|Milford NH|Milton NH|Monroe NH|MontVernon NH|Moultonborough NH|Nashua NH|Nelson NH|NewBoston NH|NewCastle NH|NewDurham NH|NewHampton NH|NewIpswich NH|NewLondon NH|Newbury NH|Newfields NH|Newington NH|Newmarket NH|Newport NH|Newton NH|NorthHampton NH|Northfield NH|Northumberland NH|Northwood NH|Nottingham NH|Orange NH|Orford NH|Ossipee NH|Pelham NH|Pembroke NH|Peterborough NH|Piermont NH|Pittsburg NH|Pittsfield NH|Plainfield NH|Plaistow NH|Plymouth NH|Portsmouth NH|Randolph NH|Raymond NH|Richmond NH|Rindge NH|Rochester NH|Rollinsford NH|Roxbury NH|Rumney NH|Rye NH|Salem NH|Salisbury NH|Sanbornton NH|Sandown NH|Sandwich NH|Sea* NH|Sharon NH|Shelburne NH" 


head(location.df, n=20) 
>      location isRelevant 
1      London   NA 
2  Orleans village VT USA   NA 
3     The World   NA 
4    D M V Towson   NA 
5 Playa del Sol Solidaridad   NA 
6 Beautiful Downtown Burbank   NA 
7      <NA>   NA 
8       US   NA 
9    Gaithersburg Md   NA 
10      <NA>   NA 
11    California   NA 
12      Indy   NA 
13     Florida   NA 
14    exsnaveen com   NA 
15     Houston TX   NA 
16     Tweaking   NA 
17    Phoenix AZ   NA 
18    Malibu Ca USA   NA 
19   Hermosa Beach CA   NA 
20    California USA   NA 

提前感谢大家,我会认真地感谢所有帮助有了这个。

+1

这是一个合理的问题,因为它代表,但如果你提供一点会更好(模拟)数据提供[可重现的例子](http://*.com/questions/5963269/how-to-make-a-great-r-reproducible-example)... –

+0

嗨本。对不起,遗憾。我现在添加了一些数据。干杯! – nikUoM

+0

你可能对'stringi'包中的某些函数有更好的运气,它往往会胜过其他正则表达式函数。 – nrussell

grepl是一个矢量化函数,应该不需要对其应用循环。您是否尝试过:

#dput(location.df)  
location.df<-structure(list(location = structure(c(12L, 14L, 17L, 5L, 16L, 
      2L, 1L, 19L, 8L, 1L, 3L, 11L, 7L, 6L, 10L, 18L, 15L, 13L, 9L, 
     4L), .Label = c("<NA>", "Beautiful Downtown Burbank", "California", 
      "California USA", "D M V Towson", "exsnaveen com", "Florida", 
      "Gaithersburg Md", "Hermosa Beach CA", "Houston TX", "Indy", 
      "London", "Malibu Ca USA", "Orleans village VT USA", "Phoenix AZ", 
      "Playa del Sol Solidaridad", "The World", "Tweaking", "US"), class = "factor"), 
      isRelevant = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("location", 
      "isRelevant"), row.names = c(NA, -20L), class = "data.frame") 

#grep_places with places in the test data 
grep_places<-"Gaithersburg Md|Phoenix AZ" 

location.df$isRelevant[grepl(grep_places, location.df$location, ignore.case = TRUE)]<-TRUE 

或稍快的实现,按照大卫Arenburg的评论:

location.df$isRelevant <- grepl(grep_places, location.df$location, ignore.case = TRUE)