从GroupGrid中刮取数据

问题描述：

我想刮掉并分析when2meet表的输入。从GroupGrid中刮取数据

这里的一个示例：http://www.when2meet.com/?4474391-IBuBA

表给出了各组成员的可用性的快速视觉概观;我想提取这个给R来做一些分析，但是我会做得很少。

很短，实际上;我只提取主要页面元素。输出（对我来说）是乱码：

library(rvest) 

url <- "http://www.when2meet.com/?4474391-IBuBA" 

grid <- html(url) %>% html_nodes(xpath = '//*[@id="GroupGrid"]')

grid看起来是这样的：

<div style="font-size:0px;vertical-align:top;"><div id="GroupTime279816300" onmouseover="ShowSlot(279816300);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[ 
Col[TimeOfSlot.indexOf(279816300)] = 0; 
Row[TimeOfSlot.indexOf(279816300)] = 23; 
]]></script></div> 
<div id="GroupTime279902700" onmouseover="ShowSlot(279902700);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #8ac56d;"><script><![CDATA[ 
Col[TimeOfSlot.indexOf(279902700)] = 1; 
Row[TimeOfSlot.indexOf(279902700)] = 23; 
]]></script></div> 
<div id="GroupTime279989100" onmouseover="ShowSlot(279989100);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[ 
Col[TimeOfSlot.indexOf(279989100)] = 2; 
Row[TimeOfSlot.indexOf(279989100)] = 23; 
]]></script>

我基本上什么也看不见使用我在这里;它可能是乌尔都语。而且我无法在Google或SO上找到任何关于刮取GroupGrid表格的信息。

有没有人有任何想法如何进行？

理想情况下，我不得不形式的输出data.table（data.frame，如果必须）：

output 
#  id slot available 
# 1: user_1 M 9:00  TRUE 
# 2: user_1 T 9:30  FALSE 
# 3: user_1 W 10:00  TRUE 
# 4: user_1 R 10:30  TRUE 
# 5: user_2 M 9:00  TRUE 
# 6: user_2 T 9:30  FALSE 
# 7: user_2 W 10:00  TRUE 
# 8: user_2 R 10:30  FALSE

（该slot列的精确格式并不重要，也不需要是一个列 - 可以，如果容易，是day和time）

答

你可以做这样的

library(data.table) 

script <- html("http://www.when2meet.com/?4474391-IBuBA") %>% 
    html_nodes("script:contains('PeopleNames')") %>% html_text() 

f <- function(regex) { 
    m <- regmatches(script, gregexpr(regex, script))[[1]] 
    #faster than transposing with `t` 
    setDT(transpose(lapply(regmatches(m, regexec(regex, m)), "[", -1)))[] 
} 
slots <- f("TimeOfSlot\\[(\\d+)\\]=(\\d+);") 
users <- f("PeopleNames\\[(\\d+)\\] = '([^']+)';PeopleIDs\\[\\d+\\] = (\\d+);") 
avails <- f("AvailableAtSlot\\[(\\d+)]\\.push\\((\\d+)\\);") 

DT <- melt(dcast(avails, V2~V1, 
       fun.aggregate = function(x) length(x) > 0, 
       value.var = "V2"), id.vars = "V2", 
      variable.name = "timeslot", value.name = "available") 

DT[users, id := i.V2, on = c(V2 = "V3")] 
DT[slots, time := format(as.POSIXct(as.integer(
    i.V2), origin = "1970-01-01", tz = "GMT"), "%a %H:%M"), 
    on = c(timeslot = "V1")] 

DT[ , c("V2", "timeslot") := NULL] 

DT[time == "Mon 11:00" & available] 
# available  id  time 
# 1:  TRUE user_1 Mon 11:00 
# 2:  TRUE user_2 Mon 11:00 
# 3:  TRUE user_3 Mon 11:00 
# 4:  TRUE user_4 Mon 11:00 
# 5:  TRUE user_5 Mon 11:00 
# 6:  TRUE user_7 Mon 11:00 
# 7:  TRUE user_10 Mon 11:00 

DT[time == "Mon 11:00" & !available] 
# available  id  time 
# 1:  FALSE user_6 Mon 11:00 
# 2:  FALSE user_8 Mon 11:00 
# 3:  FALSE user_9 Mon 11:00

感谢您的改进屏幕截图。我不会有机会在明天之前确认这个工作（看起来很棒！）你怎么知道'TimeOfSlot' RHS可以转换成'POSIXct'？你是否只是猜测和验证起源，时区等？ – MichaelChirico

所有数据都在页面的JavaScript中 - 我查找“user_1”并在那里。是的，将时间戳转换为日期时间对象是试错法（默认情况下，CET，提前一个小时，因此我明确设置了GMT;起源是一个常见起源，所以这是我的第一次猜测）。截图显示“Mon，11 am”的结果与“subset”的输出相匹配。 :-) – lukeA

太棒了！ 'regexec'部分特别漂亮。谢谢。我编辑了更多我的风格（我相信加快速度） – MichaelChirico

从GroupGrid中刮取数据

相关推荐