用R刮,如何提取var
问题描述:
require(httr)
require(XML)
basePage <- "http://bet.hkjc.com/"
h <- handle(basePage)
GET(handle = h)
res <- GET(handle = h, path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2")
resXML <- htmlParse(content(res, as = "text"))
我用上面的代码来刮一个aspx。网站。它返回了一堆文本。不过,我只想获得“var infoDivideByRace”,“var scratchList”。请问如何提取这两个变量并将它们转换为列数据?谢谢!部分退货如下:用R刮,如何提取var
var poolSellStatus = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@');
var poolSellStatus_bak = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@');
var winOddsByRace = '[email protected]@@@@@WIN;1=3.6=1;2=4.7=0;3=43=0;4=11=0;5=29=0;6=9.4=0;7=4.6=0;8=11=0;9=52=0;10=82=0;11=52=0;12=8.6=0#PLA;1=1.4=1;2=2.0=0;3=6.0=0;4=3.5=0;5=6.2=0;6=2.6=0;7=2.0=0;8=4.2=0;9=7.9=0;10=11=0;11=8.4=0;12=2.5=0'.split('@@@');
var multiRacePoolsStr = '@@@DBL#;1,2;2,3;3,4;4,5;5,6;6,7;7,[email protected]@@TBL#;6,7,[email protected]@@D-T#;3,4;6,[email protected]@@T-T#;4,5,[email protected]@@6UP#;3,4,5,6,7,8';
var fieldSize = 'HV;12;12;12;12;12;12;12;12';
var fieldSizeWithReserve = 'HV;12;12;12;12;12;12;12;12';
var reserveList = 'HV';
var scratchList = 'HV';
答
最简单或最合适的方法是使用Phantomjs或硒。如果没有,Regex
和rvest
变通。
library(rvest)
basePage <- "http://bet.hkjc.com/"
ss <- paste0(basePage,path)
path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2"
scripts <- read_html(ss, encoding = 'utf8') %>%
html_nodes("script") %>% html_text(trim=TRUE)
new <- scripts[grepl('var scratchList =|var infoDivideByRace = ',scripts)]
value1 <- str_replace_all(strsplit(str_extract(new,regex('var scratchList = (.*?);')), split=' ')[[1]][4],";|'",'')
value2 <- str_replace_all(strsplit(str_extract(new,regex('var infoDivideByRace = (.*?);')),split=' ')[[1]][4],";|'",'')
value1
#[1] "HV"
value2
使用V8包
答
备用选项:
library(rvest)
library(stringi)
library(purrr)
library(V8)
获取您指定的网页内容:包含您的目标变量
pg <- read_html("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2", encoding = "UTF-8")
提取脚本标记,脚本标签转换为文本,分成一个字符向量,只保留var
行:
html_nodes(pg, xpath=".//script[contains(., 'infoDivideByRace')]") %>%
html_text() %>%
stri_split_lines() %>%
flatten_chr() %>%
keep(stri_detect_regex, "^var") -> script_txt
初始化的V8 JavaScript引擎:
ctx <- v8()
让它解析javascript和创建数据:
ctx$eval(script_txt)
检索数据(infoDivideByRace
具有2个空白数组元素,所以我们忽略它们):
grep("^$", ctx$get('infoDivideByRace'), value=TRUE, invert=TRUE)
## [1] *'S SPAM PROTECTION WON'T LET ME PASTE THIS CONTENT
ctx$get('scratchList')
[1] "HV"
以上不起作用... 它返回:Flatten_chr(。)中的错误:不能fin d函数“flatten_chr” –
我忘了'库(purrr)'(我已经添加到帖子中) – hrbrmstr