用R刮，如何提取var

问题描述：

require(httr) 
require(XML) 
basePage <- "http://bet.hkjc.com/" 
h <- handle(basePage) 
GET(handle = h) 
res <- GET(handle = h, path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2") 
resXML <- htmlParse(content(res, as = "text"))

我用上面的代码来刮一个aspx。网站。它返回了一堆文本。不过，我只想获得“var infoDivideByRace”，“var scratchList”。请问如何提取这两个变量并将它们转换为列数据？谢谢！部分退货如下：用R刮，如何提取var

var poolSellStatus = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var poolSellStatus_bak = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var winOddsByRace = '[email protected]@@@@@WIN;1=3.6=1;2=4.7=0;3=43=0;4=11=0;5=29=0;6=9.4=0;7=4.6=0;8=11=0;9=52=0;10=82=0;11=52=0;12=8.6=0#PLA;1=1.4=1;2=2.0=0;3=6.0=0;4=3.5=0;5=6.2=0;6=2.6=0;7=2.0=0;8=4.2=0;9=7.9=0;10=11=0;11=8.4=0;12=2.5=0'.split('@@@'); 
var multiRacePoolsStr = '@@@DBL#;1,2;2,3;3,4;4,5;5,6;6,7;7,[email protected]@@TBL#;6,7,[email protected]@@D-T#;3,4;6,[email protected]@@T-T#;4,5,[email protected]@@6UP#;3,4,5,6,7,8'; 
var fieldSize = 'HV;12;12;12;12;12;12;12;12'; 
var fieldSizeWithReserve = 'HV;12;12;12;12;12;12;12;12'; 
var reserveList = 'HV'; 
var scratchList = 'HV';

答

最简单或最合适的方法是使用Phantomjs或硒。如果没有，Regex和rvest变通。

library(rvest) 

basePage <- "http://bet.hkjc.com/" 

ss <- paste0(basePage,path) 

path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2" 

scripts <- read_html(ss, encoding = 'utf8') %>% 
    html_nodes("script") %>% html_text(trim=TRUE) 

new <- scripts[grepl('var scratchList =|var infoDivideByRace = ',scripts)] 

value1 <- str_replace_all(strsplit(str_extract(new,regex('var scratchList = (.*?);')), split=' ')[[1]][4],";|'",'')  
value2 <- str_replace_all(strsplit(str_extract(new,regex('var infoDivideByRace = (.*?);')),split=' ')[[1]][4],";|'",'') 

value1 
#[1] "HV" 

value2

使用V8包

答

备用选项：

library(rvest) 
library(stringi) 
library(purrr) 
library(V8)

获取您指定的网页内容：包含您的目标变量

pg <- read_html("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2", encoding = "UTF-8")

提取脚本标记，脚本标签转换为文本，分成一个字符向量，只保留var行：

html_nodes(pg, xpath=".//script[contains(., 'infoDivideByRace')]") %>% 
    html_text() %>% 
    stri_split_lines() %>% 
    flatten_chr() %>% 
    keep(stri_detect_regex, "^var") -> script_txt

初始化的V8 JavaScript引擎：

ctx <- v8()

让它解析javascript和创建数据：

ctx$eval(script_txt)

检索数据（infoDivideByRace具有2个空白数组元素，所以我们忽略它们）：

grep("^$", ctx$get('infoDivideByRace'), value=TRUE, invert=TRUE) 
## [1] *'S SPAM PROTECTION WON'T LET ME PASTE THIS CONTENT 

ctx$get('scratchList') 
[1] "HV"

以上不起作用... 它返回：Flatten_chr（。）中的错误：不能fin d函数“flatten_chr” –

我忘了'库（purrr）'（我已经添加到帖子中） – hrbrmstr

用R刮，如何提取var

相关推荐