刮去页面源中不存在的数据表

问题描述：

我检查这个页面的页面源码，这个表格在页面源码中不存在。

然后我检查网络信息刷新的网站时，似乎数据表通过发送POST请求到这个网址获得：

http://datacenter.mep.gov.cn:8099/ths-report/report!list.action

然后我试图发送POST请求，只是没有什么用状态500.

我想知道有无论如何通过使用R刮这张表吗？

谢谢。

答

好侦探！

它正在为我制作GET请求。这似乎有伎俩。它也试图为你挑选合适的目标：

library(httr) 
library(rvest) 
library(stringi) 

pg <- read_html("http://datacenter.mep.gov.cn/index!MenuAction.action?name=259206fe260c4cf7882462520e1e3ada") 

html_nodes(pg, "div[onclick]") %>% 
    html_attr("onclick") %>% 
    stri_replace_first_fixed('load("', "") %>% 
    stri_replace_last_regex('",".*$', "") -> report_urls 

head(report_urls) 
## [1] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462849093743" 
## [2] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462764947052" 
## [3] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1465594312346" 
## [4] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462844293531" 
## [5] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462844935563" 
## [6] "http://datacenter.mep.gov.cn:8099/ths-report/report!list.action?xmlname=1462845592195" 

rpt_pg <- read_html(report_urls[1]) 
html_table(rpt_pg)[[2]] 
# SO won't let me paste the table

刮去页面源中不存在的数据表

相关推荐