检索数据时缺少信息

问题描述:

我想使用R抓取XXX中与AlphaGo相关的所有新闻(标题,网址和文本),并且网页url为http://www.xxxxxx.com/search/?q=AlphaGo。这里是我的代码:检索数据时缺少信息

url <- "http://www.xxxxxx.com/search/?q=AlphaGo" 
info <- debugGatherer() 
handle <- getCurlHandle(cookiejar ="", 
         #turn the page 
         followlocation = TRUE, 
         autoreferer = TRUE, 
         debugfunc = info$update, 
         verbose = TRUE, 
         httpheader = list(
          from = "[email protected]", 
          'user-agent' = str_c(R.version$version.string, 
               ",",R.version$platform) 
         )) 
html <- getURL(url,curl=handle,header = TRUE) 
parsedpage <- htmlParse(html) 

然而,当我使用代码

xpathSApply(parsedpage,"//h3//a",xmlGetAttr,"href") 

检查,如果我发现了有针对性的代码,我发现丢失的相关新闻信息的所有内容。然后我发现按F12后我发现DOM elements(Chrome是我用的)包含我想要的信息,而sources中没有任何东西(这是非常混乱的,就像所有的元素一起堆放在一起)。所以我改变我的代码:

parsed_page <- htmlTreeParse(file = url,asTree = T) 

而希望获得dom树。 不过,这次信息丢失了,我发现的是所有缺失的信息是折叠在DOM elements(我从未遇到过这种情况)之前的信息。

任何想法如何发生问题,我该如何解决这个问题?

+0

什么是你想要的输出?网址列表或每个网页的文字? –

+0

他们两个,我的代码有问题? – exteralvictor

+0

您违反了CNN ToC中的第3项。请确保您通知他人您要求他们以不道德的行为帮助您处理他们的罚款或监禁时间。 – hrbrmstr

有了想法@Colin提供,我试图沿原代码遵循。所以我编码如下JSON文件中的动态内容与包RJSONIO

url <- "https://search.xxxxxx.io/content?q=AlphaGo" 
content <- fromJSON(url) 
content1 <- content$result 
content_result <- matrix(NA,10,5) 
for(i in 1:length(content1)){ 
    content_result[i,] <- c("CNN", content1[[i]]$firstPublishDate,ifelse(class(content1[[i]]$headline) != "NULL",content1[[i]]$headline,"NA"), 
         content1[[i]]$body,content1[[i]]$url) 
} 

该问题不是来自您的代码。结果页面是动态生成的,因此链接和文本在结果页面中的纯HTML中不可用(正如您可以看到源代码一样)。

只有10个结果,所以我建议你手动创建一个url列表。

我不知道你在这段代码中使用的包。但我建议你去rvest,这看起来比你使用的包装更简单。

为:

url <- "http://money.cnn.com/2017/05/25/technology/alphago-china-ai/index.html" 

library(rvest) 
library(tidyverse) 

url %>% 
    read_html() %>% 
    html_nodes(xpath = '//*[@id="storytext"]/p') %>% 
    html_text() 

[1] " A computer system that Google engineers trained to play the game Go beat the world's best human player Thursday in China. The victory was AlphaGo's second this week over Chinese professional Ke Jie, clinching the best-of-three series at the Future of Go Summit in Wuzhen. "         
[2] " Afterward, Google engineers said AlphaGo estimated that the first 50 moves -- by both players -- were virtually perfect. And the first 100 moves were the best anyone had ever played against AlphaGo's master version. "                       
[3] " Related: Google's man-versus-machine showdown is blocked in China "                                                             
[4] " \"What an amazing and complex game! Ke Jie pushed AlphaGo right to the limit,\" said DeepMind CEO Demis Hassabis on Twitter. DeepMind is a British artificial intelligence company that developed AlphaGo and was purchased by Google in 2014. "                  
[5] " DeepMind made a stir in January 2016 when it first announced it had used artificial intelligence to master Go, a 2,500-year-old game. Computer scientists had struggled for years to get computers to excel at the game. "                       
[6] " In Go, two players alternate placing white and black stones on a grid. The goal is to claim the most territory. To do so, you surround your opponent's pieces so that they're removed from the board. "                            
[7] " The board's 19-by-19 grid is so vast that it allows a near infinite combination of moves, making it tough for machines to comprehend. Games such as chess have come quicker to machines. "                               
[8] " Related: Elon Musk's new plan to save humanity from AI "                                                                
[9] " The Google engineers at DeepMind rely on deep learning, a trendy form of artificial intelligence that's driving remarkable gains in what computers are capable of. World-changing technologies that loom on the horizon, such as autonomous vehicles, rely on deep learning to effectively see and drive on roads. " 
[10] " AlphaGo's achievement is also a reminder of the steady improvement of machines' ability to complete tasks once reserved for humans. As machines get smarter, there are concerns about how society will be disrupted, and if all humans will be able to find work. "             
[11] " Historically, mankind's development of tools has always created new jobs that never existed before. But the gains in artificial intelligence are coming at a breakneck pace, which will likely accentuate upheaval in the short term. "                    
[12] " Related: Google uses AI to help diagnose breast cancer "                                                                
[13] " The 19-year-old Ke and AlphaGo will play a third match Saturday morning. The summit will also feature a match Friday in which five human players will team up against AlphaGo. "  

最佳

科林

+0

感谢它的工作。 – exteralvictor

+0

我仔细考虑了你的方法,好像这是一个巨大的项目,如果从我提供的页面开始工作,即使是使用'rvest',因为你在这里做的只是解析每个新闻页面的html文件,绝对简单。如果我们需要抓取网址,而不是自己生成网址呢? – exteralvictor