使用XML packagin解析RSS提要R

问题描述:

我想抓取和解析以下RSS提要http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml我已经查看了有关R和XML的其他查询,并且无法对我的问题取得任何进展。每个条目使用XML packagin解析RSS提要R

 <item> 
    <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title> 
    <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link> 
    <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description> 
    <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid> 
    <pubDate>2012-11-15T12:56:09-05:00</pubDate> 
    <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source> 
    </item> 

对于每个条目/文章,我想记录“日期”(pubdate的),“标题”(标题),“说明”(清洗全文)XML代码。我曾尝试在R中使用xml包,但承认我有点新手(使用XML很少或没有经验,但有一些R经验)。我工作过,并与越来越行不通的代码是:

library(XML) 

xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml" 

# Use the xmlTreePares-function to parse xml file directly from the web 

xmlfile <- xmlTreeParse(xml.url) 

# Use the xmlRoot-function to access the top node 

xmltop = xmlRoot(xmlfile) 

xmlName(xmltop) 

names(xmltop[[ 1 ]]) 

    title   link description  language  copyright 
    "title"  "link" "description" "language" "copyright" 
category  generator   docs   item   item 
    "category" "generator"  "docs"  "item"  "item" 

但是,每当我试图操纵和试图操纵“标题”或“说明”的信息,我不断收到错误。任何帮助解决这个代码的帮助,将不胜感激。

感谢, 托马斯

我使用的是优秀的Rcurl库和xpathSApply

这是脚本给你3名列表(标题,pubdates和说明)

library(RCurl) 
library(XML) 
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml" 
script <- getURL(xml.url) 
doc  <- xmlParse(script) 
titles <- xpathSApply(doc,'//item/title',xmlValue) 
descriptions <- xpathSApply(doc,'//item/description',xmlValue) 
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue) 
+0

了解更多信息,xpathSApply在XML库中 –