Scrapy RSS刮板
问题描述:
我试图刮掉从雅虎的RSS Feed(其开放的公司RSS订阅| https://developer.yahoo.com/finance/company.html)Scrapy RSS刮板
我想凑以下网址:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX
出于某种原因,我的蜘蛛ISN”我认为它可能与生成的XPath有关,如果不是,定义parse_item可能会有一些问题。
import scrapy
from scrapy.spiders import CrawlSpider
from YahooScrape.items import YahooScrapeItem
class Spider(CrawlSpider):
name= "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = EmperyscraperItem()
item['title'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for title
item['link'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for link
item['description'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract() #define XPath for description
return item
代码有什么问题?如果不是,那么正确的XPath方向是提取标题,desc和链接。我是Scrapy的新手,只需要一些帮助就可以搞定!
编辑:我已经更新了我的蜘蛛并把它转换成一个XMLFeedSpider如下图所示:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX') #Crawl BPMX
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = YahooScrapeItem()
item['title'] = node.xpath('item/title/text()',).extract() #define XPath for title
item['link'] = node.xpath('item/link/text()').extract()
item['pubDate'] = node.xpath('item/link/pubDate/text()').extract()
item['description'] = node.xpath('item/category/text()').extract() #define XPath for description
return item
#Yahoo RSS feeds http://finance.yahoo.com/rss/headline?s=BPMX,APPL
现在我发现了以下错误:
2017-06-13 11:25:57 [scrapy.core.engine] ERROR: Error while obtaining start requests
知道为什么错误发生了?我的HTML路径看起来正确。
答
从我所看到的,CrawlSpider
only works for HTML responses。所以我建议你建立一个更简单的scrapy.Spider
,或更专业的XMLFeedSpider
。
然后,您在parse_items
中使用的XPath似乎是从您的浏览器以HTML/RSS提供的HTML形式构建的。 Feed中没有*[@id="collapsible"]
或<div>
s。
看view-source:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX
代替:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
<channel>
<copyright>Copyright (c) 2017 Yahoo! Inc. All rights reserved.</copyright>
<description>Latest Financial News for BPMX</description>
<image>
<height>45</height>
<link>http://finance.yahoo.com/q/h?s=BPMX</link>
<title>Yahoo! Finance: BPMX News</title>
<url>http://l.yimg.com/a/i/brand/purplelogo/uh/us/fin.gif</url>
<width>144</width>
</image>
<item>
<description>MENLO PARK, Calif., June 7, 2017 /PRNewswire/ -- BioPharmX Corporation (NYSE MKT: BPMX), a specialty pharmaceutical company focusing on dermatology, today announced that it will release its financial results ...</description>
<guid isPermaLink="false">f56d5bf8-f278-37fd-9aa5-fe04b2e1fa53</guid>
<link>https://finance.yahoo.com/news/biopharmx-report-first-quarter-financial-101500259.html?.tsrc=rss</link>
<pubDate>Wed, 07 Jun 2017 10:15:00 +0000</pubDate>
<title>BioPharmX to Report First Quarter Financial Results</title>
</item>
工作蜘蛛例如:
import scrapy
from scrapy.spiders import XMLFeedSpider
#from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "YahooScrape"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',) #Crawl BPMX
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = {}
item['title'] = node.xpath('title/text()',).extract_first() #define XPath for title
item['link'] = node.xpath('link/text()').extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first() #define XPath for description
return item
我改XMLFeedSpider,我想我该路径的正确语法。出于某种原因,我无法正确定义start_requests。也许我错过了什么? – Friezan
如果您在XPath中删除“item /”前缀,会发生什么情况? –
不幸的是同样的问题。有任何想法吗? – Friezan