scrapy中的延迟请求
问题描述:
我想重复地拖延使用不同延迟的相同URL。在研究这个问题后,似乎相应的解决方案是使用类似于scrapy中的延迟请求
nextreq = scrapy.Request(url, dont_filter=True)
d = defer.Deferred()
delay = 1
reactor.callLater(delay, d.callback, nextreq)
yield d
在解析。
但是,我一直无法做到这一点。我收到错误消息 ERROR: Spider must return Request, BaseItem, dict or None, got 'Deferred'
我不熟悉的扭曲,所以我希望我只是失去了一些东西明显
是否有不打的框架这么多的达到我的目标的更好的办法?
答
我终于找到了答案an old PR
def parse():
req = scrapy.Request(...)
delay = 0
reactor.callLater(delay, self.crawler.engine.schedule, request=req, spider=self)
然而,蜘蛛可以退出,由于闲置为时尚早。基于过时的中间件https://github.com/ArturGaspar/scrapy-delayed-requests,这可以用
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
class ImmortalSpiderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_idle, signal=signals.spider_idle)
return s
@classmethod
def spider_idle(cls, spider):
raise DontCloseSpider()
最后一个选项来弥补,通过ArturGaspar更新中间件,导致:
from weakref import WeakKeyDictionary
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from twisted.internet import reactor
class DelayedRequestsMiddleware(object):
requests = WeakKeyDictionary()
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
return ext
@classmethod
def spider_idle(cls, spider):
if cls.requests.get(spider):
spider.log("delayed requests pending, not closing spider")
raise DontCloseSpider()
def process_request(self, request, spider):
delay = request.meta.pop('delay_request', None)
if delay:
self.requests.setdefault(spider, 0)
self.requests[spider] += 1
reactor.callLater(delay, self.schedule_request, request.copy(),
spider)
raise IgnoreRequest()
def schedule_request(self, request, spider):
spider.crawler.engine.schedule(request, spider)
self.requests[spider] -= 1
,可以在解析中使用,如:
yield Request(..., meta={'delay_request': 5})