Scrapy从主蜘蛛运行多个蜘蛛?
问题描述:
我有两个蜘蛛需要一个主蜘蛛抓取网址和数据。我的做法是在主蜘蛛中使用CrawlerProcess并将数据传递给两个蜘蛛。这里是我的方法:Scrapy从主蜘蛛运行多个蜘蛛?
class LightnovelSpider(scrapy.Spider):
name = "novelDetail"
allowed_domains = ["readlightnovel.com"]
def __init__(self,novels = []):
self.novels = novels
def start_requests(self):
for novel in self.novels:
self.logger.info(novel)
request = scrapy.Request(novel, callback=self.parseNovel)
yield request
def parseNovel(self, response):
#stuff here
class chapterSpider(scrapy.Spider):
name = "chapters"
#not done here
class initCrawler(scrapy.Spider):
name = "main"
fromMongo = {}
toChapter = {}
toNovel = []
fromScraper = []
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
initCrawler.fromScraper.append(novel)
self.checkchanged()
def checkchanged(self):
#some scraped data processing here
self.dispatchSpiders()
def dispatchSpiders(self):
process = CrawlerProcess()
novelSpider = LightnovelSpider()
process.crawl(novelSpider,novels=initCrawler.toNovel)
process.start()
self.logger.info("Main Spider Finished")
主要的错误,我可以看到的是一个“twisted.internet.error.ReactorAlreadyRunning”。我不知道。有更好的方法从另一个蜘蛛运行多个蜘蛛和/或我怎样才能阻止这个错误?
答
一个一些研究,我能够通过使用属性装饰“@property”来检索主蜘蛛数据这样来解决这个问题后:
class initCrawler(scrapy.Spider):
#stuff here from question
@property
def getNovel(self):
return self.toNovel
@property
def getChapter(self):
return self.toChapter
然后使用CrawlerRunner这样的:
from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(initCrawler)
toNovel = initCrawler.toNovel
toChapter = initCrawler.toChapter
yield runner.crawl(chapterSpider,chapters=toChapter)
yield runner.crawl(lightnovelSpider,novels=toNovel)
reactor.stop()
crawl()
reactor.run()
答
哇,不知道这样的东西可以工作,但我从来没有尝试过。
我在做什么,而不是当多个刮阶段必须携手合作是这两个任一选项:
选项1 - 使用数据库
当刮刀要跑在一个连续的模式下,重新扫描网站等,我只是让刮板将其结果推入数据库(通过管道)
而且后续处理的蜘蛛会从相同的数据库中提取他们需要的数据(在你的情况下,例如小说网址)。
然后使用调度程序或cron保持一切运行,蜘蛛将携手并进。
选择2 - 合并一切都变成一个蜘蛛
这就是我选择当一切都需要运行为一体脚本的方式:我创建了多个连锁请求一起几步一个蜘蛛。
class LightnovelSpider(scrapy.Spider):
name = "novels"
allowed_domains = ["readlightnovel.com"]
# was initCrawler.start_requests
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse_novel_list)
# a mix of initCrawler.parse and parts of LightnovelScraper.start_requests
def parse_novel_list(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
yield Request(novel, callback=self.parse_novel)
def parse_novel(self, response):
#stuff here
# ... and create requests with callback=self.parse_chapters
def parse_chapters(self, response):
# do stuff
(代码没有进行测试,它只是显示的基本概念)
如果事情变得太复杂,我拉了一些元素,并将它们转移到混入类。
在你的情况我将最有可能倾向于选择2.