Scrapy - 获取内部蜘蛛变量下载MIDDLEWARE __init__
问题描述:
我正在开发一个Scrapy项目,我在其上编写了一个DOWNLOADER MIDDLEWARE以避免向已经在数据库中的URL发出请求。Scrapy - 获取内部蜘蛛变量下载MIDDLEWARE __init__
DOWNLOADER_MIDDLEWARES = {
'imobotS.utilities.RandomUserAgentMiddleware': 400,
'imobotS.utilities.DupFilterMiddleware': 500,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
的想法是连接和__init__负载当前存储在数据库中的所有URL的不同名单,并提高IgnoreRequests如果刮出项目已经是DB。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')
def process_request(self, request, spider):
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_set:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None
所以,我想通过限制对WEBSITE_NAME初始化定义的URL_LIST,有没有办法来识别下载中间件__init__方法中目前蜘蛛的名字吗?
答
您可以移动process_request
下的url集合,并检查您是否先前已经获取了它。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_sets = {}
def process_request(self, request, spider):
if not self.url_sets.get(spider._site_name):
self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_sets[spider._site_name]:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None