scrapy通过自定义类给爬取的url去重

之前我们是通过在parse函数里设置集合来解决url去重的问题。

首先先在根目录中建立一个新的duplication的py文件,在from scrapy.dupefilter import RFPDupeFilter,在RFPDupeFilter源码中把BaseDupeFilter类复制到新建的duolication中。

scrapy通过自定义类给爬取的url去重

class RepeatFilter(object):
    def __init__(self):
        self.visited_set = set()
    @classmethod
    def from_settings(cls, settings):#用类方法建立RepeatFilter类对象返回的是RepeatFliter()
        return cls()

    def request_seen(self, request):#过滤url的方法
        if request.url in self.visited_set:
            return True
        else:
            self.visited_set.add(request.url)
            return False

    def open(self):#爬虫开始
        print("---开始爬取---")
        
    def close(self, reason):  # 爬虫结束
        print("---爬取结束---")

    def log(self, request, spider):  # 记录日志
        pass

在request_open方法中把过滤的url方法写好

执行顺序是

1、from_setting

2、__init__

3、open

4、log

5、close

最后别忘了要再settings.py文件中添加一条DUPEFILTER_CLASS = "shan.duplication.RepeatFilter"

默认的是DUPEFILTER_CLASS = "shan.dupefilter.RFPDupeFilter"

(venv) D:\shan>scrapy crawl chouti --nolog
D:\shan\shan\spiders\chouti.py:9: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  from scrapy.dupefilter import RFPDupeFilter
---开始爬取---
https://dig.chouti.com/
https://dig.chouti.com/all/hot/recent/2
https://dig.chouti.com/all/hot/recent/3
https://dig.chouti.com/all/hot/recent/8
https://dig.chouti.com/all/hot/recent/5
https://dig.chouti.com/all/hot/recent/7
https://dig.chouti.com/all/hot/recent/6
https://dig.chouti.com/all/hot/recent/10
https://dig.chouti.com/all/hot/recent/9
https://dig.chouti.com/all/hot/recent/4
---爬取结束---