如何修改url之前,在scrapy中?

如何修改url之前,在scrapy中?

问题描述:

我是新与scrapy,这是我第二次蜘蛛:如何修改url之前,在scrapy中?

class SitenameScrapy(scrapy.Spider): 
    name = "sitename" 
    allowed_domains = ['www.sitename.com', 'sitename.com'] 
    rules = [Rule(LinkExtractor(unique=True), follow=True)] 

    def start_requests(self): 
     urls = ['http://www.sitename.com/'] 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse_cat) 

    def parse_cat(self, response): 
     links = LinkExtractor().extract_links(response) 
     for link in links: 
      if ('/category/' in link.url): 
       yield response.follow(link, self.parse_cat) 
      if ('/product/' in link.url): 
       yield response.follow(link, self.parse_prod) 

    def parse_prod(self, response): 
     pass 

我的问题是,有时我有一个像http://sitename.com/path1/path2/?param1=value1&param2=value2和我联系,参数1并不重要,我想response.follow之前,从URL中移除。我认为我可以通过regex来做到这一点,但我不确定这是scrapy的“正确方法”吗?也许我应该使用某种规则呢?

我想你可以使用url_query_cleaner方法从w3lib的图书馆。喜欢的东西:

from w3lib.url import url_query_cleaner 
... 
.... 
    def parse_cat(self, response): 
     links = LinkExtractor().extract_links(response) 
     for link in links: 
      url = url_query_cleaner(link.url, ('param2',)) 
      if '/category/' in url: 
       yield response.follow(url, self.parse_cat) 
      if '/product/' in url: 
       yield response.follow(url, self.parse_prod) 
+1

我想补充的其他方式,指定哪些参数从查询字符串留下所有其他删除:'URL = url_query_cleaner(link.url,(“参数1”,),删除= TRUE) '。 –