如何修改url之前,在scrapy中?
问题描述:
我是新与scrapy,这是我第二次蜘蛛:如何修改url之前,在scrapy中?
class SitenameScrapy(scrapy.Spider):
name = "sitename"
allowed_domains = ['www.sitename.com', 'sitename.com']
rules = [Rule(LinkExtractor(unique=True), follow=True)]
def start_requests(self):
urls = ['http://www.sitename.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_cat)
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
if ('/category/' in link.url):
yield response.follow(link, self.parse_cat)
if ('/product/' in link.url):
yield response.follow(link, self.parse_prod)
def parse_prod(self, response):
pass
我的问题是,有时我有一个像http://sitename.com/path1/path2/?param1=value1¶m2=value2
和我联系,参数1并不重要,我想response.follow
之前,从URL中移除。我认为我可以通过regex
来做到这一点,但我不确定这是scrapy的“正确方法”吗?也许我应该使用某种规则呢?
答
我想你可以使用url_query_cleaner方法从w3lib的图书馆。喜欢的东西:
from w3lib.url import url_query_cleaner
...
....
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
url = url_query_cleaner(link.url, ('param2',))
if '/category/' in url:
yield response.follow(url, self.parse_cat)
if '/product/' in url:
yield response.follow(url, self.parse_prod)
我想补充的其他方式,指定哪些参数从查询字符串留下所有其他删除:'URL = url_query_cleaner(link.url,(“参数1”,),删除= TRUE) '。 –