Scrapy与Scrapy蜘蛛
问题描述:
我想刮website。我想要做的提取是文档列表,作者姓名和日期。我观看了一些scrapy蜘蛛视频,并能够找出3个shell脚本命令,从网站上提供所需的数据。该命令是Scrapy与Scrapy蜘蛛
scrapy shell https://www.cato.org/research/34/commentary
日期:
response.css('span.date-display-single::text').extract()
作者:
response.css('p.text-sans::text').extract()
在页面的文档链接
response.css('p.text-large.experts-more-h > a::text').extract()
我试图得到它通过Python,但都是徒劳的。由于有多个数据。
这里是Python代码:
import scrapy
class CatoSpider(scrapy.Spider):
name = 'cato'
allowed_domains = ['cato.org']
start_urls = ['https://www.cato.org/research/34/commentary']
def parse(self, response):
pass
答
这应该工作。所有你需要的就是运行这个命令: scrapy runspider cato.py -o out.json
但我所看到的,有错误的链接,你将只从文字链接,而不是HREF
import scrapy
class CatoItem(scrapy.Item):
date = scrapy.Field()
author = scrapy.Field()
links = scrapy.Field()
class CatoSpider(scrapy.Spider):
name = 'cato'
allowed_domains = ['cato.org']
start_urls = ['https://www.cato.org/research/34/commentary']
def parse(self, response):
date = response.css('span.date-display-single::text').extract()
author = response.css('p.text-sans::text').extract()
links = response.css('p.text-large.experts-more-h > a::text').extract()
for d, a, l in zip(date, author, links):
item = CatoItem()
item['date'] = d
item['author'] = a
item['links'] = l
yield item
不要使用'css'此,更好的是'xpath' – AndMar
我正在尝试构建一个模块,并且任务将是单击文章链接并提取日期,作者和文章标题。并且为所有文章做这个链接网页(cato.org/research/34/commentary)。请帮忙 – Shad