Scrapy与Scrapy蜘蛛

问题描述：

我想刮website。我想要做的提取是文档列表，作者姓名和日期。我观看了一些scrapy蜘蛛视频，并能够找出3个shell脚本命令，从网站上提供所需的数据。该命令是Scrapy与Scrapy蜘蛛

scrapy shell https://www.cato.org/research/34/commentary

日期：

response.css('span.date-display-single::text').extract()

作者：

response.css('p.text-sans::text').extract()

在页面的文档链接

response.css('p.text-large.experts-more-h > a::text').extract()

我试图得到它通过Python，但都是徒劳的。由于有多个数据。

这里是Python代码：

import scrapy 
class CatoSpider(scrapy.Spider): 

    name = 'cato' 

    allowed_domains = ['cato.org'] 

    start_urls = ['https://www.cato.org/research/34/commentary'] 


def parse(self, response): 

    pass

不要使用'css'此，更好的是'xpath' – AndMar

我正在尝试构建一个模块，并且任务将是单击文章链接并提取日期，作者和文章标题。并且为所有文章做这个链接网页（cato.org/research/34/commentary）。请帮忙 – Shad

答

这应该工作。所有你需要的就是运行这个命令： scrapy runspider cato.py -o out.json 但我所看到的，有错误的链接，你将只从文字链接，而不是HREF

import scrapy 

class CatoItem(scrapy.Item): 
    date = scrapy.Field() 
    author = scrapy.Field() 
    links = scrapy.Field() 


class CatoSpider(scrapy.Spider): 

    name = 'cato' 

    allowed_domains = ['cato.org'] 

    start_urls = ['https://www.cato.org/research/34/commentary'] 


    def parse(self, response): 
     date = response.css('span.date-display-single::text').extract() 
     author = response.css('p.text-sans::text').extract() 
     links = response.css('p.text-large.experts-more-h > a::text').extract() 
     for d, a, l in zip(date, author, links): 
      item = CatoItem() 
      item['date'] = d 
      item['author'] = a 
      item['links'] = l 
      yield item

谢谢。我无法说我多么感激。只需要问一个“Catoitem”课程是分开还是必须与第二个蜘蛛部分一起使用？ – Shad

您可以将'CatoItem'放入与您的蜘蛛相同的模块中，但这是不好的做法，最好将它放在'items.py'中，因为将来可能有许多蜘蛛，并且它很容易从一个模块。 – AndMar

你是一个拯救生命的人。 – Shad

相关推荐