scrapy抓取网站

问题描述：

您好我正在使用scrapy抓取网站新闻，但我得到错误，当我这样做的过程中，网站有很多新闻页面和新闻的网址是www.example.com/34223我试图找到解决这个问题的一种方法，她是我的代码scrapy版本是1.4.0，我用MACOSscrapy抓取网站

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
class Example(scrapy.Spider): 
name = "example" 
allowed_domains = ["http://www.example.com"] 
start_urls = ["http://www.example.com"] 
    rules = (
    #self.log('testing rules' + response.url) 
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(LinkExtractor(allow=('/*',), deny=(' ',))), 

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow=('item\.php',)), callback='parse_item'), 
) 

def parse_item(self, response): 
    self.logger.info('Hi, this is an item page! %s', response.url) 
    item = scrapy.Item() 
     item['title'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[2]/text()').extract() 
     item['img_url'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[3]/img').extract() 
     item['description'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[5]/text()').extract() 
     return item

当我运行代码时出现此错误错误：Spider错误处理（引用：无） – Raed

将allowed_domains = [“http://www.example.com”]更改为'allowed_domains = [“www .example.com“]'，看看它是否有效 –

答

感谢它的工作，但现在我需要去扔掉所有网站消息

# -*- coding: utf-8 -*- 
import scrapy 

class ExampleSpider(scrapy.Spider): 
name = 'example' 
allowed_domains = ['www.Example.com'] 
start_urls = ['http://www.Example.com/1621305', 
] 

def parse(self, response): 
    for article in response.css('.article'): 
     yield { 
     'title' : article.css('.article-title h1::text').extract(), 
     'time' : article.css('.article-time time::text').extract(), 
     'article': article.css('.article-text p::text').extract(), 
     }

答

我确实修复了代码，它的工作正常是的我是这样做的

# -*- coding: utf-8 -*- 
import scrapy 

class ExampleSpider(scrapy.Spider): 
name = 'example' 
allowed_domains = ['www. example.com'] 
start_urls = ['http://www.example.com/', 
] 

def parse(self, response): 
    for article in response.css('.main-news'): 
     yield { 
     'title'  : article.css('.article-title h1::text').extract(), 
     'time'  : article.css('.article-time time::text').extract(), 
     'another' : article.css('.article-source::text').extract(), 
     'section' : response.xpath('/html/body/div[3]/div/div/div[1]/ol/li[2]/a//text()').extract(), 
     'article' : article.css('.article-text p::text').extract() } 


    for next_page in response.css('a::attr(href)'): 
     yield response.follow(next_page, self.parse)

相关推荐