scrapy抓取网站
问题描述:
您好我正在使用scrapy抓取网站新闻,但我得到错误,当我这样做的过程中,网站有很多新闻页面和新闻的网址是www.example.com/34223我试图找到解决这个问题的一种方法,她是我的代码scrapy版本是1.4.0,我用MACOSscrapy抓取网站
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Example(scrapy.Spider):
name = "example"
allowed_domains = ["http://www.example.com"]
start_urls = ["http://www.example.com"]
rules = (
#self.log('testing rules' + response.url)
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/*',), deny=(' ',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php',)), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['title'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[2]/text()').extract()
item['img_url'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[3]/img').extract()
item['description'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[5]/text()').extract()
return item
答
感谢它的工作,但现在我需要去扔掉所有网站消息
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['www.Example.com']
start_urls = ['http://www.Example.com/1621305',
]
def parse(self, response):
for article in response.css('.article'):
yield {
'title' : article.css('.article-title h1::text').extract(),
'time' : article.css('.article-time time::text').extract(),
'article': article.css('.article-text p::text').extract(),
}
答
我确实修复了代码,它的工作正常是的我是这样做的
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['www. example.com']
start_urls = ['http://www.example.com/',
]
def parse(self, response):
for article in response.css('.main-news'):
yield {
'title' : article.css('.article-title h1::text').extract(),
'time' : article.css('.article-time time::text').extract(),
'another' : article.css('.article-source::text').extract(),
'section' : response.xpath('/html/body/div[3]/div/div/div[1]/ol/li[2]/a//text()').extract(),
'article' : article.css('.article-text p::text').extract() }
for next_page in response.css('a::attr(href)'):
yield response.follow(next_page, self.parse)
当我运行代码时出现此错误错误:Spider错误处理(引用:无) – Raed
将allowed_domains = [“http://www.example.com”]更改为'allowed_domains = [“www .example.com“]',看看它是否有效 –