Crawlspider规则不起作用
问题描述:
我想建立一个蜘蛛来抓取使用python的scrapy框架在纽约理工学院的课程数据......以下是我的蜘蛛(nyitspider.py)。有人可以告诉我我哪里错了。Crawlspider规则不起作用
from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from nyit_sample.items import NyitSampleItem
class nyitspider(CrawlSpider):
name = 'nyitspider'
allowed_domains = ['nyit.edu']
start_urls = ['http://www.nyit.edu/academics/courses/']
rules = (
Rule(LxmlLinkExtractor(
allow=('.*/academics/courses',),
)),
Rule(LxmlLinkExtractor(
allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9] [0-9]/',),
), callback='parse_item'),
)
def parse_item(self, response):
item = Course()
item["institute"] = 'New York Institute of Technology'
item['site'] = 'www.nyit.edu'
item['title'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a').extract()[0]
item['id'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a').extract()[0]
item['credits'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[3]').extract()[0]
item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0]
yield item
答
您必须在parse_item方法中正确声明该项目,并且该方法应该返回一些内容。这里有一个建议,但你必须改进它:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from nyit_sample.items import NyitSampleItem
class nyitspider(CrawlSpider):
name = 'nyitspider'
allowed_domains = ['nyit.edu']
start_urls = ['http://www.nyit.edu/academics/courses/']
rules = (
Rule(LxmlLinkExtractor(
allow=('.*/academics/courses',),
), callback='parse_item'),
Rule(LxmlLinkExtractor(
allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9] [0-9]/',),
), callback='parse_item'),
)
def parse_item(self, response):
item = NyitSampleItem()
item['institute'] = 'New York Institute of Technology'
item['site'] = 'www.nyit.edu'
item['title'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a)').extract()[0]
item['id'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a)').extract()[0]
item['credits'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[3])').extract()[0]
item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0]
return item
我们可以从中得到什么? 2017-03-17 07:20:59 [scrapy.extensions.telnet] DEBUG:Telnet控制台监听127.0.0.1:6026 2017-03-17 07:20:59 [scrapy.core.engine] DEBUG:Crawled (200) (referer:None)['cached'] –
首先,你可以从你所有的xpath表达式中删除'tbody'标签。它通过浏览器添加并且来自页面的响应没有它。并尝试将第二条规则中的正则表达式更改为'r'\/academics \/courses \ /(。*)''(您也可以移除的第一条规则)。 – vold