Crawlspider规则不起作用

问题描述:

我想建立一个蜘蛛来抓取使用python的scrapy框架在纽约理工学院的课程数据......以下是我的蜘蛛(nyitspider.py)。有人可以告诉我我哪里错了。Crawlspider规则不起作用

from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 

from nyit_sample.items import NyitSampleItem 


class nyitspider(CrawlSpider): 
name = 'nyitspider' 
allowed_domains = ['nyit.edu'] 
start_urls = ['http://www.nyit.edu/academics/courses/'] 

rules = (
    Rule(LxmlLinkExtractor(
     allow=('.*/academics/courses',), 
    )), 

Rule(LxmlLinkExtractor(
     allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9] [0-9]/',), 
    ), callback='parse_item'), 

) 

def parse_item(self, response): 
    item = Course() 
    item["institute"] = 'New York Institute of Technology' 
    item['site'] = 'www.nyit.edu' 
    item['title'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a').extract()[0] 
item['id'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a').extract()[0] 
    item['credits'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[3]').extract()[0] 
    item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0] 



    yield item 
+0

我们可以从中得到什么? 2017-03-17 07:20:59 [scrapy.extensions.telnet] DEBUG:Telnet控制台监听127.0.0.1:6026 2017-03-17 07:20:59 [scrapy.core.engine] DEBUG:Crawled (200)(referer:None)['cached'] –

+0

首先,你可以从你所有的xpath表达式中删除'tbody'标签。它通过浏览器添加并且来自页面的响应没有它。并尝试将第二条规则中的正则表达式更改为'r'\/academics \/courses \ /(。*)''(您也可以移除的第一条规则)。 – vold

您必须在parse_item方法中正确声明该项目,并且该方法应该返回一些内容。这里有一个建议,但你必须改进它:

# -*- coding: utf-8 -*- 
from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 

from nyit_sample.items import NyitSampleItem 


class nyitspider(CrawlSpider): 
    name = 'nyitspider' 
    allowed_domains = ['nyit.edu'] 
    start_urls = ['http://www.nyit.edu/academics/courses/'] 

    rules = (
     Rule(LxmlLinkExtractor(
      allow=('.*/academics/courses',), 
     ), callback='parse_item'), 
     Rule(LxmlLinkExtractor(
      allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9] [0-9]/',), 
     ), callback='parse_item'), 

    ) 

    def parse_item(self, response): 
     item = NyitSampleItem() 
     item['institute'] = 'New York Institute of Technology' 
     item['site'] = 'www.nyit.edu' 
     item['title'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a)').extract()[0] 
     item['id'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a)').extract()[0] 
     item['credits'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[3])').extract()[0] 
     item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0] 
     return item