在scrapy中解析其他请求的结果

问题描述:

我试图抓取lynda.com课程并将它们的信息存储在csv文件中。这是我的代码在scrapy中解析其他请求的结果

# -*- coding: utf-8 -*- 
import scrapy 
import itertools 


class LyndadevSpider(scrapy.Spider): 
    name = 'lyndadev' 
    allowed_domains = ['lynda.com'] 
    start_urls = ['https://www.lynda.com/Developer-training-tutorials'] 

    def parse(self, response): 
     #print(response.url) 
     titles = response.xpath('//li[@role="presentation"]//h3/text()').extract() 
     descs = response.xpath('//li[@role="presentation"]//div[@class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract() 
     links = response.xpath('//li[@role="presentation"]/div/div/div[@class="col-xs-8 col-sm-9 card-meta-data"]/a/@href').extract() 

     for title, desc, link in itertools.izip(titles, descs, links): 
      #print link 
      categ = scrapy.Request(link, callback=self.parse2) 
      yield {'desc': link, 'category': categ} 

    def parse2(self, response): 
     #getting categories by storing the navigation info 
     item = response.xpath('//ol[@role="navigation"]').extract() 
     return item 

我想在这里做的是我抓住了冠军,教程名单的说明,然后导航到URL和parse2抓住的类别。

不过,我得到的结果是这样的:

category,desc 
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html 
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html 
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html 

我如何进入我想要的信息?

您需要在parse方法中解析start_urls(而不是产生字典)的响应yield a scrapy.Request方法。此外,我宁愿循环播放课程项目,并分别为每个课程项目提取信息。

我不确定你说的是什么意思。我想这些是你可以在底部的课程详情页面上看到的标签。但我可能是错的。

试试这个代码:

# -*- coding: utf-8 -*- 
import scrapy 

class LyndaSpider(scrapy.Spider): 
    name = "lynda" 
    allowed_domains = ["lynda.com"] 
    start_urls = ['https://www.lynda.com/Developer-training-tutorials'] 

    def parse(self, response): 
     courses = response.css('ul#category-courses div.card-meta-data') 
     for course in courses: 
      item = { 
       'title': course.css('h3::text').extract_first(), 
       'desc': course.css('div.meta-description::text').extract_first(), 
       'link': course.css('a::attr(href)').extract_first(), 
      } 
      request = scrapy.Request(item['link'], callback=self.parse_course) 
      request.meta['item'] = item 
      yield request 

    def parse_course(self, response): 
     item = response.meta['item'] 
     #item['categories'] = response.css('div.tags a em::text').extract() 
     item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first() 
     return item 
+0

嗨。感谢您的回答,如果您点击任何课程。您会在左上角看到 –

+0

您是指面包屑导航?我编辑了答案以反映您的规范 - 现在它从最后一个导航项中提取文本。 –

+0

谢谢。无论如何,我修好了。你是一个救生员。顺便说一句,你会知道如何做2阶段登录? –