在scrapy中解析其他请求的结果
问题描述:
我试图抓取lynda.com课程并将它们的信息存储在csv文件中。这是我的代码在scrapy中解析其他请求的结果
# -*- coding: utf-8 -*-
import scrapy
import itertools
class LyndadevSpider(scrapy.Spider):
name = 'lyndadev'
allowed_domains = ['lynda.com']
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
#print(response.url)
titles = response.xpath('//li[@role="presentation"]//h3/text()').extract()
descs = response.xpath('//li[@role="presentation"]//div[@class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
links = response.xpath('//li[@role="presentation"]/div/div/div[@class="col-xs-8 col-sm-9 card-meta-data"]/a/@href').extract()
for title, desc, link in itertools.izip(titles, descs, links):
#print link
categ = scrapy.Request(link, callback=self.parse2)
yield {'desc': link, 'category': categ}
def parse2(self, response):
#getting categories by storing the navigation info
item = response.xpath('//ol[@role="navigation"]').extract()
return item
我想在这里做的是我抓住了冠军,教程名单的说明,然后导航到URL和parse2抓住的类别。
不过,我得到的结果是这样的:
category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html
我如何进入我想要的信息?
答
您需要在parse
方法中解析start_urls
(而不是产生字典)的响应yield
a scrapy.Request
方法。此外,我宁愿循环播放课程项目,并分别为每个课程项目提取信息。
我不确定你说的是什么意思。我想这些是你可以在底部的课程详情页面上看到的标签。但我可能是错的。
试试这个代码:
# -*- coding: utf-8 -*-
import scrapy
class LyndaSpider(scrapy.Spider):
name = "lynda"
allowed_domains = ["lynda.com"]
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
courses = response.css('ul#category-courses div.card-meta-data')
for course in courses:
item = {
'title': course.css('h3::text').extract_first(),
'desc': course.css('div.meta-description::text').extract_first(),
'link': course.css('a::attr(href)').extract_first(),
}
request = scrapy.Request(item['link'], callback=self.parse_course)
request.meta['item'] = item
yield request
def parse_course(self, response):
item = response.meta['item']
#item['categories'] = response.css('div.tags a em::text').extract()
item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
return item
嗨。感谢您的回答,如果您点击任何课程。您会在左上角看到 –
您是指面包屑导航?我编辑了答案以反映您的规范 - 现在它从最后一个导航项中提取文本。 –
谢谢。无论如何,我修好了。你是一个救生员。顺便说一句,你会知道如何做2阶段登录? –