python爬虫——scrapy库(1)
这个库将作为一个长期的学习内容在这里呈现
mooc中的例子并不能输出txt文档,于是自己从网上找到相关实例,并进行改编
我使用的是vscode环境来爬取美剧T100数据
在终端(cmd)中键入
\>scrapy startproject movie
\>cd movie
\>scrapy genspider meiju
先是 items.py,在其中添加name
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name =scrapy.Field()
随后是 meiju.py,这里使用的是css语法查找a标签中的title,并返回列表类型(.extract())
# -*- coding: utf-8 -*-
import scrapy
import re
from movie.items import MovieItem
class MeijuSpider(scrapy.Spider):
name = "meiju"
allowed_domains = ["meijutt.com"]
start_urls = ['http://www.meijutt.com/new100.html']
def parse(self, response):
movies = response.css('a::attr(title)').extract()[2:]
for each_movie in movies:
item = MovieItem()
item['name'] = each_movie
yield item
在 pipesline.py 中,我们将数据写入文件。
在一开始我使用的‘w’模式,而在多行数据中,此模式只能保留最后一行,故只能得到一条数据。
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MoviePipeline(object):
def process_item(self, item, spider):
with open("my_meiju.txt",'a') as fp:
fp.write(str(item['name'])+'\n')
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'movie.pipelines.MoviePipeline': 100,
}
爬取结果: