python爬虫——scrapy库(1)

这个库将作为一个长期的学习内容在这里呈现
mooc中的例子并不能输出txt文档,于是自己从网上找到相关实例,并进行改编
我使用的是vscode环境来爬取美剧T100数据
在终端(cmd)中键入

\>scrapy startproject movie
\>cd movie
\>scrapy genspider meiju 

先是 items.py,在其中添加name

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name =scrapy.Field()

随后是 meiju.py,这里使用的是css语法查找a标签中的title,并返回列表类型(.extract())

# -*- coding: utf-8 -*-
import scrapy
import re
from movie.items import MovieItem

class MeijuSpider(scrapy.Spider):
    name = "meiju"
    allowed_domains = ["meijutt.com"]
    start_urls = ['http://www.meijutt.com/new100.html']
 
    def parse(self, response):
        movies = response.css('a::attr(title)').extract()[2:]
        for each_movie in movies:
            item = MovieItem()
            item['name'] = each_movie
            yield item

pipesline.py 中,我们将数据写入文件。
在一开始我使用的‘w’模式,而在多行数据中,此模式只能保留最后一行,故只能得到一条数据。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 


class MoviePipeline(object):
    def process_item(self, item, spider):
        with open("my_meiju.txt",'a') as fp:
            fp.write(str(item['name'])+'\n')

setting.py

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'movie.pipelines.MoviePipeline': 100,
}

爬取结果:
python爬虫——scrapy库(1)