python爬虫——scrapy库(1）

这个库将作为一个长期的学习内容在这里呈现
mooc中的例子并不能输出txt文档，于是自己从网上找到相关实例，并进行改编
我使用的是vscode环境来爬取美剧T100数据
在终端（cmd）中键入

\>scrapy startproject movie
\>cd movie
\>scrapy genspider meiju

先是 items.py，在其中添加name

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name =scrapy.Field()

随后是 meiju.py，这里使用的是css语法查找a标签中的title，并返回列表类型（.extract()）

# -*- coding: utf-8 -*-
import scrapy
import re
from movie.items import MovieItem

class MeijuSpider(scrapy.Spider):
    name = "meiju"
    allowed_domains = ["meijutt.com"]
    start_urls = ['http://www.meijutt.com/new100.html']
 
    def parse(self, response):
        movies = response.css('a::attr(title)').extract()[2:]
        for each_movie in movies:
            item = MovieItem()
            item['name'] = each_movie
            yield item

在 pipesline.py 中，我们将数据写入文件。
在一开始我使用的‘w’模式，而在多行数据中，此模式只能保留最后一行，故只能得到一条数据。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 


class MoviePipeline(object):
    def process_item(self, item, spider):
        with open("my_meiju.txt",'a') as fp:
            fp.write(str(item['name'])+'\n')

setting.py

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'movie.pipelines.MoviePipeline': 100,
}

爬取结果：
python爬虫——scrapy库(1）

python爬虫——scrapy库(1）

相关推荐