使用scrapy爬虫

实验目的

1. 使用scrapy爬虫

实验要求

1. 使用scrapy爬虫

实验过程

导入srapy库

打开pycharm，点击file下的settings，导入anaconda下的python编译器。

点击Project interpreter，在右侧窗口选择show all interpreter,点击+号，进入anaconda下的python编译器导入。

使用scrapy爬虫

点击OK后在project interpreter中确定选用它。再选择下方的加号，在搜索框中输入scrapy，安装scrapy包。

使用scrapy爬虫

创建一个名为stockstar的项目

打开cmd，输入scrapy startproject stockstar。

使用scrapy爬虫

在pycharm中导入stockstar项目

打开file下的open，选择stockstar存放目录，导入。

Stockstar的文件目录结构如下所示：

使用scrapy爬虫

在items.py目录下添加股票代码，股票简称，最新价，涨跌幅，涨跌额，5分钟涨幅，成交量，成交额。

import scrapy

from scrapy.loader import ItemLoader

from scrapy.loader.processors import TakeFirst

class StockstarItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    code=scrapy.Field();#股票代码

    abbr=scrapy.Field();#股票简称

    last_trade=scrapy.Field();#最新价

    chg_ratio=scrapy.Field();#涨跌幅

    chg_amt=scrapy.Field();#涨跌额

    chg_ratio_5min = scrapy.Field()  # 5分钟涨幅

    volumn = scrapy.Field()  # 成交量

    turn_over = scrapy.Field()  # 成交额

在settings文件进行爬虫设置，定义可显示中文的JSON Line Exporter，并且设置爬取间隔0.25秒。

from scrapy.exporters import JsonLinesItemExporter #默认显示的中文是阅读性较差的Unicode字符

#需要定义子类显示出原来的字符集（将父类的ensure_ascii属性设置为False即可）

class CustomJsonLinesItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):

        super (CustomJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

    #启用新定义的Exporter类\

    FEED_EXPORTERS = {

        'json':'stockstar.settings.CustomJsonLinesItemExporter',

    }

编写爬虫逻辑。在命令行输入

cd stockstar

scrapy genspider stock quote.stockstar.com

在stockstar.spinders.stock下会出现一个stock.py在这个文件中编写爬虫逻辑。

在StockSpider中定义爬虫名称，爬虫域以及爬虫链接。

# -*- coding: utf-8 -*-

import scrapy

from items import StockstarItem,StockstarItemLoader

class StockSpider (scrapy.Spider):

    name='stock'  #定义爬虫名称

    allowed_domains = ['quote.stockstar.com']#定义爬虫域

    start_urls = ['http://quote.stockstar.com/stock/ranklist_a_3_1_1.html']

    def parse(self,response):

        page=int(response.url.split("_")[-1].split(".")[0])

        item_nodes=response.css('#datalist tr')

        for item_node in item_nodes:

            item_loader=StockstarItemLoader(item=StockstarItem(),selector=item_node)

            item_loader.add_css("code","td:nth-child(1) a::text")

            item_loader.add_css("abbr", "td:nth-child(2) a::text")

            item_loader.add_css("last_trade", "td:nth-child(3) span::text")

            item_loader.add_css("chg_ratio", "td:nth-child(4) span::text")

            item_loader.add_css("chg_amt", "td:nth-child(5) span::text")

            item_loader.add_css("chg_ratio_5min", "td:nth-child(6) span::text")

            item_loader.add_css("volumn", "td:nth-child(7) ::text")

            item_loader.add_css("turn_over", "td:nth-child(8) ::text")

            stock_item=item_loader.load_item()

            yield stock_item

        if item_nodes:

            next_page=page+1

            next_url=response.url.replace("{0}.html".format(page),"{0}.html".format(next_page))

            yield scrapy.Request(url=next_url,callback=self.parse)

新建一个main.py，在其中引入命令行，输入命令scrapy crawl stock -o items.json，把获取到的数据都放在items.json这个文件夹下。

from scrapy.cmdline import execute

execute(["scrapy","crawl","stock","-o","items.json"])

点击run之后，控制台端会出现如下的信息。

在items.json中保存了爬取下来的数据。

相关推荐