Scrapy 爬虫框架使用方法

1、安装Scrapy

安装Scrapy之前需要有python环境，也就是说需要先安装python，之后在安装scrapy。

使用命令：pip install scrapy

如果你是刚安装python，按可能在安装的时候会报错，原因是pip的版本太低，故你需要在此步骤之前，更新一下pip，直接在cmd中输入：python -m pip install --upgrade pip。

安装完成后输入命令：scrapy，如果显示出版本信息，那么表示安装成功。

2、了解常用Scrapy的命令

crawl命令：用来执行一个爬虫程序，开始爬取数据。scrapy crawl xxx(爬虫名称)

genspider命令：用来创建一个爬虫文件。scrapy genspider xxx(爬虫名称) xxx.com(要爬区的网址)

list命令：用来查看当前文件夹下的所有的爬虫，列出的是爬虫名称。

startproject命令：用来创建一个scrapy爬虫项目。scrapy startproject xxx(项目名称)

3、创建一个爬虫项目

以爬取腾讯招聘网站岗位信息为例，爬取得信息存入json文件中

创建项目：scrapy startproject test2_spider

创建完成后生成的项目目录为：

进入下级目录：cd test2_spider

创建爬虫：scrapy genspider Tencent hr.tencent.com

分析爬取网站的结构，这里推荐使用xpath浏览器插件验证信息

使用xpath查看元素值

编写items.py文件，设置爬取字段
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Test2SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 职位名称
    positionName = scrapy.Field()
    # 职位类别
    positionType = scrapy.Field()
    # 招聘人数
    personNumber = scrapy.Field()
    # 工作地址
    workBase = scrapy.Field()
    # 发布时间
    publishTime = scrapy.Field()
    # pass
编写爬虫程序
# -*- coding: utf-8 -*-
import scrapy
from test2_spider.items import Test2SpiderItem

class TencentSpider(scrapy.Spider):
    # 爬虫名称
    name = 'Tencent'
    # 爬虫访问域
    # allowed_domains = ['https://hr.tencent.com']
    # 爬虫初始爬取页面地址
    start_urls = ['https://hr.tencent.com/position.php?&start=#a0']

    def parse(self, response):
        node_list = response.xpath("//tr[@class='even']|//tr[@class='odd']")
        for node in node_list:
            item = Test2SpiderItem()
            if len(node.xpath("./td[1]/a/text()")):
                item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0]
            else:
                item['positionName'] = '无'
            if len(node.xpath("./td[2]/text()")):
                item['positionType'] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item['positionType'] = '无'
            if len(node.xpath("./td[3]/text()")):
                item['personNumber'] = node.xpath("./td[3]/text()").extract()[0]
            else:
                item['personNumber'] = '无'
            if len(node.xpath("./td[4]/text()")):
                item['workBase'] = node.xpath("./td[4]/text()").extract()[0]
            else:
                item['workBase'] = '无'
            if len(node.xpath("./td[5]/text()")):
                item['publishTime'] = node.xpath("./td[5]/text()").extract()[0]
            else:
                item['publishTime'] = '无'
            yield item
        # 判断是否有下一页，如有，继续爬取，没有则自动结束，不需要做判断。
        if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            yield scrapy.Request("https://hr.tencent.com/" + url, callback = self.parse)
编写piplines文件，将传过来的item保存到json文件中去
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class Test2SpiderPipeline(object):
    def __init__(self):
        self.f = open("tencent.json","w")

    def process_item(self, item, spider):
        context = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.f.write(context)
        return item

    def close_spider(self, spider):
        self.f.close()
修改settings文件，设置相关属性

此变量是设置是否遵守目标网站的爬取规则，如果遵守，那么会有限制，导致某些数据爬取不到。

此三个配置默认注释的，需要开启，修改完成后，setting文件即设置完成

执行爬虫程序

爬取成功时显示

4、总结

编写一个简单的scrapy爬虫，主要步骤如下：

创建项目

创建爬虫

编写items.py中对应的接收变量信息

编写spider程序

编写piplines程序

修改设置settings文件

执行爬虫程序

Scrapy 爬虫框架使用方法

相关推荐