scrapy-splash 爬取Taobao页面

  • 开发环境
    • Windows 10
    • python3
    • vs code
    • docker
  • docker 安装
    • 下载 Docker Toolbox
    • 具体安装方法请百度, 教程很多, 这里就不多介绍啦
  • 安装 scrapy-splash
    • pip install scrapy-splash
    • 运行 splash
      • 打开 Docker Quickstart Terminal, 输入以下命令:
      • docker run -d -p 8050:8050 hub-mirror.c.163.com/scrapinghub/splash
      • 由于我的镜像是在163上面的,所以自己的镜像地址要弄对
      • 运行无异常之后,可以在浏览器中输入网址,看到运行效果,可以在右边自己写lua脚本测试是否达到效果,也自带了部分lua脚本可以查看
        scrapy-splash 爬取Taobao页面
  • 页面结构分析
    • 爬取淘宝电场页面, 链接是https://www.taobao.com/markets/3c/tbdc?spm=a21bo.2017.201867-main.12.5af911d9GQgDTx ;
    • 淘宝页面在对商品的价格做了动态加载,所以在静态爬取无法实现。
  • 代码实现
    • 用新命令新建一个spider
scrapy startproject taobao

cd taobao

scrapy genspider taobao_phone taobao.com
  • 首先对setting进行配置
    • 添加splash运行地址
    • SPLASH_URL = "http://192.168.99.100:8050"
    • DOWNLOADER_MIDDLEWARES 中添加Splash middleware,为了防止被发现是爬虫,可以自己设置一下User-Agent;
    DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware':723,
    'scrapy_splash.SplashMiddleware':725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
    'taobao.middlewares.TaobaoDownloaderMiddleware': 543,
    }
    
    • 添加SPIDER_MIDDLEWARES
    SPIDER_MIDDLEWARES = {
    'taobao.middlewares.TaobaoSpiderMiddleware': 543,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
    }
    
    • 添加DUPEFILTER_CLASS去重
    HTTPCACHE_ENABLED = True
    HTTPCACHE_EXPIRATION_SECS = 0
    HTTPCACHE_DIR = 'httpcache'
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
  • taobao_phone.py具体代码
import scrapy

from scrapy_splash import SplashRequest


class TaobaoPhoneSpider(scrapy.Spider):
    name = 'taobao_phone'
    allowed_domains = ['taobao.com']
    start_urls = ['http://taobao.com/']


    def start_requests(self):
        script = open('taobao.lua').read()
        url = 'https://www.taobao.com/markets/3c/tbdc?spm=a21bo.2017.201867-main.12.5af911d9GQgDTx'
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': script, 'url': url})

    def parse(self, response):

        print("url is ::::::::::::",response.body_as_unicode())
        results = response.css("div.parttwo-mid li")
        
        for item in results:
       		print(item.css("a p::text").extract())
            print(item.css('a::attr(href)').extract_first())

  • taobao.lua 单独的lua脚本代码
function main(splash, args)
      splash:set_user_agent("Mozilla/5.0  Chrome/69.0.3497.100 Safari/537.36")
      splash:go(args.url)
      splash:wait(5)
      return {html=splash:html()}
end