- 开发环境
Windows 10
python3
vs code
docker
-
docker
安装
- 下载
Docker Toolbox
- 具体安装方法请百度, 教程很多, 这里就不多介绍啦
- 安装
scrapy-splash
pip install scrapy-splash
- 运行
splash
- 打开
Docker Quickstart Terminal
, 输入以下命令:
docker run -d -p 8050:8050 hub-mirror.c.163.com/scrapinghub/splash
- 由于我的镜像是在
163
上面的,所以自己的镜像地址要弄对
- 运行无异常之后,可以在浏览器中输入网址,看到运行效果,可以在右边自己写
lua
脚本测试是否达到效果,也自带了部分lua脚本可以查看
- 页面结构分析
- 爬取淘宝电场页面, 链接是
https://www.taobao.com/markets/3c/tbdc?spm=a21bo.2017.201867-main.12.5af911d9GQgDTx
;
- 淘宝页面在对商品的价格做了动态加载,所以在静态爬取无法实现。
- 代码实现
scrapy startproject taobao
cd taobao
scrapy genspider taobao_phone taobao.com
- 首先对
setting
进行配置
- 添加splash运行地址
SPLASH_URL = "http://192.168.99.100:8050"
-
DOWNLOADER_MIDDLEWARES
中添加Splash middleware
,为了防止被发现是爬虫,可以自己设置一下User-Agent
;
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware':723,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
'taobao.middlewares.TaobaoDownloaderMiddleware': 543,
}
SPIDER_MIDDLEWARES = {
'taobao.middlewares.TaobaoSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
}
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
-
taobao_phone.py
具体代码
import scrapy
from scrapy_splash import SplashRequest
class TaobaoPhoneSpider(scrapy.Spider):
name = 'taobao_phone'
allowed_domains = ['taobao.com']
start_urls = ['http://taobao.com/']
def start_requests(self):
script = open('taobao.lua').read()
url = 'https://www.taobao.com/markets/3c/tbdc?spm=a21bo.2017.201867-main.12.5af911d9GQgDTx'
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': script, 'url': url})
def parse(self, response):
print("url is ::::::::::::",response.body_as_unicode())
results = response.css("div.parttwo-mid li")
for item in results:
print(item.css("a p::text").extract())
print(item.css('a::attr(href)').extract_first())
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 Chrome/69.0.3497.100 Safari/537.36")
splash:go(args.url)
splash:wait(5)
return {html=splash:html()}
end