爬虫简介及实现一个简单的爬虫Demo

什么是爬虫？

简单的说就是：一段自动抓取互联网信息的程序

爬虫技术价值：

我们可以利用互联网中的数据进行学习、爬取有价值的数据做成产品，可以赚钱，反正就是只要不犯法，干什么事随你。

一句话：互联网数据，为我所用！

爬虫简介及实现一个简单的爬虫Demo

简单爬虫架构：

爬虫简介及实现一个简单的爬虫Demo

运行流程：

爬虫简介及实现一个简单的爬虫Demo

URL管理器：

管理待抓取URL集合和已经抓取URL集合
-- 防止重复抓取，防止循环抓取

爬虫简介及实现一个简单的爬虫Demo

网页下载器：

-- 将互联网上URL对应的网页下载到本地的工具
--Python3 有哪几种网页下载器？
urllib(python3.x 合并了urllib2)、requests
--python爬虫urllib 下载器网页的三种方法
1) 最简洁方法：urllib.request.urlopen(url)
2）用 url、data、header 组装成 urllib.request.Request
urllib.request.urlopen(Request)

3) 添加特殊情景的处理器

import http.cookiejar
from urllib import request

url = 'http://www.baidu.com'

print('第一种方法')
response1 = request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('\n第二种方法')
req = request.Request(url)
req.add_header('user-agent', 'Mozilla/5.0')
response2 = request.urlopen(req)
print(response2.getcode())
print(response2.read()[:20])

print('\n第三种方法')
cj = http.cookiejar.CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cj))
request.install_opener(opener)
response3 = request.urlopen(url)
print(response3.getcode())
print(cj)
print(response3.read()[:20])

开发了一个简单的爬虫项目：

附详细代码：https://gitee.com/wangfuchao/python_simple_crawler.git

本文根据慕课网疯狂的蚂蚁crazyant 讲解python开发简单爬虫视频记录的一些笔记

网址：https://www.imooc.com/learn/563

爬虫简介及实现一个简单的爬虫Demo

简单爬虫架构：

网页下载器：

相关推荐