python爬虫常用库的安装

启用快速编辑模式（在命令行中，选中文本时右键代表复制，没有选中时右键代表粘贴）

python爬虫常用库的安装

request，re是python自带的。

pip install -i https://pypi.doubanio.com/simple requests

pip install -i https://pypi.doubanio.com/simple selenium

selenium + chrome浏览器，如果浏览器的版本太新的话，可能会使得chromedriver.exe已停止工作的错误，可以下载v49这个谷歌浏览器版本。

>>> import selenium
>>> from selenium import webdriver
>>> driver = webdriver.Chrome() # 需要下载chrome驱动（下载完成后把解压后得到的exe文件放在python安装目录下）
DevTools listening on ws://127.0.0.1:12332/devtools/browser/799d2590-befa-44cb-a
703-a0e6c32a2114
[0530/092619.091:ERROR:gl_surface_egl.cc(840)] eglInitialize D3D11 failed with e
rror EGL_NOT_INITIALIZED, trying next display type

>>>

下载、解压、配置静默浏览器（phantomjs），该浏览器在后台运行，不会跳出来。

配置用户的环境变量，不知道需要不需要

C:\Users\Administrator\AppData\Roaming\npm;D:\Python35-32\phantomjs-2.1.1-windows\bin

测试Phantomjs

>>> import selenium
>>> from selenium import webdriver
>>> driver = webdriver.PhantomJS()
>>> driver.get('http://www.baidu.com')
>>> driver.page_source
'<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;c
harset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content

="never" name="referrer"><title>百度一下，你就知道</title><style>html,body{heigh

relenium是用来驱动浏览器，进行自动化测试，js渲染的网页，无法使用request获取内容，直接使用浏览器获取网页的渲染。

lxml 基于xpath对网页进行解析，比较方便。

安装lxml。

安装beautifulsoup，它依赖lxml库。pip install beautifulsoup4

pip install -i https://pypi.doubanio.com/simple beautifulsoup4

from bs4 import BeautifulSoup

安装pyquery，也是网页解析库。和jquery语法相似。

pip install -i https://pypi.doubanio.com/simple pyquery

C:\Users\Administrator> python
>>> from pyquery import PyQuery as pq
>>>

安装pymysql库。

安装pymongo库。

安装redis库。（操作redis数据库，redis数据库可以维护公共的爬取队列，效率比较高）

安装flask库。（flask是一个web库，可以做一些代理的维护，设置一个代理服务器，代理的获取，代理的存储等）。依赖一些其他的库。

安装django。（做分布式爬虫维护的时候用到，做一个管理系统）。

安装jupyter。（相当于一个notebook，运行在网页上的，可以进行调试和运行）依赖于很多其他的库。

如果是在linux环境下，则安装更加方便。直接 pip install requests selenium beautifulsoup4 pyquery pymysql pymongo redis flask django jupyter

python爬虫常用库的安装

相关推荐