beautifulSoup库，是一个非常流行的python模块。通过BeautifulSoup库可以轻松地解析Requests库请求的网页，并把网页源代码解析为Soup文档，以便过滤提取数据。
import requests
from bs4 import BeautifulSoup
headers = {‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36’}
res = requests.get(‘http://bj.xiaozhu.com’,headers=headers)
soup = BeautifulSoup(res.text,‘html.parser’)
print(soup.prettify())
运行结果：
BeautifulSoup库的应用输出的结果如图所示，看上去与requests库请求返回的网页源代码类似，但通过BeautifulSoup库解析得到的Soup文档按照标准缩进格式的结构输出，为结构化的数据，为数据的过滤提取做好准备
BeautifulSoup库除了支持python标准库中的HTML解析器外，还支持一些第三方的解析器。如下表中列出了BeautifulSoup库的主要解析器及相应的优缺点。
BeautifulSoup库的应用 BeautifulSoup库官方推荐使用lxml作为解析器，因为效率更高。
解析得到的Soup文档可以使用find()和find_all()方法及selector()方法定位需要的元素。find()和find_all()两个方法用法相似，用法格式如下：
find_all(tag,attibutes,recursive,text,limit,keywords)
find(tag,attibutes,recursive,text,kewords)
常用的是前两个参数，熟练运用这两个参数，就可以提取出想要的网页信息。

1.find_all()方法
soup.find_all(‘div’,‘item’)
#查找div标签，class=‘item’
soup.find_all(‘div’,class=‘item’)

attrs参数定义一个字典参数来搜索包含特殊属性的tag

soup.find_all(‘div’,attrs={‘class’:‘item’})
2.find()方法
find()方法与find_all()方法类似，只是find_all()方法返回的是文档中符合条件的所有tag，是一个集合（class ‘bs4.element.resultset’）,find()方法返回的一个Tag（class ‘bs4.element.Tag’）
3.selector()方法
soup.selector(div.item>a>h1) #括号内容通过Chrome复制得到
该方法类似于中国> 河北省 > 石家庄市，从大到小，提取需要的信息，这种方式可以通过Chrome复制得到：
（1）鼠标定位到想要提取的数据位置，右击，在弹出的快捷菜单中选择‘检查’命令。
（2）在网页源代码中右击所选数据
（3）在弹出的快捷菜单中选择Copy selector
#page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i
通过代码即可得到房子价格
import requests
from bs4 import BeautifulSoup
headers = {‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36’}
res = requests.get(‘http://bj.xiaozhu.com’,headers=headers)
soup = BeautifulSoup(res.text,‘html.parser’)
price = soup.select(’#page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i’)
print(price)
运行结果：
BeautifulSoup库的应用查看运行结果，发现结果中含有不需要的字符，而我们需要的是其中的数据，这是用get_text()方法可获得中间的文字信息
import requests
from bs4 import BeautifulSoup
headers = {‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36’}
res = requests.get(‘http://bj.xiaozhu.com’,headers=headers)
soup = BeautifulSoup(res.text,‘html.parser’)
prices = soup.select(’#page_list > ul > li > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i’)
for price in prices:
print(price.get_text())
运行结果：
BeautifulSoup库的应用

BeautifulSoup库的应用

attrs参数定义一个字典参数来搜索包含特殊属性的tag

相关推荐