Python3爬虫从零开始:Xpath的使用
之前我们提取页面信息时使用的是正则表达式,但这比较繁琐,容易出错。XPath提供了简洁明了得路径选择表达式及大量内建函数。可以定位到几乎所有我们想要定位的节点。
XPath需要安装lxml库,安装方法。
常用规则
- nodename 选取此节点的所有子节点
- / 从当前节点选取直接子节点
- // 从当前节点选取子孙节点
- . 选取当前节点
- .. 选取当前节点的父节点
- @ 选取属性
实例1:
from lxml import etree
text = '''
<div>
<ul>
<li class = "item-0"><a href = "link1.html">first item</a></li>
<li class = "item-1"><a href = "link2.html">second item<a></li>
<li class = "item-inactive"><a href = "link3.html">third item</a></li>
<li class = "item-1"><a href = "link4.html">forth item</a><li>
<li class = "item-0"><a href = "link5.html">fifth item</a></li>
</ul>
</div>
'''
html = etree.HTML(text) #调用HTML类进行初始化
result = etree.tostring(html) #将其转化为字符串类型
print(result.decode('utf-8'))
print(type(html))
print(type(result))
结果:
分析:可以看出,etree模块为我们修正了错误的地方,即可以自动修正HTML文本。
注意:这里在PyCharm里面from lxml import etree会报红,但是没有影响,是可以正常运行的。
实例2:
from lxml import etree
html = etree.parse('test.html',etree.HTMLParser())
print(type(html))
result = etree.tostring(html)
print(type(result))
print(result.decode('utf-8'))
对应的test.html
结果:
注意:这里去掉etree.HTMLParse()会报错:
即text.html中<li>标签错误
修改后,正确输出:
即etree.HTMLParse()的作用等同于之前etree.HTML()的作用,对html进行了修正。具体函数作用有待研究。
实例3:子节点选取
from lxml import etree
html = etree.parse('test.html',etree.HTMLParser())
print("html type",type(html))
result1 = html.xpath('//*') #选取所有节点
print("选取所有节点:",result1)
result = html.xpath('//li') #选取所有li子孙节点
print("选取所有li子节点:",result)
print("提取选取后的对象[0]:",result[0])#提取其中对象,可以通过中括号加索引
result2 = html.xpath('//li/a') #选取li节点的直接a直接点
print("所有li子节点的所有直接a节点",result2)
结果:
实例4:文本获取和属性获取
from lxml import etree
html = etree.parse('test.html',etree.HTMLParser())
re = etree.tostring(html)
result1 = html.xpath('//a[@href="link4.html"]') #属性匹配
result2 = html.xpath('//a[@href="link4.html"]/../@class') #父节点
print("reuslt1:",result1)
print("result2:", result2)
result3 = html.xpath('//a[@href="link4.html"]/text()') #文本获取
print("result3:",result3)
result4 = html.xpath('//a/@href') #属性获取 注意区别与属性匹配
print('result4:',result4)
结果:
实例5:属性多只匹配
from lxml import etree
text = """
<li class = "li li-first" ><a href = "link.html">first item</a></li>
<li class = "li li-first" name = "item" ><a href = "link.html">second item</a></li>
"""
html = etree.HTML(text)
result1 = html.xpath('//li[@class="li"]/a/text()') #匹配失败
result2 = html.xpath('//li[@class="li li-first"]/a/text()') #匹配正确
result3 = html.xpath('//li[contains(@class,"li")]/a/text()') #利用contains()函数进行属性多值匹配
result4 = html.xpath('//li[contains(@class,"li") and @name = "item"]/a/text()') #多属性匹配
print("result1:",result1)
print("result2:",result2)
print("result3:",result3)
print("result4:",result4)
结果:
实例6:按序选择
from lxml import etree
text = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item0</a></li>
<li class="item-1"><a href="link4.html">forth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
html = etree.HTML(text)
result1 = html.xpath('//li[1]/a/text()')
print('result1',result1)
result2 = html.xpath('//li[last()]/a/text()')
print('result2:',result2)
result3 = html.xpath('//li[position()<3]/a/text()')
print('result3:',result3)
result4 = html.xpath('//li[last()-2]/a/text()')
print('result4:',result4)
结果:
除了last(),position()外,更多函数参考:http://www.w3school.com.cn/xpath/xpath_functions.asp#node
实例7:节点轴选择
from lxml import etree
text = """
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item0</a></li>
<li class="item-1"><a href="link4.html">forth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
html = etree.HTML(text)
result1 = html.xpath('//li[1]/ancestor::*') #调用ancestor轴,获取所有祖先节点
print('result1:',result1)
result2 = html.xpath('//li[1]/ancestor::div') #调用ancestor轴,限定获取div祖先节点
print('result2:',result2)
result3 = html.xpath('//li[1]/attribute::*') #调用attribute轴,获取所有属性值
print('result3:',result3)
result4 = html.xpath('//li[1]/child::a[@href="link1.html"]') #调用child轴并限定条件(这里加不加限定条件一样,只有一个子节点)
print('result4:',result4)
result5 = html.xpath('//li[1]/descendant::span') #调用descendant轴,获取子孙节点并限定条件
print('result5:',result5)
result6 = html.xpath('//li[1]/following::*') #调用following轴,获取当前节点后的所有节点并限定索引
print('result6:',result6)
result7 = html.xpath('//li[1]/following::*[2]') #调用following轴,获取当前节点后的所有节点并限定索引
print('result7:',result7)
result8 = html.xpath('//li[1]/following-sibling::*') #调用following-sibling轴,获取当前节点之后的所有同级节点
print('result8:',result8)
结果:
更多Python lxml库用法: