Python美丽的汤最有效的方式来查找标签

问题描述：

我使用python和BeautifulSoup解析许多大型的XML文件。我经常遇到以下任务：Python美丽的汤最有效的方式来查找标签

<Section1> 
    <Report> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
    </Report> 
</Section1>

我想收集并遍历所有的矩阵。我使用如下代码：

res = urlopen(url) 
html = res.read() 
soup = BeautifulSoup(html, 'xml') 
matrices = soup.find("Section1").find_all("Matrix") 
#Then I handle each matrix

为什么我不能使用这样的选择器？

matrices = soup.find("Section1 Matrix")

有没有更快的方法来做到这一点？有时我正在访问更多嵌套在XML中的节点，我需要确保它们是后代，但不一定是其他几个节点的直接子节点。提供的例子是一个简化。任何帮助将不胜感激。

你尝试使用LXML？它会提升很多表现。 – giaosudau

答

BeautifulSoup "supports CSS selectors"你需要你的选择传递给.select方法

In [1]: from bs4 import BeautifulSoup as BS 

In [2]: soup = BS("""<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>""", "xml") 

In [3]: soup.select("Section1 Matrix") 
Out[3]: 
[<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>]

如果你想要的是让你的文档中的所有节点Matrix;您可以使用 CSSSelector类lxml.cssselect 。

In [3]: from lxml.etree import fromstring 

In [4]: xml_doc = '''<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>''' 

In [5]: tree = fromstring(xml_doc) 

In [6]: matrix = [el for el in sel(tree)] 

In [7]: matrix 
Out[7]: 
[<Element Matrix at 0x7f84b5b8f388>, 
<Element Matrix at 0x7f84b5b8fc48>, 
<Element Matrix at 0x7f84b5b8fd88>, 
<Element Matrix at 0x7f84b5b8fdc8>]

你需要的，如果它是不是已经有点子才能安装cssselect：pip install cssselect

Python美丽的汤最有效的方式来查找标签

相关推荐