用lxml查找元素的属性

问题描述：

我需要解析一个xml文件来提取一些数据。我只需要具有某些属性的一些元素，这里的文档的示例：用lxml查找元素的属性

<root> 
    <articles> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
     <article type="info"> 
      <content>some text</content> 
     </article> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
    </articles> 
</root>

在这里，我想获得仅与类型“新闻”的文章。什么是最有效和优雅的方式来做到这一点与lxml？

我试图与find方法，但它是不是很漂亮：

from lxml import etree 
f = etree.parse("myfile") 
root = f.getroot() 
articles = root.getchildren()[0] 
article_list = articles.findall('article') 
for article in article_list: 
    if "type" in article.keys(): 
     if article.attrib['type'] == 'news': 
      content = article.find('content') 
      content = content.text

答

您可以使用XPath，例如root.xpath("//article[@type='news']")

此xpath表达式将返回所有<article/>元素的列表，其中值为“news”的“type”属性。然后你可以迭代它来做你想做的事情，或者在任何地方传递它。

得到公正的文本内容，您可以扩展的XPath像这样：

root = etree.fromstring(""" 
<root> 
    <articles> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
     <article type="info"> 
      <content>some text</content> 
     </article> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
    </articles> 
</root> 
""") 

print root.xpath("//article[@type='news']/content/text()")

，这将输出['some text', 'some text']。或者如果你只是想要的内容元素，它将是"//article[@type='news']/content" - 依此类推。

答

仅供参考，您可以用findall达到同样的效果：

root = etree.fromstring(""" 
<root> 
    <articles> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
     <article type="info"> 
      <content>some text</content> 
     </article> 
     <article type="news"> 
      <content>some text</content> 
     </article> 
    </articles> 
</root> 
""") 

articles = root.find("articles") 
article_list = articles.findall("article[@type='news']/content") 
for a in article_list: 
    print a.text

用lxml查找元素的属性

相关推荐