如何在Python中解析嵌套标记的XML
问题描述:
我有以下XML。如何在Python中解析嵌套标记的XML
<component name="QUESTIONS">
<topic name="Chair">
<state>active</state>
<subtopic name="Wooden">
<links>
<link videoDuration="" youtubeId="" type="article">
<label>Understanding Wooden Chair</label>
<url>http://abcd.xyz.com/1111?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>How To Assemble Wooden CHair</label>
<url>http://abcd.xyz.com/2222?view=app</url>
</link>
<link videoDuration="11:35" youtubeId="Qasefrt09_2" type="video">
<label>Wooden Chair Tutorial</label>
<url>/</url>
</link>
<link videoDuration="1:06" youtubeId="MSDVN235879" type="video">
<label>How To Access Wood</label>
<url>/</url>
</link>
</links>
</subtopic>
</topic>
<topic name="Table">
<state>active</state>
<subtopic name="">
<links>
<link videoDuration="" youtubeId="" type="article">
<label>Understanding Tables</label>
<url>http://abcd.xyz.com/3333?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>Set-up Table</label>
<url>http://abcd.xyz.com/4444?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>How To Change table</label>
<url>http://abcd.xyz.com/5555?view=app</url>
</link>
</links>
</subtopic>
</topic>
</component>
我试图解析这个XML Python和创建URL array
其中将包含: 1.所有存在于XML 2.对于链接选项卡中的HTTP URL如果YouTube的存在,然后捕获和准备youtube网址并将其添加到URL array
。
我有以下代码,但它没有给我的网址和链接。
from xml.etree import ElementTree
with open('faq.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter():
print node.tag, node.attrib.get('url')
for node in tree.iter('outline'):
name = node.attrib.get('link')
url = node.attrib.get('url')
if name and url:
print ' %s :: %s' % (name, url)
else:
print name
我该如何做到这一点,以获得所有的网址?
根据下面的答案开发了以下代码: 以下问题是,它只打印1个url并非全部。
from xml.etree import ElementTree
def fetch_faq_urls():
url_list = []
with open('faq.xml', 'rt') as f:
tree = ElementTree.parse(f)
for link in tree.iter('link'):
youtube = link.get('youtubeId')
if youtube:
print "https://www.youtube.com/watch?v=" + youtube
video_url = "https://www.youtube.com/watch?v=" + youtube
url_list.append(video_url)
# print "youtubeId", link.find('label').text, '???'
else:
print link.find('url').text
article_url = link.find('url').text
url_list.append(article_url)
# print 'url', link.find('label').text,
return url_list
faqs = fetch_faq_urls()
print faqs
答
你想要的信息是在<link>
所以只是遍历这些。使用get()
可以获取YouTube的id和find()
以获取子对象<url>
。
from xml.etree import ElementTree
with open('faq.xml', 'rt') as f:
tree = ElementTree.parse(f)
for link in tree.iter('link'):
youtube = link.get('youtubeId')
if youtube:
print "youtube", link.find('label').text, '???'
else:
print 'url', link.find('label').text, link.find('url').text
答
看看xmltodict。
>>> print(json.dumps(xmltodict.parse("""
... <mydocument has="an attribute">
... <and>
... <many>elements</many>
... <many>more elements</many>
... </and>
... <plus a="complex">
... element as well
... </plus>
... </mydocument>
... """), indent=4))
{
"mydocument": {
"@has": "an attribute",
"and": {
"many": [
"elements",
"more elements"
]
},
"plus": {
"@a": "complex",
"#text": "element as well"
}
}
}
非常感谢。我明白了我应该做什么。谢谢tdelaney –
用开发的代码更新了我的问题?我不明白为什么它只向数组推送1个值? –
@in_learning_software - 这只是一个小缩进问题。请注意,您的'return url_list'位于'for'块中,因此它会在循环的第一遍时执行。简单地向下一个更高层次倾斜。 – tdelaney