解析包含Python中的XML条目的文本/ csv文件

问题描述:

我有一个包含XML条目的csv文件。想象一下,每个XML条目以<entry>开头并以</entry>结尾。在我的文件中有成千上万的条目。每个XML条目由嵌套的XML元素组成。解析包含Python中的XML条目的文本/ csv文件

我需要提取每个条目的一些元素,并通过Python将它们保存到另一个文件中。这是一个XML条目的示例。想象一下,我想提取每个条目的元素。你能告诉我如何在Python中做到这一点吗?我是Python编程的初学者。

"<entry xmlns=""http://www.w3.org/2005/Atom"" xmlns:gnip=""http://www.gnip.com/schemas/2010""> 
    <id>tag:search.twitter.com,2005:157796632933576704</id> 
    <published>2012-01-13T12:10:23+00:00</published> 
    <updated>2012-01-13T12:10:23+00:00</updated> 
    <summary type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</summary> 
    <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/> 
    <source> 
    <link rel=""self"" type=""application/json"" href=""https://stream.twitter.com/1/statuses/filter.json""/> 
    <title>Twitter - Stream - Track</title> 
    <updated>2012-01-13T12:10:25Z</updated> 
    </source> 
    <service:provider xmlns:service=""http://activitystrea.ms/service-provider""> 
    <name>Twitter</name> 
    <uri>http://www.twitter.com/</uri> 
    <icon/> 
    </service:provider> 
    <contributor> 
    <name>Steve Price</name> 
    <uri>http://www.twitter.com/sprice54</uri> 
    </contributor> 
    <link rel=""via"" type=""text/html"" href=""http://twitter.com/sprice54/statuses/12736""/> 
    <title>George Cordivari shared: Steve Price posted a note on Twitter</title> 
    <category term=""StatusShared"" label=""Status Shared""/> 
    <category term=""NoteShared"" label=""Note Shared""/> 
    <activity:verb xmlns:activity=""http://activitystrea.ms/spec/1.0/"">http://activitystrea.ms/schema/1.0/share</activity:verb> 
    <activity:object xmlns:activity=""http://activitystrea.ms/spec/1.0/""> 
    <activity:object-type>http://activitystrea.ms/schema/1.0/note</activity:object-type> 
    <id>object:search.twitter.com,2005:157796632933576704</id> 
    <content type=""html"">RT @sprice54: If you rearrange the words ""Debit card"" you can spell ""Bad Credit""</content> 
    <link rel=""alternate"" type=""text/html"" href=""http://twitter.com/GCordivari/statuses/157796632933576704""/> 
    </activity:object> 
    <author> 
    <name>George Cordivari</name> 
    <uri>http://www.twitter.com/GCordivari</uri> 
    </author> 
    <activity:author xmlns:activity=""http://activitystrea.ms/spec/1.0/""> 
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type> 
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/> 
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/> 
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/> 
    <id>http://www.twitter.com/GCordivari</id> 
    </activity:author> 
    <activity:actor xmlns:activity=""http://activitystrea.ms/spec/1.0/""> 
    <activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type> 
    <gnip:friends xmlns:gnip=""http://www.gnip.com/schemas/2010"" followersCount=""37"" followingCount=""61""/> 
    <gnip:stats xmlns:gnip=""http://www.gnip.com/schemas/2010"" activityCount=""370"" upstreamId=""id:twitter.com:427031045""/> 
    <link rel=""alternate"" type=""text/html"" length=""0"" href=""http://www.twitter.com/GCordivari""/> 
    <link rel=""avatar"" href=""http://a0.twimg.com/profile_images/1670548060/274805_1268643462_1179159089_n_normal.jpg""/> 
    <id>http://www.twitter.com/GCordivari</id> 
    <os:location xmlns:os=""http://ns.opensocial.org/2008/opensocial"">Drexel Hell</os:location> 
    <os:aboutMe xmlns:os=""http://ns.opensocial.org/2008/opensocial"">This is the way I live. #*cInMyCupIDGAF #CloudNine #FollowMeLikeTheLeader </os:aboutMe> 
    </activity:actor> 
    <gnip:twitter_entities xmlns:gnip=""http://www.gnip.com/schemas/2010""> 
    <user_mentions> 
     <user_mention start=""3"" end=""12""> 
     <id>255347428</id> 
     <name>Steve Price</name> 
     <screen_name>sprice54</screen_name> 
     </user_mention> 
    </user_mentions> 
    </gnip:twitter_entities> 
    <gnip:matching_rules> 
    <gnip:matching_rule rel=""inferred"">""debit card""</gnip:matching_rule> 
    </gnip:matching_rules> 
</entry>" 

继中的示例之后,您将了解如何提取所有已命名的元素,例如贡献者并将它们导出到新的XML文档中。

import xml.dom.minidom as minidom 

#open the input csv/xml file 
inputPath = '/path/to/xml.csv' 
xml_csv = open(inputPath) 

#open a output file in write mode 
outputPath = '/path/to/contributors.xml' 
outxml = open(outputPath,'w') 

#create a new xml document and top level element 
impl = minidom.getDOMImplementation() 
newxml = impl.createDocument(None,'contributors',None) 
top = newxml.documentElement 

#loop through each line in the file splitting on commas 
for line in xml_csv: 
    xmlFields = line.split(',') 

    for fldxml in xmlFields: 
     #double double quotes caused the parser to choke, I'm replacing them here 
     fldxml = fldxml.replace('""','"') 

     #parse the xml data from each field and 
     #find all contributor elements under the top level 
     dom = minidom.parseString(xmlfld) 
     contributors = entry.getElementByTagName('contributor') 

     #add each contributor to the new xml document 
     for contributor in contributors: 
      top.appendChild(contributor) 

#write out the new xml contributors document in pretty XML 
outxml.write(newxml.toprettyxml()) 
outxml.close() 
+0

谢谢,Tharen。我在运行代码时遇到以下错误:“/ Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py”,1924行,parseString return expatbuilder.parseString(字符串) 文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py”,行940,在parseString 返回builder.parseString(字符串) 文件“/ Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py“,第223行,在parseString中 parser.Parse(string,True) xml.parsers.expat.ExpatError:语法错误 – saghar 2012-02-10 21:15:08

+0

这是说解析器无法解析它提供的字符串。我怀疑这个XML格式不正确,或者内置解析器无法理解。其他解析器可能会更好。如果你无法控制XML,我会建议尝试别的。 – tharen 2012-02-10 23:08:36

+0

您将数据描述为“包含XML条目的csv文件”,我认为它的含义是'[xmldata],[xmldata],...'。其中xmldata包括 ...。如果这是不正确的,你将需要提供更多的上下文。 – tharen 2012-02-10 23:36:37

使用csv模块解析CSV和类似elementtree解析XML领域。

如果你的xml数据与RSS兼容,请看feedparser

Python有很多非常棒的xml解析工具。 BeautifulSoup非常受欢迎,因为它具有简单的API。 http://www.crummy.com/software/BeautifulSoup/doc/

lmxml是非常快的XML解析一个伟大的图书馆,但需要的libxml

有很多在线教程,它通过一步一步讲解与Python解析XML的基本知识。 http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/