使用元素树模块解析docx
我有这个文档,我需要解析并获得一个XML等价物。基本上我需要一个ElementTree类型的对象,但它不会发生。我尝试了许多不同的组合,但我还没弄明白。 这里就是我所做的:使用元素树模块解析docx
import xml.etree.ElementTree as ET
z = zf.ZipFile("INTRODUCTION.docx")
doc_xml = z.read("word/document.xml")
print doc_xml #type(doc_xml) is str
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14"><w:body><w:p w:rsidR="00470EEF" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
由于doc_xml
是字符串类型的,我用下面来获取元素。
rooted = ET.fromstring(doc_xml) #type(rooted) is 'Element'
type(rooted)
这也太:
tree = ET.ElementTree(doc_xml) #type(tree) is 'ElementTree'
type(tree)
我觉得这个作品,但是当我做:
for branch in tree.iter():
print branch
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-83-d503315fb5e6> in <module>()
----> 1 for branch in tree.iter():
2 print branch
C:\Anaconda\lib\xml\etree\ElementTree.pyc in iter(self, tag)
671 def iter(self, tag=None):
672 # assert self._root is not None
--> 673 return self._root.iter(tag)
674
675 # compatibility
AttributeError: 'str' object has no attribute 'iter'
变量tree
是ElementTree的类型。我该如何解决这个问题?
这一行,
rooted = ET.fromstring(doc_xml)
你通过解析为以字符串形式的XML文档得到Element
实例。你可以遍历这个实例:
for branch in rooted.iter():
print branch
当你做到这一点,
tree = ET.ElementTree(doc_xml)
您可以通过给出一个字符串作为参数创建一个ElementTree
实例。这不会导致错误消息,但尝试迭代树失败是因为它不是“真正的”树(在这种情况下XML未被解析)。
如果你需要一个ElementTree
情况下,我建议做这样的:
import xml.etree.ElementTree as ET
import zipfile as zf
z = zf.ZipFile("INTRODUCTION.docx")
f = z.open("word/document.xml") # a file-like object
tree = ET.parse(f) # an ElementTree instance
for elem in tree.iter():
print elem
谢谢你的工作。 ElementTree模块是否可以帮助您在docx中返回特定颜色的字数? – 2014-09-14 08:42:19
您可以使用ElementTree从XML文档中提取任何信息,但是如果您需要某个特定的功能来处理字数,您必须自己创建它。 – mzjn 2014-09-14 08:53:16
树后把'打印类型(树)',并添加确保它不是字符串 – gosom 2014-09-13 10:01:57
是它显示类型ElementTree – 2014-09-13 10:03:45
你能写一个独立的脚本并粘贴完整的回溯? – 2014-09-13 10:09:49