如何将.txt文件解析为.xml文件？

问题描述：

In File Name: C:\Users\naqushab\desktop\files\File 1.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 1.m2 
In File Size: Low: 22636 High: 0 
Total Process time: 1.859000 
Out File Size: Low: 77619 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 2.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 2.m2 
In File Size: Low: 20673 High: 0 
Total Process time: 3.094000 
Out File Size: Low: 94485 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 3.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 3.m2 
In File Size: Low: 66859 High: 0 
Total Process time: 3.516000 
Out File Size: Low: 217268 High: 0

我试图解析此为XML格式是这样的：

<?xml version='1.0' encoding='utf-8'?> 
<root> 
    <filedata> 
     <InFileName>File 1.m1</InFileName> 
     <OutFileName>File 1.m2</OutFileName> 
     <InFileSize>22636</InFileSize> 
     <OutFileSize>77619</OutFileSize> 
     <ProcessTime>1.859000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 2.m1</InFileName> 
     <OutFileName>File 2.m2</OutFileName> 
     <InFileSize>20673</InFileSize> 
     <OutFileSize>94485</OutFileSize> 
     <ProcessTime>3.094000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 3.m1</InFileName> 
     <OutFileName>File 3.m2</OutFileName> 
     <InFileSize>66859</InFileSize> 
     <OutFileSize>217268</OutFileSize> 
     <ProcessTime>3.516000</ProcessTime> 
    </filedata> 
</root>

下面是代码（我使用Python 2）在我试图实现：

import re 
import xml.etree.ElementTree as ET 

rex = re.compile(r'''(?P<title>In File Name: 
         |Out File Name: 
         |In File Size: Low: 
         |Total Process time: 
         |Out File Size: Low: 
        ) 
        (?P<value>.*) 
        ''', re.VERBOSE) 

root = ET.Element('root') 
root.text = '\n' # newline before the celldata element 

with open('Performance.txt') as f: 
    celldata = ET.SubElement(root, 'filedata') 
    celldata.text = '\n' # newline before the collected element 
    celldata.tail = '\n\n' # empty line after the celldata element 
    for line in f: 
     # Empty line starts new celldata element (hack style, uggly) 
     if line.isspace(): 
      celldata = ET.SubElement(root, 'filedata') 
      celldata.text = '\n' 
      celldata.tail = '\n\n' 

     # If the line contains the wanted data, process it. 
     m = rex.search(line) 
     if m: 
      # Fix some problems with the title as it will be used 
      # as the tag name. 
      title = m.group('title') 
      title = title.replace('&', '') 
      title = title.replace(' ', '') 

      e = ET.SubElement(celldata, title.lower()) 
      e.text = m.group('value') 
      e.tail = '\n' 

# Display for debugging 
ET.dump(root) 

# Include the root element to the tree and write the tree 
# to the file. 
tree = ET.ElementTree(root) 
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但我得到空值，是否有可能将此txt解析为XML？

你在哪里得到空值？你可以请更清楚！ –

当一个完整的程序*没有给出预期的结果*时，只需将它分成较小的部分并单独尝试。在这里，您应该首先简单地解析输入并打印您可以找到的部分。只有他们尝试构建一个XML文件。 –

以及您的正则表达式和子元素名称不匹配！他们是故意的吗？ –

答

与您正则表达式的修正：这应该是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

而不是你给什么。因为在你的正则表达式中，In File Name|Out File Name的意思是，它会检查In File Nam后面的，但是e或O后面跟着ut File Name等等。

建议，

你可以做到这一点，而不使用正则表达式。 xml.dom.minidom可用于美化您的xml字符串。

为了更好的理解，我添加了内置评论！

Node.toprettyxml（[缩进= “”[，的NewL = “”[，编码= “”]]]）

返回文档的一个相当印刷版。 indent指定缩进字符串并默认为制表符;的NewL指定在每行和默认值的端射出的字符串

编辑

import itertools as it 
[line[0] for line in it.groupby(lines)] 
可以在列表行使用itertools包的GROUPBY到组consucutive去重

所以，

import xml.etree.ElementTree as ET 
root = ET.Element('root') 

with open('file1.txt') as f: 
    lines = f.read().splitlines() 

#add first subelement 
celldata = ET.SubElement(root, 'filedata') 

import itertools as it 
#for every line in input file 
#group consecutive dedup to one 
for line in it.groupby(lines): 
    line=line[0] 
    #if its a break of subelements - that is an empty space 
    if not line: 
     #add the next subelement and get it as celldata 
     celldata = ET.SubElement(root, 'filedata') 
    else: 
     #otherwise, split with : to get the tag name 
     tag = line.split(":") 
     #format tag name 
     el=ET.SubElement(celldata,tag[0].replace(" ","")) 
     tag=' '.join(tag[1:]).strip() 

     #get file name from file path 
     if 'File Name' in line: 
      tag = line.split("\\")[-1].strip() 
     elif 'File Size' in line: 
      splist = filter(None,line.split(" ")) 
      tag = splist[splist.index('Low:')+1] 
      #splist[splist.index('High:')+1] 
     el.text = tag 

#prettify xml 
import xml.dom.minidom as minidom 
formatedXML = minidom.parseString(
          ET.tostring(
             root)).toprettyxml(indent=" ",encoding='utf-8').strip() 
# Display for debugging 
print formatedXML 

#write the formatedXML to file. 
with open("Performance.xml","w+") as f: 
    f.write(formatedXML)

输出： Performance.xml

<?xml version="1.0" encoding="utf-8"?> 
<root> 
<filedata> 
    <InFileName>File 1.m1</InFileName> 
    <OutFileName>File 1.m2</OutFileName> 
    <InFileSize>22636</InFileSize> 
    <TotalProcesstime>1.859000</TotalProcesstime> 
    <OutFileSize>77619</OutFileSize> 
</filedata> 
<filedata> 
    <InFileName>File 2.m1</InFileName> 
    <OutFileName>File 2.m2</OutFileName> 
    <InFileSize>20673</InFileSize> 
    <TotalProcesstime>3.094000</TotalProcesstime> 
    <OutFileSize>94485</OutFileSize> 
</filedata> 
<filedata> 
    <InFileName>File 3.m1</InFileName> 
    <OutFileName>File 3.m2</OutFileName> 
    <InFileSize>66859</InFileSize> 
    <TotalProcesstime>3.516000</TotalProcesstime> 
    <OutFileSize>217268</OutFileSize> 
</filedata> 
</root>

希望它能帮助！

完美！只有一件事，我该如何检查多个新行，因为生成的txt在开始和结束时可能有一些空行？ – naqushab

itertools groupby应该做的伎俩！我已经添加了相同的编辑。 –

答

从文档（重点是我）：

re.VERBOSE
这个标志可以让你正则表达式写得看起来更好。模式中的空白被忽略，除非在字符类中或者在前面加上未转义的反斜杠，并且当行在字符类中既不包含'＃'，也不包含前缀为未转义的反斜杠的所有字符，最左边的'＃'通过行结束被忽略。在正则表达式

逃生空间或使用\s类

如何将.txt文件解析为.xml文件？

相关推荐