如何将.txt文件解析为.xml文件?
这是我的txt文件:如何将.txt文件解析为.xml文件?
In File Name: C:\Users\naqushab\desktop\files\File 1.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size: Low: 22636 High: 0
Total Process time: 1.859000
Out File Size: Low: 77619 High: 0
In File Name: C:\Users\naqushab\desktop\files\File 2.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size: Low: 20673 High: 0
Total Process time: 3.094000
Out File Size: Low: 94485 High: 0
In File Name: C:\Users\naqushab\desktop\files\File 3.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size: Low: 66859 High: 0
Total Process time: 3.516000
Out File Size: Low: 217268 High: 0
我试图解析此为XML格式是这样的:
<?xml version='1.0' encoding='utf-8'?>
<root>
<filedata>
<InFileName>File 1.m1</InFileName>
<OutFileName>File 1.m2</OutFileName>
<InFileSize>22636</InFileSize>
<OutFileSize>77619</OutFileSize>
<ProcessTime>1.859000</ProcessTime>
</filedata>
<filedata>
<InFileName>File 2.m1</InFileName>
<OutFileName>File 2.m2</OutFileName>
<InFileSize>20673</InFileSize>
<OutFileSize>94485</OutFileSize>
<ProcessTime>3.094000</ProcessTime>
</filedata>
<filedata>
<InFileName>File 3.m1</InFileName>
<OutFileName>File 3.m2</OutFileName>
<InFileSize>66859</InFileSize>
<OutFileSize>217268</OutFileSize>
<ProcessTime>3.516000</ProcessTime>
</filedata>
</root>
下面是代码(我使用Python 2)在我试图实现:
import re
import xml.etree.ElementTree as ET
rex = re.compile(r'''(?P<title>In File Name:
|Out File Name:
|In File Size: Low:
|Total Process time:
|Out File Size: Low:
)
(?P<value>.*)
''', re.VERBOSE)
root = ET.Element('root')
root.text = '\n' # newline before the celldata element
with open('Performance.txt') as f:
celldata = ET.SubElement(root, 'filedata')
celldata.text = '\n' # newline before the collected element
celldata.tail = '\n\n' # empty line after the celldata element
for line in f:
# Empty line starts new celldata element (hack style, uggly)
if line.isspace():
celldata = ET.SubElement(root, 'filedata')
celldata.text = '\n'
celldata.tail = '\n\n'
# If the line contains the wanted data, process it.
m = rex.search(line)
if m:
# Fix some problems with the title as it will be used
# as the tag name.
title = m.group('title')
title = title.replace('&', '')
title = title.replace(' ', '')
e = ET.SubElement(celldata, title.lower())
e.text = m.group('value')
e.tail = '\n'
# Display for debugging
ET.dump(root)
# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)
但我得到空值,是否有可能将此txt解析为XML?
与您正则表达式的修正:这应该是
m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)
而不是你给什么。因为在你的正则表达式中,In File Name|Out File Name
的意思是,它会检查In File Nam
后面的,但是e
或O
后面跟着ut File Name
等等。
建议,
你可以做到这一点,而不使用正则表达式。 xml.dom.minidom可用于美化您的xml字符串。
为了更好的理解,我添加了内置评论!
Node.toprettyxml([缩进= “”[,的NewL = “”[,编码= “”]]])
返回文档的一个相当印刷版。 indent指定缩进字符串并默认为制表符;的NewL指定在每行和默认值的端射出的字符串
编辑
import itertools as it [line[0] for line in it.groupby(lines)]
可以在列表行使用itertools包的GROUPBY到组consucutive去重
所以,
import xml.etree.ElementTree as ET
root = ET.Element('root')
with open('file1.txt') as f:
lines = f.read().splitlines()
#add first subelement
celldata = ET.SubElement(root, 'filedata')
import itertools as it
#for every line in input file
#group consecutive dedup to one
for line in it.groupby(lines):
line=line[0]
#if its a break of subelements - that is an empty space
if not line:
#add the next subelement and get it as celldata
celldata = ET.SubElement(root, 'filedata')
else:
#otherwise, split with : to get the tag name
tag = line.split(":")
#format tag name
el=ET.SubElement(celldata,tag[0].replace(" ",""))
tag=' '.join(tag[1:]).strip()
#get file name from file path
if 'File Name' in line:
tag = line.split("\\")[-1].strip()
elif 'File Size' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index('Low:')+1]
#splist[splist.index('High:')+1]
el.text = tag
#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML
#write the formatedXML to file.
with open("Performance.xml","w+") as f:
f.write(formatedXML)
输出: Performance.xml
<?xml version="1.0" encoding="utf-8"?>
<root>
<filedata>
<InFileName>File 1.m1</InFileName>
<OutFileName>File 1.m2</OutFileName>
<InFileSize>22636</InFileSize>
<TotalProcesstime>1.859000</TotalProcesstime>
<OutFileSize>77619</OutFileSize>
</filedata>
<filedata>
<InFileName>File 2.m1</InFileName>
<OutFileName>File 2.m2</OutFileName>
<InFileSize>20673</InFileSize>
<TotalProcesstime>3.094000</TotalProcesstime>
<OutFileSize>94485</OutFileSize>
</filedata>
<filedata>
<InFileName>File 3.m1</InFileName>
<OutFileName>File 3.m2</OutFileName>
<InFileSize>66859</InFileSize>
<TotalProcesstime>3.516000</TotalProcesstime>
<OutFileSize>217268</OutFileSize>
</filedata>
</root>
希望它能帮助!
完美!只有一件事,我该如何检查多个新行,因为生成的txt在开始和结束时可能有一些空行? – naqushab
itertools groupby应该做的伎俩!我已经添加了相同的编辑。 –
从文档(重点是我):
re.VERBOSE
这个标志可以让你正则表达式写得 看起来更好。模式中的空白被忽略,除非在 字符类中或者在前面加上未转义的反斜杠,并且当 行在字符类中既不包含'#',也不包含前缀为未转义的反斜杠的所有字符,最左边的'#'通过 行结束被忽略。在正则表达式
逃生空间或使用\s
类
你在哪里得到空值?你可以请更清楚! –
当一个完整的程序*没有给出预期的结果*时,只需将它分成较小的部分并单独尝试。在这里,您应该首先简单地解析输入并打印您可以找到的部分。只有他们尝试构建一个XML文件。 –
以及您的正则表达式和子元素名称不匹配!他们是故意的吗? –