如何处理这个文本文件并解析我需要的内容？

问题描述：

我试图从Python doctest模块解析输出并将其存储在HTML文件中。如何处理这个文本文件并解析我需要的内容？

我有相似的输出：

********************************************************************** 
File "example.py", line 16, in __main__.factorial 
Failed example: 
    [factorial(n) for n in range(6)] 
Expected: 
    [0, 1, 2, 6, 24, 120] 
Got: 
    [1, 1, 2, 6, 24, 120] 
********************************************************************** 
File "example.py", line 20, in __main__.factorial 
Failed example: 
    factorial(30) 
Expected: 
    25252859812191058636308480000000L 
Got: 
    265252859812191058636308480000000L 
********************************************************************** 
1 items had failures: 
    2 of 8 in __main__.factorial 
***Test Failed*** 2 failures.

每个失败是由星号线，其限定互相每个测试失败的前面。

我想要做的是去掉失败的文件名和方法，以及预期的和实际的结果。然后我想用这个创建一个HTML文档（或者将它存储在一个文本文件中，然后进行第二轮解析）。

我该如何使用Python或UNIX shell实用程序的组合来完成此操作？

编辑：我制定了以下shell脚本匹配每个块如何我想，但我不确定如何将每个sed匹配重定向到它自己的文件。

python example.py | sed -n '/.*/,/^\**$/p' > `mktemp error.XXX`

如果剥离文件，方法，预期结果和实际结果，剩下的是什么？ – juanjux 2009-08-07 20:20:22

嗯，我只是无法解析他们到单独的块，因为到目前为止，我只能一次抓住整个块，而不是单个字段。 – samoz 2009-08-07 20:26:12

答

这是一个快速和肮脏的脚本解析输出与相关信息的元组：

import sys 
import re 

stars_re = re.compile('^[*]+$', re.MULTILINE) 
file_line_re = re.compile(r'^File "(.*?)", line (\d*), in (.*)$') 

doctest_output = sys.stdin.read() 
chunks = stars_re.split(doctest_output)[1:-1] 

for chunk in chunks: 
    chunk_lines = chunk.strip().splitlines() 
    m = file_line_re.match(chunk_lines[0]) 

    file, line, module = m.groups() 
    failed_example = chunk_lines[2].strip() 
    expected = chunk_lines[4].strip() 
     got = chunk_lines[6].strip() 

    print (file, line, module, failed_example, expected, got)

答

你可以写一个Python程序除了挑这个，但也许一个更好的事情将考虑修改文档测试输出，你首先需要的报告。从文档的doctest.DocTestRunner：

        ... the display output 
can be also customized by subclassing DocTestRunner, and 
overriding the methods `report_start`, `report_success`, 
`report_unexpected_exception`, and `report_failure`.

我一定会看看这个！ – samoz 2009-08-07 22:20:50

答

我pyparsing做到这一点写了一个快速的解析器。

from pyparsing import * 

str = """ 
********************************************************************** 
File "example.py", line 16, in __main__.factorial 
Failed example: 
    [factorial(n) for n in range(6)] 
Expected: 
    [0, 1, 2, 6, 24, 120] 
Got: 
    [1, 1, 2, 6, 24, 120] 
********************************************************************** 
File "example.py", line 20, in __main__.factorial 
Failed example: 
    factorial(30) 
Expected: 
    25252859812191058636308480000000L 
Got: 
    265252859812191058636308480000000L 
********************************************************************** 
""" 

quote = Literal('"').suppress() 
comma = Literal(',').suppress() 
in_ = Keyword('in').suppress() 
block = OneOrMore("**").suppress() + \ 
     Keyword("File").suppress() + \ 
     quote + Word(alphanums + ".") + quote + \ 
     comma + Keyword("line").suppress() + Word(nums) + comma + \ 
     in_ + Word(alphanums + "._") + \ 
     LineStart() + restOfLine.suppress() + \ 
     LineStart() + restOfLine + \ 
     LineStart() + restOfLine.suppress() + \ 
     LineStart() + restOfLine + \ 
     LineStart() + restOfLine.suppress() + \ 
     LineStart() + restOfLine 

all = OneOrMore(Group(block)) 

result = all.parseString(str) 

for section in result: 
    print section

给

['example.py', '16', '__main__.factorial', ' [factorial(n) for n in range(6)]', ' [0, 1, 2, 6, 24, 120]', ' [1, 1, 2, 6, 24, 120]'] 
['example.py', '20', '__main__.factorial', ' factorial(30)', ' 25252859812191058636308480000000L', ' 265252859812191058636308480000000L']

非常好的工作！我想我会玩这个... – samoz 2009-08-07 22:21:24

为什么str在文本前后有3个“标记？对不起，我的Python确实不是那么好 – samoz 2009-08-08 20:23:13

三个引号只是表示一个文本字符串，可以超过多个线。 – 2009-08-08 22:23:39

答

这可能是我写过的最优雅的Python脚本之一，但它应有的框架，做你想做的，而不诉诸UNIX实用程序和单独的脚本创建html。它没有经过测试，但应该只需稍作调整即可工作。

import os 
import sys 

#create a list of all files in directory 
dirList = os.listdir('') 

#Ignore anything that isn't a .txt file. 
# 
#Read in text, then split it into a list. 
for thisFile in dirList: 
    if thisFile.endswith(".txt"): 
     infile = open(thisFile,'r') 

     rawText = infile.read() 

     yourList = rawText.split('\n') 

     #Strings 
     compiledText = '' 
     htmlText = '' 

     for i in yourList: 

      #clunky way of seeing whether or not current line 
      #should be included in compiledText 

      if i.startswith("*****"): 
       compiledText += "\n\n--- New Report ---\n" 

      if i.startswith("File"): 
       compiledText += i + '\n' 

      if i.startswith("Fail"): 
       compiledText += i + '\n' 

      if i.startswith("Expe"): 
       compiledText += i + '\n' 

      if i.startswith("Got"): 
       compiledText += i + '\n' 

      if i.startswith(" "): 
       compiledText += i + '\n' 


    #insert your HTML template below 

    htmlText = '<html>...\n <body> \n '+htmlText+'</body>... </html>' 


    #write out to file 
    outfile = open('processed/'+thisFile+'.html','w') 
    outfile.write(htmlText) 
    outfile.close()

如何处理这个文本文件并解析我需要的内容？

相关推荐