一行一行地读取文件,但是反过来(最后一行先,然后是最后一行等)

问题描述:

我想从文件中删除尾随的空行(如果有的话)。目前我通过在内存中读取它,删除那里的空白行,并覆盖它。该文件很大,但是(30000多行和长行),这需要2-3秒。一行一行地读取文件,但是反过来(最后一行先,然后是最后一行等)

所以我想逐行读取文件,但是向后读,直到我到达第一个非空行。也就是说,我从最后一行开始,然后是最后一行,等等,然后我会截断它,而不是覆盖它。

什么是最好的方式读取它反向?现在我正在考虑读取64k的块,然后以字符为单位循环遍历字符串,直到获得一行,然后当我用完64k,读取另一个64k并预先安装它们, 等等。

我假设没有标准函数或库以相反顺序读取?

+0

您预计会有多少空行?成千上万的?每一个可能只是一个单行换行符,所以我认为即使是64k字节也可能会过度杀伤。 – Blckknght 2014-09-22 08:51:46

+0

它可能是,但与将所有内容全部读入内存相比,它仍然是一个非常激烈的优化。 – sashoalm 2014-09-22 08:53:09

+0

有没有内置的功能来做到这一点,但我不得不为此编写一个类。我会看看我能否获得发布权限。 – 2014-09-22 08:59:19

这是一些代码,我在别处找到了修改后的版本(这里大概在计算器上,其实...) - 我已经提取的手柄向后读取两个关键方法。

reversed_blocks迭代器以您喜欢的大小块向后读取文件,reversed_lines迭代器将块拆分为行,保存第一个块;如果下一个块以换行符结束,则将其作为完整行返回,如果不是,则将已保存的部分行追加到新块的最后一行,从而完成在块边界上拆分的行。

所有的状态都由Python的迭代器机制来维护,所以我们不必在任何地方存储状态;这也意味着如果需要的话,可以一次向后读取多个文件,因为状态绑定到迭代器。

def reversed_lines(self, file): 
    "Generate the lines of file in reverse order." 
    newline_char_set = set(['\r', '\n']) 
    tail = "" 
    for block in self.reversed_blocks(file): 
     if block is not None and len(block)>0: 
      # First split the whole block into lines and reverse the list 
      reversed_lines = block.splitlines() 
      reversed_lines.reverse() 

      # If the last char of the block is not a newline, then the last line 
      # crosses a block boundary, and the tail (possible partial line from 
      # the previous block) should be added to it. 
      if block[-1] not in newline_char_set: 
       reversed_lines[0] = reversed_lines[0] + tail 

      # Otherwise, the block ended on a line boundary, and the tail is a 
      # complete line itself. 
      elif len(tail)>0: 
       reversed_lines.insert(0,tail) 

      # Within the current block, we can't tell if the first line is complete 
      # or not, so we extract it and save it for the next go-round with a new 
      # block. We yield instead of returning so all the internal state of this 
      # iteration is preserved (how many lines returned, current tail, etc.). 
      tail = reversed_lines.pop() 

      for reversed_line in reversed_lines: 
       yield reversed_line 

    # We're out of blocks now; if there's a tail left over from the last block we read, 
    # it's the very first line in the file. Yield that and we're done. 
    if len(tail)>0: 
     yield tail 

def reversed_blocks(self, file, blocksize=4096): 
    "Generate blocks of file's contents in reverse order." 

    # Jump to the end of the file, and save the file offset. 
    file.seek(0, os.SEEK_END) 
    here = file.tell() 

    # When the file offset reaches zero, we've read the whole file. 
    while 0 < here: 
     # Compute how far back we can step; either there's at least one 
     # full block left, or we've gotten close enough to the start that 
     # we'll read the whole file. 
     delta = min(blocksize, here) 

     # Back up to there and read the block; we yield it so that the 
     # variable containing the file offset is retained. 
     file.seek(here - delta, os.SEEK_SET) 
     yield file.read(delta) 

     # Move the pointer back by the amount we just handed out. If we've 
     # read the last block, "here" will now be zero. 
     here -= delta 

reversed_lines是一个迭代器,让你在一个循环中运行它:

for line in self.reversed_lines(fh): 
    do_something_with_the_line(line) 

的意见可能是多余的,但在我工作了迭代器如何做他们的工作,他们对我很有用。

with open(filename) as f: 
    size = os.stat(filename).st_size 
    f.seek(size - 4096) 
    block = f.read(4096) 
    # Find amount to truncate 
    f.truncate(...) 
+0

顺便说一句,你可以使用'f.seek(-4096,2)'。 – sashoalm 2014-09-22 09:02:07

+0

所以你确实知道如何从最后读取文件?或者我误解了你的问题?你可以通过执行'4096 - len(block.rstrip())'来轻松截取数据。 – filmor 2014-09-22 10:45:46

+0

这给你反向的块,但不是线。查看我在下面的基于迭代器的版本,寻找一个很好的技巧来跟踪块和行偏移量,因此您不必担心自己维护它们。 – 2014-09-22 18:14:36