一行一行地读取文件,但是反过来(最后一行先,然后是最后一行等)
问题描述:
我想从文件中删除尾随的空行(如果有的话)。目前我通过在内存中读取它,删除那里的空白行,并覆盖它。该文件很大,但是(30000多行和长行),这需要2-3秒。一行一行地读取文件,但是反过来(最后一行先,然后是最后一行等)
所以我想逐行读取文件,但是向后读,直到我到达第一个非空行。也就是说,我从最后一行开始,然后是最后一行,等等,然后我会截断它,而不是覆盖它。
什么是最好的方式读取它反向?现在我正在考虑读取64k的块,然后以字符为单位循环遍历字符串,直到获得一行,然后当我用完64k,读取另一个64k并预先安装它们, 等等。
我假设没有标准函数或库以相反顺序读取?
答
这是一些代码,我在别处找到了修改后的版本(这里大概在计算器上,其实...) - 我已经提取的手柄向后读取两个关键方法。
reversed_blocks
迭代器以您喜欢的大小块向后读取文件,reversed_lines
迭代器将块拆分为行,保存第一个块;如果下一个块以换行符结束,则将其作为完整行返回,如果不是,则将已保存的部分行追加到新块的最后一行,从而完成在块边界上拆分的行。
所有的状态都由Python的迭代器机制来维护,所以我们不必在任何地方存储状态;这也意味着如果需要的话,可以一次向后读取多个文件,因为状态绑定到迭代器。
def reversed_lines(self, file):
"Generate the lines of file in reverse order."
newline_char_set = set(['\r', '\n'])
tail = ""
for block in self.reversed_blocks(file):
if block is not None and len(block)>0:
# First split the whole block into lines and reverse the list
reversed_lines = block.splitlines()
reversed_lines.reverse()
# If the last char of the block is not a newline, then the last line
# crosses a block boundary, and the tail (possible partial line from
# the previous block) should be added to it.
if block[-1] not in newline_char_set:
reversed_lines[0] = reversed_lines[0] + tail
# Otherwise, the block ended on a line boundary, and the tail is a
# complete line itself.
elif len(tail)>0:
reversed_lines.insert(0,tail)
# Within the current block, we can't tell if the first line is complete
# or not, so we extract it and save it for the next go-round with a new
# block. We yield instead of returning so all the internal state of this
# iteration is preserved (how many lines returned, current tail, etc.).
tail = reversed_lines.pop()
for reversed_line in reversed_lines:
yield reversed_line
# We're out of blocks now; if there's a tail left over from the last block we read,
# it's the very first line in the file. Yield that and we're done.
if len(tail)>0:
yield tail
def reversed_blocks(self, file, blocksize=4096):
"Generate blocks of file's contents in reverse order."
# Jump to the end of the file, and save the file offset.
file.seek(0, os.SEEK_END)
here = file.tell()
# When the file offset reaches zero, we've read the whole file.
while 0 < here:
# Compute how far back we can step; either there's at least one
# full block left, or we've gotten close enough to the start that
# we'll read the whole file.
delta = min(blocksize, here)
# Back up to there and read the block; we yield it so that the
# variable containing the file offset is retained.
file.seek(here - delta, os.SEEK_SET)
yield file.read(delta)
# Move the pointer back by the amount we just handed out. If we've
# read the last block, "here" will now be zero.
here -= delta
reversed_lines
是一个迭代器,让你在一个循环中运行它:
for line in self.reversed_lines(fh):
do_something_with_the_line(line)
的意见可能是多余的,但在我工作了迭代器如何做他们的工作,他们对我很有用。
答
with open(filename) as f:
size = os.stat(filename).st_size
f.seek(size - 4096)
block = f.read(4096)
# Find amount to truncate
f.truncate(...)
您预计会有多少空行?成千上万的?每一个可能只是一个单行换行符,所以我认为即使是64k字节也可能会过度杀伤。 – Blckknght 2014-09-22 08:51:46
它可能是,但与将所有内容全部读入内存相比,它仍然是一个非常激烈的优化。 – sashoalm 2014-09-22 08:53:09
有没有内置的功能来做到这一点,但我不得不为此编写一个类。我会看看我能否获得发布权限。 – 2014-09-22 08:59:19