Python - 如何读取文本文件中的特定行？

问题描述：

我有一个巨大的文本文件（12GB）。这些行是制表符分隔的，第一列包含一个ID。对于每个ID我想做点什么。因此，我的计划是从第一行开始，逐行阅读第一行，直到达到下一个ID。Python - 如何读取文本文件中的特定行？

start_line = b 
num_lines = 377763316 

while b < num_lines: 
    plasmid1 = linecache.getline("Result.txt", b-1) 
    plasmid1 = plasmid1.strip("\n") 
    plasmid1 = plasmid1.split("\t") 

    plasmid2 = linecache.getline("Result.txt", b) 
    plasmid2 = plasmid2.strip("\n") 
    plasmid2 = plasmid2.split("\t") 


    if not str(plasmid1[0]) == str(plasmid2[0]): 
     end_line = b 
     #do something

该代码有效，但问题是linecache似乎每次都会重新加载txt文件。如果我不提高性能，代码将运行数年。

我很感谢您的帮助，如果您有一个好主意如何解决问题或知道替代方法！

感谢，菲利普

行是制表符分隔的？听起来像列向我？ – RuDevel

请显示所有代码。什么是'linecache' – eguaio

@eguaio：https：//docs.python.org/3/library/linecache.html – cdarke

答

你应该打开该文件只有一次，并逐一线。

with open('Result.txt', 'r') as f: 
    aline = f.next() 
    currentid = aline.split('\t', 1)[0] 
    for nextline in f: 
     nextid = nextline.split('\t', 1)[0] 
     if nextid != currentid: 
      #do stuff 
      currentid = nextid

你明白了，只是使用普通的python。每次迭代只读取一行。分割中的额外1参数将仅分割到第一个选项卡，从而提高性能。任何专业图书馆都不会获得更好的表现。只有简单的C语言实现可以胜过这种方法。

如果您得到AttributeError: '_io.TextIOWrapper' object has，可能是因为您使用的是Python 3.X（请参阅问题io-textiowrapper-object）。试试这个版本，而不是：

with open('Result.txt', 'r') as f: 
    aline = f.readline() 
    currentid = aline.split('\t', 1)[0] 
    while aline != '': 
     aline = f.readline() 
     nextid = aline.split('\t', 1)[0] 
     if nextid != currentid: 
      #do stuff 
      currentid = nextid

感谢您的评论！我收到以下错误：AttributeError：'_io.TextIOWrapper'对象没有'next'属性任何想法？ – Philipp

这是一个python 2 vs 3不兼容。 – eguaio

答

我认为numpy.loadtxt()是要走的路。此外，通过usecols参数来指定您实际需要的文件列是很好的。 Numpy软件包是一款坚实的库，它具有高性能。

致电loadtxt()后，您将收到ndarray。

答

可以使用itertools：

from itertools import takewhile 

class EqualityChecker(object): 
    def __init__(self, id): 
     self.id = id 

    def __call__(self, current_line): 
     result = False 
     current_id = current_line.split('\t')[0] 

     if self.id == current_id: 
      result = True 

     return result 


with open('hugefile.txt', 'r') as f: 
    for id in ids: 
     checker = EqualityChecker(id) 
     for line in takewhile(checker, f.xreadlines()): 
      do_stuff(line)

在外环id实际上可以是从具有id不匹配的先前值的第一行得到。

Python - 如何读取文本文件中的特定行？

相关推荐