的Python只在文本文件中的特定位置

问题描述：

AA 331    
line1 ... 
line2 ...  
% information here  
AA 332 
line1 ...  
line2 ...  
line3 ... 
%information here  
AA 1021 
line1 ... 
line2 ... 
% information here  
AA 1022  
line1 ... 
% information here  
AA 1023  
line1 ...  
line2 ...  
% information here

我想只为来之后是行后最小整数“信息”进行操作的数据的文本文件执行操作"AA 331"和线"AA 1021"和不经过线"AA 332"，"AA 1022"和"AA 1023"。

诗这是大文件只是一个样本数据

下面的代码我尝试分析文本文件，并得到它们之后的列表“列表1”，“AA”，并在第二个功能我组的整数他们在“list2”中获得最小的价值。这将返回像[331,1021，...]的整数。所以我想到了提取“AA 331”后面的行，然后执行操作，但我知道如何继续。

from itertools import groupby 
def getlineindex(textfile): 
    with open(textfile) as infile: 
    list1 = [] 
    for line in infile : 
     if line.startswith("AA"): 
      intid = line[3:] 
      list1.append(intid) 
    return list1 

def minimalinteger(list1): 
    list2 = [] 
    for k,v in groupby(list1,key=lambda x: x//10): 
      minimalint = min(v) 
      list2.append(minimalint) 
    return list2

列表2包含了来自 “AA” 之后的最小整数[331,1021，..]

我认为你的问题可以使用一些澄清。你指定的行后面的'最小整数'是多少？在哪一行发生，并且该位置是否一致/可靠？此外，你是如何提出'AA 331'和'AA 1021'作为你想要处理的数据的指标的？这是你期望作为人类输入接受的东西，还是有计算方式来确定它？ – bmhkim 2015-02-06 21:49:37

我的意思是331 Danira 2015-02-06 21:53:32

当然，您会注意到331 bmhkim 2015-02-06 22:00:27

答

好了，这里是我的解决方案。在高层次上，我一行一行地去看，看AA线，知道我什么时候找到了数据块的开始/结束，并观察我所称的运行号码，以确定我们是否应该处理下一个块。然后，我有一个处理任何给定块的子程序，基本上读取所有相关行并根据需要处理它们。该子程序是为了知道何时完成而注意的下一个 AA行。

import re 

runIdRegex = re.compile(r'AA (\d+)') 

def processFile(fileHandle): 
    lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten. 
    line = fileHandle.next() 
    while line is not None: # None is being used as a special value indicating we've hit the end of the file. 
     processData = False 
     match = runIdRegex.match(line) 
     if match: 
      runNumber = int(match.group(1)) 
      if lastNumber == None: 
       # Startup/first iteration 
       processData = True 
      elif runNumber - lastNumber == 1: 
       # Continuation, see if the tenths are the same. 
       lastNumberTens = lastNumber/10 
       runNumberTens = runNumber/10 
       if lastNumberTens != runNumberTens: 
        processData = True 
      else: 
       processData = True 

      # Always remember where we were. 
      lastNumber = runNumber 

      # And grab and process data. 
      line = dataBlock(fileHandle, process=processData) 
     else: 
      try: 
       line = fileHandle.next() 
      except StopIteration: 
       line = None 

def dataBlock(fileHandle, process=False): 
    runData = [] 
    try: 
     line = fileHandle.next() 
     match = runIdRegex.match(line) 
     while not match: 
      runData.append(line) 
      line = fileHandle.next() 
      match = runIdRegex.match(line) 
    except StopIteration: 
     # Hit end of file 
     line = None 

    if process: 
     # Data processing call here 
     # processData(runData) 
     pass 

    # Return line so we don't lose it! 
    return line

一些笔记给你。首先，我与吉米莲达成一致意见，你应该使用正则表达式来匹配AA线。

其次，我们谈到了关于我们应该处理的数据是在processFile逻辑。具体来说这些行：

 processData = False 
     match = runIdRegex.match(line) 
     if match: 
      runNumber = int(match.group(1)) 
      if lastNumber == None: 
       # Startup/first iteration 
       processData = True 
      elif runNumber - lastNumber == 1: 
       # Continuation, see if the tenths are the same. 
       lastNumberTens = lastNumber/10 
       runNumberTens = runNumber/10 
       if lastNumberTens != runNumberTens: 
        processData = True 
      else: 
       processData = True

我假设我们不想处理数据，然后确定我们什么时候做。从逻辑上讲，你可以做相反的事情，假设你想要处理数据，然后确定何时不需要。接下来，我们需要存储最近的运行的值，以便知道我们是否需要处理此运行的数据。（并注意第一次运行边缘情况）我们知道当序列被破坏（两次运行之间的差异大于1）时我们想要处理数据，这由else语句处理。我们也知道，我们要处理数据时的顺序递增十位，这是我的整数除以10

三处理的数字，注意从数据块返回数据的。如果你不这样做，你将失去导致dataBlock停止迭代的AA行，并且processFile需要该行来确定是否应该处理下一个数据块。

最后，我选择使用fileHandle.next（），当我到了文件的末尾异常处理来识别。但不要认为这是唯一的方法。 :)

让我知道意见，如果您有任何问题。

完美解释我的问题。好的解决方案我将使用我的样本数据并让您知道。无论如何，这是我想如此接受。非常感谢您的时间:-) – Danira 2015-02-07 21:53:22

答

您可以使用类似：

import re 

matcher = re.compile("AA ([\d]+)") 
already_was = [] 
good_block = False 

with open(filename) as f: 
    for line in f: 
     m = matcher.match(line) 
     if m: 
      v = int(m.groups(0))/10 
     else: 
      v = None 

     if m and v not in already_was: 
      good_block = True 
      already_was.append(m) 
     if m and v in already_was: 
      good_block = False 
     if not m and good_block: 
      do_action()

这些代码的工作只有在第一个值在小组中是最小的一个。

是的，我在你的答案后编辑了我的问题。非常感谢帮助:-) – Danira 2015-02-07 21:55:05

的Python只在文本文件中的特定位置

相关推荐