切片线和保存参数到不同的文件

问题描述：

我有一个g.out文件（粘贴在下面）。切片线和保存参数到不同的文件

该文件包含几个我想提取的FINAL OPTIMIZED几何。

对于给定的FINAL OPTIMIZED GEOMETRY，这些突出显示的值都是我想提取：

我在下面的程序提取已管理的三个第一：VOLUME和A和B：

我的代码：

import os 
import sys 
import re 

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM  3$' 
middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' 
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' 


VOLUMES = [] 
P0 = [] 
P2 = [] 
atomic_number = [] 
coord_x = [] 
coord_y = [] 
coord_z = [] 

with open('g.out') as file: 
    for line in file: 
     if re.match(initial_pattern, line): 
      print file.next() 
      print file.next() 
      print file.next() 

      volume_line = file.next() 
      print volume_line 
      aux = volume_line.split() 
      each_volume = aux[7] 
      print each_volume 
      VOLUMES.append(each_volume) 

     if re.match(middle_pattern, line): 
      print line 

      print file.next() 
      parameters_line = file.next() 
      aux = parameters_line.split() 
      p0 = aux[0] 
      p1 = aux[1] 
      p2 = aux[2] 
      p3 = aux[3] 
      p4 = aux[4] 
      p5 = aux[5] # 

      print p0 
      print p2 

      P0.append(p0) 
      P2.append(p2) 

      print file.next() 
      print file.next() 
      print file.next() 
      print file.next() 

      first_coord_line = file.next() 
      print first_coord_line 

     if re.match(end_pattern, line): 
      end_pattern = line 
      print end_pattern 
      all_coordinates = [first_coord_line:end_pattern] 
      for line in all_coordinates: 
       del('F ')    # delete those that contain 'F ' 
       aux2 = line.split() 
       coords = [] 


sys.exit() 
#Template = 
""" 
some stuff 
other stuff 
p0  p2 
3 
A B  C   D 
E F  G   H 
I J  K   L 
other stuff 
some other stuff 
"""

我不能够提取COORDINATES，因为我不能在这个伪代码中找到切片从first_coord_line线end_pattern的方式，如：

if re.match(end_pattern, line): 
    end_pattern = line 
    print end_pattern 
    all_coordinates = [first_coord_line:end_pattern] 
    for line in all_coordinates: 
     del('F ')    # delete those that contain 'F ' 
     aux2 = line.split() # split lines 
     atomic_number = aux2[2] 
     coord_x = aux2[4] 
     coord_y = aux2[5] 
     coord_z = aux2[6]

有没有办法实现这个伪代码？

在我的代码，VOLUMES，P0，P2，atomic_number，coord_x，coord_ycoord_z是因为之前结束for循环，我想在不同的文件，用“VOLUME .INP”的名字命名，以保存列表初始化，这样的信息：

#Template = 
""" 
some stuff 
other stuff 
p0  p2 
3 
A B  C   D 
E F  G   H 
I J  K   L 
other stuff 
some other stuff 
"""

其中p0和p2被值在我的代码萃取（第二和第三突出了屏幕截图的值），和A - L是atomic_number和coord_x,coord_y,coord_z。

有没有办法做到这一点？

的g.out文件：

more lines 
more lines 
more lines 

FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM  3 
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) 
******************************************************************************* 
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM 
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 
     A    B    C   ALPHA  BETA  GAMMA 
    6.28373604  6.28373604  6.28373604 46.646397 46.646397 46.646397 
******************************************************************************* 
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 
    ATOM     X/A     Y/B     Z/C 
******************************************************************************* 
     1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
     2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 
     3 T 6 C  2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 
     4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 
     5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 
     6 F 8 O  2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 
     7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 
     8 F 8 O  4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 
     9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 
    10 F 8 O  7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 

TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 
    1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 

******************************************************************************* 
CRYSTALLOGRAPHIC CELL (VOLUME=  359.47009054) 
     A    B    C   ALPHA  BETA  GAMMA 
    4.97568007  4.97568007 16.76591397 90.000000 90.000000 120.000000 

COORDINATES IN THE CRYSTALLOGRAPHIC CELL 
    ATOM     X/A     Y/B     Z/C 
******************************************************************************* 
     1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
     2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 
     3 T 6 C  3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 
     4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 
     5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 
     6 F 8 O  3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 
     7 F 8 O  7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 
     8 F 8 O  4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 
     9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 
    10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 

T = ATOM BELONGING TO THE ASYMMETRIC UNIT 
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE 

more lines 
more lines 
more lines 

FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM  3 
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) 
******************************************************************************* 
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM 
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 
     A    B    C   ALPHA  BETA  GAMMA 
    6.32229536  6.32229536  6.32229536 46.436583 46.436583 46.436583 
******************************************************************************* 
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 
    ATOM     X/A     Y/B     Z/C 
******************************************************************************* 
     1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
     2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 
     3 T 6 C  2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 
     4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 
     5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 
     6 F 8 O  2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 
     7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 
     8 F 8 O  4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 
     9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 
    10 F 8 O  7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 

TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 
    1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 

******************************************************************************* 
CRYSTALLOGRAPHIC CELL (VOLUME=  363.43040599) 
     A    B    C   ALPHA  BETA  GAMMA 
    4.98494429  4.98494429 16.88768068 90.000000 90.000000 120.000000 

COORDINATES IN THE CRYSTALLOGRAPHIC CELL 
    ATOM     X/A     Y/B     Z/C 
******************************************************************************* 
     1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
     2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 
     3 T 6 C  3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 
     4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 
     5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 
     6 F 8 O  3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 
     7 F 8 O  7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 
     8 F 8 O  4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 
     9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 
    10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 

T = ATOM BELONGING TO THE ASYMMETRIC UNIT 
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE 

more lines 
more lines 
more lines

更新后的代码：基于@nos标志的做法

，下面的代码能够提取的信息。 VOLUMES是一个包含2个元素的列表。下面列出的结果：

VOLUMES = ['119.823364', '121.143469'] 
P0 = ['4.97568007', '4.98494429'] 
P2 = ['16.76591397', '16.88768068'] 
Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] 
Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] 
Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] 
ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']

这篇文章的第二部分是写这个信息报告（P0，P2，ATOMIC_NUMBERS，Xs，Ys，Zs）两个VOLUME.inp文件中。换句话说，这样的：

V_119.823364.inp文件：

some stuff 
other stuff 
4.97568007 4.98494429 
3 
20 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
6 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 
8 -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 
other stuff

V_121.143469.inp文件：根据@号的atoms_per_frame和atoms_all_frames的建议，我曾尝试下面的代码

some stuff 
other stuff 
4.97568007 4.98494429 
3 
20 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 
6 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 
8 -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 
other stuff

。我在文件中发现元素方面存在困难，例如：

import os 
import sys 
import re 
import glob 

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM  3$' 
middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' 
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' 

global N_atom_irreducible_unit 
N_atom_irreducible_unit = 3 

VOLUMES = [] 
P0 = [] 
P2 = [] 
ATOMIC_NUMBERS = [] 
Xs = [] 
Ys = [] 
Zs = [] 

with open('g.out') as file: 
    passed_mid_point = False 
    for line in file: 
     if re.match(initial_pattern, line): 
      print file.next() 
      print file.next() 
      print file.next() 

      volume_line = file.next() 
      print volume_line 
      aux = volume_line.split() 
      each_volume = aux[7] 
      print each_volume 
      VOLUMES.append(each_volume) 

     if re.match(middle_pattern, line): 
      print line 

      print file.next() 
      parameters_line = file.next() 
      aux = parameters_line.split() 
      p0 = aux[0] 
      p1 = aux[1] 
      p2 = aux[2] 
      p3 = aux[3] 
      p4 = aux[4] 
      p5 = aux[5] # 

      print p0 
      print p2 

      P0.append(p0) 
      P2.append(p2) 

      print file.next() 
      print file.next() 
      print file.next() 
      print file.next() 

     if re.match(middle_pattern, line): 
      passed_mid_point = True 
      print 'line = ', line 

     if re.match(end_pattern, line): 
      passed_mid_point = False 

     elif passed_mid_point: 
      # parse the coordinates 
      print 'line2 =', line 
      terms = line.split() 
      print 'terms =', terms 

     if terms and terms[1] == 'T': 
      print terms[1] 
      atomic_number = terms[2] 
      print 'atomic_number = ', atomic_number 
      ATOMIC_NUMBERS.append(atomic_number) 

      x = terms[4] 
      print 'x =', x 
      Xs.append(x) 

      y = terms[5] 
      print 'y = ', y 
      Ys.append(y) 

      z = terms[6] 
      print 'z = ', z 
      Zs.append(z) 

print 'VOLUMES = ', VOLUMES 
print 'P0 = ', P0 
print 'P2 = ', P2 
print 'Xs = ', Xs 
print 'Ys = ', Ys 
print 'Zs = ', Zs 
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS 

# create the empty list of lists: 
atoms_all_frames = [[] for _ in xrange(len(VOLUMES))] 
print atoms_all_frames 

for index_vol in range(len(VOLUMES)): 
    for index in range(len(ATOMIC_NUMBERS)): 
    atoms_per_frame = [ATOMIC_NUMBERS[index], Xs[index], Ys[index], Zs[index]] 
    atoms_all_frames[index_vol].append(atoms_per_frame) 

# "atoms_all_frames" would be an appropriate list for looping 
print atoms_all_frames 

# Remove any existing V*.inp files, to clean first: 
for f in glob.glob("V*.inp"): 
    os.remove(f) 

# create the files: 
for V in VOLUMES: 
    filename = "V_{}.d12".format(V) 
    print filename 

    # open them: 
    with open(filename,"a") as f: 

    # the following is a pseudo-code, because I cannot manage to 
    # find the way to write element-wise each string to the files: 
    for p0, p2, atoms_all_frames: 

     f.write("""some stuff 
other stuff 
%s  %s 
%s 
%s %s  %s   %s 
%s %s  %s   %s 
%s %s  %s   %s 
other stuff 
some other stuff\n""" % p0 % p2 %N_atom_irreducible_unit %atoms_all_frames)

太多的代码和文字... –

我猜你正在解析每个（时间）框架的一些结果，并且每个框架都有体积，并且可能有多个原子与它们的坐标。在这种情况下，首先创建一个列表（例如'atoms_all_frames = []'）来保存所有原子结果。然后，在解析文件时，为每个帧创建一个原子坐标列表（例如'atoms_per_frame = []'），并将每个原子的（x，y，z）坐标追加到其中。然后将'atoms_per_frame'追加到'atoms_all_frames'中。这样，您的卷列表和坐标列表将具有相同的大小，即帧的数量。 – nos

@nos感谢您的建议。我采用了这种方法，但是我无法设法将元素明智地写入文件。请参阅更新后的文章 –

答

有很多方法可以做到这一点。关键是要区分是否通过了mid_pattern，因为在它之前和之后都存在相同的坐标模式，并且只有在它之后才有此坐标模式。

例如，您可以

设置一个标志，所以我们知道mid_pattern在end_pattern匹配

passed_mid_point = False 
... 
if re.match(middle_pattern, line): 
    passed_mid_point = True 
    # do what you need 
    ... 
if re.match(end_pattern, line): 
    passed_mid_point = False # so you can process a new frame 
    # do what you need after end pattern is matched 
    ... 
elif passed_mid_point: 
    # parse the coordinates 
    terms = line.split() 
    if terms and terms[1] == 'T': 
     x = float(terms[4]) 
     y = float(terms[5]) 
     z = float(terms[6])

匹配了

分支出来或者，你可以标记和匹配，像这样：

passed_mid_point = False 
    coord_patter = r'  \d+ T ' 
    ... 
    if re.match(middle_pattern, line): 
     passed_mid_point = True 
     # do what you need 
     ... 
    if re.match(end_pattern, line): 
     passed_mid_point = False # so you can process a new frame 
     # do what you need after end pattern is matched 
     ... 
    if passed_mid_point and re.match(coord_pattern, line): 
     # parse the coordinates 
     terms = line.split() 
     if terms and terms[1] == 'T': 
      x = float(terms[4]) 
      y = float(terms[5]) 
      z = float(terms[6])

坐标匹配完全可以在正则表达式来完成，以及

sci_num = r'-?\d+\.\d*E[+\-]\d+' 
coord_pattern = r'\s+\d+\sT\s+\d+\s+[A-Z]+\s+(%s)\s+(%s)\s+(%s)' % (sci_num, sci_num, sci_num) 
coord_re = re.compile(coord_pattern) 
if coord_re.match(line): 
    x = float(coord_re.group(1)) 
    y = float(coord_re.group(2)) 
    z = float(coord_re.group(3))

记录数据，这将是更好，如果你跟踪帧的原子坐标属于。例如，您可以在开始时创建一个atom_frames。并保持附加的原子坐标列表，其中每个列表对应一个帧。总体而言，它看起来像这样

atom_frames = [] 
for i in range(50): # here I assume 50 frames 
    current_frame = [] 
    for a in atoms_in_this_frame: 
     current_frame.append(a) # a could be (x, y, z) of an atom 
    atom_frames.append(current_frame)

这里我只是循环帧数。在你的情况下，当你点击mid_pattern时，你可以创建current_frame = []。当你点击end_pattern时，做atom_frames.append(current_frame)。希望它是有道理的。

感谢您的答案。这种标志程序方法非常有趣。我在代码中应用了这个原则。但是，当达到'if terms [1] =='T'：'语句时，有一个'列表索引超出范围'的错误。请参阅**更新的代码**以重现此问题。 '如果条件[1] =='T'：'陈述对我来说似乎很好，我不明白问题出在哪里 –

哦，这是因为有空行，请参阅更新代码 – nos

感谢您的澄清和再次感谢你的帮助。将信息保存到文件的部分存在一些困难。请参阅**更新的代码**。 –

切片线和保存参数到不同的文件

相关推荐