分组列的唯一值在Python

问题描述：

我有两列数据集，我需要把它从这种格式变更：分组列的唯一值在Python

这个

10 1 5 3 
11 5 4 
12 6 2

我需要在每一个独特的价值第一列将在其自己的行。

我是一名Python初学者，在我的文本文件中无法阅读，我不知道如何继续。

什么是在字段分隔符的文件？ – RomanPerekhrest

答

您可以使用Pandas数据框。

import pandas as pd 

df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]}) 
print(df)

输出：

让我们用groupby和join：

df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))

输出：

A 
10 1 5 3 
11  5 4 
12  6 2 
Name: B, dtype: object

如果这个答案对你有帮助，你会[接受]（https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work?answertab=active#tab-top）这个答案吗？ –

接受答案是什么使这个网站流行。谢谢。 –

答

一个例子使用itertools.groupby只;这一切都在Python标准库中（尽管pandas version更简洁！）。

假设你要组键相邻，这可能都可以懒洋洋地完成（不需要在任何时候在内存中的所有数据）：

from io import StringIO 
from itertools import groupby 

text = '''10 1 
10 5 
10 3 
11 5 
11 4 
12 6 
12 2''' 

# read and group data: 
with StringIO(text) as file: 
    keys = [] 
    res = {} 

    data = (line.strip().split() for line in file) 

    for k, g in groupby(data, key=lambda x: x[0]): 
     keys.append(k) 
     res[k] = [item[1] for item in g] 

print(keys) # ['10', '11', '12'] 
print(res) # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']} 

# write grouped data: 
with StringIO() as out_file: 
    for key in keys: 
     out_file.write('{:3s}'.format(key)) 
     out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]])) 
     out_file.write('\n') 
    print(out_file.getvalue()) 
    # 10 1 5 3 
    # 11 5 4 
    # 12 6 2

你可以再更换with StringIO(text) as file:的东西如with open('infile.txt', 'r') as file用于程序读取您的实际文件（以及类似的输出文件与open('outfile.txt', 'w')）。

again：当然，每次找到密钥时都可以直接写入输出文件;这样你就不需要在内存中的所有数据在任何时间：

with StringIO(text) as file, StringIO() as out_file: 

    data = (line.strip().split() for line in file) 

    for k, g in groupby(data, key=lambda x: x[0]): 
     out_file.write('{:3s}'.format(k)) 
     out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g])) 
     out_file.write('\n') 

    print(out_file.getvalue())

答

使用collections.defaultdict子类：

import collections 
with open('yourfile.txt', 'r') as f: 
    d = collections.defaultdict(list) 
    for k,v in (l.split() for l in f.read().splitlines()): # processing each line 
     d[k].append(v)    # accumulating values for the same 1st column 
    for k,v in sorted(d.items()): # outputting grouped sequences 
     print('%s %s' % (k,' '.join(v)))

输出：

10 1 5 3 
11 5 4 
12 6 2

问题：我尝试避免使用'defaultdict'并将其替换为'd.setdefault（k，[]）。append（v）'中的'dict.setdefault'。你有什么意见吗？ –

@hiroprotagonist，来自python文档：'这种技术比使用dict.setdefault（）的等效技术更简单快捷：'https://docs.python.org/3.6/library/collections.html?highlight=defaultdict#defaultdict - 例子 – RomanPerekhrest

啊！学到了什么。谢谢！ –

答

使用pandas可能更轻松。您可以使用read_csv函数来读取txt文件，其中数据由空格或空格分隔。

import pandas as pd 

df = pd.read_csv("input.txt", header=None, delimiter="\s+") 
# setting column names 
df.columns = ['col1', 'col2'] 
df

这会给dataframe输出：

col1 col2 0 10 1 1 10 5 2 10 3 3 11 5 4 11 4 5 12 6 6 12 2

在以前的其他answer阅读txt文件dataframe，类似于apply后，您还可以使用aggregate和join：

df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index() 
df_combine

输出： col1 col2 0 10 1 5 3 1 11 5 4 2 12 6 2

答

我发现这个溶液用dictonaries：

with open("data.txt", encoding='utf-8') as data: 
    file = data.readlines() 

    dic = {} 
    for line in file: 
     list1 = line.split() 
     try: 
      dic[list1[0]] += list1[1] + ' ' 
     except KeyError: 
      dic[list1[0]] = list1[1] + ' ' 

    for k,v in dic.items(): 
     print(k,v)

OUTPUT

东西更多个官能

def getdata(datafile): 
    with open(datafile, encoding='utf-8') as data: 
     file = data.readlines() 

    dic = {} 
    for line in file: 
     list1 = line.split() 
     try: 
      dic[list1[0]] += list1[1] + ' ' 
     except KeyError: 
      dic[list1[0]] = list1[1] + ' ' 

    for k,v in dic.items(): 
     v = v.split() 
     print(k, ':',v) 

getdata("data.txt")

OUTPUT

11：[ '5'， '4']

12：[ '6'， '2']

10：[ '1'， '5'， '3']

分组列的唯一值在Python

相关推荐