平铺数据来创建一个熊猫数据框

问题描述:

我是python熊猫的新手。只是一个简单而快速的问题。假设我有两列,分别是“周”和“机”:平铺数据来创建一个熊猫数据框

weeks = [1,3,5] 
machine = [M1, M1, M2, M2] 

我的计划是把这些名单的数据帧,但我得到“ValueError异常:数组必须是相同的长度”。我正在看以下输出:

final_weeks = [1,2,3,4,5,1,2,3,4,5] 
final_machine = [M1, M1, M1, M1, M1, M2, M2, M2, M2, M2] 

tempDict = {'weeks': final_weeks, 'machine': final_machine} 

我得到这两个列表,但不是数据框。为什么我得到valueError?下面是我做的,到目前为止:

maxWeek = df["weeks"].max() 
uniqueMachine = set(df.machine) 

unionWeeklist = [item for item in range(1, maxWeek+1)] 
# Output = [1, 2, 3, 4, 5] 

final_weeks = unionWeekList * len(uniqueMachine) 
# [1,2,3,4,5,1,2,3,4,5] 

machines = [[item]* maxWeek for item in uniqueMachine] 
# Output: [[M1,M1,M1,M1,M1], [M2,M2,M2,M2,M2]] 

final_machines = list(itertools.chain.from_iterable(machines)) 
# Flattened list = [M1,M1,M1,M1,M1,M2,M2,M2,M2,M2] 

tmpDict = {'week': final_weeks, 'machine': final_machines} 

# new dataframe 
newdf = pd.DataFrame.from_records(tmpDict) 

# ValueError: arrays must all be same length 
+0

能否打印'LEN(final_weeks)'和'LEN(final_machines) ' – Wen

+0

其记录数量相同,没有问题。 – SalN85

+0

好的,我更新了。 – Wen

您可以使用DataFrame构造与numpy.repeatnumpy.tile重复:

#unique machines 
uniq = np.sort(np.unique(np.array(machine))) 
#repeated range 
rng = np.arange(min(weeks), max(weeks)+1) 

df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)), 
        'week':np.tile(rng, len(uniq))}, columns=['week','machine']) 

print (df) 
    week machine 
0  1  M1 
1  2  M1 
2  3  M1 
3  4  M1 
4  5  M1 
5  1  M2 
6  2  M2 
7  3  M2 
8  4  M2 
9  5  M2 

cᴏʟᴅsᴘᴇᴇᴅ's solution比较:

weeks = [1, 3, 5, 8, 13, 15, 17, 23, 24, 26] 
machine = ['M{}'.format(x) for x in range(1, 51)] 
print (machine) 

In [29]: %%timeit 
    ...: uniq = np.sort(np.unique(np.array(machine))) 
    ...: #repeated range 
    ...: rng = np.arange(min(weeks), max(weeks)+1) 
    ...: 
    ...: df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)), 
    ...:     'week':np.tile(rng, len(uniq))}, columns=['week','machine']) 
    ...: 
1000 loops, best of 3: 636 µs per loop 

In [30]: %%timeit 
    ...: uniq_machine = sorted(set(machine)) 
    ...: df = pd.DataFrame(np.repeat(np.array(uniq_machine)\ 
    ...:       .reshape(1, len(uniq_machine)), max(weeks), 0), 
    ...:     index=range(1, max(weeks) + 1)) 
    ...: 
    ...: out = df.unstack().reset_index(level=0, drop=True) 
    ...: out = out.reset_index() 
    ...: out.columns = ['week', 'machine'] 
    ...: 
1000 loops, best of 3: 1.46 ms per loop 
+0

这也工作得很好。谢谢jezrael :) – SalN85

+0

很高兴能帮助!美好的一天! – jezrael

试试这个..我想我得到了你所需要的(PS:为了得到你想要的东西,请按照cᴏʟᴅsᴘᴇᴇᴅ的答案)

weeks = [1,3,5] 
machine = ['M1', 'M1', 'M2', 'M2'] 
newdf = pd.DataFrame(machine) 
newdf.groupby(0).apply(lambda x : (x.reindex(range(1,max(weeks)+1)).ffill().bfill())) 
Out[364]: 
     0 
0  
M1 1 M1 
    2 M1 
    3 M1 
    4 M1 
    5 M1 
M2 1 M2 
    2 M2 
    3 M2 
    4 M2 
    5 M2 

一种选择,使用np.repeatdf.unstack

weeks = [1, 3, 5] 
machine = ['M1' 'M1', 'M2', 'M2'] 

uniq_machine = sorted(set(machine)) 

df = pd.DataFrame(np.repeat(np.array(uniq_machine)\ 
          .reshape(1, len(uniq_machine)), max(weeks), 0), 
        index=range(1, max(weeks) + 1)) 

out = df.unstack().reset_index(level=0, drop=True) 
print(out) 

1 M1 
2 M1 
3 M1 
4 M1 
5 M1 
1 M2 
2 M2 
3 M2 
4 M2 
5 M2 
dtype: object 

这是一个pd.Series对象,但你可以叫.reset_index拿到2列:

out = out.reset_index() 
out.columns = ['week', 'machine'] 
print(out) 

    week machine 
0  1  M1 
1  2  M1 
2  3  M1 
3  4  M1 
4  5  M1 
5  1  M2 
6  2  M2 
7  3  M2 
8  4  M2 
9  5  M2 
+0

是的,它确实有效! :)。你能解释'重塑'的意义吗? – SalN85

+0

@SalN85:重塑将它转换成二维数组,以便我可以沿着所需的轴重复它。这是一个实现细节。 ;-) –

+0

我们是否也可以在相同的代码中添加这些头文件(周,机器)和数据?在定义列列表后,我尝试使用列参数。 – SalN85