平铺数据来创建一个熊猫数据框
问题描述:
我是python熊猫的新手。只是一个简单而快速的问题。假设我有两列,分别是“周”和“机”:平铺数据来创建一个熊猫数据框
weeks = [1,3,5]
machine = [M1, M1, M2, M2]
我的计划是把这些名单的数据帧,但我得到“ValueError异常:数组必须是相同的长度”。我正在看以下输出:
final_weeks = [1,2,3,4,5,1,2,3,4,5]
final_machine = [M1, M1, M1, M1, M1, M2, M2, M2, M2, M2]
tempDict = {'weeks': final_weeks, 'machine': final_machine}
我得到这两个列表,但不是数据框。为什么我得到valueError?下面是我做的,到目前为止:
maxWeek = df["weeks"].max()
uniqueMachine = set(df.machine)
unionWeeklist = [item for item in range(1, maxWeek+1)]
# Output = [1, 2, 3, 4, 5]
final_weeks = unionWeekList * len(uniqueMachine)
# [1,2,3,4,5,1,2,3,4,5]
machines = [[item]* maxWeek for item in uniqueMachine]
# Output: [[M1,M1,M1,M1,M1], [M2,M2,M2,M2,M2]]
final_machines = list(itertools.chain.from_iterable(machines))
# Flattened list = [M1,M1,M1,M1,M1,M2,M2,M2,M2,M2]
tmpDict = {'week': final_weeks, 'machine': final_machines}
# new dataframe
newdf = pd.DataFrame.from_records(tmpDict)
# ValueError: arrays must all be same length
答
您可以使用DataFrame
构造与numpy.repeat
和numpy.tile
重复:
#unique machines
uniq = np.sort(np.unique(np.array(machine)))
#repeated range
rng = np.arange(min(weeks), max(weeks)+1)
df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)),
'week':np.tile(rng, len(uniq))}, columns=['week','machine'])
print (df)
week machine
0 1 M1
1 2 M1
2 3 M1
3 4 M1
4 5 M1
5 1 M2
6 2 M2
7 3 M2
8 4 M2
9 5 M2
与cᴏʟᴅsᴘᴇᴇᴅ's solution
比较:
weeks = [1, 3, 5, 8, 13, 15, 17, 23, 24, 26]
machine = ['M{}'.format(x) for x in range(1, 51)]
print (machine)
In [29]: %%timeit
...: uniq = np.sort(np.unique(np.array(machine)))
...: #repeated range
...: rng = np.arange(min(weeks), max(weeks)+1)
...:
...: df = pd.DataFrame({'machine': np.repeat(uniq, len(rng)),
...: 'week':np.tile(rng, len(uniq))}, columns=['week','machine'])
...:
1000 loops, best of 3: 636 µs per loop
In [30]: %%timeit
...: uniq_machine = sorted(set(machine))
...: df = pd.DataFrame(np.repeat(np.array(uniq_machine)\
...: .reshape(1, len(uniq_machine)), max(weeks), 0),
...: index=range(1, max(weeks) + 1))
...:
...: out = df.unstack().reset_index(level=0, drop=True)
...: out = out.reset_index()
...: out.columns = ['week', 'machine']
...:
1000 loops, best of 3: 1.46 ms per loop
答
试试这个..我想我得到了你所需要的(PS:为了得到你想要的东西,请按照cᴏʟᴅsᴘᴇᴇᴅ的答案)
weeks = [1,3,5]
machine = ['M1', 'M1', 'M2', 'M2']
newdf = pd.DataFrame(machine)
newdf.groupby(0).apply(lambda x : (x.reindex(range(1,max(weeks)+1)).ffill().bfill()))
Out[364]:
0
0
M1 1 M1
2 M1
3 M1
4 M1
5 M1
M2 1 M2
2 M2
3 M2
4 M2
5 M2
答
一种选择,使用np.repeat
和df.unstack
weeks = [1, 3, 5]
machine = ['M1' 'M1', 'M2', 'M2']
uniq_machine = sorted(set(machine))
df = pd.DataFrame(np.repeat(np.array(uniq_machine)\
.reshape(1, len(uniq_machine)), max(weeks), 0),
index=range(1, max(weeks) + 1))
out = df.unstack().reset_index(level=0, drop=True)
print(out)
1 M1
2 M1
3 M1
4 M1
5 M1
1 M2
2 M2
3 M2
4 M2
5 M2
dtype: object
这是一个pd.Series
对象,但你可以叫.reset_index
拿到2列:
out = out.reset_index()
out.columns = ['week', 'machine']
print(out)
week machine
0 1 M1
1 2 M1
2 3 M1
3 4 M1
4 5 M1
5 1 M2
6 2 M2
7 3 M2
8 4 M2
9 5 M2
能否打印'LEN(final_weeks)'和'LEN(final_machines) ' – Wen
其记录数量相同,没有问题。 – SalN85
好的,我更新了。 – Wen