熊猫分层列和csv函数
问题描述:
是否有可能通过csv以一种尊重分层列结构的方式对DataFrame进行往返?换句话说,如果我有以下数据框:熊猫分层列和csv函数
>>> cols = pd.MultiIndex.from_arrays([["foo", "foo", "bar", "bar"],
["a", "b", "c", "d"]])
>>> df = pd.DataFrame(np.random.randn(5, 4), index=range(5), columns=cols)
执行以下操作失败:
>>> df.to_csv("df.csv", index_label="index")
>>> df_new = pd.read_csv("df.csv", index_col="index")
>>> assert df.columns == df_new.columns
我缺少的CSV保存/读取步骤的一些选项?
答
在你有一个柱状多指标的特殊情况,但一个简单的指标,您可以移调数据框,并使用index_label
和index_col
如下:
import numpy as np
import pandas as pd
cols = pd.MultiIndex.from_arrays([["foo", "foo", "bar", "bar"],
["a", "b", "c", "d"]])
df = pd.DataFrame(np.random.randn(5, 4), index=range(5), columns=cols)
(df.T).to_csv('/tmp/df.csv', index_label=['first','second'])
df_new = pd.read_csv('/tmp/df.csv', index_col=['first','second']).T
assert np.all(df.columns.values == df_new.columns.values)
可惜这引出了一个问题做什么,如果索引和列都是MultiIndexes?
这里是一个哈克解决方法:
import numpy as np
import pandas as pd
import ast
cols = pd.MultiIndex.from_arrays([["foo", "foo", "bar", "bar"],
["a", "b", "c", "d"]])
df = pd.DataFrame(np.random.randn(5, 4), index=range(5), columns=cols)
print(df)
df.to_csv('/tmp/df.csv', index_label='index')
df_new = pd.read_csv('/tmp/df.csv', index_col='index')
columns = pd.MultiIndex.from_tuples([ast.literal_eval(item) for item in df_new.columns])
df_new.columns = columns
df_new.index.name = None
print(df_new)
assert np.all(df.columns.values == df_new.columns.values)
当然,如果你只是想将数据帧存储任意格式的文件,然后df.save
和pd.load
提供更舒适的解决方案:
import numpy as np
import pandas as pd
cols = pd.MultiIndex.from_arrays([["foo", "foo", "bar", "bar"],
["a", "b", "c", "d"]])
df = pd.DataFrame(np.random.randn(5, 4), index=range(5), columns=cols)
df.save('/tmp/df.df')
df_new = pd.load('/tmp/df.df')
assert np.all(df.columns.values == df_new.columns.values)
这是一个悬而未决的问题:https://github.com/pydata/pandas/issues/1651 – Jeff 2013-05-05 22:38:17