大熊猫内存消耗HDF文件分组
问题描述:
我写了下面的脚本,但我有内存消耗,大熊猫被分配RAM的30多G,其中数据文件的总和大约是18G的大熊猫内存消耗HDF文件分组
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import time
mean_wo = pd.DataFrame()
mean_w = pd.DataFrame()
std_w = pd.DataFrame()
std_wo = pd.DataFrame()
start_time=time.time() #taking current time as starting time
data_files=['2012.h5','2013.h5','2014.h5','2015.h5', '2016.h5', '2008_2011.h5']
for data_file in data_files:
print data_file
df = pd.read_hdf(data_file)
grouped = df.groupby('day')
mean_wo_tmp=grouped['Significance_without_muons'].agg([np.mean])
mean_w_tmp=grouped['Significance_with_muons'].agg([np.mean])
std_wo_tmp=grouped['Significance_without_muons'].agg([np.std])
std_w_tmp=grouped['Significance_with_muons'].agg([np.std])
mean_wo = pd.concat([mean_wo, mean_wo_tmp])
mean_w = pd.concat([mean_w, mean_w_tmp])
std_w = pd.concat([std_w,std_w_tmp])
std_wo = pd.concat([std_wo,std_wo_tmp])
print mean_wo.info()
print mean_w.info()
del df, grouped, mean_wo_tmp, mean_w_tmp, std_w_tmp, std_wo_tmp
std_wo=std_wo.reset_index()
std_w=std_w.reset_index()
mean_wo=mean_wo.reset_index()
mean_w=mean_w.reset_index()
#setting the field day as date
std_wo['day']= pd.to_datetime(std_wo['day'], format='%Y-%m-%d')
std_w['day']= pd.to_datetime(std_w['day'], format='%Y-%m-%d')
mean_w['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')
mean_wo['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')
问题所以有人有一个想法如何减少内存消耗?
干杯,
答
我会做这样的事情
解决方案
data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5']
cols = ['Significance_without_muons', 'Significance_with_muons']
def agg(data_file):
return pd.read_hdf(data_file).groupby('day')[cols].agg(['mean', 'std'])
big_df = pd.concat([agg(fn) for fn in data_files], axis=1, keys=data_files)
mean_wo_tmp = big_df.xs(('Significance_without_muons', 'mean'), axis=1, level=[1, 2])
mean_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])
std_wo_tmp = big_df.xs(('Significance_without_muons', 'std'), axis=1, level=[1, 2])
std_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])
del big_df
设置
data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5']
cols = ['Significance_without_muons', 'Significance_with_muons']
np.random.seed([3,1415])
data_df = pd.DataFrame(np.random.rand(1000, 2), columns=cols)
data_df['day'] = np.random.choice(list('ABCDEFG'), 1000)
for fn in data_files:
data_df.to_hdf(fn, 'day', append=False)
运行上述溶液
然后
mean_wo_tmp
非常感谢piRSquared! 我会试试你的方法,现在我在for循环的末尾添加了一个'gc.collect()',我设法在25 G的阈值内运行它。 我会让你知道如果你的方式更好:) 再次感谢! –