缺失值以.csv由大熊猫据帧错误

问题描述:

我有一个.csv文件写入后:缺失值以.csv由大熊猫据帧错误

20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1 
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1 
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 

我使用熊猫据帧正常化,但我得到的.csv文件缺失值:

.703280701968,0.867283950617,,,,0.0971635485818,-0.132770066385,,0.318518516666,-inf,-0.742913580247,-0.74703196347,-0.779350940252,-0.659592176966,-0.483438485804,0.565758716954,,,-inf,-0.274046377081,0.705774765311,-0.281481481478,-0.596841230258,,,1 
0.104027493068,-0.0493827160494,,,,0.0199155099578,-0.0175015087508,,0.318518516666,-inf,-0.401580246914,-0.392694063927,-0.331530968381,-0.401165210674,-0.337539432177,0.426956186355,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 
0.104027493068,-0.132716049383,,,,-0.2494467914,0.254878294116,,0.318518516666,-inf,-0.0620246913541,-0.0547945205479,0.00470906912955,0.0370370365169,-0.183753943218,0.0159880797389,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 
0.104027493068,-0.132716049383,,,,-0.281231140616,0.286662643331,,0.318518516666,-inf,-0.0229135802474,-0.0164383561644,0.0392144605923,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 
0.104027493068,-0.132716049383,,,,-0.566083283042,0.571514785757,,0.318518516666,-inf,0.201086419753,0.199086757991,0.184362139917,0.104452766854,-0.160094637224,-0.0472377828174,,,-inf,-0.373755558635,-0.294225234689,0.518518518522,-0.232751454697,,,1 

我的代码:

import pandas as pd 


df = pd.read_csv('pooja.csv',index_col=False) 
df_norm = (df.ix[:, 1:-1] - df.ix[:, 1:-1].mean())/(df.ix[:, 1:-1].max() - df.ix[:, 1:-1].min()) 
rslt = pd.concat([df_norm, df.ix[:,-1]], axis=1) 
rslt.to_csv('example.csv',index=False,header=False) 

什么是错的代码?为什么.csv文件中缺少值?

+0

为什么不打印出数据帧,因为它在代码中经历了不同的步骤。这样你就可以识别哪一行代码是负责任的。 – Spinor8

+0

我打印了df_norm,它为所有缺失值赋予'nan',但为什么? –

+0

所以其余的都不相关。当您将其推入csv文件时,Nan会转换为空格。现在,我无法访问您的csv数据文件,但我怀疑您的分母可能会给出一个零。为什么不把df_norm计算分成两部分:df_numerator和df_denominator。检查数据文件是否为df_denominator为零。 – Spinor8

你得到很多NaN,因为00。见broadcasting behaviour。更好的解释是here

我使用您以前的question的代码,因为我认为使用df.ix[:, 1:-1]切片是没有必要的。切片归一化后,我得到空DataFrame

import pandas as pd 
import numpy as np 
import io 

temp=u"""20376.65,22398.29,4.8,0.0,1.0,2394.0,6.1,89.1,0.0,4.027,9.377,0.33,0.28,0.36,51364.0,426372.0,888388.0,0.0,2040696.0,57.1,21.75,25.27,0.0,452.0,1046524.0,1046524.0,1 
7048.842,8421.754,1.44,0.0,1.0,2394.0,29.14,69.5,0.0,4.027,9.377,0.33,0.28,0.36,51437.6,426964.0,684084.0,0.0,2040696.0,57.1,12.15,14.254,3.2,568.8,1046524.0,1046524.0,1 
3716.89,4927.62,0.12,0.0,1.0,2394.0,26.58,73.32,0.0,4.027,9.377,0.586,1.056,3.544,51456.0,427112.0,633008.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 
3716.89,4927.62,0.0,0.0,1.0,2394.0,17.653333333,82.346666667,0.0,4.027,9.377,0.84066666667,1.796,5.9346666667,51487.2,427268.0,481781.6,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1 
3716.89,4927.62,0.0,0.0,1.0,2394.0,16.6,83.4,0.0,4.027,9.377,0.87,1.88,6.18,51492.0,427292.0,458516.0,0.0,2040696.0,57.1,9.75,11.5,4.0,598.0,1046524.0,1046524.0,1""" 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp),index_col=None, header=None) 
#print df 
#filter only first 5 columns for testing 
df = df.iloc[:, :5] 
print df 
      0   1  2 3 4 
0 20376.650 22398.290 4.80 0 1 
1 7048.842 8421.754 1.44 0 1 
2 3716.890 4927.620 0.12 0 1 
3 3716.890 4927.620 0.00 0 1 
4 3716.890 4927.620 0.00 0 1 

#get max values by columns 
print df.max() 
0 20376.65 
1 22398.29 
2  4.80 
3  0.00 
4  1.00 
dtype: float64 

#get min values by columns 
print df.min() 
0 3716.89 
1 4927.62 
2  0.00 
3  0.00 
4  1.00 
dtype: float64 
#difference, you get 0 
print (df.max() - df.min()) 
0 16659.76 
1 17470.67 
2  4.80 
3  0.00 
4  0.00 
dtype: float64 

print df - df.mean() 
      0   1  2 3 4 
0 12661.4176 13277.7092 3.528 0 0 
1 -666.3904 -698.8268 0.168 0 0 
2 -3998.3424 -4192.9608 -1.152 0 0 
3 -3998.3424 -4192.9608 -1.272 0 0 
4 -3998.3424 -4192.9608 -1.272 0 0 

#you get NaN, because divide columns 3 and 4 filled 0 to difference with index 3,4 filled 0 
df_norm = (df - df.mean())/(df.max() - df.min()) 
print df_norm 
     0  1  2 3 4 
0 0.76 0.76 0.735 NaN NaN 
1 -0.04 -0.04 0.035 NaN NaN 
2 -0.24 -0.24 -0.240 NaN NaN 
3 -0.24 -0.24 -0.265 NaN NaN 
4 -0.24 -0.24 -0.265 NaN NaN 

最后,如果你生成to_csv,从NaN""得到的,因为参数na_rep有默认值""

print df_norm.to_csv(index=False, header=False, na_rep="") 
0.76,0.76,0.735,, 
-0.04,-0.04,0.035,, 
-0.24,-0.24,-0.24,, 
-0.24,-0.24,-0.265,, 
-0.24,-0.24,-0.265,, 

如果更改值:

#change na_rep to * for testing 
print df_norm.to_csv(index=False, header=False, na_rep="*") 
0.76,0.76,0.735,*,* 
-0.04,-0.04,0.035,*,* 
-0.24,-0.24,-0.24,*,* 
-0.24,-0.24,-0.265,*,* 
-0.24,-0.24,-0.265,*,*