尝试将字符串转换为整数的熊猫错误

问题描述：

要求：尝试将字符串转换为整数的熊猫错误

DataFrame中的一个特定列是“混合”类型。它可以具有像"123456"或"ABC12345"这样的值。

此数据框正在使用xlsxwriter写入Excel。

对于像"123456"值，上下行熊猫将其转换成123456.0（使它看起来像一个浮动）

我们需要把它变成XLSX 123456（即作为+整数）的情况下，价值的完全数字。

努力：

代码片段所示下面

import pandas as pd 
import numpy as np 
import xlsxwriter 
import os 
import datetime 
import sys 
excel_name = str(input("Please Enter Spreadsheet Name :\n").strip()) 

print("excel entered : " , excel_name) 
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias', 
     'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription', 
     'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description', 
     'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID', 
     'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV', 
      'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID', 
      'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage'] 
first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType', 
            'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID', 
            'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) 
# We need to keep integer AltID as is 

first_pass_drop_duplicate.loc[first_pass_drop_duplicate['AltID']] = first_pass_drop_duplicate['AltID'].apply(lambda x : str(int(x)) if str(x).isdigit() == True else x)

我曾尝试：

1. using `dataframe.astype(int).astype(str)` # works as long as value is not alphanumeric 
2.importing re and using pure python `re.compile()` and `replace()` -- does not work 
3.reading DF row by row in a for loop !!! Kills the machine as dataframe can have 300k+ records

每一次，错误，我得到：

raise KeyError('%s not in index' % objarr[mask])
KeyError: '[ 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 102711. 102711. 102711. 102711. 102711. 102711. 102711. 102711.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 5337. 5337. 5337. 5337. 5337. 5337.\n 5337. 5337. 2124. 2124. 2124. 2124. 2124. 2124.\n 2124. 2124. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.\n 6643. 6643. 6643. 6643. 6643. 6643. 6643. 6643.] not in index'

我是新手在蟒蛇/熊猫，任何帮助，非常感谢解决方案。

因此，你只需要将数值转换为'浮动'和非数值不是？ – jezrael

我需要确保它将一个+整数视为TEXT/STRING，并且不会在实际显示在Excel中的末尾添加一个.0（小数点）。 – SanBan

所以你需要将所有值转换为'type'' string'？问题是'Excel'解析'int'值转换为'string'为'float'？ – jezrael

答

我想你需要to_numeric：

df = pd.DataFrame({'AltID':['123456','ABC12345','123456'], 
        'B':[4,5,6]}) 

print (df) 
     AltID B 
0 123456 4 
1 ABC12345 5 
2 123456 6 

df.ix[df.AltID.str.isdigit(), 'AltID'] = pd.to_numeric(df.AltID, errors='coerce') 

print (df) 
     AltID B 
0 123456 4 
1 ABC12345 5 
2 123456 6 

print (df['AltID'].apply(type)) 
0 <class 'float'> 
1  <class 'str'> 
2 <class 'float'> 
Name: AltID, dtype: object

太棒了！它不适用于我的系列，因为第四个元素已经是'int'了。 'pd.Series（[1]，dtype = object）.str.isdigit（）'返回'NaN'。我必须这样做：'s.ix [s.str.isdigit（）。fillna（False）] = pd.to_numeric（s，errors ='coerce'）'它的工作很完美。 – piRSquared

而！这几乎肯定会更快。 – piRSquared

@piRSquared - 谢谢。另一个解决方案是'df.ix [df.AltID.astype（str）.str.isdigit（），'AltID'] = pd.to_numeric（df.AltID，errors ='coerce'）' – jezrael

答

使用apply和pd.to_numeric与参数errors='ignore'

考虑pd.Seriess

s = pd.Series(['12345', 'abc12', '456', '65hg', 54, '12-31-2001']) 

s.apply(pd.to_numeric, errors='ignore') 

0   12345 
1   abc12 
2   456 
3   65hg 
4   54 
5 12-31-2001 
dtype: object

的通知类型

s.apply(pd.to_numeric, errors='ignore').apply(type) 

0 <type 'numpy.int64'> 
1   <type 'str'> 
2 <type 'numpy.int64'> 
3   <type 'str'> 
4   <type 'int'> 
5   <type 'str'> 
dtype: object

答

最后，它工作在熊猫使用“转换器”选项read_excel格式

df_w02 = pd.read_excel(excel_name, names = df_header,converters = {'AltID':str,'RatingReason' : str}).fillna("")

转换器可以“投”一类由我的功能/价值定义和不断intefer存储为字符串，不增加小数点。

尝试将字符串转换为整数的熊猫错误

相关推荐