在使用熊猫导入CSV文件时有效地清理数据

问题描述：

我正在导入一个数据集与Python的熊猫，不幸需要一些清洁。导入后，我需要删除两列中的所有引号和空格（alpha2和alpha3）。这是目前我如何做到这一点：在使用熊猫导入CSV文件时有效地清理数据

# Add alpha2 country codes to custom dataset to normalize data 
country_codes = pd.read_csv('datasets/country_codes.csv').rename(columns = {'Alpha-2 code':'alpha2', 'Alpha-3 code':'alpha3'}) 
# Remove commas and spaces from dataset 
country_codes['alpha2'] = country_codes['alpha2'].str.replace('"', '') 
country_codes['alpha2'] = country_codes['alpha2'].str.replace(' ', '') 
country_codes['alpha3'] = country_codes['alpha3'].str.replace('"', '') 
country_codes['alpha3'] = country_codes['alpha3'].str.replace(' ', '')

在我oppinion，这是一个有点难看，因为我需要一些简单的命令5条规则。这可以通过更少的代码更有效地完成吗？

答

可以使用df.replace与regex如下：

country_codes[['alpha2', 'alpha3']].replace(r'"|\s','', 
               regex=True, 
               inplace=True)

完整的代码如下所示：

country_codes = pd.read_csv('datasets/country_codes.csv').rename(columns = {'Alpha-2 code': 'alpha2', 'Alpha-3 code':'alpha3'}) 
country_codes[['alpha2', 'alpha3']].replace(r'"|\s','', 
              regex=True, 
              inplace=True)

然而，正如@Jeff在下面的评论refered ，最好不要使用inplace=True，而应该这样做：

country_codes[['alpha2', 'alpha3']] = country_codes[['alpha2', 'alpha3']].replace(r'"|\s','', 
               regex=True)

有关更多详细信息，请参阅文档here。

在链式表达式中使用inplace = True是不惯用的，它可能仅在有时;而只是简单地返回新的值 – Jeff

在使用熊猫导入CSV文件时有效地清理数据

相关推荐