ch7_01 Pandas 数据清洗和准备

Jupyter nobook模式,不一样的阅读体验

import pandas as pd
import numpy as np

7.1处理缺失值

  • 对于数值数据,pandas使用浮点值NaN(Not a Number)表示缺失数据。我们称其为哨兵值,可以方便的检测出来:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
  • python的内置None也可以作为NA
string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool
  • 表7-1列出了一些关于缺失数据处理的函数
  • ch7_01 Pandas 数据清洗和准备

滤除缺失值:使用dropna()

string_data.dropna()
1    artichoke
3      avocado
dtype: object
  • 上述代码相当于:
string_data[string_data.notnull()]
1    artichoke
3      avocado
dtype: object
  • 而对于DataFrame对象,事情就有点复杂了。你可能希望丢弃全NA或含有NA的行或列。dropna默认丢弃任何含有缺失值的行:
from numpy import nan as NA
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
data.dropna()
0 1 2
0 1.0 6.5 3.0
  • 如果只想丢弃全为NA的行,可以传入一个参数 how=‘all’
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
  • 如果想丢弃全为NA的列,则可以再传入一个参数 axis = 1:
data[4] = NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(how='all',axis=1)
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# 注意:以上操作并不会影响原来的数据
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
  • 另一个滤除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数据,可以用thresh参数实现此目的:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
0 1 2
0 1.736760 NaN NaN
1 -2.129930 NaN NaN
2 1.437452 NaN -1.604753
3 0.826910 NaN 2.472819
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758
df.dropna()
0 1 2
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758
df.dropna(thresh=2)# 滤除2行
0 1 2
2 1.437452 NaN -1.604753
3 0.826910 NaN 2.472819
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758

填充缺失值:fillna()

df.fillna(0)
0 1 2
0 1.736760 0.000000 0.000000
1 -2.129930 0.000000 0.000000
2 1.437452 0.000000 -1.604753
3 0.826910 0.000000 2.472819
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758
  • 若是 传入一个字典,则可以对不同的列填充不同的值
df.fillna({1:0.05, 2:0.01})
0 1 2
0 1.736760 0.050000 0.010000
1 -2.129930 0.050000 0.010000
2 1.437452 0.050000 -1.604753
3 0.826910 0.050000 2.472819
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758
  • 以上的方法都是返回一个新的对象,但是也可以传入一个参数: inplace=True实现就地修改
df.fillna(0,inplace=True)
df
0 1 2
0 1.736760 0.000000 0.000000
1 -2.129930 0.000000 0.000000
2 1.437452 0.000000 -1.604753
3 0.826910 0.000000 2.472819
4 1.298570 -0.334513 1.318870
5 -0.640984 -1.344805 0.497007
6 -0.878085 -1.819722 0.341758
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA
df
0 1 2
0 0.773045 -1.022993 0.657842
1 0.541367 -0.003357 -2.031984
2 -0.041549 NaN -0.678426
3 0.586369 NaN -1.340845
4 -0.367919 NaN NaN
5 -0.917897 NaN NaN
df.fillna(method='ffill')
0 1 2
0 0.773045 -1.022993 0.657842
1 0.541367 -0.003357 -2.031984
2 -0.041549 -0.003357 -0.678426
3 0.586369 -0.003357 -1.340845
4 -0.367919 -0.003357 -1.340845
5 -0.917897 -0.003357 -1.340845
df.fillna(method='ffill',limit=1)
0 1 2
0 0.773045 -1.022993 0.657842
1 0.541367 -0.003357 -2.031984
2 -0.041549 -0.003357 -0.678426
3 0.586369 NaN -1.340845
4 -0.367919 NaN -1.340845
5 -0.917897 NaN NaN
  • 也可以使用均值或者中位数来填充
data = pd.Series([1,NA,3.5,7])
data
0    1.0
1    NaN
2    3.5
3    7.0
dtype: float64
data.fillna(data.mean())
0    1.000000
1    3.833333
2    3.500000
3    7.000000
dtype: float64
  • fillna得到常用参数
    ch7_01 Pandas 数据清洗和准备
    ch7_01 Pandas 数据清洗和准备

7.2数据转换

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
  • duplicated方法返回一个布尔型Series,表示各行是否是重复行(前面出现过的行):
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
  • drop_duplicates()
data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
  • 上面两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。假设我们还有一列值,且只希望根据k1列过滤重复项:
data['v'] = range(7)
data
k1 k2 v
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
k1 k2 v
0 one 1 0
1 two 1 1
  • duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last’则保留最后一个:
data.drop_duplicates(['k1','k2'],keep='last')
k1 k2 v
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

利用函数或者映射进行数据转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
#假设想要添加一列表示该肉类来源的动物类型,可以先编写一个不同肉类到动物的映射
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}
'''Series的map方法可以接受一个函数或含有映射关系的字典型对象,但是有些肉类的首字母大写了,而另一些则没有。
因此,我们还需要使用Series的str.lower方法,将各个值转换为小写:'''
lowercased = data['food'].str.lower()
lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
# 也可以传入一个函数
data['food'].map(lambda x: meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换值

  • 利用fillna方法填充缺失数据可以看做值替换的一种特殊情况。map可用于修改对象的数据子集,而replace则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series:
data = pd.Series([1,-999,2.,-999.,-1000.,3.])
data
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
# -999这个值可能是一个缺失值,想要把他用NA替换
data.replace(-999,np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
# 如果想要同时替换多个值,可以使用一个列表
data.replace([-999,-1000],np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
#如果想为不同的值替换不同的值,可以传入两个列表
data.replace([-999,-1000],[np.nan,0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
# 传入的参数也可以是字典
data.replace({-999:np.nan, -1000:0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名索引

  • 跟Series中的值一样,轴标签也可以通过函数或映射进行转换,从而得到一个新的不同标签的对象。轴还可以被就地修改,而无需新建一个数据结构
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index=['Ohio', 'Colorado', 'New York'],
                   columns=['one','two','three','four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
  • 如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是rename:
data.rename(index=str.title, columns=str.upper)
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
  • rename可以结合字典型对象实现对部分轴标签的更新;如果想要就地修改,让参数inplace为True即可
data.rename(index={'OHIO':'INDIANA'},columns={'three':'peekaboo'})
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11