十分钟搞定pandas（二）

合并

连接

pandas提供各种工具以简便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数.

把pandas对象连接到一起

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df

Out[74]: 

          0         1         2         3

0 -0.548702  1.467327 -1.015962 -0.483075

1  1.637550 -1.217659 -0.291519 -1.745505

2 -0.263952  0.991460 -0.919069  0.266046

3 -0.709661  1.669052  1.037882 -1.705775

4 -0.919854 -0.042379  1.247642 -0.009920

5  0.290213  0.495767  0.362949  1.548106

6 -1.131345 -0.089329  0.337863 -0.945867

7 -0.932132  1.956030  0.017587 -0.016692

8 -0.575247  0.254161 -1.143704  0.215897

9  1.193555 -0.077118 -0.408530 -0.862495
 
# break it into pieces

In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)

Out[76]: 

          0         1         2         3

0 -0.548702  1.467327 -1.015962 -0.483075

1  1.637550 -1.217659 -0.291519 -1.745505

2 -0.263952  0.991460 -0.919069  0.266046

3 -0.709661  1.669052  1.037882 -1.705775

4 -0.919854 -0.042379  1.247642 -0.009920

5  0.290213  0.495767  0.362949  1.548106

6 -1.131345 -0.089329  0.337863 -0.945867

7 -0.932132  1.956030  0.017587 -0.016692

8 -0.575247  0.254161 -1.143704  0.215897

9  1.193555 -0.077118 -0.408530 -0.862495

连接

SQL样式合并. 请参阅数据库style联接

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

In[77]:left=pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})

In[78]:right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

In[79]:left

Out[79]:

  key  lval

0  foo    1

1  foo    2

In[80]:right

Out[80]:

  key  rval

0  foo    4

1  foo    5

In[81]:pd.merge(left,right,on='key')

Out[81]:

  key  lval  rval

0  foo    1    4

1  foo    1    5

2  foo    2    4

3  foo    2    5

添加

添加行到数据增. 参阅添加

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


In [82]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

In [83]: df

Out[83]: 

          A         B         C         D

0  1.346061  1.511763  1.627081 -0.990582

1 -0.441652  1.211526  0.268520  0.024580

2 -1.577585  0.396823 -0.105381 -0.532532

3  1.453749  1.208843 -0.080952 -0.264610

4 -0.727965 -0.589346  0.339969 -0.693205

5 -0.339355  0.593616  0.884345  1.591431

6  0.141809  0.220390  0.435589  0.192451

7 -0.096701  0.803351  1.715071 -0.708758
 

In [84]: s = df.iloc[3]

In [85]: df.append(s, ignore_index=True)

Out[85]: 

          A         B         C         D

0  1.346061  1.511763  1.627081 -0.990582

1 -0.441652  1.211526  0.268520  0.024580

2 -1.577585  0.396823 -0.105381 -0.532532

3  1.453749  1.208843 -0.080952 -0.264610

4 -0.727965 -0.589346  0.339969 -0.693205

5 -0.339355  0.593616  0.884345  1.591431

6  0.141809  0.220390  0.435589  0.192451

7 -0.096701  0.803351  1.715071 -0.708758

8  1.453749  1.208843 -0.080952 -0.264610

分组

对于“group by”指的是以下一个或多个处理

将数据按某些标准分割为不同的组
在每个独立组上应用函数
组合结果为一个数据结构

请参阅分组部分

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

In[86]:df=pd.DataFrame({'A':['foo','bar','foo','bar',

  ....:                          'foo','bar','foo','foo'],

  ....:                    'B':['one','one','two','three',

  ....:                          'two','two','one','three'],

  ....:                    'C':np.random.randn(8),

  ....:                    'D':np.random.randn(8)})

  ....:

In[87]:df

Out[87]:

    A      B        C        D

0  foo    one-1.202872-0.055224

1  bar    one-1.814470  2.395985

2  foo    two  1.018601  1.552825

3  bar  three-0.595447  0.166599

4  foo    two  1.395433  0.047609

5  bar    two-0.392670-0.136473

6  foo    one  0.007207-0.561757

7  foo  three  1.928123-1.623033

分组然后应用函数统计总和存放到结果组

Python


1
2
3
4
5
6


In [88]: df.groupby('A').sum()

Out[88]: 

            C        D

A                     

bar -2.802588  2.42611

foo  3.146492 -0.63958

按多列分组为层次索引,然后应用函数

Python

1

2

3

4

5

6

7

8

9

10

In[89]:df.groupby(['A','B']).sum()

Out[89]:

                  C        D

A  B                        

barone  -1.814470  2.395985

    three-0.595447  0.166599

    two  -0.392670-0.136473

fooone  -1.195665-0.616981

    three  1.928123-1.623033

    two    2.414034  1.600434

重塑

请参阅章节分层索引和重塑.

堆叠

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16


In [90]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',

   ....:                      'foo', 'foo', 'qux', 'qux'],

   ....:                     ['one', 'two', 'one', 'two',

   ....:                      'one', 'two', 'one', 'two']]))

   ....: 

In [91]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [92]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [93]: df2 = df[:4]

In [94]: df2

Out[94]: 

                     A         B

first second                    

bar   one     0.029399 -0.542108

      two     0.282696 -0.087302

baz   one    -1.575170  1.771208

      two     0.816482  1.100230

堆叠函数 “压缩” 数据桢的列一个级别.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

In[95]:stacked=df2.stack()

In[96]:stacked

Out[96]:

first  second  

bar    one    A    0.029399

              B  -0.542108

      two    A    0.282696

              B  -0.087302

baz    one    A  -1.575170

              B    1.771208

      two    A    0.816482

              B    1.100230

dtype:float64

被“堆叠”数据桢或序列(有多个索引作为索引), 其堆叠的反向操作是未堆栈, 上面的数据默认反堆叠到上一级别:

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


In [97]: stacked.unstack()

Out[97]: 

                     A         B

first second                    

bar   one     0.029399 -0.542108

      two     0.282696 -0.087302

baz   one    -1.575170  1.771208

      two     0.816482  1.100230
 

In [98]: stacked.unstack(1)

Out[98]: 

second        one       two
first                      

bar   A  0.029399  0.282696

      B -0.542108 -0.087302

baz   A -1.575170  0.816482

      B  1.771208  1.100230
 

In [99]: stacked.unstack(0)

Out[99]: 

first          bar       baz
second                      

one    A  0.029399 -1.575170

       B -0.542108  1.771208

two    A  0.282696  0.816482

       B -0.087302  1.100230

数据透视表

查看数据透视表.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

In[100]:df=pd.DataFrame({'A':['one','one','two','three']*3,

  .....:                    'B':['A','B','C']*4,

  .....:                    'C':['foo','foo','foo','bar','bar','bar']*2,

  .....:                    'D':np.random.randn(12),

  .....:                    'E':np.random.randn(12)})

  .....:

In[101]:df

Out[101]:

        A  B    C        D        E

0    one  A  foo  1.418757-0.179666

1    one  B  foo-1.879024  1.291836

2    two  C  foo  0.536826-0.009614

3  three  A  bar  1.006160  0.392149

4    one  B  bar-0.029716  0.264599

5    one  C  bar-1.146178-0.057409

6    two  A  foo  0.100900-1.425638

7  three  B  foo-1.035018  1.024098

8    one  C  foo  0.314665-0.106062

9    one  A  bar-0.773723  1.824375

10    two  B  bar-1.170653  0.595974

11  three  C  bar  0.648740  1.167115

我们可以从此数据非常容易的产生数据透视表:

Python


1
2
3
4
5
6
7
8
9
10
11
12
13


In [102]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Out[102]: 

C             bar       foo

A     B                    

one   A -0.773723  1.418757

      B -0.029716 -1.879024

      C -1.146178  0.314665

three A  1.006160       NaN

      B       NaN -1.035018

      C  0.648740       NaN

two   A       NaN  0.100900

      B -1.170653       NaN

      C       NaN  0.536826

时间序列

pandas有易用,强大且高效的函数用于高频数据重采样转换操作(例如,转换秒数据到5分钟数据), 这是很普遍的情况，但并不局限于金融应用, 请参阅时间序列章节

Python

1

2

3

4

5

6

In[103]:rng=pd.date_range('1/1/2012',periods=100,freq='S')

In[104]:ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)

In[105]:ts.resample('5Min',how='sum')

Out[105]:

2012-01-01    25083

Freq:5T,dtype:int32

时区表示

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [106]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [107]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [108]: ts

Out[108]: 

2012-03-06    0.464000

2012-03-07    0.227371

2012-03-08   -0.496922

2012-03-09    0.306389

2012-03-10   -2.290613

Freq: D, dtype: float64
 

In [109]: ts_utc = ts.tz_localize('UTC')

In [110]: ts_utc

Out[110]: 

2012-03-06 00:00:00+00:00    0.464000

2012-03-07 00:00:00+00:00    0.227371

2012-03-08 00:00:00+00:00   -0.496922

2012-03-09 00:00:00+00:00    0.306389

2012-03-10 00:00:00+00:00   -2.290613

Freq: D, dtype: float64

转换到其它时区

Python

1

2

3

4

5

6

7

8

In[111]:ts_utc.tz_convert('US/Eastern')

Out[111]:

2012-03-0519:00:00-05:00    0.464000

2012-03-0619:00:00-05:00    0.227371

2012-03-0719:00:00-05:00  -0.496922

2012-03-0819:00:00-05:00    0.306389

2012-03-0919:00:00-05:00  -2.290613

Freq:D,dtype:float64

转换不同的时间跨度

Python


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


In [112]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [113]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [114]: ts

Out[114]: 

2012-01-31   -1.134623

2012-02-29   -1.561819

2012-03-31   -0.260838

2012-04-30    0.281957

2012-05-31    1.523962

Freq: M, dtype: float64
 

In [115]: ps = ts.to_period()

In [116]: ps

Out[116]: 

2012-01   -1.134623

2012-02   -1.561819

2012-03   -0.260838

2012-04    0.281957

2012-05    1.523962

Freq: M, dtype: float64
 

In [117]: ps.to_timestamp()

Out[117]: 

2012-01-01   -1.134623

2012-02-01   -1.561819

2012-03-01   -0.260838

2012-04-01    0.281957

2012-05-01    1.523962

Freq: MS, dtype: float64

转换时段并且使用一些运算函数, 下例中, 我们转换年报11月到季度结束每日上午9点数据

Python

1

2

3

4

5

6

7

8

9

10

11

In[118]:prng=pd.period_range('1990Q1','2000Q4',freq='Q-NOV')

In[119]:ts=pd.Series(np.random.randn(len(prng)),prng)

In[120]:ts.index=(prng.asfreq('M','e')+1).asfreq('H','s')+9

In[121]:ts.head()

Out[121]:

1990-03-0109:00  -0.902937

1990-06-0109:00    0.068159

1990-09-0109:00  -0.057873

1990-12-0109:00  -0.368204

1991-03-0109:00  -1.144073

Freq:H,dtype:float64

分类

自版本0.15起, pandas可以在数据桢中包含分类. 完整的文档, 请查看分类介绍 and the API文档.

Python

1

In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

转换原始类别为分类数据类型.

Python

1

2

3

4

5

6

7

8

9

10

11

In[123]:df["grade"]=df["raw_grade"].astype("category")

In[124]:df["grade"]

Out[124]:

0    a

1    b

2    b

3    a

4    a

5    e

Name:grade,dtype:category

Categories(3,object):[a,b,e]

重命令分类为更有意义的名称 (分配到Series.cat.categories对应位置!)

Python

1

In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"]

重排顺分类,同时添加缺少的分类(序列 .cat方法下返回新默认序列)

Python

1

2

3

4

5

6

7

8

9

10

11

In[126]:df["grade"]=df["grade"].cat.set_categories(["very
bad","bad","medium","good","very
good"])

In[127]:df["grade"]

Out[127]:

0    verygood

1        good

2        good

3    verygood

4    verygood

5    verybad

Name:grade,dtype:category

Categories(5,object):[verybad,bad,medium,good,verygood]

排列分类中的顺序,不是按词汇排列.

Python


1
2
3
4
5
6
7
8
9


In [128]: df.sort("grade")

Out[128]: 

   id raw_grade      grade

5   6         e   very bad

1   2         b       good

2   3         b       good

0   1         a  very good

3   4         a  very good

4   5         a  very good

类别列分组,并且也显示空类别.

Python

1

2

3

4

5

6

7

8

9

In[129]:df.groupby("grade").size()

Out[129]:

grade

verybad      1

bad        NaN

medium      NaN

good          2

verygood    3

dtype:float64

绘图

绘图文档.

Python

1
2
3
4

In [130]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

In [131]: ts = ts.cumsum()

In [132]: ts.plot()

Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0xb02091ac>

十分钟搞定pandas（二）

在数据桢中,可以很方便的绘制带标签列:

Python

1

2

3

4

5

6

In[133]:df=pd.DataFrame(np.random.randn(1000,4),index=ts.index,

  .....:                  columns=['A','B','C','D'])

  .....:

In[134]:df=df.cumsum()

In[135]:plt.figure();df.plot();plt.legend(loc='best')

Out[135]:<matplotlib.legend.Legendat0xb01c9cac>

十分钟搞定pandas（二）

获取数据输入/输出

CSV

写入csv文件

Python

1

In [136]: df.to_csv('foo.csv')

读取csv文件

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

In[137]:pd.read_csv('foo.csv')

Out[137]:

    Unnamed:0          A          B        C          D

0    2000-01-01  0.266457  -0.399641-0.219582  1.186860

1    2000-01-02  -1.170732  -0.345873  1.653061  -0.282953

2    2000-01-03  -1.734933  0.530468  2.060811  -0.515536

3    2000-01-04  -1.555121  1.452620  0.239859  -1.156896

4    2000-01-05  0.578117  0.511371  0.103552  -2.428202

5    2000-01-06  0.478344  0.449933-0.741620  -1.962409

6    2000-01-07  1.235339  -0.091757-1.543861  -1.084753

..          ...        ...        ...      ...        ...

993  2002-09-20-10.628548  -9.153563-7.883146  28.313940

994  2002-09-21-10.390377  -8.727491-6.399645  30.914107

995  2002-09-22  -8.985362  -8.485624-4.669462  31.367740

996  2002-09-23  -9.558560  -8.781216-4.499815  30.518439

997  2002-09-24  -9.902058  -9.340490-4.386639  30.105593

998  2002-09-25-10.216020  -9.480682-3.933802  29.758560

999  2002-09-26-11.856774-10.671012-3.216025  29.369368

[1000rowsx5columns]

HDF5

读写HDF存储

写入HDF5存储

Python

1

In [138]: df.to_hdf('foo.h5','df')

读取HDF5存储

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

In[139]:pd.read_hdf('foo.h5','df')

Out[139]:

                    A          B        C          D

2000-01-01  0.266457  -0.399641-0.219582  1.186860

2000-01-02  -1.170732  -0.345873  1.653061  -0.282953

2000-01-03  -1.734933  0.530468  2.060811  -0.515536

2000-01-04  -1.555121  1.452620  0.239859  -1.156896

2000-01-05  0.578117  0.511371  0.103552  -2.428202

2000-01-06  0.478344  0.449933-0.741620  -1.962409

2000-01-07  1.235339  -0.091757-1.543861  -1.084753

...              ...        ...      ...        ...

2002-09-20-10.628548  -9.153563-7.883146  28.313940

2002-09-21-10.390377  -8.727491-6.399645  30.914107

2002-09-22  -8.985362  -8.485624-4.669462  31.367740

2002-09-23  -9.558560  -8.781216-4.499815  30.518439

2002-09-24  -9.902058  -9.340490-4.386639  30.105593

2002-09-25-10.216020  -9.480682-3.933802  29.758560

2002-09-26-11.856774-10.671012-3.216025  29.369368

[1000rowsx4columns]

Excel

读写MS Excel

写入excel文件

Python

1

In [140]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取excel文件

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

In[141]:pd.read_excel('foo.xlsx','Sheet1',index_col=None,na_values=['NA'])

Out[141]:

                    A          B        C          D

2000-01-01  0.266457  -0.399641-0.219582  1.186860

2000-01-02  -1.170732  -0.345873  1.653061  -0.282953

2000-01-03  -1.734933  0.530468  2.060811  -0.515536

2000-01-04  -1.555121  1.452620  0.239859  -1.156896

2000-01-05  0.578117  0.511371  0.103552  -2.428202

2000-01-06  0.478344  0.449933-0.741620  -1.962409

2000-01-07  1.235339  -0.091757-1.543861  -1.084753

...              ...        ...      ...        ...

2002-09-20-10.628548  -9.153563-7.883146  28.313940

2002-09-21-10.390377  -8.727491-6.399645  30.914107

2002-09-22  -8.985362  -8.485624-4.669462  31.367740

2002-09-23  -9.558560  -8.781216-4.499815  30.518439

2002-09-24  -9.902058  -9.340490-4.386639  30.105593

2002-09-25-10.216020  -9.480682-3.933802  29.758560

2002-09-26-11.856774-10.671012-3.216025  29.369368

[1000rowsx4columns]

陷阱

如果尝试这样操作可能会看到像这样的异常:

Python

1

2

3

4

5

>>>ifpd.Series([False,True,False]):

    print("I
was true")

Traceback

    ...

ValueError:Thetruthvalueofanarrayisambiguous.Usea.empty,a.any()ora.all().

查看对照获取解释和怎么做的帮助

也可以查看陷阱.

十分钟搞定pandas（二）

合并

连接

连接

添加

分组

重塑

堆叠

数据透视表

时间序列

分类

绘图

获取数据输入/输出

CSV

HDF5

Excel

陷阱

相关推荐