pandas数据处理实践四(时间序列date_range、数据分箱cut、分组技术GroupBy)
时间序列:
关键函数
pandas.
date_range
(start = None,end = None,periods = None,freq = None,tz = None,normalize = False,name = None,closed = None,** kwargs )
参数: |
start:str或datetime-like,可选
end:str或datetime-like,可选
periods:整数,可选
freq:str或DateOffset,默认为'D'(每日日历)
tz:str或tzinfo,可选
normalize:bool,默认为False
name:str,默认无
closed:{无,'左','右'},可选
** kwargs
|
---|
返回固定频率DatetimeIndex。
时间序列生成的几种方式和采样:
from datetime import datetime # 导入时间序列^M
...: t1 = datetime(2009,10,20) # 直接定义
...:
...:
In [105]: t1
Out[105]: datetime.datetime(2009, 10, 20, 0, 0)
In [106]: # 通过列表^M
...: date_list = [^M
...: datetime(2018,10,1),^M
...: datetime(2018,10,2),^M
...: datetime(2018,10,5),^M
...: datetime(2018,10,7)^M
...: ]
In [107]: date_list
Out[107]:
[datetime.datetime(2018, 10, 1, 0, 0),
datetime.datetime(2018, 10, 2, 0, 0),
datetime.datetime(2018, 10, 5, 0, 0),
datetime.datetime(2018, 10, 7, 0, 0)]
In [108]: s1 = Series(np.random.randn(4),index=date_list) # 给时间序列赋
In [109]: s1
Out[109]:
2018-10-01 0.433032
2018-10-02 -1.180358
2018-10-05 -1.583058
2018-10-07 -1.200917
dtype: float64
In [110]: s1.values
Out[110]: array([ 0.43303189, -1.1803582 , -1.58305798, -1.20091707])
In [111]: s1.index
Out[111]: DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-05', '2018-10-07'], dtype='datetime64[ns]', freq=None)
In [112]: # 快速生成时间序列:pd.date_range
In [113]: data_list_new = pd.date_range('2018-01-01',periods=100,freq='H') # 默认是从周日开始
In [114]: len(data_list_new)
Out[114]: 100
In [115]: s2 = Series(np.random.rand(100),index=data_list_new)
In [116]: s2.head()
Out[116]:
2018-01-01 00:00:00 0.891556
2018-01-01 01:00:00 0.953536
2018-01-01 02:00:00 0.321705
2018-01-01 03:00:00 0.150378
2018-01-01 04:00:00 0.180122
Freq: H, dtype: float64
In [117]: t_range = pd.date_range('20180101','20181231')
In [118]: t_range
Out[118]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
'2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
'2018-12-30', '2018-12-31'],
dtype='datetime64[ns]', length=365, freq='D')
In [119]: s1 = Series(np.random.randn(len(t_range)),index=t_range)
In [120]: s1.head()
Out[120]:
2018-01-01 0.442134
2018-01-02 1.726818
2018-01-03 -1.157719
2018-01-04 1.179449
2018-01-05 0.974630
Freq: D, dtype: float64
In [121]: # 对时间序列采样
In [122]: s1['2018-01'].mean()
Out[122]: 0.03117062119001378
In [123]: s1_month = s1.resample('M').mean() #按月进行采样
In [124]: s1_month.index
Out[124]:
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
'2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
'2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
dtype='datetime64[ns]', freq='M')
In [125]: s1.resample('H').bfill().head()
Out[125]:
2018-01-01 00:00:00 0.442134
2018-01-01 01:00:00 1.726818
2018-01-01 02:00:00 1.726818
2018-01-01 03:00:00 1.726818
2018-01-01 04:00:00 1.726818
Freq: H, dtype: float64
数据分箱技术Binning:
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
该函数的用处是把分散的数据化为分段数据,例如学生的分数,从0到100分,可以分为(0,59],(60,79],(80,90],(90,100],还有就是年龄也可以分段,因此该函数就是为此而生的,同时返回的还是原始数据,只是已经是分箱过的数据,同时可以添加新标签,下面给出例子:
把学生分数分箱
In [1]: import numpy as np^M
...: import pandas as pd^M
...: from pandas import Series,DataFrame
...:
...:
In [2]: score_list = np.random.randint(0,100,size=100) # 随机创建100个学生分数,分数从0
# 到100
In [3]: score_list
Out[3]:
array([56, 80, 89, 3, 45, 56, 65, 48, 12, 20, 13, 37, 1, 85, 64, 50, 72,
43, 8, 15, 9, 16, 63, 41, 68, 98, 2, 18, 78, 83, 54, 90, 81, 64,
98, 48, 52, 67, 1, 7, 24, 98, 83, 57, 57, 36, 90, 48, 59, 72, 4,
8, 2, 26, 16, 91, 26, 9, 66, 92, 22, 3, 91, 72, 90, 28, 74, 88,
89, 79, 13, 91, 57, 98, 63, 68, 63, 73, 33, 33, 99, 55, 18, 87, 60,
53, 24, 77, 85, 70, 57, 58, 75, 86, 88, 43, 52, 4, 71, 16])
In [4]: bins = [0,59,79,89,100] # 分数分段区间即0,59],(60,79],(80,90],(90,100]
In [5]: score_cut = pd.cut(score_list,bins) # 通过pd.cut()函数把分数按照bins进行分割
In [18]: len(score_cut) # 返回还是100个分数,只是这些分数已经分箱了,可以添加标签等
Out[18]: 100
In [6]: score_cut # 返回的数据类型为pandas.core.arrays.categorical.Categorical
Out[6]:
[(0, 59], (79, 89], (79, 89], (0, 59], (0, 59], ..., (0, 59], (0, 59], (0, 59], (59, 79], (0, 59]]
Length: 100
Categories (4, interval[int64]): [(0, 59] < (59, 79] < (79, 89] < (89, 100]]
In [7]: type(score_cut)
Out[7]: pandas.core.arrays.categorical.Categorical
In [8]: pd.value_counts(score_cut) # 查看每个区间的人数
Out[8]:
(0, 59] 54
(59, 79] 22
(89, 100] 12
(79, 89] 12
dtype: int64
# 为后续处理做准备
Dataframe数据进行分箱
还是引用上面的数据进行实践
In [9]: df = DataFrame() # 创建一个空Dataframe数据
In [10]: df['score_list'] = score_list # 把数据填充进去
In [11]: df.head() # 查看前5行
Out[11]:
score_list
0 56
1 80
2 89
3 3
4 45
In [12]: df['name'] = [pd.util.testing.rands(3) for i in range(100)]
...: # pandas提供pd.util.testing.rands()函数 随机生成字符串作为学生姓名并填充进去
In [13]: df.head() # 显示前5个人的数据
Out[13]:
score_list name
0 56 puk
1 80 VUL
2 89 cwz
3 3 uVb
4 45 sRN
In [14]: # 把分箱结果作为一个columns
In [15]: # 把分箱结果作为一个columns,并把分数段分等级:low,0k,good,great
In [16]: df['Categories'] = pd.cut(df['score_list'],bins,labels=['low','ok','g
...: ood','great'])
In [17]: df.head(10)
Out[17]:
score_list name Categories
0 56 puk low
1 80 VUL good
2 89 cwz good
3 3 uVb low
4 45 sRN low
5 56 3vM low
6 65 wp8 ok
7 48 lSF low
8 12 AkT low
9 20 tgb low
分组技术GroupBy
DataFrame.
groupby
(by = None,axis = 0,level = None,as_index = True,sort = True,group_keys = True,squeeze = False,observe = False,** kwargs )
该函数的主要处理分组问题,例如从数据中有两个特征感兴趣,可以单独拿出来供我们处理,例如:
date city temperature wind
0 03/01/2016 BJ 8 5
1 17/01/2016 BJ 12 2
2 31/01/2016 BJ 19 2
3 14/02/2016 BJ -3 3
4 28/02/2016 BJ 19 2
5 13/03/2016 BJ 5 3
6 27/03/2016 SH -4 4
7 10/04/2016 SH 19 3
8 24/04/2016 SH 20 3
9 08/05/2016 SH 17 3
10 22/05/2016 SH 4 2
11 05/06/2016 SH -10 4
12 19/06/2016 SH 0 5
13 03/07/2016 SH -9 5
14 17/07/2016 GZ 10 2
15 31/07/2016 GZ -1 5
16 14/08/2016 GZ 1 5
17 28/08/2016 GZ 25 4
18 11/09/2016 SZ 20 1
19 25/09/2016 SZ -10 4
从数据中我们看到主要有四个城市的天气记录,只是通过这个表格我们不容易处理数据,例如各城市的均值和最大值、最小值、画图等,以此可以针对‘city’进行分组,然后对其处理,再利用分组后的属性对数据进一步处理,其中一些属性有:
gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
从中我们可以看出有很多属性函数给我们处理数据,还具有画图功能,下面给出具体数据处理代码示例:
In [59]: import numpy as np
...: import pandas as pd
...: from pandas import Series,DataFrame
...:
...:
In [60]: df = pd.read_csv('city_weather.csv')
In [61]: df.head()
Out[61]:
date city temperature wind
0 03/01/2016 BJ 8 5
1 17/01/2016 BJ 12 2
2 31/01/2016 BJ 19 2
3 14/02/2016 BJ -3 3
4 28/02/2016 BJ 19 2
In [62]: gb = df.groupby(df['city'],) # 以城市为准分组,可分为BJ,GZ,SH,SZ
g.<tab> # 有很多属性可用
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
In [65]: gb.groups # 组成员和每组的索引
Out[65]:
{'BJ': Int64Index([0, 1, 2, 3, 4, 5], dtype='int64'),
'GZ': Int64Index([14, 15, 16, 17], dtype='int64'),
'SH': Int64Index([6, 7, 8, 9, 10, 11, 12, 13], dtype='int64'),
'SZ': Int64Index([18, 19], dtype='int64')}
In [67]: gb.get_group('BJ').mean() # 获得BJ的temperature和wind的均值
Out[67]:
temperature 10.000000
wind 2.833333
dtype: float64
In [69]: gb.max()
Out[69]:
date temperature wind
city
BJ 31/01/2016 19 5
GZ 31/07/2016 25 5
SH 27/03/2016 20 5
SZ 25/09/2016 20 4
gb.plot()
其他功能参考pandas官方文档