ch11 时间序列
11.1日期和时间数据类型及工具
- Python标准库包含用于日期(date)和时间(time)数据的数据类型,而且还有日历方面的功能。我们主要会用到datetime、time以及calendar模块
from datetime import datetime
now = datetime.now()
now
datetime.datetime(2018, 12, 25, 9, 25, 16, 517966)
now.year, now.month, now.day
(2018, 12, 25)
- datetime以毫秒形式存储日期和时间。timedelta表示两个datetime对象之间的时间差:
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta
datetime.timedelta(926, 56700)
delta.days
926
delta.seconds
56700
- 可以给datetime对象加上(或减去)一个或多个timedelta,这样会产生一个新对象:
from datetime import timedelta
start = datetime(2011,1,7)
start + timedelta(12,20)# 天,毫秒
datetime.datetime(2011, 1, 19, 0, 0, 20)
start - 2 * timedelta(12)
datetime.datetime(2010, 12, 14, 0, 0)
datetime 模块的数据类型如下:
字符串和datetime的相互转换
- 利用str或strftime方法(传入一个格式化字符串),datetime对象和pandas的Timestamp对象(稍后就会介绍)可以被格式化为字符串:
stamp = datetime(2011,1,3)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y-%m-%d')
'2011-01-03'
- datetime.strptime可以用这些格式化编码将字符串转换为日期:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')
datetime.datetime(2011, 1, 3, 0, 0)
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(value,'%m/%d/%Y') for value in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
- datetime.strptime是通过已知格式进行日期解析的最佳方式。但是每次都要编写格式定义是很麻烦的事情,用dateutil这个第三方包中的parser.parse方法(pandas中已经自动安装好了):
from dateutil.parser import parse
parse('2011-01-03')
datetime.datetime(2011, 1, 3, 0, 0)
parse("Jan 31, 1997 10:45 PM")
datetime.datetime(1997, 1, 31, 22, 45)
- 在一些国际应用领域,日期出现在月前面很普遍,可以传入参数dayfirst=True
parse("6/12/2011",dayfirst=True)
datetime.datetime(2011, 12, 6, 0, 0)
- pandas 通常用于处理成组的日期:
import pandas as pd
datestrs = ["2011-07-06 12:00:00","2011-08-06 00:00:00"]
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
# 还可以处理缺失值
idx = pd.to_datetime(datestrs + [None])
idx
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
idx[2]#NaT(Not a Time)是pandas中时间戳数据的null值。
NaT
pd.isnull(idx)
array([False, False, True])
11.2 时间序列基础
- pandas 最基础的时间序列类型就是以时间戳为索引的Series
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
datetime(2011, 1, 7), datetime(2011, 1, 8),
datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02 0.319960
2011-01-05 1.431469
2011-01-07 -1.651676
2011-01-08 -1.302452
2011-01-10 -0.284987
2011-01-12 0.565406
dtype: float64
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
# datetime 对象也可以切片
ts[datetime(2011,1,7) :]
2011-01-07 -1.651676
2011-01-08 -1.302452
2011-01-10 -0.284987
2011-01-12 0.565406
dtype: float64
ts + ts[::2]
2011-01-02 0.639919
2011-01-05 NaN
2011-01-07 -3.303352
2011-01-08 NaN
2011-01-10 -0.569973
2011-01-12 NaN
dtype: float64
# pandas用NumPy的datetime64数据类型以纳秒形式存储时间戳:
ts.index.dtype
dtype('<M8[ns]')
#DatetimeIndex中的各个标量值是pandas的Timestamp对象:
stamp = ts.index[0]
stamp
Timestamp('2011-01-02 00:00:00')
索引、选取、子集
- 根据标签索引选取数据时,时间序列和其它的pandas.Series很像
stamp = ts.index[2]
ts[stamp]
-1.6516760222106173
# 还有一个更为方便的形式:传入一个可以被解释为日期的字符串
ts['1/10/2011']
-0.2849867054501697
ts['20110110']
-0.2849867054501697
- 对于较长的时间序列,秩只需传入年或者年月即可选取数据的切片
longer_ts = pd.Series(np.random.randn(1000), index = pd.date_range('1/1/2000', periods=1000))
longer_ts
2000-01-01 0.323078
2000-01-02 -0.192916
2000-01-03 0.161027
2000-01-04 1.042233
2000-01-05 1.344387
2000-01-06 -0.764185
2000-01-07 -0.141419
2000-01-08 0.297445
2000-01-09 0.623654
2000-01-10 0.584203
2000-01-11 0.087188
2000-01-12 -0.110279
2000-01-13 0.209217
2000-01-14 -0.915065
2000-01-15 -0.713069
2000-01-16 -0.836166
2000-01-17 0.295419
2000-01-18 0.288559
2000-01-19 -0.084119
2000-01-20 -0.413960
2000-01-21 -0.120220
2000-01-22 0.453401
2000-01-23 -2.301278
2000-01-24 -0.253605
2000-01-25 -1.404243
2000-01-26 1.409910
2000-01-27 0.959088
2000-01-28 -2.079919
2000-01-29 -1.176011
2000-01-30 -0.356094
...
2002-08-28 -1.037953
2002-08-29 0.936959
2002-08-30 -0.991882
2002-08-31 -1.012418
2002-09-01 -0.333391
2002-09-02 -0.562380
2002-09-03 -1.936792
2002-09-04 0.086965
2002-09-05 -0.751722
2002-09-06 0.874634
2002-09-07 -0.694940
2002-09-08 -1.155072
2002-09-09 -0.266088
2002-09-10 -0.412032
2002-09-11 0.032159
2002-09-12 -0.569722
2002-09-13 -0.769999
2002-09-14 -0.540141
2002-09-15 0.380193
2002-09-16 -0.834590
2002-09-17 -0.105814
2002-09-18 -0.509613
2002-09-19 -0.464820
2002-09-20 -0.369378
2002-09-21 -0.588090
2002-09-22 -1.452517
2002-09-23 1.517069
2002-09-24 -0.177512
2002-09-25 -1.207979
2002-09-26 0.575119
Freq: D, Length: 1000, dtype: float64
longer_ts['2001']
2001-01-01 -0.710368
2001-01-02 -0.493213
2001-01-03 0.011035
2001-01-04 -0.188882
2001-01-05 -0.275450
2001-01-06 -1.397614
2001-01-07 -0.050230
2001-01-08 0.995234
2001-01-09 0.144589
2001-01-10 1.399901
2001-01-11 -0.230674
2001-01-12 -0.921200
2001-01-13 -0.125920
2001-01-14 -0.398851
2001-01-15 -1.369030
2001-01-16 -1.083224
2001-01-17 1.703383
2001-01-18 1.481350
2001-01-19 0.721221
2001-01-20 -0.555076
2001-01-21 -0.164058
2001-01-22 0.616386
2001-01-23 -0.614457
2001-01-24 0.624650
2001-01-25 -0.141876
2001-01-26 0.491621
2001-01-27 0.434586
2001-01-28 0.030046
2001-01-29 1.141433
2001-01-30 2.319519
...
2001-12-02 -1.371908
2001-12-03 -0.947667
2001-12-04 -1.169943
2001-12-05 3.115463
2001-12-06 -0.796079
2001-12-07 -0.287574
2001-12-08 -0.775596
2001-12-09 0.473937
2001-12-10 0.353532
2001-12-11 -1.696697
2001-12-12 -0.250758
2001-12-13 -0.395799
2001-12-14 -0.565465
2001-12-15 0.035062
2001-12-16 0.086432
2001-12-17 0.069176
2001-12-18 -0.834662
2001-12-19 0.415141
2001-12-20 -0.433074
2001-12-21 0.731880
2001-12-22 -0.831124
2001-12-23 0.194700
2001-12-24 -0.051128
2001-12-25 -0.379829
2001-12-26 -1.756667
2001-12-27 -0.581870
2001-12-28 1.144978
2001-12-29 1.232212
2001-12-30 -1.354103
2001-12-31 -0.929930
Freq: D, Length: 365, dtype: float64
longer_ts['2001-05']
2001-05-01 -1.271589
2001-05-02 -0.351115
2001-05-03 -0.895262
2001-05-04 -0.713803
2001-05-05 -0.572470
2001-05-06 0.388224
2001-05-07 -0.415884
2001-05-08 -0.149180
2001-05-09 -1.331999
2001-05-10 0.417673
2001-05-11 -0.633069
2001-05-12 1.277451
2001-05-13 0.350078
2001-05-14 -0.477254
2001-05-15 0.331342
2001-05-16 -0.844850
2001-05-17 1.931488
2001-05-18 -0.291305
2001-05-19 0.066933
2001-05-20 0.516700
2001-05-21 -0.472930
2001-05-22 -1.264003
2001-05-23 -0.222774
2001-05-24 -0.633053
2001-05-25 -1.627209
2001-05-26 0.206001
2001-05-27 0.929017
2001-05-28 -0.386632
2001-05-29 1.769678
2001-05-30 -0.250572
2001-05-31 -0.815622
Freq: D, dtype: float64
- 范围查询
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
datetime(2011, 1, 7), datetime(2011, 1, 8),
datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
2011-01-02 0.372253
2011-01-05 -0.746129
2011-01-07 -0.702319
2011-01-08 0.140512
2011-01-10 0.248298
2011-01-12 0.392128
dtype: float64
ts['1/6/2011':'1/11/2011']#这样切片所产生的是原时间序列的视图,没有数据被复制,对切片进行修改会反映到原始数据上。
2011-01-07 -0.702319
2011-01-08 0.140512
2011-01-10 0.248298
dtype: float64
#还有一个等价的方式可以截取两个日期之间的时间序列
ts.truncate(after='1/9/2011')# 1月9号之前的时间序列
2011-01-02 0.372253
2011-01-05 -0.746129
2011-01-07 -0.702319
2011-01-08 0.140512
dtype: float64
dates = pd.date_range('1/1/2000', periods=100, freq = 'W-WED')
long_df = pd.DataFrame(np.random.randn(100,4), index = dates, columns = ['Colorado', 'Texas', 'New York', 'Ohio'])
long_df['5-2001']
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2001-05-02 | 0.228299 | 0.916194 | 1.478840 | -0.715889 |
2001-05-09 | 0.401658 | 1.397341 | 0.505877 | 1.921401 |
2001-05-16 | -1.370897 | -0.493776 | 0.300839 | -0.520820 |
2001-05-23 | -0.565485 | 0.820914 | 0.056647 | 0.890600 |
2001-05-30 | 0.092271 | -0.752676 | 0.585210 | 0.873675 |
带有重复索引的时间序列
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000','1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
dup_ts.index.is_unique
False
dup_ts['1/2/2000']
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int32
dup_ts['1/3/2000']
4
# 想要对具有非唯一时间戳的数据进行聚合。一个办法是使用groupby,并传入level=0:
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
日期的范围、频率以及移动
ts
2011-01-02 0.372253
2011-01-05 -0.746129
2011-01-07 -0.702319
2011-01-08 0.140512
2011-01-10 0.248298
2011-01-12 0.392128
dtype: float64
resampler = ts.resample('D')
resampler
DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
- 生成日期范围
- pandas.date_range可用于根据指定的频率生成指定长度的DatetimeIndex;默认情况下,date_range会产生按天计算的时间点。
index = pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
- 如果只传入起始或结束日期,那就还得传入一个表示一段时间的数字:
pd.date_range(end='2012-06-01',periods=20)
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
'2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
'2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
'2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
'2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
'''想要生成一个由每月最后一个工作日组成的日期索引,
可以传入"BM"频率(表示business end of month,表11-4是频率列表),
这样就只会包含时间间隔内(或刚好在边界上的)符合频率要求的日期:'''
pd.date_range('2000-01-01','2000-12-01',freq='BM')
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
- 基本的时间序列频率
pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
pd.date_range('2012-05-02 12:56:31',periods=5, normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
频率和日期偏移量
- pandas中的频率是由一个基础频率(base frequency)和一个乘数组成的。基础频率通常以一个字符串别名表示,比如"M"表示每月,"H"表示每小时。对于每个基础频率,都有一个被称为日期偏移量(date offset)的对象与之对应。
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
<Hour>
# 传入一个整数即可定义偏移量的倍数
four_hours = Hour(4)
four_hours
<4 * Hours>
#无需明确创建这样的对象,只需使用诸如"H"或"4H"字符串别名即可。在基础频率前面放上一个整数即可创建倍数:
pd.date_range('2000-01-01','2000-01-03 23:59', freq='4h')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
# 偏移量可以使用加法链接
Hour(2) + Minute(30)
<150 * Minutes>
# 同时也可以传入字符串,如“2h30min”
pd.date_range('2001-01-01',periods=10, freq='1h30min')
DatetimeIndex(['2001-01-01 00:00:00', '2001-01-01 01:30:00',
'2001-01-01 03:00:00', '2001-01-01 04:30:00',
'2001-01-01 06:00:00', '2001-01-01 07:30:00',
'2001-01-01 09:00:00', '2001-01-01 10:30:00',
'2001-01-01 12:00:00', '2001-01-01 13:30:00'],
dtype='datetime64[ns]', freq='90T')
- 有些频率所描述的时间点并不是均匀分隔的。例如,“M”(日历月末)和"BM"(每月最后一个工作日)就取决于每月的天数,对于后者,还要考虑月末是不是周末。由于没有更好的术语,我将这些称为锚点偏移量(anchored offset)