使用熊猫Series.rolling与DateOffset
问题描述:
Python,熊猫,数据分析在这里。使用熊猫Series.rolling与DateOffset
所以我想要做的是从大量的Apache服务器日志中确定最繁忙的60分钟时间间隔。我已经将日志中的时间戳提取到列表中。
time_recieved是具有这样的
[
1995-07-01T00:01:18-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:46-04:00,
1995-07-01T00:13:47-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:50-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:14:11-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:18-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:23-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:26-04:00,
1995-07-01T00:14:27-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:31-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:36-04:00,
]
我的目标是,沿着时间戳的这个名单,我将能够获得60分钟间隔的那些点中的任意一个开始计值的列表。一旦我得到了滚动窗口,我想我可以处理。
熊猫文档上: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我发现有关窗口参数 “ 窗口下面的项:int或偏移 移动窗口的大小这是用于计算统计观测值的数目的每个。窗口的大小是固定的 如果它是一个偏移量,那么这将是每个窗口的时间周期,每个窗口将是一个基于时间段中包含的观察值的变量,这只对日期时间类型的索引有效。是0.19.0新增功能 “
我正在使用熊猫19.2选项o f根据时间段内的观察结果,使用可变大小的窗口听起来就像我想要的那样。所以,我想实现它:
import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 60):
time_window = DateOffset(minutes = 60)
print (type(time_window))
series = pd.Series(data)
series.rolling(time_window).count()
return series
busiest_tf = busiest_timeframe(time_received)
我得到以下错误: 提高ValueError异常(“窗口必须是整数”)
ValueError: window must be an integer
是存在的,我使用了一些其它的补偿对象?这个熊猫功能不起作用吗?我误解了文档吗?
非常感谢您的帮助和建议!
答
不幸的是我不知道如何使用series.rolling,它好像你没有将它设置为索引,这就是为什么它没有工作。但即使如此,我还是有错误,所以这里有一个选择(也许真的很丑陋),所以如果别人有更好的方法,最好是听取其他人的意见。
所以是的,它使用布尔索引。如果需要,可以使用代码(大量的打印语句),也许可以更改> =和< =>和<。
liste=[
"1995-07-01T00:01:18-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:46-04:00",
"1995-07-01T00:13:47-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:50-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:14:11-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:18-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:23-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:26-04:00",
"1995-07-01T00:14:27-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:31-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:36-04:00"
]
import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 1):
series = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S') #maybe you dont need the to_datetime here. I did.
df=series.to_frame(name="time")
df["count"]=[df[(df["time"] >= x) & (df["time"] <= (x+pd.Timedelta(seconds=timeframe)))].size for x in df["time"].values] #change seconds to minutes or whatever you want
highest_index=df["count"].idxmax()
#print(df.ix[highest_index]["time"])
df2=df[(df["time"] >= df.ix[highest_index]["time"]) & (df["time"] <= (df.ix[highest_index]["time"]+pd.Timedelta(seconds=timeframe)))] #change seconds here to th same as above
return df2
print(busiest_timeframe(liste))
'''''''''''''''''''''''''所以,第一个参数必须是一个整数。 – DyZ
您可能正在寻找重采样器,而不是窗口:'series.resample('60M')。count()'。但是,重采样器不在滚动,它只是将您的系列分成60分钟的组。 – DyZ
DYZ熊猫文档说:“如果它是一个偏移量,那么这将是每个窗口的时间周期。每个窗口将基于包含在time_period' –