熊猫矢量化而不是循环
问题描述:
我有一个数据帧的路径。任务是使用类似datetime.fromtimestamp(os.path.getmtime('PATH_HERE'))
成一个单独的列熊猫矢量化而不是循环
import pandas as pd
import numpy as np
import os
df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']})
#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine.
#for path in df1['Path']:
# os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path))
我可以用下面的做任务得到的最后修改时间为文件夹,但它是一个缓慢的循环,如果我有很多文件夹:
for each_path in df1['Path']:
df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path))
我该如何去引导这个过程来提高速度? os.path.getmtime
不能接受该系列。我在寻找类似:
df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))
答
我要去假设使用100条路径的3种方法。我认为方法3是优选的。
# Data initialisation:
paths100 = ['a_whatever_path_here'] * 100
df = pd.DataFrame(columns=['paths', 'time'])
df['paths'] = paths100
def fun1():
# Naive for loop. High readability, slow.
for path in df['paths']:
mask = df['paths'] == path
df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun2():
# Naive for loop optimised. Medium readability, medium speed.
for i, path in enumerate(df['paths']):
df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun3():
# List comprehension. High readability, high speed.
df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']]
% timeit fun1()
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
% timeit fun2()
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
% timeit fun3()
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
答
可以使用GROUPBY transform
(让你每组做昂贵的调用仅一次):
g = df1.groupby("Path")["Path"]
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name)))
df1["Last Modification Time"] = s # putting this on two lines so it looks nicer...
'df1 ['Path'] .application(lambda x:datetime.fromtimestamp(os.path.getmtime(x)))'?? – Dark
如果'os.path.getmtime'不能接受这个系列,那么广播就无法完成,所以我不认为你可以得到一个矢量化的解决方案。 – Dark
@Bharathshetty,应用方法*在我的短期测试中速度更快。每个循环约300ms。不幸的是,我害怕一个非矢量化的解决方案不可能 – MattR