熊猫矢量化而不是循环

问题描述:

我有一个数据帧的路径。任务是使用类似datetime.fromtimestamp(os.path.getmtime('PATH_HERE'))成一个单独的列熊猫矢量化而不是循环

import pandas as pd 
import numpy as np 
import os 


df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']}) 

#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine. 
#for path in df1['Path']: 
# os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path)) 

我可以用下面的做任务得到的最后修改时间为文件夹,但它是一个缓慢的循环,如果我有很多文件夹:

for each_path in df1['Path']: 
    df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path)) 

我该如何去引导这个过程来提高速度? os.path.getmtime不能接受该系列。我在寻找类似:

df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))

+0

'df1 ['Path'] .application(lambda x:datetime.fromtimestamp(os.path.getmtime(x)))'?? – Dark

+0

如果'os.path.getmtime'不能接受这个系列,那么广播就无法完成,所以我不认为你可以得到一个矢量化的解决方案。 – Dark

+0

@Bharathshetty,应用方法*在我的短期测试中速度更快。每个循环约300ms。不幸的是,我害怕一个非矢量化的解决方案不可能 – MattR

我要去假设使用100条路径的3种方法。我认为方法3是优选的。

# Data initialisation: 
paths100 = ['a_whatever_path_here'] * 100 
df = pd.DataFrame(columns=['paths', 'time']) 
df['paths'] = paths100 


def fun1(): 
    # Naive for loop. High readability, slow. 
    for path in df['paths']: 
     mask = df['paths'] == path 
     df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path)) 


def fun2(): 
    # Naive for loop optimised. Medium readability, medium speed. 
    for i, path in enumerate(df['paths']): 
     df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path)) 


def fun3(): 
    # List comprehension. High readability, high speed. 
    df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']] 


% timeit fun1() 
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

% timeit fun2() 
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

% timeit fun3() 
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) 
+0

#3适用于我,它是我测试过程中速度最快的 – MattR

+0

有趣的是我使用了相同类型的逻辑来测试其他类似于这个问题的函数# 3只是针对*这个特定的场景而言更快。@Bharath shetty在评论中提到的apply方法在其他场景中是最快的 – MattR

可以使用GROUPBY transform(让你每组做昂贵的调用仅一次):

g = df1.groupby("Path")["Path"] 
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name))) 
df1["Last Modification Time"] = s # putting this on two lines so it looks nicer... 
+0

只有当路径列重复时才会节省时间... –

+0

我不会有重复的路径,但这对于其他代码问题肯定很方便。作为一个附注:在'os.path.getmtime'附近添加'datetime.fromtimestamp()'或者其他值不正确 – MattR

+0

@AndyHayden因为OP对于每个文件夹都有唯一的路径 – Dark