Python多处理:Pool.map()似乎根本不会调用函数

问题描述:

我很多新的多线程,所以我很抱歉,如果它是基本的。我有一些功能,OCR图像文件,我想多线程的任务。该函数不返回任何内容,但仅保存OCR数据集的文本。代码如下:Python多处理:Pool.map()似乎根本不会调用函数

start_time = time.time() 
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test' 
listfiles = os.listdir(path) 

filterfiles = [p for p in listfiles if p[-4:] == '.tif'] 

pool = Pool(processes=2) 

result = pool.map(OCRimage,filterfiles) 

pool.close() 
pool.join() 

print("--- %s seconds ---" % (time.time() - start_time)) 

当我运行的代码看起来它卡住上pool.map()。我跑了30分钟,这比试用过程花费的时间要长,并且它不会在单次输出中产生。我测试了我的功能OCRimage,它似乎并没有像一次性使用该功能(使用print(1)作为我的OCRimage代码的第一行)。我想知道有人能帮助我。谢谢,

卡梅伦

编辑(添加OCRimage功能):

的OCRimage功能如下:

def OCRimage(f): 
    #This runs the magick bash script which splits a multi-image tif into multiple single image tiffs 
    process = subprocess.Popen(["magick", path + "\\" + f, path + "\\temp\\%d.tif"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 
    print(process.communicate()[0]) 

    #finds the number of pages for each tiff file (this might not be necassary but the all files in directory python command could access files randomly) 
    max1 = -1 
    for filename in os.listdir(path+'\\temp'):  
     if (max1 < int(filename[0:-4])): 
      max1 = int(filename[0:-4]) 
    max1 = max1 + 1 

    text = "" 
    for each in range(0,max1): 
     im = Image.open(path + "\\temp\\"+ str(each) + ".tif") 
     text = text + pytesseract.image_to_string(im) 
    with open(path + "\\result\\OCR-"+f[0:-4]+".txt", 'w') as file: 
     file.write(text)  

    for f in os.listdir(path+'\\temp'): 
     os.remove(path + '\\temp\\' + f) 

EDIT2:这里是所有进口

import time 
import subprocess 
import os 
import pytesseract 
from PIL import Image 

from multiprocessing import Pool 
import multiprocessing 
countcpus = multiprocessing.cpu_count() 

编辑3:

只运行OCRimage(f)本身工作正常。取而代之的是多线程代码,我只是用这个:

path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test' 
for p in os.listdir(path): 
    OCRimage(p) 
+0

代替打印到标准输出尝试打印到输出文件:) – alfasin

+0

你是否建议打印到stdo ut出于某种原因不会工作? – cfen

+0

其余代码不会将OCR文本文件打印到输出文件中。 – cfen

这是一个Minimal, Complete, and Verifiable Example,这似乎表明,这个问题必须在你的OCRimage功能(见的Windows下面节真正的问题):

from multiprocessing import Pool 

def OCRimage(file_name): 
    print "file_name = %s" % file_name 

filterfiles = ["image%03d.tif" % n for n in range(5)] 

pool = Pool(processes=2) 
result = pool.map(OCRimage, filterfiles) 

pool.close() 
pool.join() 

输出

file_name = image000.tif 
file_name = image001.tif 
file_name = image002.tif 
file_name = image003.tif 
file_name = image004.tif 

我recomme ND这些变化的OCRimage开始:

def OCRimage(file_name): 
    print "file_name = %s" % file_name 
    src = os.path.join([path, file_name]) 
    dst = os.path.join([path, 'temp', '%d.tif']) 
    command_list = ['magick', src, dst] 
    # This runs the magick bash script which splits a multi-image tif into 
    # multiple single image tiffs 
    process = subprocess.Popen(command_list, 
           shell=True, 
           stdout=subprocess.PIPE, 
           stderr=subprocess.PIPE) 
    output, errors = process.communicate() 
    if process.returncode != 0: 
     print "Image processing failed for %s: %s" % (file_name, errors) 
     return 
    # The rest of your code goes here 

重要的是要验证从子进程的返回码是零。如果它不是零,你真的想看看errors字符串。

的Windows

当我运行在Windows上mcve,我得到这个异常:

RuntimeError: 
      Attempt to start a new process before the current process 
      has finished its bootstrapping phase. 

      This probably means that you are on Windows and you have 
      forgotten to use the proper idiom in the main module: 

       if __name__ == '__main__': 
        freeze_support() 
        ... 

      The "freeze_support()" line can be omitted if the program 
      is not going to be frozen to produce a Windows executable. 
Traceback (most recent call last): 
    File "<string>", line 1, in <module> 
    File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main 

当我改变了mcve到这一点,它的工作:

from multiprocessing import Pool 

def OCRimage(file_name): 
    print "file_name = %s" % file_name 

def main(): 
    filterfiles = ["image%03d.tif" % n for n in range(5)] 
    pool = Pool(processes=2) 
    result = pool.map(OCRimage, filterfiles) 
    pool.close() 
    pool.join() 

if __name__ == '__main__': 
    main() 
+0

所以问题是,当我没有多线程,OCRimage工作正常 – cfen

+0

因此,至少我的问题是'result = pool.map(OCRimage,filterfiles)'不起作用。即使我做'OCRimage(f):返回f ** 2'。我使用python 2.7 – cfen

+0

您是否在我的答案顶部运行[mcve]?它会产生预期的输出吗? –