Python多处理:Pool.map()似乎根本不会调用函数
问题描述:
我很多新的多线程,所以我很抱歉,如果它是基本的。我有一些功能,OCR图像文件,我想多线程的任务。该函数不返回任何内容,但仅保存OCR数据集的文本。代码如下:Python多处理:Pool.map()似乎根本不会调用函数
start_time = time.time()
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
listfiles = os.listdir(path)
filterfiles = [p for p in listfiles if p[-4:] == '.tif']
pool = Pool(processes=2)
result = pool.map(OCRimage,filterfiles)
pool.close()
pool.join()
print("--- %s seconds ---" % (time.time() - start_time))
当我运行的代码看起来它卡住上pool.map()
。我跑了30分钟,这比试用过程花费的时间要长,并且它不会在单次输出中产生。我测试了我的功能OCRimage,它似乎并没有像一次性使用该功能(使用print(1)
作为我的OCRimage代码的第一行)。我想知道有人能帮助我。谢谢,
卡梅伦
编辑(添加OCRimage功能):
的OCRimage功能如下:
def OCRimage(f):
#This runs the magick bash script which splits a multi-image tif into multiple single image tiffs
process = subprocess.Popen(["magick", path + "\\" + f, path + "\\temp\\%d.tif"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print(process.communicate()[0])
#finds the number of pages for each tiff file (this might not be necassary but the all files in directory python command could access files randomly)
max1 = -1
for filename in os.listdir(path+'\\temp'):
if (max1 < int(filename[0:-4])):
max1 = int(filename[0:-4])
max1 = max1 + 1
text = ""
for each in range(0,max1):
im = Image.open(path + "\\temp\\"+ str(each) + ".tif")
text = text + pytesseract.image_to_string(im)
with open(path + "\\result\\OCR-"+f[0:-4]+".txt", 'w') as file:
file.write(text)
for f in os.listdir(path+'\\temp'):
os.remove(path + '\\temp\\' + f)
EDIT2:这里是所有进口
import time
import subprocess
import os
import pytesseract
from PIL import Image
from multiprocessing import Pool
import multiprocessing
countcpus = multiprocessing.cpu_count()
编辑3:
只运行OCRimage(f)本身工作正常。取而代之的是多线程代码,我只是用这个:
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
for p in os.listdir(path):
OCRimage(p)
答
这是一个Minimal, Complete, and Verifiable Example,这似乎表明,这个问题必须在你的OCRimage
功能(见的Windows下面节真正的问题):
from multiprocessing import Pool
def OCRimage(file_name):
print "file_name = %s" % file_name
filterfiles = ["image%03d.tif" % n for n in range(5)]
pool = Pool(processes=2)
result = pool.map(OCRimage, filterfiles)
pool.close()
pool.join()
输出
file_name = image000.tif
file_name = image001.tif
file_name = image002.tif
file_name = image003.tif
file_name = image004.tif
我recomme ND这些变化的OCRimage
开始:
def OCRimage(file_name):
print "file_name = %s" % file_name
src = os.path.join([path, file_name])
dst = os.path.join([path, 'temp', '%d.tif'])
command_list = ['magick', src, dst]
# This runs the magick bash script which splits a multi-image tif into
# multiple single image tiffs
process = subprocess.Popen(command_list,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output, errors = process.communicate()
if process.returncode != 0:
print "Image processing failed for %s: %s" % (file_name, errors)
return
# The rest of your code goes here
重要的是要验证从子进程的返回码是零。如果它不是零,你真的想看看errors
字符串。
的Windows
当我运行在Windows上mcve,我得到这个异常:
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
当我改变了mcve到这一点,它的工作:
from multiprocessing import Pool
def OCRimage(file_name):
print "file_name = %s" % file_name
def main():
filterfiles = ["image%03d.tif" % n for n in range(5)]
pool = Pool(processes=2)
result = pool.map(OCRimage, filterfiles)
pool.close()
pool.join()
if __name__ == '__main__':
main()
代替打印到标准输出尝试打印到输出文件:) – alfasin
你是否建议打印到stdo ut出于某种原因不会工作? – cfen
其余代码不会将OCR文本文件打印到输出文件中。 – cfen