如何使用Python的多处理模块编写程序?

问题描述:

# -*- coding: utf-8 -*- 
from __future__ import print_function 
import os, codecs, re, string, mysql 
import mysql.connector 

'''Reading files with txt extension''' 
y_ = "" 
for root, dirs, files in os.walk("/Users/Documents/source-document/part1"): 
    for file in files: 
     if file.endswith(".txt"): 
      x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig") 
      for lines in x_.readlines(): 
       y_ = y_ + lines 
#print(tokenized_docs) 

'''Tokenizing sentences of the text files''' 

from nltk.tokenize import sent_tokenize 
raw_docs = sent_tokenize(y_) 

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs] 

'''Removing stop words''' 

stopword_removed_sentences = [] 
from nltk.corpus import stopwords 
stopset = stopwords.words("English") 
for i in tokenized_docs[0]: 
    tokenized_docs = ' '.join([word for word in i.split() if word not in stopset]) 
    stopword_removed_sentences.append(tokenized_docs) 

''' Removing punctuation marks''' 

regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html 
nw = [] 
for review in stopword_removed_sentences: 
    new_review = '' 
    for token in review: 
     new_token = regex.sub(u'', token) 
     if not new_token == u'': 
      new_review += new_token 
    nw.append(new_review) 

'''Lowercasing letters after removing puctuation marks.''' 

lw = [] #lw stands for lowercase word. 
for i in nw: 
    k = i.lower() 
    lw.append(k) 

'''Removing number with a dummy symbol''' 
nr = [] 
for j in lw: 
    string = j 
    regex = r'[^\[\]]+(?=\])' 
# let "#" be the dummy symbol 
    output = re.sub(regex,'#',string) 
    nr.append(output) 
nrfinal = []  
for j in nr: 
    rem = 0 
    outr = '' 
    for i in j: 
     if ord(i)>= 48 and ord(i)<=57: 
      rem += 1 
      if rem == 1: 
       outr = outr+ '#' 
     else: 
      rem = 0    
      outr = outr+i 
    nrfinal.append(outr) 

'''Inserting into database''' 
def connect(): 
    for j in nrfinal: 
     conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis') 
     cursor = conn.cursor() 
     cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j)) 
     conn.commit() 
     conn.close() 
if __name__ == '__main__': 
    connect() 

我没有收到此代码的任何错误。这对文本文件来说很好。这个问题只是执行时间,因为我有很多文本文件(接近6Gb),程序花费的时间太多。在检查中我发现它是CPU限制的。所以要解决它,需要多处理。请帮助我用多处理模块编写我的代码,以便可以完成并行处理。 谢谢大家。如何使用Python的多处理模块编写程序?

+0

在你使用多处理之前---重构你的代码看看其弱点。 – Merlin

+0

由于代码的工作原理,发布在http://codereview.stackexchange.com/ – Merlin

+0

@Merlin OP是询问尚未编写的代码,因此它在CR上脱离主题 – Heslacher

有其中演示了使用的multiprocessingpython docs一个例子:

from multiprocessing import Pool 
def f(x): 
    return x*x 

if __name__ == '__main__': 
    with Pool(5) as p: 
     print(p.map(f, [1, 2, 3])) 

您可以用它来适应你的代码。一旦获得了文本文件,就可以使用map函数并行执行其余部分。你必须定义一个函数来封装你想要在多个核上执行的代码。

但是,并行读取文件可能会降低性能。另外,以异步方式向数据库添加内容可能无效。所以你可能想要在主线程中执行这两个任务,仍然是