如何将python代码转换为Spark兼容代码(pyspark)?
问题描述:
我有一个pyspark代码,可以从文本片段中提取所需的名称。此代码为我提供了结果,但是由于它的某些部分比较pythonic,因此需要很多时间处理大量数据。请求您的帮助将其转换成更pyspark的方式来提高效率(新火花环境)如何将python代码转换为Spark兼容代码(pyspark)?
articles=sc.textFile("file:///home//XXX//articles.csv").map(lambda line: line.split(","))
articles_ls=list(articles.map(lambda x: [x[0].lower(),x[1].lower(),x[2].lower(),x[3].lower().strip()]).collect())
#Function which needs to be optimized to run faster
def mapper(f):
article_list=[]
list1=[]
list2=[]
list3=[]
list4=[]
list5=[]
list6=[]
list7=[]
for i in range(len(articles_ls)):
for j in range(len(articles_ls[i])-1):
comment=re.split(r'\W+', f.lower().strip())
if articles_ls[i][j] in comment:
if articles_ls[i][j]:
if articles_ls[i][3] == 'typea':
if articles_ls[i][j] not in list1:
list1.append(articles_ls[i][0])
if articles_ls[i][3] == 'typeb':
if articles_ls[i][j] not in list2:
list2.append(articles_ls[i][0])
if articles_ls[i][3] == 'typec':
if articles_ls[i][j] not in list3:
list3.append(articles_ls[i][0])
if articles_ls[i][3] == 'typed':
if articles_ls[i][j] not in list4:
list4.append(articles_ls[i][0])
if articles_ls[i][3] == 'typee':
if articles_ls[i][j] not in list5:
list5.append(articles_ls[i][0])
if articles_ls[i][3] == 'typef':
if articles_ls[i][j] not in list6:
list6.append(articles_ls[i][0])
if articles_ls[i][3] == 'typeg':
if articles_ls[i][j] not in list7:
list7.append(articles_ls[i][0])
list1 = list(set(list1))
list2 = list(set(list2))
list3 = list(set(list3))
list4 = list(set(list4))
list5 = list(set(list5))
list6 = list(set(list6))
list7 = list(set(list7))
article_list.append([("ProductA:".split())+list1]+[("ProductB:".split())+list2]+[("ProductC:".split())+list3]+\
[("ProductD:".split())+list4]+[("ProductE:".split())+list5]+[("ProductF:".split())+list6]+\
[("ProductG:".split())+list7])
return article_list
lines = sc.textFile("file:///home//XXX//data.csv").map(lambda line: line.split(",")).map(lambda x: (x[0],x[1],x[2].encode("ascii", "ignore")))
articles_all = (lines.map(lambda x: (x[0],x[1],x[2],(mapper(x[2].lower())))))
答
我发现火花,当我装我的数据集成星火数据帧,而不是将其追加到列表的运行非常快。然后,我会赞成更多的for循环功能样式。