如何申请pos_tag_sents(),以大熊猫数据帧有效

如何申请pos_tag_sents(),以大熊猫数据帧有效

问题描述:

在你想POS标签存储在大熊猫数据帧,每行1句大部分实现文本列上SO的情况下使用的应用方法如何申请pos_tag_sents(),以大熊猫数据帧有效

dfData['POSTags']= dfData['SourceText'].apply(
       lamda row: [pos_tag(word_tokenize(row) for item in row]) 

NLTK文档recommends using the pos_tag_sents()用于高效标记多个句子。

这是否适用于这个例子中,如果是将代码中的变化pso_tag一样简单pos_tag_sents或不NLTK意味着段落

正如评论pos_tag_sents()旨在减少导师的装载提到的文本来源每次但问题是如何做到这一点,仍然在熊猫数据框中产生一列?

Link to Sample Dataset 20kRows

+0

你有多少行? – alvas

+0

20,000行将是行数 – mobcdi

+0

这不是问题。只需提取列作为字符串列表,处理它,然后将列添加到数据框。 – alvas

输入

$ cat test.csv 
ID,Task,label,Text 
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat 
2,New Credit,no response,active married expensive soccer Mr. Chang flat 
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat 
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical 
5,Collect Information,response,cozily single practical badminton Mr. Brown flat 

TL; DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents 
>>> import pandas as pd 
>>> df = pd.read_csv('test.csv', sep=',') 
>>> df['Text'] 
0 cozily married practical athletics Mr. Brown flat 
1  active married expensive soccer Mr. Chang flat 
2 healthy single expensive badminton Mrs. Green ... 
3 cozily married practical soccer Mr. Brown hier... 
4  cozily single practical badminton Mr. Brown flat 
Name: Text, dtype: object 
>>> texts = df['Text'].tolist() 
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts)) 
>>> tagged_texts 
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]] 

>>> df['POS'] = tagged_texts 
>>> df 
    ID     Task  label \ 
0 1 Collect Information no response 
1 2   New Credit no response 
2 3 Collect Information  response 
3 4 Collect Information  response 
4 5 Collect Information  response 

               Text \ 
0 cozily married practical athletics Mr. Brown flat 
1  active married expensive soccer Mr. Chang flat 
2 healthy single expensive badminton Mrs. Green ... 
3 cozily married practical soccer Mr. Brown hier... 
4 cozily single practical badminton Mr. Brown flat 

               POS 
0 [(cozily, RB), (married, JJ), (practical, JJ),... 
1 [(active, JJ), (married, VBD), (expensive, JJ)... 
2 [(healthy, JJ), (single, JJ), (expensive, JJ),... 
3 [(cozily, RB), (married, JJ), (practical, JJ),... 
4 [(cozily, RB), (single, JJ), (practical, JJ), ... 

在长:

首先,你可以在Text列中提取到的字符串列表:

texts = df['Text'].tolist() 

那么你可以申请的word_tokenize功能:

map(word_tokenize, texts) 

需要注意的是, @ Boud的建议几乎相同,使用df.apply

df['Text'].apply(word_tokenize) 

然后你转储标记过的文本字符串列表清单:

df['Text'].apply(word_tokenize).tolist() 

然后你可以使用pos_tag_sents

pos_tag_sents(df['Text'].apply(word_tokenize).tolist()) 

然后添加列回数据框:

df['POS'] = pos_tag_sents(df['Text'].apply(word_tokenize).tolist()) 

通过对每一行施加pos_tag,该感知模型将每一次加载(成本高的操作,因为它从磁盘读取一个泡菜)。

如果您取得所有行并将它们发送到pos_tag_sents(需要list(list(str))),则模型将加载一次并用于所有行。

查看source

+0

您能否提供一个示例,使用带有pandas dataframe列的'pos_tag_sents()'作为源和整体目的地,以便句子和标记的输出在同一行上? – mobcdi

+0

我会在黑暗中刺伤,因为我不熟悉熊猫。也许像'pos_tag_sents(map(word_tokenize,dfData ['SourceText']。values()))''。 –

分配给你的新列改为:

dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())