如何申请pos_tag_sents(),以大熊猫数据帧有效
在你想POS标签存储在大熊猫数据帧,每行1句大部分实现文本列上SO的情况下使用的应用方法如何申请pos_tag_sents(),以大熊猫数据帧有效
dfData['POSTags']= dfData['SourceText'].apply(
lamda row: [pos_tag(word_tokenize(row) for item in row])
NLTK文档recommends using the pos_tag_sents()用于高效标记多个句子。
这是否适用于这个例子中,如果是将代码中的变化pso_tag
一样简单pos_tag_sents
或不NLTK意味着段落
正如评论pos_tag_sents()
旨在减少导师的装载提到的文本来源每次但问题是如何做到这一点,仍然在熊猫数据框中产生一列?
输入
$ cat test.csv
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat
TL; DR
>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]
>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response
Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...
在长:
首先,你可以在Text
列中提取到的字符串列表:
texts = df['Text'].tolist()
那么你可以申请的word_tokenize
功能:
map(word_tokenize, texts)
需要注意的是, @ Boud的建议几乎相同,使用df.apply
:
df['Text'].apply(word_tokenize)
然后你转储标记过的文本字符串列表清单:
df['Text'].apply(word_tokenize).tolist()
然后你可以使用pos_tag_sents
:
pos_tag_sents(df['Text'].apply(word_tokenize).tolist())
然后添加列回数据框:
df['POS'] = pos_tag_sents(df['Text'].apply(word_tokenize).tolist())
通过对每一行施加pos_tag
,该感知模型将每一次加载(成本高的操作,因为它从磁盘读取一个泡菜)。
如果您取得所有行并将它们发送到pos_tag_sents
(需要list(list(str))
),则模型将加载一次并用于所有行。
查看source。
您能否提供一个示例,使用带有pandas dataframe列的'pos_tag_sents()'作为源和整体目的地,以便句子和标记的输出在同一行上? – mobcdi
我会在黑暗中刺伤,因为我不熟悉熊猫。也许像'pos_tag_sents(map(word_tokenize,dfData ['SourceText']。values()))''。 –
分配给你的新列改为:
dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())
你有多少行? – alvas
20,000行将是行数 – mobcdi
这不是问题。只需提取列作为字符串列表,处理它,然后将列添加到数据框。 – alvas