为什么NLTK Lemmatizer不能解释一些复数单词?
问题描述:
我试图从古兰经圣书中解读一些词,但有些词不能词法化。为什么NLTK Lemmatizer不能解释一些复数单词?
这里是我的一句话:
sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
那句话是我的txt数据集的一部分。你可以看到有 ,有“surah”这是复数形式的“surah”。 我已经尽我的代码:
def lemmatize(self, ayat):
wordnet_lemmatizer = WordNetLemmatizer()
result = []
for i in xrange (len(ayat)):
result.append(wordnet_lemmatizer.lemmatize(sentence[i],'v'))
return result
,当我运行和打印,结果是这样的:
['bring', 'ten', 'surahs', 'like', u'invent', 'call', 'upon', 'assistance', 'whomever', 'besides', 'Allah', 'truthful']
的“surahs”不变成“古兰经”。
有人可以说为什么?谢谢。
答
见
- Stemming some plurals with wordnet lemmatizer doesn't work
- Python NLTK Lemmatization of the word 'further' with wordnet
对于大多数非标英语单词,共发现Lemmatizer没有什么帮助中得到正确的引理,尝试一个词干:
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('surahs')
u'surah'
此外,尝试lemmatize_sent
在earthy
(一nltk
包装,“无耻插头”):
>>> from earthy.nltk_wrappers import lemmatize_sent
>>> sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
>>> lemmatize_sent(sentence)
[('Then', 'Then', 'RB'), ('bring', 'bring', 'VBG'), ('ten', 'ten', 'RP'), ('surahs', 'surahs', 'NNS'), ('like', 'like', 'IN'), ('it', 'it', 'PRP'), ('that', 'that', 'WDT'), ('have', 'have', 'VBP'), ('been', u'be', 'VBN'), ('invented', u'invent', 'VBN'), ('and', 'and', 'CC'), ('call', 'call', 'VB'), ('upon', 'upon', 'NN'), ('for', 'for', 'IN'), ('assistance', 'assistance', 'NN'), ('whomever', 'whomever', 'NN'), ('you', 'you', 'PRP'), ('can', 'can', 'MD'), ('besides', 'besides', 'VB'), ('Allah', 'Allah', 'NNP'), ('if', 'if', 'IN'), ('you', 'you', 'PRP'), ('should', 'should', 'MD'), ('be', 'be', 'VB'), ('truthful', 'truthful', 'JJ')]
>>> words, lemmas, tags = zip(*lemmatize_sent(sentence))
>>> lemmas
('Then', 'bring', 'ten', 'surahs', 'like', 'it', 'that', 'have', u'be', u'invent', 'and', 'call', 'upon', 'for', 'assistance', 'whomever', 'you', 'can', 'besides', 'Allah', 'if', 'you', 'should', 'be', 'truthful')
>>> from earthy.nltk_wrappers import pywsd_lemmatize
>>> pywsd_lemmatize('surahs')
'surahs'
>>> from earthy.nltk_wrappers import porter_stem
>>> porter_stem('surahs')
u'surah'
没有什么不对的wordnetlemmatizer本身,而是它只是无法处理不规则的话不够好。你可以试试这个'黑客' - https://stackoverflow.com/questions/22333392/stemming-some-plurals-with-wordnet-lemmatizer-doesnt-work –
我试过那个黑客,但它没有返回任何[] – sang