python自然语言处理之分类和标注词性5.4
自动标注
本节主要介绍以不同的方式给文本自动添加词性标记,词的标记依赖于这个词和它在句子中的上下文。
加载要使用的数据
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
默认标注器
最简单的标注器是为每个标识符分配同样的标记。为了得到最好的效果,最有可能的标记标注每个词。如下演示如何找出那个标记是最有可能的那个(出现次数最多的词性)。
>>> tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN'
创建一个将所有词都标注为NN的标注器:
>>> raw = 'I do not like green eggs and ham , I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
很明显的是,它没能正确的为每一个词划分词性。
默认标注器给每个单独的词分配标记,即使是从来未遇到的词。而在处理几千个词的英文文本时发现大多数新词都是名词,这意味着,默认标注器可以帮我们提高语言处理系统的稳定性。
正则表达式标注器是基于匹配模式分配标记给标识符。例如一般情况下,认为任一以ed结尾的单词都是动词的过去分词,任一以's结尾的名词都是名词所有格。
>>> patterns=[
... (r'.*ing$','VBG'),
... (r'.*ed$','VBD'),
... (r'.*es$','VBZ'),
... (r'.*ould$','MD'),
... (r'.*\'s$','NN$'),
... (r'.*s$','NNS'),
... (r'.^-?[0-9]+(.[0-9]+)?$','CD'),
... (r'.*','NN')
... ]
>>>
>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.1914195357718241
查询标注器
>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = list(fd.keys())[:100]
>>> likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.3329355371243312
许多词被分配None标签,因为它们不在100个最频繁用词中,此时,我们为其分配默认标签NN,先使用查找表,如果不能指定标记就是用默认标注器,这个过程叫做回退。也即通过指定一个标注器作为另一份标注器的参数,完成该操作。performance函数中的第二行。
import nltk from nltk.corpus import brown def performance(cfd,wordlist): lt = dict((word,cfd[word].max()) for word in wordlist) baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN')) return baseline_tagger.evaluate(brown.tagged_sents(categories='news')) def display(): import pylab words_by_freq = list(nltk.FreqDist(brown.words(categories='news'))) cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) sizes = 2 ** pylab.arange(15) perfs = [performance(cfd,words_by_freq[:size]) for size in sizes] pylab.plot(sizes,perfs,'-bo') pylab.title('Lookup Tagger Performance with varying model size') pylab.xlabel('model size') pylab.ylabel('performance') pylab.show() display()