为什么gensim Doc2Vec为同一个句子提供不同的载体？

问题描述：

我正在使用gensim.models.doc2vec import Doc2Vec使用两个完全相同的句子（文档）进行训练，并且在检查每个句子的向量时，它们是完全不同的。神经网络是否有不同的随机初始化？为什么gensim Doc2Vec为同一个句子提供不同的载体？

# imports 
from gensim.models.doc2vec import LabeledSentence 
from gensim.models.doc2vec import Doc2Vec 
from gensim import utils 

# Document iteration class (turns many documents in to sentences 
# each document being once sentence) 
class LabeledDocs(object): 
    def __init__(self, sources): 
     self.sources = sources 
     flipped = {} 
     # make sure that keys are unique 
     for key, value in sources.items(): 
      if value not in flipped: 
       flipped[value] = [key] 
      else: 
       raise Exception('Non-unique prefix encountered') 

    def __iter__(self): 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       # print fin.read().strip(r"\n") 
       yield LabeledSentence(utils.to_unicode(fin.read()).split(), 
             [prefix]) 

    def to_array(self): 
     self.sentences = [] 
     for source, prefix in self.sources.items(): 
      with utils.smart_open(source) as fin: 
       #print fin, fin.read() 
       self.sentences.append(
        LabeledSentence(utils.to_unicode(fin.read()).split(), 
            [prefix])) 
     return self.sentences 

# play and play3 are names of identical documents (diff gives nothing) 
inp = LabeledDocs({"play":"play", "play3":"play3"}) 
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025, 
       min_alpha=0.025, batch_words=1) 
model.build_vocab(inp.to_array()) 
for epoch in range(10): 
    model.train(inp) 

# post to this model.docvecs["play"] is very different from 
# model.docvecs["play3"]

这是为什么？无论play和play3包含：

foot ball is a sport 
played with a ball where 
teams of 11 each try to 
score on different goals 
and play with the ball

答

是，每个句子向量不同的初始化。

特别是在reset_weights方法中。初始化向量一句随机的代码是这样的：

for i in xrange(length): 
    # construct deterministic seed from index AND model seed 
    seed = "%d %s" % (model.seed, self.index_to_doctag(i)) 
    self.doctag_syn0[i] = model.seeded_vector(seed)

在这里你可以看到，每个句子向量利用该模型的随机种子和句子的标签初始化。因此，在你的示例play和play3中导致不同的向量是有意义的。

但是，如果你正确地训练模型，我会期望两个向量最终彼此非常接近。

为什么gensim Doc2Vec为同一个句子提供不同的载体？

相关推荐