为什么gensim Doc2Vec为同一个句子提供不同的载体?
问题描述:
我正在使用gensim.models.doc2vec import Doc2Vec
使用两个完全相同的句子(文档)进行训练,并且在检查每个句子的向量时,它们是完全不同的。神经网络是否有不同的随机初始化?为什么gensim Doc2Vec为同一个句子提供不同的载体?
# imports
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import Doc2Vec
from gensim import utils
# Document iteration class (turns many documents in to sentences
# each document being once sentence)
class LabeledDocs(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
# print fin.read().strip(r"\n")
yield LabeledSentence(utils.to_unicode(fin.read()).split(),
[prefix])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
#print fin, fin.read()
self.sentences.append(
LabeledSentence(utils.to_unicode(fin.read()).split(),
[prefix]))
return self.sentences
# play and play3 are names of identical documents (diff gives nothing)
inp = LabeledDocs({"play":"play", "play3":"play3"})
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025,
min_alpha=0.025, batch_words=1)
model.build_vocab(inp.to_array())
for epoch in range(10):
model.train(inp)
# post to this model.docvecs["play"] is very different from
# model.docvecs["play3"]
这是为什么?无论play
和play3
包含:
foot ball is a sport
played with a ball where
teams of 11 each try to
score on different goals
and play with the ball
答
是,每个句子向量不同的初始化。
特别是在reset_weights
方法中。初始化向量一句随机的代码是这样的:
for i in xrange(length):
# construct deterministic seed from index AND model seed
seed = "%d %s" % (model.seed, self.index_to_doctag(i))
self.doctag_syn0[i] = model.seeded_vector(seed)
在这里你可以看到,每个句子向量利用该模型的随机种子和句子的标签初始化。因此,在你的示例play
和play3
中导致不同的向量是有意义的。
但是,如果你正确地训练模型,我会期望两个向量最终彼此非常接近。