Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)

Tensorflow上其实本来已经有word2vec的代码了，但是我第一次看的时候也是看得云里雾里，还是看得不太明白。并且官方文档中只有word2vec的skip-gram实现，所以google了一下，发现了这两篇好文章，好像也没看到中文版本，本着学习的态度，决定翻译一下，一来加深一下自己的理解，二来也可以方便一下别人。第一次翻译，如有不当，欢迎指出。

原文章地址：

Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram)

Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW）

上一篇文章，也就是Skip-gram模型，点这里

下面带来CBOW模型的讲解：

CBOW是什么？

CBOW是什么呢？它的全称是 continuous bag-of-words ，中文是连续词袋模型。它的框架可以说就是将skip-gram模型倒转过来。在skip-gram模型中，是根据目标词预测上下文。而CBOW模型，则是根据上下文预测目标词。

为什么要使用CBOW模型？

既然我们已经有了skip-gram模型，为什么我们还要学习CBOW模型呢？原因就是CBOW模型的表现更加优秀。一部分的原因在于CBOW模型的 inputs 更加丰富。换句化说，假定如下句子：the dog barked at the mailman ，在skip-gram模型中，输入输出为 (input:'dog',output:'barked') ，而在CBOW模型中，将有以下输入输出：(input:['the','barked','at'],output:'dog') 。可以看出在CBOW中，只有当 [the, barked, at] 等词准确地出现了，才会预测dog 出现，而不像skip-gram那样，只能预测出 dog 将出现在 barked 附近。

CBOW模型

CBOW的概念模型看起来就像倒过来的skip-gram模型一样。尽管看起来如此，但是CBOW模型和skip-gram模型并不是对称的。下面是模型的框架图。

Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)

注意到，因为实现框架图跟skip-gram的十分相似，所以没有给出来。如何把这个概念模型如转化成实现模型呢，我们要做的就是生成 (input, output) 的 batch。换句化说，对每一列每次处理处理 b 个（b - batch大小）词（比如b x word[t-2],b x word[t-1], b x word[t+1],b x word[t+2]）。

CBOW 背后的思想是，我们使用所有 input 词的平均词向量作为学习模型的输入。

数据生成

现在，生成数据的函数需要做一些小小的修改来适应CBOW模型。下面是修改后的代码：

[python]view
plain copy

def generate_batch(batch_size, skip_window):  

    # skip window is the amount of words we're looking at from each side of a given word  

    # creates a single batch  

    global data_index  

    assert skip_window%2==1  

    span = 2 * skip_window + 1 # [ skip_window target skip_window ]  

    batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)  

    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  

    # e.g if skip_window = 2 then span = 5  

    # span is the length of the whole frame we are considering for a single word (left + word + right)  

    # skip_window is the length of one side  

    # queue which add and pop at the end  

    buffer = collections.deque(maxlen=span)  

    #get words starting from index 0 to span  

    for _ in range(span):  

        buffer.append(data[data_index])  

        data_index = (data_index + 1) % len(data)  

    # num_skips => # of times we select a random word within the span?  

    # batch_size (8) and num_skips (2) (4 times)  

    # batch_size (8) and num_skips (1) (8 times)  

    for i in range(batch_size):  

        target = skip_window  # target label at the center of the buffer  

        target_to_avoid = [ skip_window ] # we only need to know the words around a given word, not the word itself  

        # do this num_skips (2 times)  

        # do this (1 time)  

        # add selected target to avoid_list for next time  

        col_idx = 0  

        for j in range(span):  

            if j==span//2:  

                continue  

            # e.g. i=0, j=0 => 0; i=0,j=1 => 1; i=1,j=0 => 2  

            batch[i,col_idx] = buffer[j] # [skip_window] => middle element  

            col_idx += 1  

        labels[i, 0] = buffer[target]  

        buffer.append(data[data_index])  

        data_index = (data_index + 1) % len(data)  

    assert batch.shape[0]==batch_size and batch.shape[1]== span-1

可以注意到，batch 的大小变为 (b x span-1)，而修改前为（b x 1）。并且去除了num_skips.因为我们将使用 span 中所有的词。直观地说，batch 的索引（i，j）可以理解为文档labels 中第 j 个词的 i-skip_window 偏移量（ i<skip_window 时）或 i-skip_window+1 个词（i>=skip_window时）。举个例子，假设 skip_window=1 ，输入句子 the dog barked at the mailman，我们将会得到，

batch: [['the','barked'],['dog','at'],['barked','the'],['at','mailman']]

labels: ['dog','barked','at','the']

训练模型

同样，训练模型的阶段也需要做一些调整。但是这也没有很复杂，我们要做的就是把 data placholder 的大小做一些调整，并且为多个输入写

写入正确的符号操作以得到其平均值。考虑到训练过程比较重要，因此我将会把代码分成几个小片段，并且从中挑取重要的来讲解。

变量初始化

首先我们要把 train_dataset 的 placeholder 改变为（b x 2*skip_window）（记住，span-1 = 2*skip_window）。其它保持不变。

[python]view
plain copy

if __name__ == '__main__':  

    batch_size = 128  

    embedding_size = 128 # Dimension of the embedding vector.  

    skip_window = 1 # How many words to consider left and right.  

    num_skips = 2 # How many times to reuse an input to generate a label.  

    valid_size = 16 # Random set of words to evaluate similarity on.  

    valid_window = 100 # Only pick dev samples in the head of the distribution.  

    # pick 16 samples from 100  

    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))  

    valid_examples = np.append(valid_examples,random.sample(range(1000,1000+valid_window), valid_size//2))  

    num_sampled = 64 # Number of negative examples to sample.  

    graph = tf.Graph()  

    with graph.as_default(), tf.device('/cpu:0'):  

        # Input data.  

        train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*skip_window])  

        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])  

        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)  

        # Variables.  

        # embedding, vector for each word in the vocabulary  

        embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))  

        softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],  

                         stddev=1.0 / math.sqrt(embedding_size)))  

        softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

向量查找与求平均

这里我们要做些大的变动。我们要重新编写 embedding lookup 并且正确地求出它们的平均值。总来的来说，我们要查找 train_dataset (大小为 b x 2*skip_window) 的每一行，查找行中词ID对应的向量。然后将这些向量保存在临时变量（ embedding_i ）中，在把这些向量连接起来称为复合向量（embeds）(大小为 2*skip_window x b x D)，进而在 axis 0 上求得 reduce mean 。最终我们可以对 data 的每个 batch 生成 train_labels 中词相应上下文的平均向量。

[python]view
plain copy

# Model.  

embeds = None  

for i in range(2*skip_window):  

    embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])  

    print('embedding %d shape: %s'%(i,embedding_i.get_shape().as_list()))  

    emb_x,emb_y = embedding_i.get_shape().as_list()  

    if embeds is None:  

        embeds = tf.reshape(embedding_i,[emb_x,emb_y,1])  

    else:  

        embeds = tf.concat(2,[embeds,tf.reshape(embedding_i,[emb_x,emb_y,1])])  

assert embeds.get_shape().as_list()[2]==2*skip_window  

print("Concat embedding size: %s"%embeds.get_shape().as_list())  

avg_embed =  tf.reduce_mean(embeds,2,keep_dims=False)  

print("Avg embedding size: %s"%avg_embed.get_shape().as_list())

Loss 函数以及优化

现在相较于skip-gram模型，我们在CBOW模型的 sampled_softmax_loss 中使用了平均向量。代码方面没有很大的变动。

[python]view
plain copy

loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, avg_embed,  

                       train_labels, num_sampled, vocabulary_size))  

optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)  

# We use the cosine distance:  

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))  

normalized_embeddings = embeddings / norm  

valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)  

similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

运行程序

最后我们要来让tensorflow跑起来。

[python]view
plain copy

with tf.Session(graph=graph) as session:  

    tf.initialize_all_variables().run()  

    print('Initialized')  

    average_loss = 0  

    for step in range(num_steps):  

        batch_data, batch_labels = generate_batch(batch_size, skip_window)  

        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}  

        _, l = session.run([optimizer, loss], feed_dict=feed_dict)  

        average_loss += l  

        if step % 2000 == 0:  

            if step > 0:  

                average_loss = average_loss / 2000  

                # The average loss is an estimate of the loss over the last 2000 batches.  

            print('Average loss at step %d: %f' % (step, average_loss))  

            average_loss = 0  

        # note that this is expensive (~20% slowdown if computed every 500 steps)  

        if step % 10000 == 0:  

            sim = similarity.eval()  

            for i in range(valid_size):  

                valid_word = reverse_dictionary[valid_examples[i]]  

                top_k = 8 # number of nearest neighbors  

                nearest = (-sim[i, :]).argsort()[1:top_k+1]  

                log = 'Nearest to %s:' % valid_word  

                for k in range(top_k):  

                    close_word = reverse_dictionary[nearest[k]]  

                    log = '%s %s,' % (log, close_word)  

                print(log)  

    final_embeddings = normalized_embeddings.eval()

结果

最终我们得到的结果如下

[python]view
plain copy

Average loss at step 0: 7.687360  

Nearest to he: annoying, menachem, publicize, unwise, skinny, attractors, devastating, declination,  

Nearest to is: iarc, agrarianism, revoluci, bachman, distinguish, schliemann, carbons, ne,  

Nearest to some: routed, oscillations, reverence, collaborating, invitational, murderous, mortimer, migratory,  

Nearest to only: walkway, loud, today, headshot, foundational, asceticism, tracked, hare,  

...  

Nearest to i: intermediates, backed, techs, duly, inefficiencies, ibadi, creole, poured,  

Nearest to bbc: mprp, catching, slavic, mol, dorian, mining, inactivity, applet,  

Nearest to cost: cakes, voltages, halter, disappeared, poking, buttocks, talents, salle,  

Nearest to proposed: prisoners, ecuador, sorghum, complying, saturdays, positioned, probing, observables,

[python]view
plain copy

Average loss at step 100000: 2.422888  

Nearest to he: she, it, they, there, who, eventually, neighbors, theses,  

Nearest to is: was, has, became, remains, be, becomes, seems, cetacean,  

Nearest to some: many, several, certain, most, any, all, both, these,  

Nearest to only: settling, orchids, commutation, until, either, first, alcohols, rabba,  

...  

Nearest to i: we, you, ii, iii, iv, they, t, lm,  

Nearest to bbc: news, corporation, coffers, inactivity, mprp, formatted, cara, pedestrian,  

Nearest to cost: cakes, length, completion, poking, measure, enforcers, parody, figurative,  

Nearest to proposed: introduced, discovered, foreground, suggested, dismissed, argued, ecuador, builder,

完整代码可以在这里下载：5_word2vec_cbow.py

Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)

相关推荐