Question Answering with Subgraph Embeddings笔记
来源: EMNLP2014
原文
Motivation
目标:
Our main motivation is to provide a system for open QA able to be trained as
long as it has access to: (1) a training set of questions paired with answers and
(2) a KB providing a structure among answers.
假设:
问题的答案是知识图中的实体,并且问题也是由知识图中的实体组成的。当没有实体时,使用plain string matching method。
数据集
- WebQuestions数据集(This dataset is built using Freebase as the KB and contains 5,810 question-answer pairs.)作为evaluation bemchmark。因为该数据集较小,因此还使用了其他来源的数据来训练。
-
Freebase
Freebase [3] is a huge and freely available database of general facts; data is organized as triplets ( subject , type1.type2.predicate, object ), where two entities subject and object (identified by mids) are connected by the relation type type1.type2.predicate .
把triples转化成问题,“What is the predicate of the type2 subject”。但没有使用全部的知识,只保留了实体在WebQuestions或CLUEWEB出现过的。所有出现次数小于5的实体也被删掉。
An example is “What is the nationality of the person barack obama ?” (united states ).
- > we also created questions using ClueWeb extractions provided by[10]. Using string matching, we ended up with 2M extractions structured as( subject , “text string”, object ) with both subject and object linked to Freebase.
- Paraphrases. 一个包含不同类别问题的数据集。
Method
和分别表示问题和答案。socre function generates a high score if is the correct answer to the question , and a low score otherwise. Note that both and are represented as a combination of the embeddings of their individual words and/or symbols;hence, learning essentially involves learning these embeddings:
具体说来,有矩阵, 是embedding size,是字典的大小,其中,表示单词的总数,表示实体和关系类型的总数。第列就是embedding of the i-th element (word, entity or relation type) in the dictionary.
函数把问题映射到embedding space , 是一个指示向量表示某个词出现的次数(没有考虑单词之间语义的联系,单词之间的顺序,影响大不大?)。相当于就是用中单词embedding的和来表示.
Representing Candidate Answers
对答案的表示。其中,,有三种计算方式:
1. Single Entity. The answer is represented as a single entity from Freebase. is a 1-of- coded vector.
2. Path Representation. The answer is represented as a path from the entity mentioned in the question to the answer entity.作者只考虑1- or 2-hops的路径。这时是一个3-of-或者4-of-的coded vector(路径中的实体不表示).
3. Subgraph Representation. 方法同2,但加入了实体对应的C个实体和D个关系(候选答案实体子图)。同时为了区别路径表示和子图表示的不同,这里对实体和关系字典扩大一倍,即(??). 这时是一个3+C+D或者4+C+D-of-的coded vector.
这种做法基于假设:
Our hypothesis is that including more information about the answer in its representation will lead to improved results.
Training and Loss Function
要学习的参数就是,也就是word, entity or relation type的embedding. 其中头顶带帽的表示负样本,负样本怎么来的呢?本文使用的是构造负样本,一半是与问题实体相连的其它路径,另一半是随机选择的。
Multitask Training of Embeddings
作者还使用前面提到的Paraphrases数据集进行多任务训练,方法同上。目的是使同类的问题有更高的相似度。
Inference
显然,在测试阶段,问题的答案可以通过下式得到:
作者使用的测试集每个问题只包含一个可识别的FREEBASE实体. 所有和该实体直接相连的实体构成答案候选集. 而考虑所有与问题中实体两跳相连的实体候选实体过多,作者使用beamsearch的方法,top10的1跳候选实体才考虑其2跳实体。这个候选集记为,默认使用这种方法。
Experiment
略
Our results also verify our hypothesis of Section 3.1, that a richer representation for answers (using the local subgraph) can store more pertinent information.
结论与思考
本文在几乎不需要任何手工定义的特征(hand- crafted features),也不需要借助词汇映射表,词性标注,依存树等条件下取得了当时很好的效果。
创新点:丰富了答案信息的表达,大大提升了基于深度学习的知识库问答的效果。
思考:既然更丰富的答案表达可以提升效果,那对问题有更好的表达会不会提升效果?