论文《Aspect Level Sentiment Classification with Deep Memory Network》总结
Aspect Level Sentiment Classification with Deep Memory Network
论文来源:Tang, D., Qin, B., & Liu, T. (2016). Aspect level sentiment classification with deep memory network. arXiv preprint arXiv:1605.08900.
原文链接:http://blog.****.net/rxt2012kc/article/details/73770408
advantages
Neural models are of growing interest for their capacity to learn text representation from data without careful engineering of features, and to capture semantic relations between aspect and context words in a more scalable way than feature based SVM.
disadvantage
Despite these advantages, conventional neural models like long short-term memory (LSTM) (Tang et al., 2015a) capture context information in an implicit way, and are incapable of explicitly exhibiting important context clues of an aspect.
Standard LSTM works in a sequential way and manipulates each context word with the same operation, so that it cannot explicitly reveal the importance of each context word.
cross-entropy
As every component is differentiable, the entire model could be efficiently trained end-to- end with gradient descent, where the loss function is the cross-entropy error of sentiment classification.
目标词为单个词,如果是多个词的处理如下:
For the case where aspect is multi word expression like “battery life”, aspect represen- tation is an average of its constituting word vectors (Sun et al., 2015).
dataset
- laptop and restaurant datasets
We apply the proposed approach to laptop and restaurant datasets from SemEval 2014 (Pontiki et al., 2014).
steps
- input
Given a sentence s = {w1, w2, …, wi, …wn} and the aspect word wi, we map each word into its em- bedding vector. These word vectors are separated into two parts, aspect representation and context rep- resentation. If aspect is a single word like “food” or “service”, aspect representation is the embedding of aspect word.
Context word vectors {e1 , e2 … ei 1 , ei+1 … en } are stacked and regarded as the external memory m 2 Rd⇥(n 1), where n is the sentence length.
- step1
In the first computational layer (hop 1), we regard aspect vector as the input to adaptively select important evidences from memory m through attention layer.
The output of attention layer and the linear transformation of aspect vector2 are summed and the result is considered as the input of next layer (hop 2).
It is helpful to note that the parameters of attention and linear layers are shared in different hops. There- fore, the model with one layer and the model with nine layers have the same number of parameters.
attention model
The basic idea of attention mechanism is that it assigns a weight/importance to each lower position when computing an upper level representation (Bahdanau et al., 2015).
In this work, we use attention model to compute the representation of a sentence with re- gard to an aspect.
Furthermore, the importance of a word should be different if we focus on different aspect. Let us again take the example of “great food but the service was dreadful!”. The context word “great” is more important than “dreadful” for aspect “food”. On the contrary, “dreadful” is more important than “great” for aspect “service”.
计算每个词的权重,通过一个gi = tanh(Watt[mi; vaspect] + batt),gi为1*1的值,最终得到[g1,g2,g3,g4,gk],1*k的向量,再经过一个softmax,计算出每个词的权重,然后分别和每个记忆相乘再相加,最终得到一个d*1的矩阵,作为attention model的输出。
location attention
Such location information is helpful for an attention model because intuitively a context word closer to the aspect should be more important than a farther one.
In this work, we define the location of a context word as its absolute distance with the aspect in the original sentence sequence3.
vi =1 li/n, li代表元素的位置,n代表句子的长度。
the need for multiple hops
Multiple computational layers allow the deep memory network to learn representations of text with multiple levels of abstraction. Each layer/hop retrieves important context words, and transforms the representation at previous level into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions of sentence representation towards an aspect can be learned.
cross entropy
The model is trained in a supervised manner by minimizing the cross entropy error of sentiment classification,
share the same parameters
It is helpful to note that the parameters of attention and linear layers are shared in different hops. There- fore, the model with one layer and the model with nine layers have the same number of parameters.