Memory4Dialog系列学习

记得上半年看GLMP,当时还跟别人讨论说memory都是骗人的,但是现在心态变正常了,能抓老鼠的都是好猫。下面对于这一系列的文章进行讲解。
Memory4Dialog系列学习
Memory4Dialog系列学习
1)LEARNING END-TO-END GOAL-ORIENTED DIALOG
2)Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems
3) GLOBAL-TO-LOCAL MEMORY POINTER NETWORKS FOR TASK-ORIENTED DIALOGUE
对于query的查询过程,是迭代查询还是并发查询每一条memory呢?
1)只是证实memory可以用在用在task-oriented上
2)Memory4Dialog系列学习pointer network:
ptri={max(z) if z s.t. yi=uzn+l+1 otherwise p t r_{i}=\left\{\begin{array}{ll}{\max (z)} & {\text { if } \exists z \text { s.t. } y_{i}=u_{z}} \\ {n+l+1} & {\text { otherwise }}\end{array}\right.
要么从context+kb中抽取单词,要么从utu_t中抽取.where uzUu_z ∈ U is the input sequence and n+l+1n + l + 1 is the sentinel position index.
在multi-hop就是query与全部entity都计算一个attn分数,然后得到一个综合表达,进行多次,迭代询问
而3)
Memory4Dialog系列学习
Memory4Dialog系列学习
具体算法:
Knowledge read and write:
C=(C1,,CK+1), where CkRV×dembC=\left(C^{1}, \ldots, C^{K+1}\right), \text { where } C^{k} \in \mathbb{R}^{|V| \times d_{e m b}}是一个要学习的embedding矩阵,用来表示 external knowledge as M=[B;X]=(m1,,mn+l)\text { external knowledge as } M=[B ; X]=\left(m_{1}, \dots, m_{n+l}\right)(n是kb实体格式,l是历史长度),K是hop数量
pik=Softmax((qk)Tcik)p_{i}^{k}=\operatorname{Softmax}\left(\left(q^{k}\right)^{T} c_{i}^{k}\right)
cik=B(Ck(mi))Rdembc_{i}^{k}=B\left(C^{k}\left(m_{i}\right)\right) \in \mathbb{R}^{d_{e m b}}(B是一个词袋模型,就是采出M中某一个词的embedding), pikp_i^k是attn分数,
ok=ipikcik+1,qk+1=qk+oko^{k}=\sum_{i} p_{i}^{k} c_{i}^{k+1}, \quad q^{k+1}=q^{k}+o^{k},不是为什么pcp和c的上标不一样,

encode+decode的过程:
1.history写到external knowledge:
cik=cik+hmie if miX and k[1,K+1]c_{i}^{k}=c_{i}^{k}+h_{m_{i}}^{e} \quad \text { if } \quad m_{i} \in X \text { and } \forall k \in[1, K+1].就是说本来C是存在的一个参数矩阵,等到encode好饿之后,再加上一些参数,
2.生成Global memory pointer:
G=(g1,,gn+l)G=\left(g_{1}, \dots, g_{n+l}\right)是一个binary向量,每个entity独立计算attn,
使用hneh_n^e作为query去进行multi-hop得到一个attn向量,训练目标是最后一层的注意力向量,某个词出现在目标句与否。
Glabel=(g1l,,gn+ll)G^{l a b e l}=\left(g_{1}^{l}, \ldots, g_{n+l}^{l}\right)
gi=Sigmoid((qK)TciK),gil={1 if Object (mi)Y0 otherwise g_{i}=\operatorname{Sigmoid}\left(\left(q^{K}\right)^{T} c_{i}^{K}\right), \quad g_{i}^{l}=\left\{\begin{array}{ll}{1} & {\text { if Object }\left(m_{i}\right) \in Y} \\ {0} & {\text { otherwise }}\end{array}\right.
 Loss g=i=1n+l[gil×loggi+(1gil)×log(1gi)]\text { Loss }_{g}=-\sum_{i=1}^{n+l}\left[g_{i}^{l} \times \log g_{i}+\left(1-g_{i}^{l}\right) \times \log \left(1-g_{i}\right)\right]

3.生成Local memory pointer:
1)At each time step t, the global memory pointer G first modify the global contextual representation using its attention weights,就是让external knowledge中和目标句相关的才出现,是一种MLE训练思路。
cik=cik×gi,i[1,n+l] and k[1,K+1]c_{i}^{k}=c_{i}^{k} \times {g}_{i}, \quad \forall i \in[1, n+l] \text { and } \forall k \in[1, K+1]
and then the sketch RNN hidden state htdh_t^d queries the external knowledge.
提取出的向量( The memory attention in the last hop) is the corresponding local memory pointer LtL_t, ---->代表着which is represented as the memory distribution at time step t.
local pointer的MLE训练:
Ltlabel={max(z) if z s.t. yt=Object(mz)n+l+1 otherwise L_{t}^{l a b e l}=\left\{\begin{array}{ll}{\max (z)} & {\text { if } \exists z \text { s.t. } y_{t}=\operatorname{Object}\left(m_{z}\right)} \\ {n+l+1} & {\text { otherwise }}\end{array}\right.
4.最终如何生成:
y^t={argmax(Ptvocab) if argmax(Ptvocab)STObject(margmax(LtR)) otherwise \hat{y}_{t}=\left\{\begin{array}{ll}{\arg \max \left(P_{t}^{v o c a b}\right)} & {\text { if } \arg \max \left(P_{t}^{v o c a b}\right) \notin S T} \\ {\text {Object}\left(m_{\arg \max \left(L_{t} \odot R\right)}\right)} & {\text { otherwise }}\end{array}\right.
5.我的理解:
global pointer是确保目标句中的在external knowledge的memoryy矩阵权重最大。local pointer是某一步生成某一个字。
6.还有待回头学习。感觉还没搞完全懂。