文本摘要的评测方法：Rouge-1, Rouge-2, Rouge-L, Rouge-S

关于Rouge

Rouge(Recall-Oriented Understudy for Gisting Evaluation)，是评估自动文摘以及机器翻译的一组指标，它通过将自动生成的摘要或翻译与一组参考摘要（通常是人工生成的）进行比较计算，得出相应的分值，以衡量自动生成的摘要或翻译与参考摘要之间的相似度。

Rouge-1,Rouge-2, Rouge-N

Rouge-N 定义

文本摘要的评测方法：Rouge-1, Rouge-2, Rouge-L, Rouge-S
分母是参考摘要中n-gram的total个数，分子是参考摘要和自动摘要共有的n-gram的个数。即召回率。
Rouge-N: 是对应的Rouge-1 : 1-gram， Rouge-2: 2-gram，Rouge-3: 3-gram。

举例：
自动摘要Y(自动生成的)：the cat was found under the bed.
参考摘要X： the cat was under the bed.

#	1-gram	reference-1-gram	2-gram	reference-2-gram
1	the	the	the cat	the cat
2	cat	cat	cat was	cat was
3	was	was	was found	was under
4	found	under	found under	under the
5	under	the	under the	the bed
6	the	bed	the bed
7	bed
count	7	6	6	5

$Rouge-1(X,Y) = 6/6=1.0$ 分子是带测评的摘要和参考摘要都出现的1-gram的个数，分母是参考摘要中1-gram的个数。（其实分母也可以是带测评摘要的，但是精确率和召回率之间，我们更关心召回率Recall, 这和上面的公式相同），同样 $Rouge-2(X,Y) = 4/5=0.8$

Rouge-L

L即是LCS(longest common subsequence，最长公共子序列)的首字母。因为Rouge-L使用了最长公共子序列，Rouge-L计算方式：
文本摘要的评测方法：Rouge-1, Rouge-2, Rouge-L, Rouge-S
其中LCS(X,Y)是X和Y 的最长公共子序列的长度， m和n分别表示参考摘要和自动摘要的长度（一般就是所含词的个数）， $R_{lcs},P_{lcs}$ 分别表示召回率和准确率。最后 $F_{lcs}$ 即是我们所说的Rouge-L. 在DUC中， $\beta$ 被设置为一个很大的数，所以Rouge-L几乎只考虑 $R_{lcs}$ ，与上面所说的一般只考虑召回率对应。

Rouge-L的改进版Rouge-W

针对Rouge-L存在的问题，提出了改进版Rouge-W
文本摘要的评测方法：Rouge-1, Rouge-2, Rouge-L, Rouge-S
上图中，X是参考文摘， $Y_1, Y_2$ 是两个待评测文摘，明显 $Y_1$ 要优于 $Y_2$ ，因为 $Y_1$ 可以和参考摘要X连续匹配，但是Rouge_L(X,Y1)=Rouge_L(X,Y2)，针对这个问题论文作者提出了改进的方案—加权最长公共子序列(Weighted Longest Common Subsequence)。关于Rouge-W的详细内容请参看论文[3]。

Rouge-S

即使用了skip-grams，在参考摘要和待评测摘要进行匹配时，不要求gram之间必须是连续的，可以“跳过”几个单词，比如skip-bigram，在产生grams时，允许最多跳过两个词。比如“cat in the hat”的 skip-bigrams 就是 “cat in, cat the, cat hat, in the, in hat, the hat”.

参考：
How-rouge-works-for-evaluation
文本摘要测评方法