【论文翻译】Clustering by Passing Messages Between Data Points

论文题目:Clustering by Passing Messages Between Data Points
论文来源Clustering by Passing Messages Between Data Points
翻译人:[email protected]实验室

Clustering by Passing Messages Between Data Points

Brendan J. Frey* and Delbert Dueck

通过在数据点之间传递消息进行聚类

Brendan J. Frey* and Delbert Dueck

Abstract

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

通过识别具有代表性的样本的子集来聚类数据对于处理感官信号和检测数据中的模式非常重要。 这样的“样本”可以通过随机选择数据点的初始子集,然后迭代地细化它来找到,但是初始值的选择对聚类能否很好的工作非常重要。 我们设计了一种称为“亲和传播”的方法,它以数据点对之间的相似性作为输入度量。 在数据点之间交换实值信息,直到一个高质量的样本集和相应的集群逐渐出现。 我们使用亲和传播对人脸图像进行聚类,检测微阵列数据中的基因,识别文本中的代表性句子,并通过航空公司识别有效访问的城市。亲和力传播发现的聚类误差相对于其他方法低得多,而且它所用的时间不到其他方法时间的百分之一。

正文

Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called“exemplars.” The popular kk-centers clustering technique begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. kk-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However, this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges. As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method“affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

基于相似性度量对数据进行聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是选择出一组中心点,此时数据点与其最近的中心之间的平方误差之和很小。这些被选取的中心点称为“聚类中心”。受欢迎的k-centers聚类技术首先从随机选择的样本集开始,然后迭代地细化该集合,以减少误差平方之和。k-centers聚类对样本的初始选择非常敏感,因此通常会用不同的初始化方法多次尝试,试图找到一个好的初始选择方案。然而,只有当集群数量较少且随机初始化中至少有一个接近良好的解时,这种方法才有效。我们采用了一种完全不同的方法,并引入了一种同时将所有数据点视为潜在聚类中心的方法。通过将每个数据点视为网络中的一个节点,沿着网络的边缘递归地传输实值消息,直到得到一组好的积累点和相应的簇类出现。消息根据简单的公式更新,这些公式搜索适当选择的能量函数的最小值。在任何时间点每一条消息都反映了一个数据点选择另一个数据点作为其样本的当前亲和力,因此我们将我们的方法称为“亲和力传播”。图1展示了在消息传递过程中集群是如何逐渐出现的。

【论文翻译】Clustering by Passing Messages Between Data Points

Fig. 1. How affinity propagation works.
(A) Affinity propagation is illustrated for two-dimensional data points, where negative Euclidean distance (squared error) was used to measure similarity. Each point is colored according to the current evidence that it is a cluster center (exemplar). The darkness of the arrow directed from point i to point k corresponds to the strength of the transmitted message that point i belongs to exemplar point k.
(B) “Responsibilities” r(i,k) are sent from data points to candidate exemplars and indicate how strongly each data point favors the candidate exemplar over other candidate exemplars.
(C)“Availabilities” a(i,k) are sent from candidate exemplars to data points and indicate to what degree each candidate exemplar is available as a cluster center for the data point.
(D) The effect of the value of the input preference (common for all data points) on the number of identified exemplars (number of clusters) is shown. The value that was used in (A) is also shown, which was computed from the median of the pairwise similarities.

图1. 亲和力传播的工作原理
(A)亲和力传播以二维数据点为例,其中负的欧几里得距离(平方差)用于衡量相似度。每个点根据它是否是聚类中心着色。点ii到点kk的黑色箭头表示的是与点ii属于聚类点kk的可能程度较强。
(B)“responsibility” r(i,k)r(i,k)发送数据点到候选聚类点,指明每一数据点对候选聚类点和其他候选聚类点的支持强度。
(C)“availability” a(i,k)a(i,k)从候选聚类点发送到数据点,指出每个候选聚类点作为该数据点的聚类中心可用的程度。
(D)展示输入偏好(所有数据点共有)对已识别聚类点数目(簇数)的影响。(A)中使用的值也展示出来,它是从两两相似度的中值计算出来的。

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k)s(i,k) indicates how well the data point with index kk is suited to be the exemplar for data point ii. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points xix_i,and xkx_k,s(i,k)=xixk2s(i,k)=-||x_i-x_k||^2. Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k)s(i,k) can be set to the log-likelihood of data point ii given that its exemplar is point kk. Alternatively, when appropriate, similarities may be set by hand.

AP将数据点之间的实值相似性集合作为输入,其中相似度s(i,k)s(i,k)表示索引为kk的数据点是否适合作为数据点ii的范例点。当目标是最小化平方误差时,每个相似性被设置为负平方误差(欧式距离):对于点xix_ixkx_k,s(i,k)=xixk2s(i,k)=-||x_i-x_k||^2。实际上,这里描述的方法可以应用于优化准则更加一般的情况下。比如,在具体的任务中,相似性来自于成对的图像、成对的微阵列测量、成对的英语句子和成对的城市。当样本相关概率模型可用时,s(i,k)s(i,k)可设置为数据点i在假设其样本为点kk的情况下的对数似然。或者在某些适当的情况下,可手动设置相似性s(i,k)s(i,k)

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k)s(k,k) for each data point kk so that data points with larger values of s(k,k)s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value— this value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

亲和力传播不是要求预先指定簇的数量,而是将每个数据点kk的实数值s(k,k)s(k,k)作为输入,以便具有较大s(kk)s(k,k)值的数据点更有可能被选为聚类中心。作为示例,这些值称为“首选项”。所确定的聚类中心数(集群数量)受输入首选项值的影响,但也受消息传递过程的影响。如果有一个先验,所有数据点都适合作为示例,则应将首选项设置为一个公共值—可以更改此值以产生不同数量的聚类。共享值可以是输入相似度的中位数(导致适度数量的集群)或最小值(导致少数数量的集群)。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k)r(i,k), sent from data point ii to candidate exemplar point kk, reflects the accumulated evidence for how well-suited point kk is to serve as the exemplar for point ii, taking into account other potential exemplars for point ii (Fig. 1B). The “availability” a(i,k)a(i,k), sent from candidate exemplar point kk to point ii, reflects the accumulated evidence for how appropriate it would be for point ii to choose point kk as its exemplar, taking into account the support from other points that point kk should be an exemplar (Fig. 1C). r(i,k)r(i,k) and a(i,k)a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k)=0a(i,k)=0. Then, the responsibilities are computed using the rule
r(i,k)s(i,k)maxk  s.t.  kk{a(i,k)+s(i,k)}(1) r(i,k) \leftarrow s(i,k) - \max \limits_{k' \; s.t. \; k' \not= k} \{ a(i,k')+s(i,k')\} \qquad (1) In the first iteration, because the availabilities are zero,r(i,k)r(i,k) is set to the input similarity between point ii and point kk as its exemplar, minus the largest of the similarities between point ii and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k)s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k=ik = i, the responsibility r(k,k)r(k,k) is set to the input preference that point kk be chosen as an exemplar, s(k,k)s(k,k), minus the largest of the similarities between point ii and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point kk is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

数据点之间有两种消息交换,每种都考虑到不同种类的竞争。可以在任何阶段组合消息,以确定哪些点是聚类中心,以及确定其他每个点属于哪个类别。从数据点ii发送到候选聚类中心点kk的“responsibility” r(i,k)r(i,k),反映了点kk适合作为点ii的聚类中心的程度,同时考虑了点ii的其他潜在聚类中心(图1B)。从候选聚类中心点kk到发送点ii的“availability” a(i,k)a(i,k),反映了点ii适合选择点kk作为其聚类中心的程度,同时考虑到来自其他点的支持,得出点kk是聚类中心点的结论(图1C)。r(i,k)r(i,k)a(i,k)a(i,k)可以看作对数概率比。首先,availability初始化为零。然后,使用公式计算responsibility:
r(i,k)s(i,k)maxk  s.t.  kk{a(i,k)+s(i,k)}(1) r(i,k) \leftarrow s(i,k) - \max \limits_{k' \; s.t. \; k' \not= k} \{ a(i,k')+s(i,k')\} \qquad (1) 在第一次迭代中,由于availability为零,r(i,k)r(i,k)设为ii点和kk点之间的输入相似度,让kk作为其聚类中心,减去ii点与其他候选聚类中心之间的最大相似度。这个竞争性的更新是数据驱动的,没有考虑有多少其他点属于这些候选样本。随着迭代次数变多,当一些点被有效地分配给其他样本时,它们的availability将下降到零以下。这些负availability将降低上述公式中某些相似性s(i,k)s(i,k')的有效值,从而从竞争中删除相应的候选聚类中心。当k=ik = i时,r(k,k)r(k,k)被设为点kk被选为聚类中心的s(k,k)s(k,k)减去点ii和所有候选样本之间最大的相似性。这种“self-responsibility”反映了kk适合成为一个聚类中心的程度,同时它被分配给另一个聚类中心的可能性受到了控制。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:
a(i,k)min{0,r(k,k)+i  s.t.  i{i,k}max{0,r(i,k)}}(2){ a(i,k) \leftarrow min \left\{ 0,r(k,k)+ \sum_{i' \; s.t. \; i' \notin \{i,k\}} max \{ 0,r(i',k) \} \right\} \qquad (2) }The availability a(i,k)a(i,k)a is set to the self-responsibility r(k,k)r(k,k) plus the sum of the positive responsibilities candidate exemplar kk receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the self-responsibility r(k,k)r(k,k) is negative (indicating that point kk is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point kk as an exemplar can be increased if some other points have positive responsibilities for point kk being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k)a(k,k) is updated differently:
a(k,k)i  s.t.  ikmax{0,r(i,k)}(3){ a(k,k) \leftarrow \sum_{i' \; s.t. \; i' \not= k} max\{0,r(i',k)\} \qquad (3) }This message reflects accumulated evidence that point kk is an exemplar, based on the positive responsibilities sent to candidate exemplar kk from other points.

上面的responsibility更新允许所有候选聚类中心竞争数据点的所有权,availability更新则更关心通过收集证据每个候选点是否能成为一个好的聚类中心:
a(i,k)min{0,r(k,k)+i  s.t.  i{i,k}max{0,r(i,k)}}(2){ a(i,k) \leftarrow min \left\{ 0,r(k,k)+ \sum_{i' \; s.t. \; i' \notin \{i,k\}} max \{ 0,r(i',k) \} \right\} \qquad (2) }a(i,k)a(i,k)设为r(k,k)r(k,k)加上候选点kk从其他点获得的正responsibility的总和。只加上responsibility的正值,因为一个好的聚类点只需要解释一些数据点选择它的好处,而不需要解释不选择它的坏处。如果的r(k,k)r (k, k)是负的(表明点kk是目前适合属于另一个聚类中心,而不是作为一个聚类中心),如果其它点选择kk作为聚类中心有正的responsibility值,那么点kk作为聚类点的availability便可以相应的提高。为了限制responsibility正值特别大的情况出现,总和被限制,使它不能超过零。“self-availability”a(k,k)a(k,k)具有不同的更新方式:
a(k,k)i  s.t.  ikmax{0,r(i,k)}(3){ a(k,k) \leftarrow \sum_{i' \; s.t. \; i' \not= k} max\{0,r(i',k)\} \qquad (3) }此消息基于其他点发送给候选聚类点kk的正responsibility,反映了明kk点是一个聚类点的程度。

The above update rules require only simple, local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point ii, the value of kk that maximizes a(i,k)a(i,k) + r(i,k)r(i,k) either identifies point ii as an exemplar if k=ik = i, or identifies the data point that is the exemplar for point ii. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages, it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to λ\lambda times its value from the previous iteration plus 1λ1 – \lambda times its prescribed updated value, where the damping factor λλ is between 0 and 1. In all of our experiments (3), we used a default damping factor of λ=0.5λ = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

上面的更新规则只需要简单的、易于实现的局部计算(公式2),并且消息只需要在具有已知相似性的点对之间交换。在亲和性传播过程中的任何时候,availability和responsibility都可以组合起来以识别样本。对于点ii,使a(i,k)+r(i,k)a(i,k)+r(i,k)最大化的kk值要么点ii本身就是聚类点,要么kk是作为点ii的聚类中心的数据点。消息传递过程可以在经过固定次数的迭代后终止,或者在消息的变化降到阈值以下时,以及在局部决策保持一定数量的迭代之后终止。在更新消息时,重要的是要对消息进行一些限制,以避免在某些情况下出现数值振荡。将每条信息设置为λλ乘以上一次迭代的值加上1λ1-λ乘以其规定的更新值,其中限制系数λλ在0和1之间。在我们的所有实验(3)中,我们使用默认限制因子λ=0.5λ=0.5,并且亲和传播的每次迭代包括(1)更新给定availability的所有responsibility,(2)更新给定responsibility的所有availability(3)结合availability和responsibility来监控范例决策,并在这些决策在10次迭代中没有变化时终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message-passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar. Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1A显示了相似性为负平方误差的25个二维数据点(3)的亲和力传播过程。亲和力传播的一个优点是不需要预先指定样本的数量。亲和力传播的消息传递方法中会出现适当数量的取决于输入首选项的聚类点。这使得自动模型选择成为可能,这基于提前对每个点作为一个聚类点优先程度的计算。图1D显示了公共输入首选项的值对集群数量的影响。这个关系几乎与通过精确最小化平方误差(2)得到的关系相同。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both affinity propagation and kk-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of kk-centers clustering (Fig. 2A), which took about the same amount of computer time. We asked whether a huge number of random restarts of kk-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of kk-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

接着,我们研究了用标准的平方误差优化准则对人脸图像进行聚类的问题。我们使用亲和传播和k-centers聚类来识别从Olivetti人脸数据库(3)中提取的900幅灰度图像样本。亲和力传播找出的聚类点的平方误差远低于k-中心聚类(图2A)100次中最好的一次,并且两种方法花费的计算机时间大约相同。我们尝试多次随机初始化进行kk-centers聚类是否可以获得与亲和力传播相近的平方误差。图2B显示了一次亲和力传播所获得的误差,以及10000次kk-centers聚类所获得的误差分布,并根据簇的数量绘制。亲和力传播实现了小于kk-centers聚类两个数量级的均匀传播误差。另一个流行的优化标准是绝对像素差的总和(它可以更好地容忍离群的像素强度),因此我们使用这个误差度量重复上述过程。亲和传播再次均匀地获得较低的误差(图2C)。

【论文翻译】Clustering by Passing Messages Between Data Points

Fig. 2. Clustering faces.
Exemplars minimizing the standard squared error measure of similarity were identified from 900 normalized face images (3). For a common preference of −600, affinity propagation found 62 clusters, and the average squared error was 108. For comparison, the best of 100 runs of k-centers clustering with different random initializations achieved a worse average squared error of 119.
(A) The 15 images with highest squared error under either affinity propagation or k-centers clustering are shown in the top row. The middle and bottom rows show the exemplars assigned by the two methods, and the boxes show which of the two methods performed better for that image, in terms of squared error. Affinity propagation found higher-quality exemplars.
(B) The average squared error achieved by a single run of affinity propagation and 10,000 runs of k-centers clustering, versus the number of clusters. The colored bands show different percentiles of squared error, and the number of exemplars corresponding to the result from (A) is indicated.
(C)The above procedure was repeated using the sum of absolute errors as the measure of similarity, which is also a popular optimization criterion.

图2.人脸聚类
从900个标准化的人脸图像中识别出最小化相似度的标准平方差的聚类点(3)。对于-600的参考度,亲和力传播找到62个簇,平均平方差为108。为进行比较,使用不同随机初始化的100个kk-centers聚类的最佳运行获得了较差的平均平方差119。
(A)顶行显示在亲和力传播或kk-centers聚类下具有最高平方差的15张图像。 中间和底部两行显示了这两种方法分配的聚类点,而方框则显示了根据平方差,这两种方法中哪一种对图像效果更好。亲和力传播发现了更高质量的聚类点。
(B)通过单次亲和力传播和10,000次kk-centers聚类所获得的平均平方差与聚类数的关系。彩色带显示了不同百分比的平方差,并表示与(A)结果相对应的聚类点数目。
(C)使用绝对误差之和作为相似度的度量重复上述过程,这也是一种常用的优化准则。

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation inthis context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “nonexon exemplar ”using statistics taken over the entire data set. The resulting 75,067 x 75,067 similarity matrix (3) consisted of 99.73% similarities with values of -\infty,corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point ii and kk if s(i,k)=s(i,k) = -\infty, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

许多任务需要在相关性不那么强的的数据中识别样本,也就是说,在这些数据中,大多数数据之间的相似性要么是未知的,要么是大且负面的。为了检验这种背景下的亲和力传播,我们利用从微阵列数据中获得的稀疏相似性矩阵(见第(4)节)提出了聚类假设外显子以寻找基因的任务。在这项研究中,我们从小鼠1号染色体的基因组中提取了75066段DNA片段(60碱基长),这些片段与假定的外显子相对应。在12个组织样本中测量了它们的转录水平,并计算了每对假定外显子(数据点)之间的相似性。假设外显子之间的相似性是基于它们在基因组中的接近程度以及它们在12种组织中转录水平的协调程度。为了从假定外显子(例如内含子)中确定外显子,我们加入了一个额外的人工样本,并使用对整个数据集的统计数据来确定其他数据点与这个“非外显子样本”的相似性。由此得到的75067x75067相似矩阵(3)与-\infty值的相似度为99.73%,-\infty值对应于不可能是同一基因一部分的远缘DNA片段。我们将亲和度传播应用到这个相似性矩阵中,但是由于当s(ik)=s(i,k)=-\infty时,点iikk之间不需要交换消息,所以每次相似性传播迭代只需要在数据点对的一小部分(0.27%或1500万数据点)之间交换消息。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar.’ The reconstruction errors for affinity propagation and k-centers clustering are compared in Fig. 3B.For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the true-positive (TP) rate against the false-positive (FP)rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A说明了基因簇的识别和一些数据点分配到非外显子样本的情况。图3B比较了亲和力传播和k-centers聚类的重建误差。对于每个簇数,亲和力传播运行一次,耗时6分钟,而k-centers聚类运行10000次,花了208个小时。为了体现这些方法在检测真实基因片段方面的表现如何,图3C使用RefSeq数据库(5)中提供的标签绘制了true-positive(TP)率和false-positive(FP)率。亲和力传播获得了显著较高的TP率,尤其是在低FP率下,这对生物学家来说是最重要的。当FP率为3%时,亲和力传播的TP率为39%,而k-centers聚类的最佳结果为17%。相比之下,在相同的FP率下,层次聚集聚类(2)的最佳TP率为19%,而(4)中描述的工程工具(它增添了额外的生物学知识)实现了43%的TP率。

【论文翻译】Clustering by Passing Messages Between Data Points

Fig. 3. Detecting genes.
Affinity propagation was used to detect putative exons (data points) comprising genes from mouse chromosome 1. Here, squared error is not appropriate as a measure of similarity, but instead similarity values were derived from a cost function measuring proximity of the putative exons in the genome and coexpression of the putative exons across 12 tissue samples (3).
(A) A small portion of the data and the emergence of clusters during each iteration of affinity propagation are shown. In each picture, the 100 boxes outlined in black correspond to 100 data points (from a total of 75,066 putative exons), and the 12 colored blocks in each box indicate the transcription levels of the corresponding DNA segment in 12 tissue samples. The box on the far left corresponds to an artificial data point with infinite preference that is used to account for nonexon regions (e.g., introns). Lines connecting data points indicate potential assignments, where gray lines indicate assignments that currently have weak evidence and solid lines indicate assignments that currently have strong evidence.
(B) Performance on minimizing the reconstruction error of genes, for different numbers of detected clusters. For each number of clusters, affinity propagation took 6 min, whereas 10,000 runs of k-centers clustering took 208 hours on the same computer. In each case, affinity propagation achieved a significantly lower reconstruction error than k-centers clustering.
(C)A plot of true-positive rate versus false-positive rate for detecting exons [using labels from RefSeq (5)] shows that affinity propagation also performs better at detecting biologically verified exons than k-centers clustering.

图3 基因检测
亲和力传播用于检测包含来自小鼠1号染色体基因的假定外显子(数据点)。在这里,平方误差不适合作为相似度性的度量,而相似度的值是从测量基因组中假定外显子的接近度和假定的外显子在12个组织样本中的共表达(3)的成本函数。
(A)在亲和力传播的每次迭代中显示了一小部分数据和簇的出现。在每张图片中,100个黑色轮廓的方框对应于100个数据点(来自总共75,066个假定的外显子),每个框中的12个彩色块表示12个组织样本中相应DNA片段的转录水平。最左侧的框对应于具有无限参考度的人工数据点,该点用于解释非外显子区域(例如内含子)。连接数据点的线表示潜在的分配,其中灰线表示当前证据不足的分配,实线表示当前证据较强的分配。
(B)对于不同数目的检测到的簇,使基因的重建误差最小化。对于每个数量的簇,亲和力传播花费6分钟,而在同一台计算机上进行10,000次kk-centers聚类花费了208小时。在每种情况下,亲和力传播实现的重建误差均远低于kk-centers聚类。
(C)检测外显子的真阳性率与假阳性率的关系图(使用来自RefSeq(5)的标签)显示,亲和力传播在检测经过生物学验证的外显子方面也比k-centers聚类更好。

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metricspace clustering techniques such as kk-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e.,s(i,k)s(k,i)s(i,k)≠s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e.,s(i,k)<s(i,j)+s(j,k)s(i,k)<s(i,j)+s(j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a“bag of words" (6) and computed the similarity of sentence ii to sentence kk based on the cost of encoding the words in sentence ii using the words in sentence kk. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using λ=0.8λ = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

亲和力传播基于非标准优化准则的能力使得它适合于使用不寻常的相似性度量进行探索性数据分析。与k-means聚类(1)等度量空间聚类技术不同,亲和传播可以应用于数据不在连续空间中的问题。实际上,它还可以应用于相似性不对称的问题[即s(i,k)s(k,i)s(i,k)≠s(k,i)]的问题和相似性不满足三角形不等式[即s(i,k)<s(i,j)+s(j,k)s(i,k)<s(i,j)+s(j,k)]的问题。比如,为了找出总结文章的少量句子,我们将每个句子作为一个“单词包”(6),并根据使用kk句子中的单词对第一句中的单词进行编码的成本计算出句子ii与句子kk的相似性。我们发现,97%的相似性(2,3)是不对称的。调整偏好(使用λ=0.8λ=0.8)以识别不同数量的代表性例句(2),选择四个句子作为代表性句子的解决方案如图4A所示。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k)s(i,k) was set to the negative time it takes to travel from city ii to city kk by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs ii and kk, there was a third city jj such that the triangle inequality was violated, because the trip from ii to kk included a long stopover delay in city jj so it took longer than the sum of the durations of the trips from ii to jj and jj to kk.When the number of“most accessible cities’ was constrained to be seven (by adjusting the input preference appropriately), the cities , shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g, Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

我们还应用亲和力传播,根据预计商业航空旅行时间来探索确定有限数量的加拿大和美国最容易被其他大城市所访问的城市。每个数据点都是一个城市,相似性s(i,k)s(i,k)被设置为乘飞机从城市ii到城市kk所需的负时间,包括预计的中途停留延误(3)。由于逆风,在许多情况下,运输时间因旅行方向不同而不同,因此36%的相似性是不对称的。此外,对于97%的城市iikk,存在第三个城市jj违反了三角不等式,因为从iikk的行程包括城市jj的长时间中途停留延迟,所以它花费的时间比从iijj和从jjkk的行程时间之和长。当“最容易到达的城市”的数量被限制为7个时(通过适当地调整输入偏好),图4B到图4E所示的城市被认为是最容易到达的城市。有趣的是,有几个大城市没有被选中,要么是因为繁重的的国际旅行使得它们不适合作为容易到达的国内目的地(如纽约市、洛杉矶),要么是因为它们的社区可以更有效地通过其他目的地(如亚特兰大、费城、明尼阿波利斯是芝加哥的目的地,同时避免了潜在的机场延误)。

【论文翻译】Clustering by Passing Messages Between Data Points

Fig. 4. Identifying key sentences and air-travel routing.
Affinity propagation can be used to explore the identification of exemplars on the basis of nonstandard optimization criteria.
(A) Similarities between pairs of sentences in a draft of this manuscript were constructed by matching words. Four exemplar sentences were identified by affinity propagation and are shown.
(B) Affinity propagation was applied to similarities derived from air-travel efficiency (measured by estimated travel time) between the 456 busiest commercial airports in Canada and the United States—the travel times for both direct flights (shown in blue) and indirect flights (not shown), including the mean transfer time of up to a maximum of one stopover, were used as negative similarities (3).
(C)Seven exemplars identified by affinity propagation are color-coded, and the assignments of other cities to these exemplars is shown. Cities located quite near to exemplar cities may be members of other more distant exemplars due to the lack of direct flights between them (e.g., Atlantic City is 100 km from Philadelphia, but is closer in flight time to Atlanta).
(D) The inset shows that the Canada-USA border roughly divides the Toronto and Philadelphia clusters, due to a larger availability of domestic flights compared to international flights. However, this is not the case on the west coast as shown in (E), because extraordinarily frequent airline service between Vancouver and Seattle connects Canadian cities in the northwest to Seattle.

图4. 识别关键句子和航线
亲和力传播可用于探索基于非标准优化准则聚类点。
(A)在使用这篇手稿时,句子对之间的相似度是通过匹配单词来构建的。通过亲和力传播识别并显示了四个具有代表性的句子。
(B)将亲和力传播应用于加拿大和美国456个最繁忙的商业机场之间的航空旅行效率(通过估计的旅行时间衡量)得出的相似度——直飞(蓝色显示)和间接飞行的旅行时间 (未显示)(包括最多一个中途停留的平均转换时间)被用作负相似度(3)。
(C)通过亲和力传播识别的七个聚类中心进行了颜色编码,并显示了其他城市对这些聚类中心的分配。 由于城市之间缺乏直飞航班,因此距离聚类中心很近的城市可能是其他较远的聚类中心的成员(例如,大西洋城距离费城100公里,但到亚特兰大的飞行时间更近)。
(D)插图显示,由于与国际航班相比,国内航班更多,因此加拿大-美国边界将大致划分为多伦多和费城的簇。但是,(E)中所示的西海岸情况并非如此,因为温哥华和西雅图之间非常频繁的航空服务将加拿大西北部的城市和西雅图连接起来。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of NN hidden labels, c1cNc_1…c_N, corresponding to the NN data points. Each label indicates the exemplar to which the point belongs, so that s(i,ci)s(i,c_i) is the similarity of data point ii to its exemplar. ci=ic_i =i is a special case indicating that point ii is itself an exemplar, so that s(i,ci)s(i,c_i) is the input preference for point ii. Not all configurations of the labels are valid; a configuration c is valid when for every point ii, if some other point ii' has chosen ii as its exemplar (i.e., ci=ic_{i'}=i), then ii must be an exemplar (i.e, ci=ic_{i}=i). The energy of a valid configuration is E(c)=i=1Ns(i,ci)E(c)=- \sum_{i=1}^N s(i,c_i). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

亲和力传播可以看作是一种寻找能量函数(7)的极小值的方法,它依赖于一组NN个隐藏标签,c1cNc_1…c_N。对应于NN个数据点,每个标签都表示该点所属的类别,s(i,ci)s(i,c_i)表示数据点ii与其范例的相似性。ci=ic_i =i是一个特例,表明点ii本身是一个聚类点,因此s(ici)s(i,c_i)是点ii的输入偏好。并非所有偏好配置都有效;当对于每个点ii,如果存在某个其他点ii'选择了ii作为其范例(即,ci=ic_i'=i),此时配置c是有效的。有效组态的能量为E(c)=i=1Ns(i,ci)E(c)=- \sum_{i=1}^N s(i,c_i)。精确地最小化能量在计算上是困难的,因为这个最小化问题的一个特例是NP-hard k-median 问题(8)。然而,亲和力传播的更新规则对应于最小化Bethe*能(9)近似的定点递归。亲和力传播作为因子图(10)中描述标签约束和能量函数(2)的最大和算法的实例最容易导出。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and . these may prevent convergence. For example, if s(1,2)=s(2,1)s(1,2)=s(2,1) and s(1,1)=s(2,2)s(1,1)= s(2,2), then the solutions c1=c2=1c_1 =c_2= 1 and c1=c2=2c_1=c_2= 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations, or by increasing the damping factor.

在一些不太好的情况下,能量函数可能具有多个极小值,更新规则对应需要更新多个固定点,并且。这些可能会阻止结果收敛。例如,如果s(1,2)=s(2,1)s(1,2)=s(2,1)s(1,1)=s(2,2)s(1,1)=s(2,2),则c1=c2=1c_1 =c_2= 1 或者c1=c2=2c_1=c_2= 2 都计算出相同的能量。在这种情况下,亲和力传播可能会产生振荡,两个数据点在聚类点和非聚类点之间交替。在实践中,我们发现通过在相似点上添加少量的噪声来防止退化情况,或者通过增加阻尼因子来避免振荡。

Affinity propagation has several advantages over related techniques. Methods such as kk-centers clustering (1), kk-means clustering(1),and the expectation maximization (EM ) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

与其他聚类相关技术比较,亲和力传播有几个优点。像kk-centers 聚类(1)、kk-means 聚类(1)和期望最大化(EM)算法(11)这样的方法在每个步骤都存储一个估计出的相对较小的聚类中心集。这些技术从大量的簇开始,然后对它们进行修剪,结果得到逐步优化(12),但是它们仍然依赖于随机抽样,并且难以弥补已经做出的修剪决策。相反,通过同时考虑所有数据点作为候选中心并逐步识别簇,亲和力传播能够避免由于不好的初始化和艰难的决策而导致的许多糟糕的决策。马尔可夫链蒙特卡罗技术(13)随机搜索好的解决方案,但不共享亲和力传播的优势,即无法一次考虑多个可能的解。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

层次聚集聚类(14)和谱聚类(15)通过递归比较点对以找到数据分区。这些技术并不要求一个簇内的所有点都类似于一个中心,因此许多时候都不太适用。特别是,不应该在同一个簇中的两个点可能会被一个不幸的成对分组序列组合在一起。

In (8), it was shown that the related metric kk-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general non metric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Where as the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在(8)中,我们证明了相关的度量kk-means问题可以用常数因子近似来松弛成线性规划。证明时,假设输入是可度量的,即非负的,对称的,满足三角不等式的。相反,亲和力传播可以将一般非度量相似性作为输入。亲和力传播还提供了一种概念上新的方法,在实践中效果良好。虽然线性规划松弛很难解决,并且需要应用复杂的软件包(例如,CPLEX),但亲和传播利用了可以在几行代码中实现的直观消息更新(2)。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages(17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard two dimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

亲和力传播在思想上与最近在不同学科中获得破纪录结果的技术有关(16)。在“循环图”中递归传播消息(17)的方法已用于逼近纠错解码(18,19)中的香农极限,解决大小增加一个数量级的随机可满足性问题(20),解决NP难二维相位展开问题的实例(21),并有效地估计立体图像对的深度(22)。然而,据我们所知,亲和力传播是利用这种思想来解决古老的、基本的数据聚类问题的首要途径。由于它的简单性,普遍适用性和性能,我们相信亲和传播将被证明在科学和工程中具有广泛的价值。