【论文翻译】Clustering by Passing Messages Between Data Points

【论文题目】:Clustering by Passing Messages Between Data Points
【论文来源】:Clustering by Passing Messages Between Data Points
【翻译人】:[email protected]实验室

Clustering by Passing Messages Between Data Points

通过在数据点之间传递消息进行聚类

Brendan J. Frey* and Delbert Dueck

Abstract

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

摘要

通过识别代表性示例的子集来对数据进行聚类,对于处理感官信号和检测数据中的模式非常重要。这样可以通过随机选择数据点的初始子集,然后进行迭代细化来找到“示例”,但这仅在初始选择接近于一个好的解决方案的情况下才有效。我们设计了一种称为“亲和力传播”的方法,该方法将输入的数据点对之间的相似性作为输入度量。在数据点之间交换实值消息,直到高质量的示例集和相应的簇逐渐出现。我们使用亲和力传播对人脸图像进行聚类,检测微阵列数据中的基因,确定该手稿中的代表性句子以及确定可通过航空旅行有效访问的城市。亲和传播发现群集的错误率比其他方法低得多,并且这样做的时间少于一百分之一。

正文

clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called “exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However, this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges. As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

基于相似性度量的数据聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是使用数据来学习一组中心,这样数据点与其最近的中心之间的平方误差之和很小。当中心从实际数据点中选择时,它们被称为“样本”。流行的k-中心聚类技术(1)从随机选择的初始样本集开始,迭代地细化该集合,以减少平方误差之和。k-centers聚类对样本的初始选择非常敏感,因此通常会用不同的初始化方法多次重新运行,试图找到一个好的解决方案。然而,只有当集群数量较少且至少有一个随机初始化接近一个好的解决方案的可能性很高时,这种方法才有效。我们采用了一种完全不同的方法,并引入了一种同时将所有数据点视为潜在范例的方法。通过将每个数据点视为网络中的一个节点,我们设计了一种沿着网络边缘递归传输实值消息的方法,直到出现一组好的样本和相应的簇。如后文所述,在简单公式的基础上更新消息,这些公式搜索适当选择的能量函数的最小值。在任何时间点,每个消息的大小都反映了一个数据点选择另一个数据点作为其示例的当前亲和力,因此我们将我们的方法称为“AP聚类算法”。图1A说明了在消息传递过程中集群是如何逐渐出现的。

【论文翻译】Clustering by Passing Messages Between Data Points

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates how well the data point with index k is suited to be the exemplar for data point i. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points x i and xk , s(i,k) = −||x i − xk || 2 . Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k) can be set to the log-likelihood of data point i given that its exemplar is point k. Alternatively, when appropriate, similarities may be set by hand.

相似性传播将数据点之间的实值相似性集合作为输入,其中相似度s(i,k)表示索引为k的数据点适合作为数据点i的样本。当目标是最小化平方误差时,每个相似性都被设置为负平方误差(欧几里得距离):对于xi和xk,s(i, k) =−||xi − xk||^2。实际上,当优化标准更一般时,这里描述的方法可以应用。随后我们将描述在这些任务中,从成对的图像、成对的微阵列测量、成对的英语句子和成对的城市中得出相似性。当样本相关概率模型可用时,s(i, k)可以设置为数据点i的对数似然,因为它的样本是k点。或者,在适当的情况下可以手动设置相似性。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k) for each data point k so that data points with larger values of s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common valuethis value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

与要求预先指定簇的数目不同,相似性传播将每个数据点k的实数s(k, k)作为输入,以便更有可能选择s(k, k)值较大的数据点作为示例。这些值被称为“首选项”。已识别的样本数(聚类的数量)受输入首选项值的影响,但也来自消息传递过程。如果这是先验的(即所有的数据点都同样适合作为范例),那么首选项应该设置为一个共同的值-这个值可以改变以产生不同数量的聚类。共享值可以是输入相似性的中间值(产生适度数量的聚类)或其最小值(产生少量聚类)。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The “responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B). The “availability” a(i,k), sent from candidate exemplar point k to point i, reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k) = 0. Then, the responsibilities are computed using the rule

【论文翻译】Clustering by Passing Messages Between Data Points

In the first iteration, because the availabilities are zero, r(i,k) is set to the input similarity between point i and point k as its exemplar, minus the largest of the similarities between point i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k = i, the responsibility r(k,k) is set to the input preference that point k be chosen as an exemplar, s(k,k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

在数据点之间有两种类型的消息交换,每种类型都考虑到不同类型的竞争。可以在任何阶段组合消息,以确定哪些点是示例,对于其他每个点,可以确定它们属于哪个示例。从数据点i发送到候选示例点k的“责任” r(i, k)反映了考虑到的其他潜在示例,积累的证据表明适合点k充当点i的示例。点i(图1B)。从候选示例点k发送到点i的“可用性” a(i, k)反映了累积的证据,表明考虑到其他点的支持,点i选择点k作为示例是多么合适。该点k应该是一个示例(图1C)。r(i, k)和a(i, k)可以看作是对数概率比。首先将可用性初始化为零:a(i, k)=0。然后使用规则计算

【论文翻译】Clustering by Passing Messages Between Data Points

在第一次迭代中,由于可用性为零,r(i,k)设为点i和点k之间的输入相似度作为其样本,减去点i与其他候选样本之间的最大相似度。这个竞争性的更新是数据驱动的,没有考虑到有多少其他点有利于每个候选范例。在以后的迭代中,当一些点被有效地分配给其他示例时,它们的可用性将下降到零以下,正如下面的更新规则所规定的那样。这些负有效性将降低上述规则中某些输入相似性s(i,k’)的有效值,从而从竞争中删除相应的候选样本。对于k=i,责任r(k,k)被设置为输入偏好,即选择点k作为样本,s(k,k),减去点i和所有其他候选样本之间的最大相似度。这种“自我责任”反映了累积的证据,即k点是一个范例,基于它的输入偏好,它被分配给另一个范例有多不合适。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:

【论文翻译】Clustering by Passing Messages Between Data Points

鉴于上述职责更新使所有候选样例都能争夺数据点的所有权,而以下可用性更新会从数据点收集证据,以证明每个候选样例是否都能成为一个好的样例:

【论文翻译】Clustering by Passing Messages Between Data Points

The availability a(i,k) is set to the selfresponsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the selfresponsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:

【论文翻译】Clustering by Passing Messages Between Data Points

This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.

可用性a(i,k)设置为自负r(k,k)加上候选样例k从其他点收到的积极责任之和。仅添加传入职责的积极部分,因为只有一个好的榜样才需要很好地解释某些数据点(积极职责),而不管它对其他数据点的解释(负面职责)有多糟糕。如果自负r(k,k)为负(表示点k当前更适合属于另一个示例而不是示例本身),则如果其他一些点具有正值,则可以增加点k作为示例的可用性 k点的责任就是他们的榜样。为了限制强大的积极主动责任的影响,对总和设置阈值,以使总和不能超过零。“自我可用性” a(k,k)的更新方式有所不同:

【论文翻译】Clustering by Passing Messages Between Data Points

该消息基于从其他点发送到候选样例k的积极责任,反映了点k是样例的累积证据。

The above update rules require only simple, local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages, it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to l times its value from the previous iteration plus 1 – l times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of l = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

上面的更新规则只需要简单的、易于实现的局部计算(2),并且消息只需要在具有已知相似性的点对之间交换。在亲和性传播过程中的任何时候,可用性和责任都可以组合起来以识别样本。对于点i,最大化a(i,k)+r(i,k)的k值要么将点i标识为示例(如果k=i),要么标识作为点i示例的数据点。消息传递过程可以在经过固定次数的迭代后终止,在消息的更改降到阈值以下时,或者在局部决策保持一定数量的迭代之后。在更新消息时,重要的是要对消息进行阻尼,以避免在某些情况下出现数值振荡。将每条信息设置为λ乘以上一次迭代的值加上1–λ乘以其规定的更新值,其中阻尼系数λ在0和1之间。在我们的所有实验(3)中,我们使用默认阻尼因子λ=0.5,并且AP聚类的每次迭代包括(i)更新给定可用性的所有责任,(ii)更新给定责任的所有可用性,以及(iii)结合可用性和责任来监视示例决策,并在这些决策在10次迭代中没有变化时终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the messagepassing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar. Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1A显示了应用于25个二维数据点(3)的相似性传播动力学,使用负平方误差作为相似性。AP聚类算法的一个优点是不需要预先指定样本的数量。相反,消息传递方法中会出现适当数量的示例,并取决于输入示例首选项。这使得自动模型选择成为可能,这是基于对每个点作为一个范例的优先程度的预先说明。图1D显示了公共输入首选项的值对集群数量的影响。这个关系几乎与通过精确最小化平方误差(2)得到的关系相同。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time. We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

我们接着研究了用标准的平方误差优化准则对人脸图像进行聚类的问题。我们使用AP聚类和k-中心聚类来识别从奥利维蒂人脸数据库(3)中提取的900幅灰度图像样本。AP聚类算法发现样本的平方误差远低于k-中心聚类(图2A)100次中最好的一次,后者花费了大约相同的计算机时间。我们询问大量随机重启k-中心聚类是否可以获得相同的平方误差。图2B显示了一次AP聚类算法所获得的误差,以及10000次k-中心聚类所获得的误差分布,并根据簇的数量绘制。AP聚类在两个数量级以上的时间内均匀地获得了更低的误差。另一个流行的优化标准是绝对像素差的总和(它可以更好地容忍离群的像素强度),因此我们使用这个误差度量重复上述过程。AP聚类再次均匀地获得较低的误差(图2C)。

【论文翻译】Clustering by Passing Messages Between Data Points

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “nonexon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of − ∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = − ∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

许多任务需要在稀疏相关数据中识别示例,这些数据要么大多数相似性未知,要么数据量巨大并且质量较差。为了检查在这种情况下的亲和力传播,我们使用从微阵列数据得出并在(4)中报道的稀疏相似矩阵,解决了将推定外显子聚类以寻找基因的任务。在这项工作中,从小鼠染色体1的基因组中提取了与假定的外显子相对应的75,066个DNA片段(长60个碱基)。在12个组织样本中测量了它们的转录水平,并且每对假定的外显子之间都具有相似性(数据点)计算。推定外显子之间相似性的度量是基于它们在基因组中的接近程度以及它们在12个组织中转录水平的协调程度。为了说明不是外显子的推定外显子(例如内含子),我们加入了一个额外的人工样本,并使用整个数据集的统计数据确定了每个数据点与此“非外显子样本”的相似性。所得的75,067×75,067相似性矩阵(3)由99.73%的相似性组成-∞值,对应于可能不是同一基因一部分的遥远DNA片段。我们将亲和力传播应用于此相似性矩阵,但是因为如果s(i,k)= −∞时不需要在点i和k之间交换消息,所以亲和力传播的每次迭代都只需要在很小的子集(0.27%或15百万)之间交换消息的数据点对。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and kcenters clustering are compared in Fig. 3B. For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the truepositive (TP) rate against the false-positive (FP) rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A说明了基因簇的鉴定以及一些数据点对非外显子样本的分配。在图3B中比较了亲和力传播和kcenters聚类的重构误差。对于每个数量的群集,亲和力传播运行一次,耗时6分钟,而k中心群集运行10,000次,耗时208小时。为了解决这些方法在检测善意基因片段中表现如何的问题,图3C使用RefSeq数据库中提供的标记绘制了真阳性(TP)率与假阳性(FP)率的关系图(5)。亲和力传播实现了更高的TP速率,尤其是在低FP速率下,这对生物学家而言最重要。在FP率为3%时,亲和力传播的TP率为39%,而最佳k中心聚类结果为17%。为了进行比较,在相同的FP速率下,用于等级团聚聚类(2)的最佳TP率为19%,而(4)中描述的工程工具(它提供了更多的生物学知识)的TP率为43%。

【论文翻译】Clustering by Passing Messages Between Data Points

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metricspace clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e., s(i,k) ≠ s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e., s(i,k) < s(i,j) + s( j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using l = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

AP聚类基于非标准优化准则的能力使得它适合于使用不寻常的相似性度量进行探索性数据分析。与度量空间聚类技术(如k-means聚类(1))不同,AP聚类可以应用于数据不在连续空间中的问题。实际上,它可以应用于相似性不对称的问题[即s(i,k)≠s(k,i)]和相似性不满足三角形不等式[即s(i,k)<s(i,j)+s(j,k)]。为了找出本文草稿中总结其他句子的少量句子,我们将每个句子视为“单词包”(6),并根据使用k句子中的单词对第i句中的单词进行编码的成本,计算出第i句与第k句的相似度。我们发现97%的相似性(2,3)不是对称的。调整偏好以识别(使用λ=0.8)不同数量的代表性例句(2),四个句子的解决方案如图4A所示。

【论文翻译】Clustering by Passing Messages Between Data Points

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k. When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

我们还应用AP聚类来探索确定有限数量的加拿大和美国城市的问题,这些城市最容易被其他大城市的子集所访问,根据估计的商业航空旅行时间。每个数据点都是一个城市,相似性s(i,k)被设置为乘飞机从城市i到城市k所需的负时间,包括估计的中途停留延误(3)。由于逆风,在许多情况下,运输时间因旅行方向不同而不同,因此36%的相似性是不对称的。此外,对于97%的城市对i和k,存在第三个城市j,从而违反了三角不等式,因为从i到k的行程包括城市j的长时间停留延迟,因此它比从i到j和j到k的行程持续时间总和要长。当“最容易到达的城市”的数量被限制为7个(通过适当地调整输入偏好),识别出图4中B至E所示的城市。有趣的是,几个主要城市没有被选中,要么是因为大量的国际旅行使得它们不适合作为容易到达的国内目的地(如纽约市、洛杉矶),要么是因为它们的社区可以通过其他目的地更有效地到达(如亚特兰大、费城,明尼阿波利斯是芝加哥的目的地,同时避免了潜在的机场延误)。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels, c1 ,…,cN , corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s(i,ci ) is the similarity of data point i to its exemplar. c i = i is a special case indicating that point i is itself an exemplar, so that s(i,ci ) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., c i′ = i), then i must be an exemplar (i.e., c i = i). The energy of a valid configuration is E(c) = − ∑ i=1 N s(i,ci ). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

AP聚类可以看作是一种搜索能量函数(7)的极小值的方法,该能量函数依赖于与NN数据点相对应的一组NN隐藏标签c1,…,cN。每个标签都表示该点所属的范例,因此s(i,ci)是数据点i与其范例的相似性。ci=i是一个特例,表明点i本身就是一个范例,因此s(i,ci)是点i的输入偏好。不是所有的标签配置都是有效的;配置c是有效的,当对每个点i,如果其他点i’选择i作为其范例(即,ci’=i),那么i必须是一个范例(即ci=i。有效配置的能量为。精确地最小化能量在计算上是困难的,因为这个最小化问题的一个特例是NP难k-中值问题(8)。然而,亲合传播的更新规则对应于最小化Bethe*能(9)近似的不动点递归。AP聚类作为因子图(10)中描述标签约束和能量函数(2)的最大和算法的实例最容易导出。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if s(1,2) = s(2,1) and s(1,1) = s(2,2), then the solutions c 1 = c 2 = 1 and c 1 = c 2 = 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations, or by increasing the damping factor.

在某些退化情形下,能量函数可能具有多个极小值,且更新规则的多个不动点可能会阻止收敛。例如,如果s(1,2)=s(2,1)和s(1,1)=s(2,2),则解c1=c2=1和c1=c2=2都获得相同的能量。在这种情况下,亲和性传播可能会振荡,两个数据点在样本和非样本之间交替。在实践中,我们发现通过在相似点上添加少量的噪声来防止退化情况,或者通过增加阻尼因子来避免振荡。

Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering(1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques(13) randomly search for good solutions, but do not share affinity propagation's advantage of considering many possible solutions all at once.

与相关技术相比,AP聚类有几个优点。像k-中心聚类(1)、k-均值聚类(1)和期望最大化(EM)算法(11)这样的方法在每个步骤存储一个相对较小的估计聚类中心集。这些技术通过从大量的簇开始,然后对它们进行修剪的方法得到了改进(12),但是它们仍然依赖于随机抽样,并且做出无法恢复的硬修剪决策。相反,通过同时考虑所有数据点作为候选中心并逐步识别簇,AP聚类能够避免由于不吉利的初始化和艰难的决策而导致的许多糟糕的解决方案。马尔可夫链蒙特卡罗技术(13)随机搜索好的解决方案,但不共享AP聚类的优点,即一次考虑许多可能的解决方案。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

层次聚集聚类(14)和谱聚类(15)解决了递归比较点对以找到数据分区的完全不同的问题。这些技术并不要求一个簇内的所有点都类似于一个中心,因此不太适合许多任务。特别是,不应该在同一个簇中的两个点可能会被一个不幸的成对分组序列组合在一起。

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general nonmetric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Whereas the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在(8)中,我们证明了相关的度量k-中值问题可以用常数因子近似来松弛成线性规划。在这里,假设输入是度量的,即非负的,对称的,满足三角不等式的。相反,AP聚类可以将一般非度量相似性作为输入。AP聚类还提供了一种概念上新的方法,在实践中效果良好。虽然线性规划松弛很难解决,并且需要应用复杂的软件包(例如,CPLEX),但AP聚类利用了可以在几行代码中实现的直观消息更新(2)。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages (17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard two-dimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

AP聚类在精神上与最近在不同学科中获得破纪录结果的技术有关(16)。在“循环图”中递归传播消息(17)的方法已用于逼近纠错解码(18,19)中的香农极限,解决大小增加一个数量级的随机可满足性问题(20),解决NP难二维相位展开问题(21),并有效地估计立体图像对的深度(22)。然而,据我们所知,AP聚类是利用这种思想来解决古老的、基本的数据聚类问题的第一种方法。由于它的简单性,普遍适用性和性能,我们相信AP聚类将被证明在科学和工程中具有广泛的价值。