文章目录

Hierarchical clustering

Hierarchical clustering

Abstract

The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic.

Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.

The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of K-means and EM.

The differences in applications of flat and hierarchical clustering in IR.
we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering is a concern.

Hierarchical agglomerative clustering

HAC is Bottom-up hierarchical clustering, which is frequently used in IR.

HAC is visualized as a dendrogram

信息检索导论第十七章笔记(英文)

Merge operation is monotonic, which means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds.

Cutting point

Cut at a prespecified level of similarity.
Cut the dendrogram where the gap between two successive combination similarities is largest
apply equation

信息检索导论第十七章笔记(英文)

where K’ refers to the cut of the hierarchy that results in K’ clusters

pseodo code

信息检索导论第十七章笔记(英文)

Four merge criteria of these four variants of HAC

信息检索导论第十七章笔记(英文)

Single-link and complete-link clustering

This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.

This complete-link merge criterion is nonlocal; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.

Comparison

信息检索导论第十七章笔记(英文)

Left: The single-link similarity of the two upper two-point clusters is the similarity of d2 and d3 (solid line), which is greater than the single-link similarity of the two left two-point clusters (dashed line).

Right: The complete-link similarity of the two upper two-point clusters is the similarity of d1 and d4 (dashed line), which is smaller than the complete-link similarity of the two left two-point clusters (solid line).

Compared with the above dendrogram, there is also a dendrogram according to the complete-link clustering:

信息检索导论第十七章笔记(英文)

Note:
graph-theoretic interpretations motivate the terms single-link and complete-link clustering. Single-link clusters at step k are maximal sets of points that are linked via at least one link (a single link) of similarity s ≥ sk ; complete-link clusters at step k are maximal sets of points that are completely linked with each other via links of similarity s ≥ sk.

Single-link clustering can produce straggling clusters:

信息检索导论第十七章笔记(英文)

which is called chaining.

Complete-link clustering pays too much attention to outliers:

信息检索导论第十七章笔记(英文)

sometimes, it cannot find most intuitive cluster structure.

Time complexity

naive HAC: Θ(N^3)

a more efficient algorithm is the priority-queue algorithm, which has a time complexity of Θ(N^2logN)

信息检索导论第十七章笔记(英文)

four different similarity measures

信息检索导论第十七章笔记(英文)

single link optimization

next-best-merge array (NBM)

信息检索导论第十七章笔记(英文)

Group-average agglomerative clustering

GAAC evaluates cluster quality based on all similarities between documents, which is alse called group-average clustering and average-link clustering.

GAAC computes the average similarity sim-ga of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average:

信息检索导论第十七章笔记(英文)

Compute SIM-GA efficiently

信息检索导论第十七章笔记(英文)

so we have:

信息检索导论第十七章笔记(英文)

Note:
GAAC Requires:

documents represented as vectors

length normalization of vectors, so that self-similarities are 1.0

the dot product for computing the similarity between vectors and sums of vectors.

Centroid clustering

In centroid clustering, the similarity of two clusters is defined as the similarity of their centroids:

信息检索导论第十七章笔记(英文)

the difference between GAAC and centroid clustering is that GAAC considers all pairs of documents in computing average pairwise similarity

Example

Three iterations of centroid clustering. Each iterationmerges the two clusterswhose centroids are closest.

信息检索导论第十七章笔记(英文)

Note:
centroid clustering is not monotonic.

the intersection is circled.

Similarity measure

It is conceptually simpler than the average of all pairwise similarities in GAAC.

Optimality of hierarchical agglomerative clustering

define COME-SIM as the smallest combination similarity:

信息检索导论第十七章笔记(英文)

Centroid clustering is not optimal because inversions can occur.

single-link
The combination similarity of a cluster ω is the smallest similarity of any bipartition of the cluster, where the similarity of a bipartition is the largest similarity between any two documents from the two parts:

信息检索导论第十七章笔记(英文)

complete link
The combination similarity of a cluster ω is the smallest similarity of any two points in ω:

信息检索导论第十七章笔记(英文)

GAAC:
The combination similarity of a cluster ω is the average of all pairwise similarities in ω

信息检索导论第十七章笔记(英文)

Note:
If we use these definitions of combination similarity, then optimality is a property of a set of clusters and not of a process that produces a set of clusters.

Comparison of HAC algorithm

信息检索导论第十七章笔记(英文)

Recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.

first story detection or novelty detection

One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents.

For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001, form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space – and not global structure – that is important in this case.

an approach to duplicate detection on the web

Cluster labeling

Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.

three labeling method to a K-means:

Cluster-internal labeling computes a label that solely depends on the cluster itself, not on other clusters. Labeling a cluster with the title of the document closest to the centroid is one cluster-internal method.

Implementation notes

Most problems that require the computation of a large number of dot products benefit from an inverted index. This is also the case for HAC clustering.

In order to solve this problem,

truncating centroids (keeping only highly weighted terms)
representing clusters by means of sparse medoids instead of dense centroids.

These optimizations can also be applied to GAAC and centroid clustering.

We can employ an HAC algorithm to compute seeds of high quality. If the HAC algorithm is applied to a document subset of size √N, then the overall runtime of K-means cum
HAC seed generation is Θ(N). An appropriate adjustment can be made for an Θ(N^2 log N) algorithm to guarantee linearity, which is called Buckshot algorithm.

信息检索导论第十七章笔记(英文)

文章目录

Hierarchical clustering

Abstract

Hierarchical agglomerative clustering

Single-link and complete-link clustering

Time complexity

Group-average agglomerative clustering

Centroid clustering

Optimality of hierarchical agglomerative clustering

Cluster labeling

Implementation notes

相关推荐