信息检索导论第十七章笔记(英文)
文章目录
Hierarchical clustering
Abstract
The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic.
Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.
The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of K-means and EM.
The differences in applications of flat and hierarchical clustering in IR.
we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering is a concern.
Hierarchical agglomerative clustering
HAC is Bottom-up hierarchical clustering, which is frequently used in IR.
- HAC is visualized as a dendrogram
Merge operation is monotonic, which means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds.
- Cutting point
- Cut at a prespecified level of similarity.
- Cut the dendrogram where the gap between two successive combination similarities is largest
- apply equation
where K’ refers to the cut of the hierarchy that results in K’ clusters
- pseodo code
- Four merge criteria of these four variants of HAC
Single-link and complete-link clustering
This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.
This complete-link merge criterion is nonlocal; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.
- Comparison
Left: The single-link similarity of the two upper two-point clusters is the similarity of d2 and d3 (solid line), which is greater than the single-link similarity of the two left two-point clusters (dashed line).
Right: The complete-link similarity of the two upper two-point clusters is the similarity of d1 and d4 (dashed line), which is smaller than the complete-link similarity of the two left two-point clusters (solid line).
Compared with the above dendrogram, there is also a dendrogram according to the complete-link clustering:
Note:
graph-theoretic interpretations motivate the terms single-link and complete-link clustering. Single-link clusters at step k are maximal sets of points that are linked via at least one link (a single link) of similarity s ≥ sk ; complete-link clusters at step k are maximal sets of points that are completely linked with each other via links of similarity s ≥ sk.
Single-link clustering can produce straggling clusters:
which is called chaining.
Complete-link clustering pays too much attention to outliers:
sometimes, it cannot find most intuitive cluster structure.
Time complexity
naive HAC: Θ(N^3)
a more efficient algorithm is the priority-queue algorithm, which has a time complexity of Θ(N^2logN)
- four different similarity measures
- single link optimization
next-best-merge array (NBM)
Group-average agglomerative clustering
GAAC evaluates cluster quality based on all similarities between documents, which is alse called group-average clustering and average-link clustering.
GAAC computes the average similarity sim-ga of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average:
- Compute SIM-GA efficiently
so we have:
Note:
GAAC Requires:
- documents represented as vectors
- length normalization of vectors, so that self-similarities are 1.0
- the dot product for computing the similarity between vectors and sums of vectors.
Centroid clustering
In centroid clustering, the similarity of two clusters is defined as the similarity of their centroids:
the difference between GAAC and centroid clustering is that GAAC considers all pairs of documents in computing average pairwise similarity
- Example
Three iterations of centroid clustering. Each iterationmerges the two clusterswhose centroids are closest.
Note:
centroid clustering is not monotonic.
the intersection is circled.
- Similarity measure
It is conceptually simpler than the average of all pairwise similarities in GAAC.
Optimality of hierarchical agglomerative clustering
define COME-SIM as the smallest combination similarity:
Centroid clustering is not optimal because inversions can occur.
- single-link
The combination similarity of a cluster ω is the smallest similarity of any bipartition of the cluster, where the similarity of a bipartition is the largest similarity between any two documents from the two parts:
- complete link
The combination similarity of a cluster ω is the smallest similarity of any two points in ω:
- GAAC:
The combination similarity of a cluster ω is the average of all pairwise similarities in ω
Note:
If we use these definitions of combination similarity, then optimality is a property of a set of clusters and not of a process that produces a set of clusters.
- Comparison of HAC algorithm
Recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.
- first story detection or novelty detection
- One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents.
For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001, form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space – and not global structure – that is important in this case.
- an approach to duplicate detection on the web
Cluster labeling
Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.
three labeling method to a K-means:
Cluster-internal labeling computes a label that solely depends on the cluster itself, not on other clusters. Labeling a cluster with the title of the document closest to the centroid is one cluster-internal method.
Implementation notes
Most problems that require the computation of a large number of dot products benefit from an inverted index. This is also the case for HAC clustering.
In order to solve this problem,
- truncating centroids (keeping only highly weighted terms)
- representing clusters by means of sparse medoids instead of dense centroids.
These optimizations can also be applied to GAAC and centroid clustering.
We can employ an HAC algorithm to compute seeds of high quality. If the HAC algorithm is applied to a document subset of size √N, then the overall runtime of K-means cum
HAC seed generation is Θ(N). An appropriate adjustment can be made for an Θ(N^2 log N) algorithm to guarantee linearity, which is called Buckshot algorithm.