信息检索导论第十六章笔记(英文)

Flat clustering

Abstract

  • Cluster

Clustering algorithms group a set of documents into subsets or clusters. The algorithms’ goal is to create clusters that are coherent internally.

  • Unsupervised Learning

No supervision means that there is no human expert who has assigned documents to classes.

  • Flat Clustering

Flat clustering creates a flat set of clusterswithout any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy
of clusters.

  • Example of a data set with a clear cluster structure

信息检索导论第十六章笔记(英文)

  • Hard clustering & Soft clustering

Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft
clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.

K-means, a hard clustering algorithm, and the expectation
maximization (or EM) algorithm, a soft clustering algorithm

Clustering in information retrival

  • CLuster hypothesis

Documents in the same cluster behave similarly with respect to relevance to information needs.

  • Applications of clustering in IR

信息检索导论第十六章笔记(英文)

Search result clustering

Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together.

This is particularly useful if a search term has different word senses.

  • Example

信息检索导论第十六章笔记(英文)

None of the top hits cover the animal sense of jaguar, but users can easily access it by clicking on the Cat Cluster in the Clustered Results panel on the left.

Scatter gather

Scatter-gather clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered. This process is repeated until a cluster of interest is found.

Just like:

信息检索导论第十六章笔记(英文)

Problem statement

  • Objective function
  1. s set of documents D
  2. a desired number of clusters K
  3. an objective function that evaluates the quality of a clustering

The objective function is often defined in terms of similarity or distance between documents.

For documents, the type of similarity we want is usually topic similarity or high values on the same dimensions in the vector space model.

When computing topic similarity, stop words can be safely ignored, but they are important cues for separating clusters of English (in which the occurs frequently and la
infrequently) and French documents (in which the occurs infrequently and la frequently).

  • Cardinality

A difficult issue in clustering is determining the number of clusters or cardinality of a clustering, which we denote by K.

Often K is nothing more than a good guess based on experience or domain knowledge.

Because our goal is to optimize an objective function, clustering is essentially a search problem

Finding a good starting point is therefore another important problem we have to solve in flat clustering.

Evaluation of clustering

Typical objective functions in clustering formalize the goal of attaining high intracluster similarity (documents within a cluster are similar) and low intercluster similarity (documents from different clusters are dissimilar).

  • Four external criteria
  1. Purity is a simple and transparent evaluation measure.
  2. Normalized mutual information can be information-theoretically interpreted.
  3. The Rand index penalizes both false-positive and false-negative decisions during clustering
  4. The F measure in addition supports differential weighting of these two types of errors.
  • Purity

each cluster is assigned to the class which is most frequent
in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.

信息检索导论第十六章笔记(英文)

for example:

信息检索导论第十六章笔记(英文)

High purity is easy to achieve when the number of clusters is large – in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

  • Normalized mutual information

信息检索导论第十六章笔记(英文)

I is mutual information:

信息检索导论第十六章笔记(英文)

H is entropy:

信息检索导论第十六章笔记(英文)

MI has the same problem as purity:
It does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.
The normalization fixes this problem; entropy tends to increase with the number of clusters
Thus, NMI is always a number between 0 and 1.

  • Rand Index

An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions

信息检索导论第十六章笔记(英文)

  • F measures

We can use the F measure to penalize FNs more strongly than FPs by selecting a value β > 1, thus giving more weight to
recall.

信息检索导论第十六章笔记(英文)

In IR, evaluating clustering with F has the advantage that the measure is already familiar to the research community.

K-Means

K-means is the most important flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ centroid of the documents in a cluster ω:

信息检索导论第十六章笔记(英文)

  • Pseudo code

信息检索导论第十六章笔记(英文)

  • RSS

residual sum of squares.

信息检索导论第十六章笔记(英文)

RSS is the objective function in K-means and our goal is to minimize it. Because N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.

  • Process of K-means

信息检索导论第十六章笔记(英文)

This is a particular problem if a document set contains many outliers, documents that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS.

  • Time complexity

Most of the time is spent on computing vector distances. One such operation costs Θ(M). The reassignment
step computes KN distances, so its overall complexity is Θ(IKNM).

  • Cluster cardinality in K-means
  1. heuristic method

Estimated minimal residual sum of squares (RSSmin) as a function of the number of clusters in K-means.

There are two such points where successive decreases in RSSmin become noticeably smaller in Figure below, one at K = 4, where the gradient flattens slightly, and a clearer flattening at K = 9. This is typical: There is seldom a single best number of clusters. We still need to employ an
external constraint to choose from a number of possible values of K (four and nine in this case).

信息检索导论第十六章笔记(英文)

  1. imposes a penalty for each new cluster

get this selection criterion for K:

信息检索导论第十六章笔记(英文)

Nevertheless, we need to determine λ.

A theoretical justification for above equation is the Akaike information criterion or AIC, an information-theoretic measure that trades off distortion
against model complexity.

信息检索导论第十六章笔记(英文)

where −L(K), the negative maximum log-likelihood of the data for K clusters, is a measure of distortion and q(K), the number of parameters of a model with K clusters, is a measure of model complexity.

Note:
AIC provides a theoretical justification for one particular way of weighting these two factors, distortion
and model complexity, when selecting a model.

For K-Means:

信息检索导论第十六章笔记(英文)

Model-based clustering

EM algorithm, It can be applied to a larger variety of document representations and distributions than K-means.

More generally, the maximum likelihood criterion is to select the parameters θ that maximize the log-likelihood of generating the data D:

信息检索导论第十六章笔记(英文)

EM clustering is an iterative algorithm that maximizes L(D|Θ).

L(D|Θ) is the objective function that measures the goodness of the clustering
信息检索导论第十六章笔记(英文)

EM is similar to K-means in that it alternates between an expectation step, corresponding to reassignment, and a maximization step, corresponding to recomputation of the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk.

  1. Maximization step:

信息检索导论第十六章笔记(英文)

  1. Expectation step:

信息检索导论第十六章笔记(英文)

Note:
Finding good seeds is even more critical for EM than for K-means. EM is prone to get stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents to clusters is often computed by a different algorithm. For example, a hard K-means clustering may provide the initial assignment, which EM can then “soften up.”