信息检索导论第十六章笔记(英文)
文章目录
Flat clustering
Abstract
- Cluster
Clustering algorithms group a set of documents into subsets or clusters. The algorithms’ goal is to create clusters that are coherent internally.
- Unsupervised Learning
No supervision means that there is no human expert who has assigned documents to classes.
- Flat Clustering
Flat clustering creates a flat set of clusterswithout any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy
of clusters.
- Example of a data set with a clear cluster structure
- Hard clustering & Soft clustering
Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft
clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.
K-means, a hard clustering algorithm, and the expectation
maximization (or EM) algorithm, a soft clustering algorithm
Clustering in information retrival
- CLuster hypothesis
Documents in the same cluster behave similarly with respect to relevance to information needs.
- Applications of clustering in IR
Search result clustering
Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together.
This is particularly useful if a search term has different word senses.
- Example
None of the top hits cover the animal sense of jaguar, but users can easily access it by clicking on the Cat Cluster in the Clustered Results panel on the left.
Scatter gather
Scatter-gather clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered. This process is repeated until a cluster of interest is found.
Just like:
Problem statement
- Objective function
- s set of documents D
- a desired number of clusters K
- an objective function that evaluates the quality of a clustering
The objective function is often defined in terms of similarity or distance between documents.
For documents, the type of similarity we want is usually topic similarity or high values on the same dimensions in the vector space model.
When computing topic similarity, stop words can be safely ignored, but they are important cues for separating clusters of English (in which the occurs frequently and la
infrequently) and French documents (in which the occurs infrequently and la frequently).
- Cardinality
A difficult issue in clustering is determining the number of clusters or cardinality of a clustering, which we denote by K.
Often K is nothing more than a good guess based on experience or domain knowledge.
Because our goal is to optimize an objective function, clustering is essentially a search problem
Finding a good starting point is therefore another important problem we have to solve in flat clustering.
Evaluation of clustering
Typical objective functions in clustering formalize the goal of attaining high intracluster similarity (documents within a cluster are similar) and low intercluster similarity (documents from different clusters are dissimilar).
- Four external criteria
- Purity is a simple and transparent evaluation measure.
- Normalized mutual information can be information-theoretically interpreted.
- The Rand index penalizes both false-positive and false-negative decisions during clustering
- The F measure in addition supports differential weighting of these two types of errors.
- Purity
each cluster is assigned to the class which is most frequent
in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.
for example:
High purity is easy to achieve when the number of clusters is large – in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.
- Normalized mutual information
I is mutual information:
H is entropy:
MI has the same problem as purity:
It does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.
The normalization fixes this problem; entropy tends to increase with the number of clusters
Thus, NMI is always a number between 0 and 1.
- Rand Index
An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions
- F measures
We can use the F measure to penalize FNs more strongly than FPs by selecting a value β > 1, thus giving more weight to
recall.
In IR, evaluating clustering with F has the advantage that the measure is already familiar to the research community.
K-Means
K-means is the most important flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ centroid of the documents in a cluster ω:
- Pseudo code
- RSS
residual sum of squares.
RSS is the objective function in K-means and our goal is to minimize it. Because N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.
- Process of K-means
This is a particular problem if a document set contains many outliers, documents that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS.
- Time complexity
Most of the time is spent on computing vector distances. One such operation costs Θ(M). The reassignment
step computes KN distances, so its overall complexity is Θ(IKNM).
- Cluster cardinality in K-means
- heuristic method
Estimated minimal residual sum of squares (RSSmin) as a function of the number of clusters in K-means.
There are two such points where successive decreases in RSSmin become noticeably smaller in Figure below, one at K = 4, where the gradient flattens slightly, and a clearer flattening at K = 9. This is typical: There is seldom a single best number of clusters. We still need to employ an
external constraint to choose from a number of possible values of K (four and nine in this case).
- imposes a penalty for each new cluster
get this selection criterion for K:
Nevertheless, we need to determine λ.
A theoretical justification for above equation is the Akaike information criterion or AIC, an information-theoretic measure that trades off distortion
against model complexity.
where −L(K), the negative maximum log-likelihood of the data for K clusters, is a measure of distortion and q(K), the number of parameters of a model with K clusters, is a measure of model complexity.
Note:
AIC provides a theoretical justification for one particular way of weighting these two factors, distortion
and model complexity, when selecting a model.
For K-Means:
Model-based clustering
EM algorithm, It can be applied to a larger variety of document representations and distributions than K-means.
More generally, the maximum likelihood criterion is to select the parameters θ that maximize the log-likelihood of generating the data D:
EM clustering is an iterative algorithm that maximizes L(D|Θ).
L(D|Θ) is the objective function that measures the goodness of the clustering
EM is similar to K-means in that it alternates between an expectation step, corresponding to reassignment, and a maximization step, corresponding to recomputation of the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk.
- Maximization step:
- Expectation step:
Note:
Finding good seeds is even more critical for EM than for K-means. EM is prone to get stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents to clusters is often computed by a different algorithm. For example, a hard K-means clustering may provide the initial assignment, which EM can then “soften up.”