信息检索导论第十四章笔记(英文)

Vector space classification

Abstract

Author adopt a different representation for text classification, the vector space model.

This chapter introduces a number of classification methods that operate on real-valued vectors.

  • CONTIGUITY HYPOTHESIS
    Documents in the same class form a contiguous region and regions of different classes do not overlap.

For example, documents in the class China tend to have high
values on dimensions like Chinese, Beijing, and Mao, whereas documents in the class UK tend to have high values for London, British, and Queen. Documents of the two classes therefore form distinct contiguous regions as shown in the below figure:

信息检索导论第十四章笔记(英文)

So that we can draw boundaries that separate them and classify new documents.

  • we have to prefer weighted representations

Term with five occurrences in a document should get a higher weight than a term with one occurrence, but a weight
five times larger would give too much emphasis to the term. Unweighted and unnormalized counts should not be used in vector space classification.

  • Two vector space classification methods
  1. Rocchio
  2. KNN

Note:
Nonlinear models have more parameters to fit on a limited amount of training data and are more likely to make mistakes for small and noisy data sets.

Document representations and measures of relatedness in vector spaces

Decisions of many vector space classifiers are based on a notion of distance.

In vector space classification, it rarely matters whether
the relatedness of two documents is expressed in terms of similarity or distance.

We are mostly concerned with small local regions when computing the similarity between a document and a centroid, and the smaller the region the more similar the behavior of the three measures is.

Rocchio classification

Use centroids to define the boundaries.
The centroid of a class c if computed as the vector average or center of mass of its members:

信息检索导论第十四章笔记(英文)

where Dc is the set of documents in D whose class is c.

The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids.

信息检索导论第十四章笔记(英文)

so, we define as the set of points x that satisfy:

信息检索导论第十四章笔记(英文)

  • Pseudo code

信息检索导论第十四章笔记(英文)

  • MULTIMODAL CLASS

信息检索导论第十四章笔记(英文)

We cannot represent the “a” class well with a single prototype because it has two multimodal clusters. Rocchio often misclassifies this type of multimodal class.

Most two-class classification problems therefore require a
modified decision rule of the form:

信息检索导论第十四章笔记(英文)

  • Time Complexity of Rocchio

信息检索导论第十四章笔记(英文)

Lave is the average number of tokens per document. La and Ma are the numbers of tokens and types, respectively

K nearest neighbor

For kNN we assign each document to the majority class of
its k closest neighbors where k is a parameter

Rationale:
based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d

The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors, with ties broken randomly.

The parameter k in kNN is often chosen based on experience or knowledge about the classification problem at hand. It is desirable for k to be odd to make ties less likely. k = 3 and k = 5 are common choices, but much larger values, between 50 and 100, are also used.

  • Pseudo code

信息检索导论第十四章笔记(英文)

  • Votes by cosine similarity

信息检索导论第十四章笔记(英文)

where Sk is the set of d’s k nearest neighbors and Ic(d’) = 1 iff d’ is in class c and 0 otherwise.

  • Time complexity

Training a kNN classifier simply consists of determining k and preprocessing documents.
In fact, if we preselect a value for k and do not preprocess, then kNN requires no training at all.

信息检索导论第十四章笔记(英文)

In kNN classification, we do not perform any estimation of parameters as we do in Rocchio classification (centroids) or in Naive Bayes (priors and conditional probabilities). kNN simply memorizes all examples in the training set and then compares the test document to them.

The search time in an inverted index is a function of the length of the postings lists of the terms in the query. Postings lists grow sublinearly with the length of the collection since the vocabulary increases according to Heaps’ law – if the probability of occurrence of some terms increases, then the probability of occurrence of others must decrease.

Linear versus nonlinear classifier

In two dimensions, a linear classifier is a line.

信息检索导论第十四章笔记(英文)

The corresponding algorithm for linear classification in M dimensions is shown as follows:

信息检索导论第十四章笔记(英文)

  • class boundary

信息检索导论第十四章笔记(英文)

We call this separating line the class boundary. It is the “true” boundary of the two classes and we distinguish it from the decision boundary that the learning method computes to approximate the class boundary.

Noise documents are one reason why training a linear classifier is hard. If we pay too much attention to noise documents when choosing the decision hyperplane of the classifier, then it will be inaccurate on new data. More
fundamentally, it is usually difficult to determine which documents are noise documents and therefore potentially misleading.

  • Nonlinear classifier

The nonlinearity of kNN is
classifier intuitively clear when looking at examples like Figure below

信息检索导论第十四章笔记(英文)

Classification with more than two classes

Solving an any-of classification task with linear classifiers is straightforward:

信息检索导论第十四章笔记(英文)

  • One-of classification

the classes are mutually exclusive. Each document must belong to exactly one of the classes.

We can state this algorithm for one-of classification with linear classifiers as follows:

信息检索导论第十四章笔记(英文)

The bias-variance tradeoff

There is no universally optimal learning method.

  1. many nonlinear models subsume linear models as a special case.

  2. there are nonlinear models that are less complex than linear models.

  3. the complexity of learning is not really a property of the classifier because there are many aspects of learning that make a learning method either more powerful or less powerful without affecting the type of classifier that is the final result of learning-regardless of whether that classifier is linear or nonlinear

  • Simplify the calculations

look at how well the classifier estimates the conditional probability P(c|d) of a document being in a class

Our goal in text classification then is to find a lassifier γ such that, averaged over documents d, γ (d) is as close as possible to the true probability P(c|d). We measure this using mean squared error:

信息检索导论第十四章笔记(英文)

For learning methods, we adopt as our goal to find a parameter that, averaged over training sets, learns classifiers γ with minimal MSE.We can formalize this as
learning minimizing learning error:

信息检索导论第十四章笔记(英文)

Transform the above equation:

信息检索导论第十四章笔记(英文)

Bias is the squared difference between P(c|d), the true conditional probability of d being in c, and

信息检索导论第十四章笔记(英文)

the prediction of the learned classifier, averaged over training sets.

Note:
Nonlinear methods like kNN have low bias
Linear learning methods have low variance because most randomly drawn training sets produce similar decision hyperplanes.
Nonlinear methods like kNN have high variance.

We can also think of variance as the model complexity or, equivalently, memory capacity of the learning method – how detailed a characterization of the training set it can remember and then apply to new data. This capacity corresponds to the number of independent parameters available to fit the training set.

All in all, we have to weigh the respective merits of bias and variance in our application and choose accordingly. This tradeoff is called the bias–variance tradeoff.

linear models in high-dimensional spaces are quite powerful despite their linearity. Even more powerful nonlinear learning methods can model decision boundaries that are more complex than a hyperplane, but they are also more sensitive to noise in the training data. Nonlinear learning methods sometimes perform better if the training
set is large, but by no means in all cases.