信息索引导论第二章笔记(英文)

Abstract

In chapter 2, there are four main problems which are relevant to the process of major steps in inverted index construction. First of all, make clear how the basic unit of a document can be defined and how the character sequence that it comprises is determined. Then, in order to tokenize the text, experts examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing. After that, how to support faster querying by updating the postings lists data structure becomes the explicitly crucial issue. Thus, finally, last section covers building postings data structures suitable for handling phrase and proximity queries.

introduction

character sequence decoding

信息索引导论第二章笔记(英文)

Typically a file is stored in bytes in both disks and memory, and if we want it to be readable, we have to convert characters to the correct encoding. Therefore, it is important to know the encoding of the document before a series of processing steps. The encoding is generally stored in the meta data section of the document.

Document unit & Index granularity

信息索引导论第二章笔记(英文)

The selection of index granularity is also important. Because if we choose the index granularity too large, for example, if you want to find the book “information”, in the case that the index granularity is the entire book, it is very inaccurate that the book in which word “Information” appears to be selected.

Because we only wanna know which paragraph is talking about “information”.

Tokenization

Basic concepts:

  1. Token: each instance after tokenization
  2. Type: the class of all tokens containing the same character sequence. In other words, results after we push tokens into set
  3. Term: Normalized type(归一化之后的type), a type that is included in the IR system’s dict.

For instances, there is a sentence:

i have a apple and a bag.

token type term
7 6 3

explanation:

  • 7 words to 7 tokens
  • combine two a
  • drop stop words: a, and

So, what is stop word, and what will happen if we dont delete them.

stop word

Stop words are words which have extremely high frequency of occurrence with little value in helping select documents.

Just like: this, a, an, of, etc.

信息索引导论第二章笔记(英文)

信息索引导论第二章笔记(英文)

According to our common sense, these words appear everywhere, in each of ducuments. Once they are included in inverted index contruction, we get huge postings lists.

Attention:

  1. Can’t simply delete all stop words

such as: to be or not to be.
this sentence is made up all stop words almostly. It makes misunderstanding on assumption that just drop them all.

  1. when and where to drop stop words
  • building inverted index, we dont consider stop words.
  • querying, searching after dropping stop words
  1. shortcoming

Dropping stop words is not a completely beneficial strategy.

shortcomings in chance:

  • misunderstanding: president in US, president and US

Equivalence classing of terms

what is equivalence class/set

  • {A, B, C}
  • search A is equal to search B is equal to search C

This method is used simultaneously in querying and documents.
The way to realise Equivalence classing of terms is to do Normalization.

Common methods:

  • Case folding
  • Add query expansion dicts
  • Expand postings lists

Attention:

  • Normalization also does harms
  • May cause semantic loss and deviation

stemming and lemmatization

  • stemming

Simple interception removal of prefixes and suffixes

Stemreduction can increase the Recall rate and Precision rate. because for example, the respectability and respective are converted to respect, but the meaning is different, so when querying the respectability, it is converted by stemming, and then the query shows the respective.

What’s more, stemming could decrease the space of dict.

Porter 算法
An algorithm for effective stem analysis in English
firstly, porter gives some rules mappings:
SSES -> SS
TION -> None
IES -> I
Rule1: When more than one match, the longest is taken
Rule2: when m>1, ement

  • Lemmatization

To remove affixes through dictionary analysis

Define the lemma which refers to words obtained from word format analysis

Attention:

Stemming and Lemmatization are used for documents and words querying. so indexing is to parse both reqpective and respectability into respect. More specifically, the word in the dictionary is the respect, and posting stores the intersection of the docID about two words respectively. If the word query is respective, it is turned into a respect and looked for in the inverted index.

Extensions to postings lists data structures

In order to increse the efficiency of using postings lists, we have to improve some ways and data structures.

In other words, we wanna do better than the Walking through method in chapter, which takes O(m+n) operations.

  1. skip pointer

Include skip pointers in posting, it can skip docIDs that are certainly not in the result, which only apply to AND operations.

Skip Pointer Diagram

信息索引导论第二章笔记(英文)

Usually, place an inverted pointer at each (root p) and p is the length of the inverted index table;

Attention
this way apples to the inverted index which is stable

Skip pointer search example

firstly, we need to define how to store the skip pointer, for instance:

信息索引导论第二章笔记(英文)
Here we are going to analysis a example in book:

信息索引导论第二章笔记(英文)

  • input: search Brutus and Caesar
  • Process:
    1. i = 2, j = 4
    1. if listB[i] == listC[j]:
    • ans.append(listB)
    • i += 1, j += 1
    1. if listB[i] < listC[j]:
    • then if skiplist(i) and listB[i] <= listC[j]:
      • i = skip(i)

According to above process, we save time looking for 16 and 19 and comparing them, which makes sence.

Complete algorithmic process is as follows:

信息索引导论第二章笔记(英文)

  1. support phrase queries and their combination efficiently

Think of the real querying that always happens in our daily life. We seldom want to look for a single word, but phrase more often.

So a simple posings lists is no longer to Meet our needs.

biword inverted index

Think of two words as an item, both words are grouped in the dictionary.

.i.e

  • “invert and revert”
  • is parsed into :
  • “invert and” and “and revert”

shortcomings:
(1) Not suitable for word queries
(2) too large postings lists
(3) The query is sometimes incorrect. Post-filtering is required

positional inverted index

Add the position of the word after each docID of posting

.i.e

  • search fruit
  • ans:
  • (1, 3: <100,220,300>,
  • 2, 1: <1>)
  • (docId, times: <pos1, pos2, pos3>)

Attention:
The positional index can perform a near-search, whereas binary words cannot

Shortcomings:
too large postings lists

Application:
deal with near search

Complexity:
adapting positional index to combine needs O(T) operitions, T stands for the number of terms in doc set.

Algorithm in book:

信息索引导论第二章笔记(英文)

How to Compromise

The original text has a very good summary of the definition of a good query, i am much too inspired.
“Good queries to include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare. Adding Britney Spears as a phrase index entry may only give a speedup factor to that query of about three, because most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1,000. Hence, having the latter is more desirable, even if it is a relatively less common query.”

According to the Professor Manning’s book, I’ll summarize this Compromise strategies into three points:

  • Use biword index strategy if single word appears very often in doc set, but occurs significant reduction in the number of occurrences after forming a phrase
  • building biword index for phrases fften searched by users
  • most of the rest uses positional index if you dont have more references and rules

这里作为自己的笔记和总结,借鉴了manning原书还有csdn两位博主:

英语难免有些错误大家见谅。
大家共勉~~