信息检索导论第十章笔记(英文)

XML Retrieval

Abstract

There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language.

信息检索导论第十章笔记(英文)

In the chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.

  • Problems
  1. Unranked retrieval methods suffer from low recall.
  2. Most users are also notoriously bad at precisely stating strctural constraints.
  3. Users are mostly unfamiliar with structured search and advanced search interfaces or unwilling touse them.

This is why we try to recommend the XML Retrieval.

  • where to use the XML Retrieval
    digital libraries
    patent databases
    blogs
    text in which entities like persons and locations have been tagged

just like the examples in the raw book:
信息检索导论第十章笔记(英文)

but the RDB search is not suitable to the above three situation. There are three main problems which is mentioned in the abstract.

In order to solve that, we adapt the ranked retrieval methods.

Basic concepts

This is a XML document:

信息检索导论第十章笔记(英文)

An xml document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag.

And one element can have one or more XML attributes.
like:

信息检索导论第十章笔记(英文)

This is a structured tree figure of an XML object:

信息检索导论第十章笔记(英文)

  • DOM

the XML DOM reffers to the XML document object model, like the above figure.

  • XPath

XPath is a standard for enumerating paths in an XML documents collection. Only a small subset of XPath is needed for our purposes.

For example,
title#“Macbeth” selects all titles containing
the term Macbeth.

  • SCHEMA

A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute.

  • NEXI

Narrowed Extended XPath I.

In particular, //section is embedded under //article.

信息检索导论第十章笔记(英文)

The two yr conditions are relational attribute constraints.

We can representb querues as trees in the same way.

In this figure:

信息检索导论第十章笔记(英文)

q1 is a search for books whose titles score highly for the keywords Julius Caesar. q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war.

Challenges in XML retrieval

Challenge 1

If we are looking for Macbeth’s castle, the search should return the scene, act or the whole play.

Users hope get a part of the document(the XML element), not the entire document like unstructured retrieval.

  • Structured doocument retrieval principle

Choose the best matched part of the document.
The system should always return the most explict and suitable answers.

According to this principle, it motivateds a retrieval strategy thatreturns the smallest unit that contains the information sought.

Challenge 2

how to define the indexing unit.

In structured retrieval, there are a series of different methods.

  1. group nodes into nonvverlapping pseudodocuments.

信息检索导论第十章笔记(英文)

pseudodocuments may not make sence to the users because they are not coherent units

  1. use one of the largest elements as the indexing unit

this two-stage retrieval process fails to return the best subelement for many qyeries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it

  1. search all leaves, select the most relevant ones, and extend them to larger units in postprocessing(bottom up)

However, it has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements in which it is contained.

  1. index all elements

Indexing all elements means that search results will be highly redundant.

Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies
include:

信息检索导论第十章笔记(英文)

Challenge 3

a challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking.

In order to solve that, we compute a corresponding idf for each XML context respectively.

A vector space model for XML retrieval

For each dimension of vector space, the word and its position in the XML tree are considered simultaneously

How to do that: Mapping XML documents to lexicalized subtrees.

信息检索导论第十章笔记(英文)

Therefore, queries and documents can be represented as vectors in the lexicalized readme space, and the similarity can be calculated according to the previous vector similarity formula

  • the difference between structured and unstructured method

The main difference between the two is that in the former, each dimension of the vector corresponds to a term, while in the latter, it corresponds to a lexicalized subtree.

There is a tradeoff between the dimension of vector space and the precision of query,
Index all paths that end up with a single term, in other words, all XML contexts.

A simple measure of the similarity of a path cq in a query and a path cd in a document is the following
context context resemblance function Cr:

信息检索导论第十章笔记(英文)

For example,

信息检索导论第十章笔记(英文)

The final score for a document is computed as a variant of the cosine measure:

信息检索导论第十章笔记(英文)

Note:
SimNoMerge is not a real cosine function, because it may exceed 1.0.

  • the algorithm

信息检索导论第十章笔记(英文)

Evaluation of XML retrieval

  • INEX

The first evaluation platform in XML Retrieval Research.

INEX 2002 collections statistics

信息检索导论第十章笔记(英文)

INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance.

component coverage

The component coverage dimension evaluates whether the element retrieved is “structurally” correct, that is, neither too low nor too high in the tree.

Which has four cases:
Exact coverage(E)
Too small(S)
Too large(L)
No coverage(N)

Topic relevance

Topic relevance dimension also has four levels:

highly relevant(3)
fairly relevant(2)
marginally relevant(1)
nonrelevant(0)


the relevance-coverage combinations are quantized as folls:

信息检索导论第十章笔记(英文)


The number of relevant components in a retrieved set A of components can then be computed as:

信息检索导论第十章笔记(英文)

Note:
As an approximation, the standard definitions of precision, recall and F from Chapter 8 can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.

  • Disadvantages

The overlap is not accounted for.
This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result.

Text-centric versus data-centric XML retrieval

Text-centric

we give higher priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by

(i) long text fields (e.g., sections of a document) (ii) inexact matching
(iii) relevanceranked results. Relational databases do not deal well with this use case.

Data-centric

Data-centric in contrast, data-centric XML mainly encodes numerical and nontext XML attribute-value data. When querying data-centric XML, we want to impose
exact match conditions in most cases.
This puts the emphasis on the structural aspects of XML documents and queries.