信息检索导论第十章笔记(英文)
XML Retrieval
Abstract
There are fundamental differences between IR and database systems in terms of retrieval model, data structures, and query language.
In the chapter, we look at how ranked retrieval methods can be adapted to structured documents to address these problems.
- Problems
- Unranked retrieval methods suffer from low recall.
- Most users are also notoriously bad at precisely stating strctural constraints.
- Users are mostly unfamiliar with structured search and advanced search interfaces or unwilling touse them.
This is why we try to recommend the XML Retrieval.
- where to use the XML Retrieval
digital libraries
patent databases
blogs
text in which entities like persons and locations have been tagged
just like the examples in the raw book:
but the RDB search is not suitable to the above three situation. There are three main problems which is mentioned in the abstract.
In order to solve that, we adapt the ranked retrieval methods.
Basic concepts
This is a XML document:
An xml document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag.
And one element can have one or more XML attributes.
like:
This is a structured tree figure of an XML object:
- DOM
the XML DOM reffers to the XML document object model, like the above figure.
- XPath
XPath is a standard for enumerating paths in an XML documents collection. Only a small subset of XPath is needed for our purposes.
For example,
title#“Macbeth” selects all titles containing
the term Macbeth.
- SCHEMA
A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute.
- NEXI
Narrowed Extended XPath I.
In particular, //section is embedded under //article.
The two yr conditions are relational attribute constraints.
We can representb querues as trees in the same way.
In this figure:
q1 is a search for books whose titles score highly for the keywords Julius Caesar. q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war.
Challenges in XML retrieval
Challenge 1
If we are looking for Macbeth’s castle, the search should return the scene, act or the whole play.
Users hope get a part of the document(the XML element), not the entire document like unstructured retrieval.
- Structured doocument retrieval principle
Choose the best matched part of the document.
The system should always return the most explict and suitable answers.
According to this principle, it motivateds a retrieval strategy thatreturns the smallest unit that contains the information sought.
Challenge 2
how to define the indexing unit.
In structured retrieval, there are a series of different methods.
- group nodes into nonvverlapping pseudodocuments.
pseudodocuments may not make sence to the users because they are not coherent units
- use one of the largest elements as the indexing unit
this two-stage retrieval process fails to return the best subelement for many qyeries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it
- search all leaves, select the most relevant ones, and extend them to larger units in postprocessing(bottom up)
However, it has a similar problem as the last one: the relevance of a leaf element is often not a good predictor of the relevance of elements in which it is contained.
- index all elements
Indexing all elements means that search results will be highly redundant.
Because of the redundancy caused by nested elements, it is common to restrict the set of elements that are eligible to be returned. Restriction strategies
include:
Challenge 3
a challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking.
In order to solve that, we compute a corresponding idf for each XML context respectively.
A vector space model for XML retrieval
For each dimension of vector space, the word and its position in the XML tree are considered simultaneously
How to do that: Mapping XML documents to lexicalized subtrees.
Therefore, queries and documents can be represented as vectors in the lexicalized readme space, and the similarity can be calculated according to the previous vector similarity formula
- the difference between structured and unstructured method
The main difference between the two is that in the former, each dimension of the vector corresponds to a term, while in the latter, it corresponds to a lexicalized subtree.
There is a tradeoff between the dimension of vector space and the precision of query,
Index all paths that end up with a single term, in other words, all XML contexts.
A simple measure of the similarity of a path cq in a query and a path cd in a document is the following
context context resemblance function Cr:
For example,
The final score for a document is computed as a variant of the cosine measure:
Note:
SimNoMerge is not a real cosine function, because it may exceed 1.0.
- the algorithm
Evaluation of XML retrieval
- INEX
The first evaluation platform in XML Retrieval Research.
INEX 2002 collections statistics
INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance.
component coverage
The component coverage dimension evaluates whether the element retrieved is “structurally” correct, that is, neither too low nor too high in the tree.
Which has four cases:
Exact coverage(E)
Too small(S)
Too large(L)
No coverage(N)
Topic relevance
Topic relevance dimension also has four levels:
highly relevant(3)
fairly relevant(2)
marginally relevant(1)
nonrelevant(0)
the relevance-coverage combinations are quantized as folls:
The number of relevant components in a retrieved set A of components can then be computed as:
Note:
As an approximation, the standard definitions of precision, recall and F from Chapter 8 can be applied to this modified definition of relevant items retrieved, with some subtleties because we sum graded as opposed to binary relevance assessments.
- Disadvantages
The overlap is not accounted for.
This problem is worse in XML retrieval because of the problem of multiple nested elements occurring in a search result.
Text-centric versus data-centric XML retrieval
Text-centric
we give higher priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by
(i) long text fields (e.g., sections of a document) (ii) inexact matching
(iii) relevanceranked results. Relational databases do not deal well with this use case.
Data-centric
Data-centric in contrast, data-centric XML mainly encodes numerical and nontext XML attribute-value data. When querying data-centric XML, we want to impose
exact match conditions in most cases.
This puts the emphasis on the structural aspects of XML documents and queries.