XML Retrieval with slides of C. Manning und H.Schutze - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

XML Retrieval with slides of C. Manning und H.Schutze

Description:

Title: L3S Overview - Visit in Sweden Author: Nejdl Created Date: 3/9/2000 9:55:31 AM Document presentation format: Benutzerdefiniert Company: L3S Other titles – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 51

Provided by: Nejdl

Category:

more less

Transcript and Presenter's Notes

Title: XML Retrieval with slides of C. Manning und H.Schutze

1
XML Retrievalwith slides of C. Manning und
H.Schutze
2
Outline

What is XML?
Challenges in XML retrieval
Vector space model for XML retrieval
Evaluation of text-centric XML retrieval

3
What is XML?

eXtensible Markup Language
A framework for defining markup languages
No fixed collection of markup tags
Each XML language targeted for application
All XML languages share features
Enables building of generic tools

4
XML Example
ltchapter id"cmds"gt ltchaptitlegtFileCablt/chaptitl
egt ltparagtThis chapter describes the commands
that manage the lttmgtFileCablt/tmgtinet
application. lt/paragt lt/chaptergt
5
Basic Structure

An XML document is an ordered, labeled tree
character data leaf nodes contain the actual data
(text strings)
element nodes, are each labeled with
a name (often called the element type), and
a set of attributes, each consisting of a name
and a value,
can have child nodes

6
Elements

Elements are denoted by markup tags
ltfoo attr1value gt thetext lt/foogt
Element start tag foo
Attribute attr1
The character data thetext
Matching element end tag lt/foogt

7
Why Use XML?

Represent semi-structured data
data that are structured, but dont fit
relational model
XML is more flexible than DBs
XML is more structured than simple IR
You get a massive infrastructure for free

8
XML Schemas

Schema syntax definition of XML language
Schema language formal language for expressing
XML schemas
Examples
Document Type Definition
XML Schema (W3C)
Relevance for XML IR
Our job is much easier if we have a (one) schema

9
Challenges in XML Retrieval
04/12/2008
10
Data vs. Text-centric XML

Data-centric XML used for messaging between
enterprise applications
Mainly a recasting of relational data
Numerical and non-text data dominate
Mostly stored in the databases
Text-centric XML used for annotating content
Rich in text
Demands good integration of text retrieval
functionality
Queries are user information needs. E.g., give me
the Section (element) of the document that tells
me how to change a brake light

11
Text search in RDB

Highly structured text search problems are most
efficiently handled by a relational database
Information need find all employees who are
involved with invoicing
SQL query
select lastname from employees where job_desc
like 'invoic'
satisfies the information need with high
precision and recall.

id lastname job_desc salary
invoicing

12
Structured Retrieval Data

Many structured data sources containing text are
best modeled as structured documents rather than
relational data.
Search over structured documents structured
retrieval.
Queries in structured retrieval can be either
structured or unstructured (here we assume that
the collection consists only of structured
documents).
Applications of structured retrieval include
digital libraries, patent databases, text with
tagged entities (e.g. persons and locations),
application output as marked up text.

13
Structured Retrieval Query

Queries combine textual criteria with structural
criteria.
Information need Articles about sightseeing
tours of the Vatican and the Coliseum.
Usability of structured queries
knowledge of the data structure
sightseeing AND (COUNTRYVatican OR
LANDMARKColiseum)
sightseeing AND (STATEVatican OR
BUILDINGColiseum)
syntax of the query language
(XQuery)
low recall in Boolean model

for a in doc()//au let
14
IR XML Challenges Indexing Units Term
Statistics

Indexing unit there is no document unit in XML
Book? Chapter?
How do we compute tf and idf?
Global tf/idf over all text context is useless
Indexing granularity
Structured document retrieval principle. A system
should always retrieve the most specific part of
a document answering the query.

15
Partitioning an XML document into non-overlapping
indexing units
16
IR XML Challenges Schemas

Schema heterogeneity
Many schemas
Schemas not known in advance
Schemas change
Users dont understand schemas
Need to identify similar elements in different
schemas
Example name, surname, family name

17
IR XML Challenges UI

Help user find relevant nodes in schema
Author, editor, contributor, from/sender
What is the query language you expose to the
user?
Specific XML query language? No.
Forms? Parametric search?
A textbox?
In general design layer between XML and user

18
Vector Spaces and XML
19
Vector spaces and XML

Vector spaces triedtested framework for
keyword retrieval
Other bag of words applications in text
classification, clustering
For text-centric XML retrieval, can we make use
of vector space ideas?
Challenge capture the structure of an XML
document in the vector space.

20
Vector spaces and XML

For instance, distinguish between the following
two cases

Book
Book
Title
Title
Author
Author
The Pearly Gates
Bill Gates
Microsoft
Bill Wulf
21
Content-rich XML representation
Book
Book
Title
Title
Author
Author
Bill
Microsoft
Wulf
Pearly
Gates
Bill
Gates
The
Lexicon terms.
22
Encoding the Gates differently

What are the axes of the vector space?
In text retrieval, there would be a single axis
for Gates
Here we must separate out the two occurrences,
under Author and Title
Thus, axes must represent not only terms, but
something about their position in an XML tree

23
Queries

Before addressing this, let us consider the kinds
of queries we want to handle

Book
Book
Author
Title
Title
Bill
Gates
Microsoft
24
Subtrees and structure

Consider all subtrees of the document that
include at least one lexicon term

Bill
Microsoft
Gates
Book
e.g.
Title
Author
Author
Microsoft
Bill
Title
Author
Gates
Book
Book

Bill
Microsoft
Gates
Author
Title
Gates
Bill
Microsoft
25
Structural terms

Call each of the resulting (8, in the previous
slide) subtrees a structural term
Note that structural terms might occur multiple
times in a document
Create one axis in the vector space for each
distinct structural term
Weights based on frequencies for number of
occurrences (just as we had tf)
All the usual issues with terms (stemming? Case
folding?) remain

26
Example of tf weighting
Play
Play
Play
Play
Play
Act
Act
Act
Act
Act
To be or not to be
be
or
not
to

Here the structural terms containing to or be
would have more weight than those that dont

Exercise How many axes are there in this example?
27
Structural terms docsqueries

The notion of structural terms is independent of
any schema/DTD for the XML documents
Well-suited to a heterogeneous collection of XML
documents
Each document becomes a vector in the space of
structural terms
A query tree can likewise be factored into
structural terms
And represented as a vector
Allows weighting portions of the query

28
The catch remains

This is all very promising, but
How big is this vector space?
Can be exponentially large in the size of the
document
Cannot hope to build such an index
And in any case, still fails to answer queries
like

Book
(somewhere underneath)
Gates
29
Descendants
Author
Book
Book
vs.
Author
Author
Gates
Bill
Gates
FirstName
LastName
Bill
Gates
No known DTD. Query seeks Gates under Author.
30
Handling descendants in the vector space

Devise a match function that yields a score in
0,1 between structural terms
E.g., when the structural terms are paths,
measure overlap
The greater the overlap, the higher the match
score
Can adjust match for where the overlap occurs

Book
Book
Book
Author
in
vs.
Author
LastName
Bill
Bill
Bill
31
How do we use this in retrieval?

First enumerate structural terms in the query
Measure each for match against the dictionary of
structural terms
Just like a postings lookup, except not Boolean
(does the term exist)
Instead, produce a score that says 80 close to
this structural term, etc.
Then, retrieve docs with that structural term,
compute cosine similarities, etc.

32
Example of a retrieval step
Index
ST1
Doc1 (0.7)
Doc4 (0.3)
Doc9 (0.2)
ST5
Doc3 (1.0)
Doc6 (0.8)
Doc9 (0.6)
ST Structural Term
Now rank the Docs by cosine similarity e.g.,
Doc9 scores 0.578.
33
Closing technicalities

But what exactly is a Doc?
In a sense, an entire corpus can be viewed as an
XML document

Corpus
Doc1
Doc2
Doc3
Doc4
34
What are the Docs in the index?

Anything we are prepared to return as an answer
Could be nodes, some of their children

35
What are queries we cant handle using vector
spaces?

Find figures that describe the Corba architecture
and the paragraphs that refer to those figures
Requires JOIN between 2 tables
Retrieve the titles of articles published in the
Special Feature section of the journal IEEE Micro
Depends on order of sibling nodes.
Requires ltarticlesgt to appear after a specific
ltsecgt node.
Query tree cannot express relations that depend
on node ordering.

ltjournalgtlttitlegtlt/titlegt ltsec1gtlttitlegtlt/titlegtlt/
sec1gt ltarticlegtlt/articlegt ltarticlegtlt/articlegt lt/
journalgt
36
Can we do IDF?

Yes, but doesnt make sense to do it corpus-wide
Can do it, for instance, within all text under a
certain element name say Chapter
Yields a tf-idf weight for each lexicon term
under an element
Issues how do we propagate contributions to
higher level nodes.

37
Example

Say Gates has high IDF under the Author element
How should it be tf-idf weighted for the Book
element?
Should we use the idf for Gates in Author or that
in Book?

Book
Author
Bill
Gates
38
INEX a benchmark for text-centric XML retrieval
39
INEX

Benchmark for the evaluation of XML retrieval
Analog of TREC (recall IIR8)
Consists of
Set of XML documents
Collection of retrieval tasks

40
INEX

Each engine indexes docs
Engine team converts retrieval tasks into queries
In XML query language understood by engine
In response, the engine retrieves not docs, but
elements within docs
Engine ranks retrieved elements

41
INEX assessment

For each query, each retrieved element is
human-assessed on two measures
Relevance how relevant is the retrieved element
Coverage is the retrieved element too specific,
too general, or just right
E.g., if the query seeks a definition of the Fast
Fourier Transform, do I get the equation (too
specific), the chapter containing the definition
(too general) or the definition itself
These assessments are turned into composite
precision/recall measures

42
INEX corpus (Ad Hoc Track)

Articles from IEEE Computer Society publications
Wikipedia collection 2009
others

43
INEX topics

Each topic is an information need, one of two
kinds
Content Only (CO) free text queries
Content and Structure (CAS) explicit structural
constraints, e.g., containment conditions.

44
Sample INEX CO topic

ltTitlegt computational biology lt/Titlegt
ltKeywordsgt computational biology, bioinformatics,
genome, genomics, proteomics, sequencing, protein
folding lt/Keywordsgt
ltDescriptiongt Challenges that arise, and
approaches being explored, in the
interdisciplinary field of computational
biologylt/Descriptiongt
ltNarrativegt To be relevant, a document/component
must either talk in general terms about the
opportunities at the intersection of computer
science and biology, or describe a particular
problem and the ways it is being attacked.
lt/Narrativegt

45
INEX assessment

Each engine formulates the topic as a query
E.g., use the keywords listed in the topic.
Engine retrieves one or more elements and ranks
them.
Human evaluators assign to each retrieved element
relevance and coverage scores.

46
Assessments

Relevance assessed on a scale from Irrelevant
(scoring 0) to Highly Relevant (scoring 3)
Coverage assessed on a scale with four levels
No Coverage (N the query topic does not match
anything in the element
Too Large (L The topic is only a minor theme of
the element retrieved)
Too Small (S the element is too small to provide
the information required)
Exact Coverage(E).
So every element returned by each engine has
ratings from 0,1,2,3 N,S,L,E

47
Combining the relevance/coverage assessments