Title: XML Retrieval with slides of C. Manning und H.Schutze
1XML Retrievalwith slides of C. Manning und
H.Schutze
2Outline
- What is XML?
- Challenges in XML retrieval
- Vector space model for XML retrieval
- Evaluation of text-centric XML retrieval
3What is XML?
- eXtensible Markup Language
- A framework for defining markup languages
- No fixed collection of markup tags
- Each XML language targeted for application
- All XML languages share features
- Enables building of generic tools
4XML Example
ltchapter id"cmds"gt ltchaptitlegtFileCablt/chaptitl
egt ltparagtThis chapter describes the commands
that manage the lttmgtFileCablt/tmgtinet
application. lt/paragt lt/chaptergt
5Basic Structure
- An XML document is an ordered, labeled tree
- character data leaf nodes contain the actual data
(text strings) - element nodes, are each labeled with
- a name (often called the element type), and
- a set of attributes, each consisting of a name
and a value, - can have child nodes
6Elements
- Elements are denoted by markup tags
- ltfoo attr1value gt thetext lt/foogt
- Element start tag foo
- Attribute attr1
- The character data thetext
- Matching element end tag lt/foogt
7Why Use XML?
- Represent semi-structured data
- data that are structured, but dont fit
relational model - XML is more flexible than DBs
- XML is more structured than simple IR
- You get a massive infrastructure for free
8XML Schemas
- Schema syntax definition of XML language
- Schema language formal language for expressing
XML schemas - Examples
- Document Type Definition
- XML Schema (W3C)
- Relevance for XML IR
- Our job is much easier if we have a (one) schema
9Challenges in XML Retrieval
04/12/2008
10Data vs. Text-centric XML
- Data-centric XML used for messaging between
enterprise applications - Mainly a recasting of relational data
- Numerical and non-text data dominate
- Mostly stored in the databases
- Text-centric XML used for annotating content
- Rich in text
- Demands good integration of text retrieval
functionality - Queries are user information needs. E.g., give me
the Section (element) of the document that tells
me how to change a brake light
11Text search in RDB
- Highly structured text search problems are most
efficiently handled by a relational database - Information need find all employees who are
involved with invoicing - SQL query
- select lastname from employees where job_desc
like 'invoic' - satisfies the information need with high
precision and recall.
id lastname job_desc salary
invoicing
12Structured Retrieval Data
- Many structured data sources containing text are
best modeled as structured documents rather than
relational data. - Search over structured documents structured
retrieval. - Queries in structured retrieval can be either
structured or unstructured (here we assume that
the collection consists only of structured
documents). - Applications of structured retrieval include
digital libraries, patent databases, text with
tagged entities (e.g. persons and locations),
application output as marked up text.
13Structured Retrieval Query
- Queries combine textual criteria with structural
criteria. - Information need Articles about sightseeing
tours of the Vatican and the Coliseum. - Usability of structured queries
- knowledge of the data structure
- sightseeing AND (COUNTRYVatican OR
LANDMARKColiseum) - sightseeing AND (STATEVatican OR
BUILDINGColiseum) - syntax of the query language
- (XQuery)
- low recall in Boolean model
-
for a in doc()//au let
14IR XML Challenges Indexing Units Term
Statistics
- Indexing unit there is no document unit in XML
- Book? Chapter?
- How do we compute tf and idf?
- Global tf/idf over all text context is useless
- Indexing granularity
- Structured document retrieval principle. A system
should always retrieve the most specific part of
a document answering the query.
15Partitioning an XML document into non-overlapping
indexing units
16IR XML Challenges Schemas
- Schema heterogeneity
- Many schemas
- Schemas not known in advance
- Schemas change
- Users dont understand schemas
- Need to identify similar elements in different
schemas - Example name, surname, family name
17IR XML Challenges UI
- Help user find relevant nodes in schema
- Author, editor, contributor, from/sender
- What is the query language you expose to the
user? - Specific XML query language? No.
- Forms? Parametric search?
- A textbox?
- In general design layer between XML and user
18Vector Spaces and XML
19Vector spaces and XML
- Vector spaces triedtested framework for
keyword retrieval - Other bag of words applications in text
classification, clustering - For text-centric XML retrieval, can we make use
of vector space ideas? - Challenge capture the structure of an XML
document in the vector space.
20Vector spaces and XML
- For instance, distinguish between the following
two cases
Book
Book
Title
Title
Author
Author
The Pearly Gates
Bill Gates
Microsoft
Bill Wulf
21Content-rich XML representation
Book
Book
Title
Title
Author
Author
Bill
Microsoft
Wulf
Pearly
Gates
Bill
Gates
The
Lexicon terms.
22Encoding the Gates differently
- What are the axes of the vector space?
- In text retrieval, there would be a single axis
for Gates - Here we must separate out the two occurrences,
under Author and Title - Thus, axes must represent not only terms, but
something about their position in an XML tree
23Queries
- Before addressing this, let us consider the kinds
of queries we want to handle
Book
Book
Author
Title
Title
Bill
Gates
Microsoft
24Subtrees and structure
- Consider all subtrees of the document that
include at least one lexicon term
Bill
Microsoft
Gates
Book
e.g.
Title
Author
Author
Microsoft
Bill
Title
Author
Gates
Book
Book
Bill
Microsoft
Gates
Author
Title
Gates
Bill
Microsoft
25Structural terms
- Call each of the resulting (8, in the previous
slide) subtrees a structural term - Note that structural terms might occur multiple
times in a document - Create one axis in the vector space for each
distinct structural term - Weights based on frequencies for number of
occurrences (just as we had tf) - All the usual issues with terms (stemming? Case
folding?) remain
26Example of tf weighting
Play
Play
Play
Play
Play
Act
Act
Act
Act
Act
To be or not to be
be
or
not
to
- Here the structural terms containing to or be
would have more weight than those that dont
Exercise How many axes are there in this example?
27Structural terms docsqueries
- The notion of structural terms is independent of
any schema/DTD for the XML documents - Well-suited to a heterogeneous collection of XML
documents - Each document becomes a vector in the space of
structural terms - A query tree can likewise be factored into
structural terms - And represented as a vector
- Allows weighting portions of the query
28The catch remains
- This is all very promising, but
- How big is this vector space?
- Can be exponentially large in the size of the
document - Cannot hope to build such an index
- And in any case, still fails to answer queries
like
Book
(somewhere underneath)
Gates
29Descendants
Author
Book
Book
vs.
Author
Author
Gates
Bill
Gates
FirstName
LastName
Bill
Gates
No known DTD. Query seeks Gates under Author.
30Handling descendants in the vector space
- Devise a match function that yields a score in
0,1 between structural terms - E.g., when the structural terms are paths,
measure overlap - The greater the overlap, the higher the match
score - Can adjust match for where the overlap occurs
Book
Book
Book
Author
in
vs.
Author
LastName
Bill
Bill
Bill
31How do we use this in retrieval?
- First enumerate structural terms in the query
- Measure each for match against the dictionary of
structural terms - Just like a postings lookup, except not Boolean
(does the term exist) - Instead, produce a score that says 80 close to
this structural term, etc. - Then, retrieve docs with that structural term,
compute cosine similarities, etc.
32Example of a retrieval step
Index
ST1
Doc1 (0.7)
Doc4 (0.3)
Doc9 (0.2)
ST5
Doc3 (1.0)
Doc6 (0.8)
Doc9 (0.6)
ST Structural Term
Now rank the Docs by cosine similarity e.g.,
Doc9 scores 0.578.
33Closing technicalities
- But what exactly is a Doc?
- In a sense, an entire corpus can be viewed as an
XML document
Corpus
Doc1
Doc2
Doc3
Doc4
34What are the Docs in the index?
- Anything we are prepared to return as an answer
- Could be nodes, some of their children
35What are queries we cant handle using vector
spaces?
- Find figures that describe the Corba architecture
and the paragraphs that refer to those figures - Requires JOIN between 2 tables
- Retrieve the titles of articles published in the
Special Feature section of the journal IEEE Micro - Depends on order of sibling nodes.
- Requires ltarticlesgt to appear after a specific
ltsecgt node. - Query tree cannot express relations that depend
on node ordering.
ltjournalgtlttitlegtlt/titlegt ltsec1gtlttitlegtlt/titlegtlt/
sec1gt ltarticlegtlt/articlegt ltarticlegtlt/articlegt lt/
journalgt
36Can we do IDF?
- Yes, but doesnt make sense to do it corpus-wide
- Can do it, for instance, within all text under a
certain element name say Chapter - Yields a tf-idf weight for each lexicon term
under an element - Issues how do we propagate contributions to
higher level nodes.
37Example
- Say Gates has high IDF under the Author element
- How should it be tf-idf weighted for the Book
element? - Should we use the idf for Gates in Author or that
in Book?
Book
Author
Bill
Gates
38INEX a benchmark for text-centric XML retrieval
39INEX
- Benchmark for the evaluation of XML retrieval
- Analog of TREC (recall IIR8)
- Consists of
- Set of XML documents
- Collection of retrieval tasks
40INEX
- Each engine indexes docs
- Engine team converts retrieval tasks into queries
- In XML query language understood by engine
- In response, the engine retrieves not docs, but
elements within docs - Engine ranks retrieved elements
41INEX assessment
- For each query, each retrieved element is
human-assessed on two measures - Relevance how relevant is the retrieved element
- Coverage is the retrieved element too specific,
too general, or just right - E.g., if the query seeks a definition of the Fast
Fourier Transform, do I get the equation (too
specific), the chapter containing the definition
(too general) or the definition itself - These assessments are turned into composite
precision/recall measures
42INEX corpus (Ad Hoc Track)
- Articles from IEEE Computer Society publications
- Wikipedia collection 2009
- others
43INEX topics
- Each topic is an information need, one of two
kinds - Content Only (CO) free text queries
- Content and Structure (CAS) explicit structural
constraints, e.g., containment conditions.
44Sample INEX CO topic
- ltTitlegt computational biology lt/Titlegt
- ltKeywordsgt computational biology, bioinformatics,
genome, genomics, proteomics, sequencing, protein
folding lt/Keywordsgt - ltDescriptiongt Challenges that arise, and
approaches being explored, in the
interdisciplinary field of computational
biologylt/Descriptiongt - ltNarrativegt To be relevant, a document/component
must either talk in general terms about the
opportunities at the intersection of computer
science and biology, or describe a particular
problem and the ways it is being attacked.
lt/Narrativegt
45INEX assessment
- Each engine formulates the topic as a query
- E.g., use the keywords listed in the topic.
- Engine retrieves one or more elements and ranks
them. - Human evaluators assign to each retrieved element
relevance and coverage scores.
46Assessments
- Relevance assessed on a scale from Irrelevant
(scoring 0) to Highly Relevant (scoring 3) - Coverage assessed on a scale with four levels
- No Coverage (N the query topic does not match
anything in the element - Too Large (L The topic is only a minor theme of
the element retrieved) - Too Small (S the element is too small to provide
the information required) - Exact Coverage(E).
- So every element returned by each engine has
ratings from 0,1,2,3 N,S,L,E
47Combining the relevance/coverage assessments
48The Q-values
- Scalar measure of goodness of a retrieved
elements - Can compute Q-values for varying numbers of
retrieved elements 10, 20 etc. - Means for comparing engines.
49From Q-values to ?
- INEX provides a method for turning these into
precision-recall curves - Standard issue only elements returned by some
participant engine are assessed - Lots more commentary (and proceedings from
previous INEX bakeoffs) - http//www.inex.otago.ac.nz
50Resources
- Chapter 10 of IIR
- Resources at http//ifnlp.org/ir
- INEX http//www.inex.otago.ac.nz/