XML Retrieval with slides of C. Manning und H.Schutze - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

XML Retrieval with slides of C. Manning und H.Schutze

Description:

Title: L3S Overview - Visit in Sweden Author: Nejdl Created Date: 3/9/2000 9:55:31 AM Document presentation format: Benutzerdefiniert Company: L3S Other titles – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 51
Provided by: Nejdl
Category:

less

Transcript and Presenter's Notes

Title: XML Retrieval with slides of C. Manning und H.Schutze


1
XML Retrievalwith slides of C. Manning und
H.Schutze
2
Outline
  • What is XML?
  • Challenges in XML retrieval
  • Vector space model for XML retrieval
  • Evaluation of text-centric XML retrieval

3
What is XML?
  • eXtensible Markup Language
  • A framework for defining markup languages
  • No fixed collection of markup tags
  • Each XML language targeted for application
  • All XML languages share features
  • Enables building of generic tools

4
XML Example
ltchapter id"cmds"gt ltchaptitlegtFileCablt/chaptitl
egt ltparagtThis chapter describes the commands
that manage the lttmgtFileCablt/tmgtinet
application. lt/paragt lt/chaptergt
5
Basic Structure
  • An XML document is an ordered, labeled tree
  • character data leaf nodes contain the actual data
    (text strings)
  • element nodes, are each labeled with
  • a name (often called the element type), and
  • a set of attributes, each consisting of a name
    and a value,
  • can have child nodes

6
Elements
  • Elements are denoted by markup tags
  • ltfoo attr1value gt thetext lt/foogt
  • Element start tag foo
  • Attribute attr1
  • The character data thetext
  • Matching element end tag lt/foogt

7
Why Use XML?
  • Represent semi-structured data
  • data that are structured, but dont fit
    relational model
  • XML is more flexible than DBs
  • XML is more structured than simple IR
  • You get a massive infrastructure for free

8
XML Schemas
  • Schema syntax definition of XML language
  • Schema language formal language for expressing
    XML schemas
  • Examples
  • Document Type Definition
  • XML Schema (W3C)
  • Relevance for XML IR
  • Our job is much easier if we have a (one) schema

9
Challenges in XML Retrieval
04/12/2008
10
Data vs. Text-centric XML
  • Data-centric XML used for messaging between
    enterprise applications
  • Mainly a recasting of relational data
  • Numerical and non-text data dominate
  • Mostly stored in the databases
  • Text-centric XML used for annotating content
  • Rich in text
  • Demands good integration of text retrieval
    functionality
  • Queries are user information needs. E.g., give me
    the Section (element) of the document that tells
    me how to change a brake light

11
Text search in RDB
  • Highly structured text search problems are most
    efficiently handled by a relational database
  • Information need find all employees who are
    involved with invoicing
  • SQL query
  • select lastname from employees where job_desc
    like 'invoic'
  • satisfies the information need with high
    precision and recall.

id lastname job_desc salary
invoicing

12
Structured Retrieval Data
  • Many structured data sources containing text are
    best modeled as structured documents rather than
    relational data.
  • Search over structured documents structured
    retrieval.
  • Queries in structured retrieval can be either
    structured or unstructured (here we assume that
    the collection consists only of structured
    documents).
  • Applications of structured retrieval include
    digital libraries, patent databases, text with
    tagged entities (e.g. persons and locations),
    application output as marked up text.

13
Structured Retrieval Query
  • Queries combine textual criteria with structural
    criteria.
  • Information need Articles about sightseeing
    tours of the Vatican and the Coliseum.
  • Usability of structured queries
  • knowledge of the data structure
  • sightseeing AND (COUNTRYVatican OR
    LANDMARKColiseum)
  • sightseeing AND (STATEVatican OR
    BUILDINGColiseum)
  • syntax of the query language
  • (XQuery)
  • low recall in Boolean model

for a in doc()//au let
14
IR XML Challenges Indexing Units Term
Statistics
  • Indexing unit there is no document unit in XML
  • Book? Chapter?
  • How do we compute tf and idf?
  • Global tf/idf over all text context is useless
  • Indexing granularity
  • Structured document retrieval principle. A system
    should always retrieve the most specific part of
    a document answering the query.

15
Partitioning an XML document into non-overlapping
indexing units
16
IR XML Challenges Schemas
  • Schema heterogeneity
  • Many schemas
  • Schemas not known in advance
  • Schemas change
  • Users dont understand schemas
  • Need to identify similar elements in different
    schemas
  • Example name, surname, family name

17
IR XML Challenges UI
  • Help user find relevant nodes in schema
  • Author, editor, contributor, from/sender
  • What is the query language you expose to the
    user?
  • Specific XML query language? No.
  • Forms? Parametric search?
  • A textbox?
  • In general design layer between XML and user

18
Vector Spaces and XML
19
Vector spaces and XML
  • Vector spaces triedtested framework for
    keyword retrieval
  • Other bag of words applications in text
    classification, clustering
  • For text-centric XML retrieval, can we make use
    of vector space ideas?
  • Challenge capture the structure of an XML
    document in the vector space.

20
Vector spaces and XML
  • For instance, distinguish between the following
    two cases

Book
Book
Title
Title
Author
Author
The Pearly Gates
Bill Gates
Microsoft
Bill Wulf
21
Content-rich XML representation
Book
Book
Title
Title
Author
Author
Bill
Microsoft
Wulf
Pearly
Gates
Bill
Gates
The
Lexicon terms.
22
Encoding the Gates differently
  • What are the axes of the vector space?
  • In text retrieval, there would be a single axis
    for Gates
  • Here we must separate out the two occurrences,
    under Author and Title
  • Thus, axes must represent not only terms, but
    something about their position in an XML tree

23
Queries
  • Before addressing this, let us consider the kinds
    of queries we want to handle

Book
Book
Author
Title
Title
Bill
Gates
Microsoft
24
Subtrees and structure
  • Consider all subtrees of the document that
    include at least one lexicon term

Bill
Microsoft
Gates
Book
e.g.
Title
Author
Author
Microsoft
Bill
Title
Author
Gates
Book
Book

Bill
Microsoft
Gates
Author
Title
Gates
Bill
Microsoft
25
Structural terms
  • Call each of the resulting (8, in the previous
    slide) subtrees a structural term
  • Note that structural terms might occur multiple
    times in a document
  • Create one axis in the vector space for each
    distinct structural term
  • Weights based on frequencies for number of
    occurrences (just as we had tf)
  • All the usual issues with terms (stemming? Case
    folding?) remain

26
Example of tf weighting
Play
Play
Play
Play
Play
Act
Act
Act
Act
Act
To be or not to be
be
or
not
to
  • Here the structural terms containing to or be
    would have more weight than those that dont

Exercise How many axes are there in this example?
27
Structural terms docsqueries
  • The notion of structural terms is independent of
    any schema/DTD for the XML documents
  • Well-suited to a heterogeneous collection of XML
    documents
  • Each document becomes a vector in the space of
    structural terms
  • A query tree can likewise be factored into
    structural terms
  • And represented as a vector
  • Allows weighting portions of the query

28
The catch remains
  • This is all very promising, but
  • How big is this vector space?
  • Can be exponentially large in the size of the
    document
  • Cannot hope to build such an index
  • And in any case, still fails to answer queries
    like

Book
(somewhere underneath)
Gates
29
Descendants
Author
Book
Book
vs.
Author
Author
Gates
Bill
Gates
FirstName
LastName
Bill
Gates
No known DTD. Query seeks Gates under Author.
30
Handling descendants in the vector space
  • Devise a match function that yields a score in
    0,1 between structural terms
  • E.g., when the structural terms are paths,
    measure overlap
  • The greater the overlap, the higher the match
    score
  • Can adjust match for where the overlap occurs

Book
Book
Book
Author
in
vs.
Author
LastName
Bill
Bill
Bill
31
How do we use this in retrieval?
  • First enumerate structural terms in the query
  • Measure each for match against the dictionary of
    structural terms
  • Just like a postings lookup, except not Boolean
    (does the term exist)
  • Instead, produce a score that says 80 close to
    this structural term, etc.
  • Then, retrieve docs with that structural term,
    compute cosine similarities, etc.

32
Example of a retrieval step
Index
ST1
Doc1 (0.7)
Doc4 (0.3)
Doc9 (0.2)
ST5
Doc3 (1.0)
Doc6 (0.8)
Doc9 (0.6)
ST Structural Term
Now rank the Docs by cosine similarity e.g.,
Doc9 scores 0.578.
33
Closing technicalities
  • But what exactly is a Doc?
  • In a sense, an entire corpus can be viewed as an
    XML document

Corpus
Doc1
Doc2
Doc3
Doc4
34
What are the Docs in the index?
  • Anything we are prepared to return as an answer
  • Could be nodes, some of their children

35
What are queries we cant handle using vector
spaces?
  • Find figures that describe the Corba architecture
    and the paragraphs that refer to those figures
  • Requires JOIN between 2 tables
  • Retrieve the titles of articles published in the
    Special Feature section of the journal IEEE Micro
  • Depends on order of sibling nodes.
  • Requires ltarticlesgt to appear after a specific
    ltsecgt node.
  • Query tree cannot express relations that depend
    on node ordering.

ltjournalgtlttitlegtlt/titlegt ltsec1gtlttitlegtlt/titlegtlt/
sec1gt ltarticlegtlt/articlegt ltarticlegtlt/articlegt lt/
journalgt
36
Can we do IDF?
  • Yes, but doesnt make sense to do it corpus-wide
  • Can do it, for instance, within all text under a
    certain element name say Chapter
  • Yields a tf-idf weight for each lexicon term
    under an element
  • Issues how do we propagate contributions to
    higher level nodes.

37
Example
  • Say Gates has high IDF under the Author element
  • How should it be tf-idf weighted for the Book
    element?
  • Should we use the idf for Gates in Author or that
    in Book?

Book
Author
Bill
Gates
38
INEX a benchmark for text-centric XML retrieval
39
INEX
  • Benchmark for the evaluation of XML retrieval
  • Analog of TREC (recall IIR8)
  • Consists of
  • Set of XML documents
  • Collection of retrieval tasks

40
INEX
  • Each engine indexes docs
  • Engine team converts retrieval tasks into queries
  • In XML query language understood by engine
  • In response, the engine retrieves not docs, but
    elements within docs
  • Engine ranks retrieved elements

41
INEX assessment
  • For each query, each retrieved element is
    human-assessed on two measures
  • Relevance how relevant is the retrieved element
  • Coverage is the retrieved element too specific,
    too general, or just right
  • E.g., if the query seeks a definition of the Fast
    Fourier Transform, do I get the equation (too
    specific), the chapter containing the definition
    (too general) or the definition itself
  • These assessments are turned into composite
    precision/recall measures

42
INEX corpus (Ad Hoc Track)
  • Articles from IEEE Computer Society publications
  • Wikipedia collection 2009
  • others

43
INEX topics
  • Each topic is an information need, one of two
    kinds
  • Content Only (CO) free text queries
  • Content and Structure (CAS) explicit structural
    constraints, e.g., containment conditions.

44
Sample INEX CO topic
  • ltTitlegt computational biology lt/Titlegt
  • ltKeywordsgt computational biology, bioinformatics,
    genome, genomics, proteomics, sequencing, protein
    folding lt/Keywordsgt
  • ltDescriptiongt Challenges that arise, and
    approaches being explored, in the
    interdisciplinary field of computational
    biologylt/Descriptiongt
  • ltNarrativegt To be relevant, a document/component
    must either talk in general terms about the
    opportunities at the intersection of computer
    science and biology, or describe a particular
    problem and the ways it is being attacked.
    lt/Narrativegt

45
INEX assessment
  • Each engine formulates the topic as a query
  • E.g., use the keywords listed in the topic.
  • Engine retrieves one or more elements and ranks
    them.
  • Human evaluators assign to each retrieved element
    relevance and coverage scores.

46
Assessments
  • Relevance assessed on a scale from Irrelevant
    (scoring 0) to Highly Relevant (scoring 3)
  • Coverage assessed on a scale with four levels
  • No Coverage (N the query topic does not match
    anything in the element
  • Too Large (L The topic is only a minor theme of
    the element retrieved)
  • Too Small (S the element is too small to provide
    the information required)
  • Exact Coverage(E).
  • So every element returned by each engine has
    ratings from 0,1,2,3 N,S,L,E

47
Combining the relevance/coverage assessments
  • Define scores

48
The Q-values
  • Scalar measure of goodness of a retrieved
    elements
  • Can compute Q-values for varying numbers of
    retrieved elements 10, 20 etc.
  • Means for comparing engines.

49
From Q-values to ?
  • INEX provides a method for turning these into
    precision-recall curves
  • Standard issue only elements returned by some
    participant engine are assessed
  • Lots more commentary (and proceedings from
    previous INEX bakeoffs)
  • http//www.inex.otago.ac.nz

50
Resources
  • Chapter 10 of IIR
  • Resources at http//ifnlp.org/ir
  • INEX http//www.inex.otago.ac.nz/
Write a Comment
User Comments (0)
About PowerShow.com