Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing

Description:

It can extract data from existing XML documents and construct new documents (transformations) ... CONSTRUCT clause specifies how to assemble the query results in XML ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing


1
Integrating Keyword Search into XML Query
Processing
XML Query Language (XML-QL) Extending XML-QL with
Keyword Search Extended XML-QL Implementation
Using RDBMS
  • Presentation By
  • Alex Kremer
  • Ariel Rosenblatt

2
Bibliography(well-formed, but invalid)
  • Bibliography
  • Article elements are from different sources
  • Same information, but using different XML Scheme
    / DTDs (Document Type Descriptors)

3
XML Queries
  • XML is becoming the Data Storage and Exchange
    Format of choice in many applications
  • Handling of XML data requires a rich and powerful
    Query Language
  • Allow for querying the content and structure of
    an XML document
  • Varying or unknown structures can make
    formulating queries very difficult

4
XML Queries Why not SQL/OQL
  • XML is not rigidly structured
  • In XML the schema can exists with the data as tag
    names
  • If DTD is not available, schema is build while
    the document is parsed
  • Missing elements or multiple occurrences of the
    same element
  • This flexibility is crucial for EDI (Electronic
    Document Interchange)

5
XML Query Requirements W3C Working Group
  • Goals
  • Support different usage scenarios
  • Define data model query operators
  • Define query language syntax
  • Interoperate with other XML working groups

6
XML Query Requirements Usage Scenarios
  • Human-readable documents
  • Manuals, Books, Articles
  • Data-oriented documents
  • XML representation of
  • Database data, Object data,
  • XML representation might be either
  • Physical or Virtual

7
XML Query Requirements Usage Scenarios Contd.
  • Mixed model documents
  • Hybrid of document oriented and data-oriented
  • Catalogues, Patient health records,
  • Administrative data
  • Configuration files, User profiles,
    Administrative logs

8
XML Query Requirements Usage Scenarios Contd.
  • Filtering streams
  • On-line filtering / extracting / transforming /
    routing, of XML data streams
  • Logs of email messages, Network packets, Stock
    market data, Newswire feeds
  • Document Object Model (DOM)
  • Perform queries on DOM structures to return sets
    of nodes that meet the specified criteria

9
XML Query Requirements Usage Scenarios Contd.
  • Multiple syntactic environments for queries
    embedded in
  • URL, XML, JSP or ASP pages, a string in a
    general-purpose programming language

10
XML Query Requirements Interoperability
  • Results must be returned in a DOM compatible
    manner
  • XPath (used in XPointer and XSLT)
  • XPath expressibility and search facilities should
    be used in query syntax
  • Usage of XML Schema (XSDL) and/or DTD

11
XML Query Languages Proposals to W3C
  • XQL (heavily based on XPath)
  • XML-QL

12
XML-QL
  • It is declarative
  • It is relational complete in particular it can
    express joins
  • Simple enough to enable optimizations
  • It can extract data from existing XML documents
    and construct new documents (transformations)

13
XML-QL Syntax
WHERE ( xml-pattern ELEMENT_AS elem_var
) IN url, ( predicate ) CONSTRUCT xml-pattern
variable
  • WHERE clause specifies how to filter data from
    the input XML dataset
  • CONSTRUCT clause specifies how to assemble the
    query results in XML

14
XML-QL Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
  • Yields the following result

15
XML-QL ExplainedThe Data Model
  • A Set of XML documents must be represented (XML
    Data Set)
  • XML elements in a dataset can be partitioned
    according to their types
  • Need to represent information in a loss-less
    manner (original data set must be recreatable
    from the representation)

16
XML-QL ExplainedData Model Representation
ID00
Bibliography
article
article
article
article
ID14
ID01
ID04
ID08
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
17
XML-QL ExplainedData Model Representation
  • Dataset D is represented as a graph GD
  • Nodes
  • Element e ? node Ne uniquely labeled IDe
  • Data value v ? leaf Lv uniquely labeled v
  • Edges
  • (Ne , Ne) labeled with the tag of e, if e is
    directly nested within e (ltegtltegtlt/egtlt/egt)
  • (Ne , Lv) labeled with , if v is directly
    contained within e (ltegtvlt/egt)
  • (Ne , Lv) labeled with attribute name a, if v is
    the value of atribute a of element e (lte
    avgtlt/egt)

18
XML-QL ExplainedQuery Processing
  • An XML pattern can be also modeled by a graph
  • Some labels in the graph are now variables
  • The result of the evaluation of query q on the
    input D, is
  • Each mapping from the graph Gq to the graph GD
    which preservers the constant labels
  • This mapping induces a substitution of the
    variables in the query on the set of constant
    values

19
XML-QL ExplainedA Query Graph for Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
article
title
author
name
T
Florescu
20
XML-QL ExplainedQuery Processing, Example 1
ID00
Bibliography
article
article
article
article
ID014
ID01
ID04
ID08
No ltauthorgt
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
No ltnamegt name is an attribute
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
article
Match! Add ID08 to Results E ID08 T
Integrating Keyword Search
title
author
name
T
Florescu
21
XML-QL Advanced QueriesExample 2 (More
Florescu)
WHERE ltarticlegt ltgtltauthorgtltnamegtNlt/namegtlt/
authorgtlt/gt lttitlegtTlt/titlegt ltarticlegt
ELEMENT_AS E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E
lt/resultgt union WHERE ltarticlegt
ltgtltauthorgtlt_ nameNgtlt/_gtlt/authorgtlt/gt
lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS E IN
bibliography.xml, N like Florescu CONSTRUCT
ltresultgt E lt/resultgt
  • We now look for articles where the author name
    can be also an attribute!, result

Back
22
XML-QL Disadvantages
  • We need to know the XML structure in order to
    query
  • We can still perform more efficient queries,
    where we get all the information available, but
  • These queries can easily grow very complex as
    seen previously

23
XML-QL Keyword Search Extension
  • Addition of special predicate called contains to
    XML-QL
  • Tests the existence of a given word within an XML
    element
  • Works on partially known or not-known XML
    structure
  • Allows querying several XML documents with
    different structure

24
Extended XML-QL The contains Predicate
  • The contains predicate has 4 arguments, (E,
    word, depth, location)
  • E is an XML element variable
  • Word the word we are searching for
  • Depth is an integer expression limiting the depth
    at which the word is found within the element
  • Location is a boolean expression over the set of
    constants,
  • tag_name, attribute_name, content,
    attribute_value

25
Extended XML-QLExample 3
  • We can use the extended XML-QL to formulate a
    query which yields the same result as Example 2

WHERE ltarticlegt ltauthorgtlt/authorgt ELEMENT_AS
A lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, contains(A,
Florescu, 3, content or attribute_value) CO
NSTRUCT ltresultgt E lt/resultgt
Back
26
Extended XML-QLExample 4
  • We are able to query unstructured data (full text
    search) within a set of articles

WHERE ltarticlegtlt/articlegt ELEMENT_AS E IN
bibliography.xml, contains(E, Florescu, 3,
any) CONSTRUCT ltresultgt E lt/resultgt
Yielding the result
27
Implementing the contains predicate
  • The authors suggest an implementation of the
    XML-QL extension on top of a Commercial RDBMS
  • Oracle 8, IBM DB2, MS-SQL,

28
Implementation Using RDBMS
  • Reasons
  • Easy to implement an extended XML query processor
  • Universally available
  • RDBMS allow to mix XML data and other (relational
    data)
  • Very good performance over large volumes of data

29
Relational Support forFull-text Indexing
  • Use of extended Inverted Files to implement
  • The contains predicate
  • Finding of relevant XML data sources (URLs) in a
    distributed environment
  • We will use RDBMS to implement Inverted Files

30
Inverting Files
  • For our needs the inverted file will contain
    tuples of the following format
  • ltword, elID, depth, locationgt
  • Examples from bibliography.xml
  • ltarticle, elID01, 0, taggt
  • ltid, elID01, 1, attrgt
  • ltRequirements, elID01, 2, valuegt

31
Storing Inverted Files in RDBMS Unique Internal
elIDs
  • Unique element IDs are modeled as records
    containing
  • Document locators (URLs)
  • Element locators within the document
  • Using absolute positions (start, end)
  • Using unique identifiers specified by DTD
    (explicit id attribute)
  • Why not XPointer?

32
Storing Inverted Files in RDBMS Unique elID
Schemes
  • After normalization the authors propose the
    following scheme
  • Elements(elID, docid, start_pos, end_pos, type,
    id_val)
  • Documents(docid, URL)
  • From this point elID can be used as an internal
    key used for faster processing

33
Storing Inverted Files in RDBMS
  • Natural way using scheme
  • contains(elID, word, depth, location)
  • Huge! We partition it into word tables for each
    keyword ltwordgt in the dataset
  • ltwordgt(elID, depth, location)
  • Virtually all IR (Information Retrieval) systems
    use partitioning by word

Back
34
Storing Inverted Files in RDBMS Further
Partitioning
  • We use further partitioning to optimize the query
    processing
  • The type (tag) of the element is usually known at
    predicate evaluation time
  • by looking at the XML pattern of the query
  • We further partition the individual ltwordgt tables
    by the type of the element they are in
  • ltwordgt-lttypegt(elID, depth, location)
  • Table examples Name-author, Florescu-name

bibliography.xml
Back
35
Implementation Extended XML-QL Query Processing
  • Two Ways
  • Replicating the whole XML data in an RDBMS
  • XML-QL processing is entirely performed in an
    RDBMS
  • Distributed XML Query Processing
  • only index (contains) is stored in an RDBMS

36
Replicating the XML Data in an RDBMS
  • The binary table approach
  • For each type (tag name or attribute name), a
    table is built with the following scheme
  • lttypegt(parent, element, value)
  • The parent element contains the element of type
    lttypegt
  • element is null if a lttypegt has no sub-elements
    or if lttypegt is an attribute name (in that case
    we are usually interested in the value)

bibliography.xml
37
Replicating the XML Data in an RDBMS XML-QL
Queries
  • Every XML-QL query can be translated into an
    equivalent SQL query
  • The SQL query will process the binary tables of
    the replicated XML Data

Back
38
XML-QL to SQL Example 5 (from Example 1)
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/aut
horgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
SELECT article.element FROM article, author,
name, title WHERE article.element
author.parent AND author.element name.parent
AND article.element title.parent AND / title
exists / name.value like Florescu
39
Extended XML-QL to SQL Keyword Search
  • Processing the contains predicate involves usage
    of inverted file tables
  • The word-type table has to be joined with the
    previous result
  • The word-type table is the resulting table of the
    word by type partitioning

40
Extended XML-QL to SQL Example 6
WHERE ltarticlegt ltauthorgtlt/authorgt
ELEMENT_AS A lttitlegtTtextlt/titlegt
ELEMENT_AS T ltarticlegt ELEMENT_AS E IN
bibliography.xml, contains(A, Florescu, 3,
any) contains(T, Integrating, 3,
any) CONSTRUCT ltresultgt Ttext lt/resultgt
SELECT title.value FROM article, author, name,
title, Florescu-author, Integrating-title W
HERE article.element author.parent AND
author.element Florescu-author.elID AND
article.element title.parent AND
title.element Integrating-title.elID
41
Distributed XML Query Processing
  • XML data can be indexed in RDBMS, but
  • The XML data cannot be stored in the RDBMS
  • Reasons volume (entire www) or legal
  • The mediator (query interface)
  • Uses inverted files in RDBMS, but
  • Accesses the data sources to compute the full
    query result (Expensive!)
  • Load relevant documents/elements into RDBMS and
    process the query as described before
  • (XML-QL to SQL)

42
Distributed XML Query Processing Elements
Retrieval
  • Use of Inverted Files for the retrieval of
    relevant documents/elements
  • Evaluate contains predicates to disqualify
    irrelevant elements
  • Further reduce the dataset needed to process the
    remaining basic XML-QL query
  • This is an optimization since retrieval of remote
    data is expensive
  • Load the relevant documents/elements

43
Distributed XML Query Processing Reducing
Retrieval
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, T like
XML CONSTRUCT ltresultgt N lt/resultgt
  • Get the intersection of elIDs sets from
  • author-article
  • name-article
  • title-article
  • XML-article

44
Conclusions
  • XML-QL can be extended to support keyword search
  • Use of RDBMS
  • Inverted Files can be stored an queried using an
    RDBMS
  • XML data itself can be replicated and queried in
    the RDBMS
  • Keyword search and overall XML query processing
    can be carried out very efficiently
  • Data structure influence
  • The more structure is known, the faster a query
    will be executed
  • Totally unstructured queries can be executed very
    fast
  • The more structure is known, the higher is the
    quality of the query results
Write a Comment
User Comments (0)
About PowerShow.com