Overview of XML Data Management Research at Cornell

About This Presentation

Title:

Overview of XML Data Management Research at Cornell

Description:

As witnessed by popularity of keyword search over the Internet ... Ranked Keyword Search over Structured and Unstructured Data ... E.g., Used car marketplace ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 53

Provided by: jayavelsha

Category:

more less

Transcript and Presenter's Notes

Title: Overview of XML Data Management Research at Cornell

1
Overview of XML Data Management Research at
Cornell

Jayavel Shanmugasundaram
Cornell University

2
Why XML?

Internet data exchange
XML emerging as dominant standard for data
interactions over the Internet (e.g., SOAP)
Consequently, web application developers deal
with XML (e.g., WebSphere)
Captures structured and unstructured data
Content management
Semi-structured data

3
Outline

XML for data exchange
XML for structured and unstructured data
Conclusion

4
XML for Data Exchange
Cars R Us
Tires R Us
5
Key Challenges

Publishing relational data as XML
For the foreseeable future, most business data
will continue to be store in relational databases
Need to publish relational data as XML
Storing XML using relational database systems
Need to manage XML documents being transferred
across the wire (Purchase orders for auditing
etc.)
Do we need to build a specialized XML database?
Or can we leverage relational technology?

6
Outline

XML for data exchange
Publishing relational data as XML
Querying XML using relational databases
XML for structured and unstructured data
Conclusion

7
Example Relational Data
order
item
payment
8
XML View for Users
ltorder id10gt ltcustomergt Smith
Construction lt/customergt ltitemsgt
ltitem descriptiongenerator gt
ltcostgt 8000 lt/costgt lt/itemgt
ltitem descriptionbackhoegt
ltcostgt 24000 lt/costgt lt/itemgt
lt/itemsgt ltpaymentsgt
ltpayment due1/10/01gt
ltamountgt 20000 lt/amountgt lt/paymentgt
ltpayment due6/10/01gt
ltamountgt 12000 lt/amountgt
lt/paymentgt lt/paymentsgtlt/ordergt
9
Allow Users to Query View
Get all orders of customer Smith
for order in view(orders)where
order/customer/text() like Smith return order
10
// First prepare all the SQL statements to be
executed and create cursors for them Exec SQL
Prepare CustStmt From select cust.id, cust.name
from Customer cust where cust.name Jack Exec
SQL Declare CustCursor Cursor For CustStmt Exec
SQL Prepare AcctStmt From select acct.id,
acct.acctnum from Account acct where acct.custId
? Exec SQL Declare AcctCursor Cursor For
AcctStmtExec SQL Prepare PorderStmt From select
porder.id, porder.acct, porder.date from
PurchOrder porder
where porder.custId
? Exec SQL Declare PorderCursor Cursor For
PorderStmtExec SQL Prepare ItemStmt From select
item.id, item.desc from Item item where item.poId
? Exec SQL Declare ItemCursor Cursor For
ItemStmtExec SQL Prepare PayStmt From select
pay.id, pay.desc from Payment pay where item.poId
? Exec SQL Declare PayCursor Cursor For
PayStmt// Now execute SQL statements in nested
order of XML document result. Start with
customer XMLresult Exec SQL Open
CustCursorwhile (CustCursor has more rows)
Exec SQL Fetch CustCursor Into custId,
custName XMLResult ltcustomer id
custId gtltnamegt custName
lt/namegtltaccountsgt // For each customer,
issue sub-query to get account information and
add to custAccts Exec SQL Open AcctCursor
Using custId while (AcctCursor has more
rows) Exec SQL Fetch AcctCursor
Into acctId, acctNum XMLResult
ltaccount id acctId gt acctNum
lt/accountgt XMLResult
lt/accountsgtltpordersgt // For each
customer, issue sub-query to get purchase order
information and add to custPorders Exec SQL
Open PorderCursor Using custId while
(PorderCursor has more rows) Exec
SQL Fetch PorderCursor Into poId, poAcct,
poDate XMLResult ltporder id
poId acctpoAcct gtltdategtpoDate
lt/dategtltitemsgt // For each
purchase order, issue a sub-query to get item
information and add to porderItems
Exec SQL Open ItemCursor Using poId
while (ItemCursor has more rows)
Exec SQL Fetch ItemCursor Into itemId,
itemDesc XMLResult ltitem
id itemId gt itemDesc lt/itemgt
XMLResult
lt/itemsgtltpaymentsgt // For each
purchase order, issue a sub-query to get payment
information and add to porderPays
Exec SQL Open PayCursor Using poId
while (PayCursor has more rows)
Exec SQL Fetch PayCursor Into payId, payDesc
XMLResult ltpayment id
payId gt payDesc lt/paymentgt
XMLResult lt/paymentsgtlt/pordergt
// End of looping over all purchase
orders associated with a customer
XMLResult lt/customergt Return
XMLResult as one result row reset XMLResult
// loop until all customers are tagged and
output
11
Previous Work

SQL extensions for publishing relational data as
XML
Shanmugasundaram et al., VLDB 2000
Prototyped in DB2
Input into SQL/X working group
XML publishing using XQuery
Shanmugasundaram et al., VLDB 2001
Initially XPERANTO prototype
Now XTables product initiative

12
Updates

Updating XML views of relational data
Extend XQuery with update semantics
Translate XQuery updates to SQL updates (when
possible!) efficiently!

for order in view(orders)where
order/customer/text() Smith update
order/cost order/cost 100.00
13
Recursion

// queries are very popular
Navigational recursion
XQuery functions allow structural recursion
Part hierarchies
Nested catalogs
How can we evaluate them using a relational
database system?
View composition
Fix-point recursion in SQL

14
Outline

XML for data exchange
Publishing relational data as XML
Querying XML using relational databases
XML for structured and unstructured data
Conclusion

15
Native XML Documents
ltPurchaseOrder BuyerExcavation Corp. Date1
Jan 2000gt ltItemsgt ltItem
ItemId10 Price 10000/gt ltItem
ItemId 20 Price6000/gt lt/Itemsgt
ltPaymentsgt ltPayment CreditCard834239843
2 ChargeAmt8000.00/gt ltPayment
CreditCard3474324934 ChargeAmt2000.00/gt
lt/Paymentsgtlt/PurchaseOrdergt
16
Querying Native XML Documents

Native XML database systems
Specialized for XML document processing
Extend relational (or object-oriented) database
systems
Leverage gt 30 years of research and development
Harness sophisticated functionality, tools

17
Previous WorkShanmugasundaram et. al.,
VLDB99Shanmugasundaram et al., SIGMOD
Record01
XML Translation Layer
Relational Database System
18
Query Workload

Different XML shredding techniques can have a
dramatic influence on performance
How can we choose appropriate shredding based on
XML query and update workload?
SMART for XML

19
Typing

XML Schemas have many sophisticated constraints
Min occurs, max occurs
Structural constraints
How can we preserve these in relational database
systems in presence of updates?
Relational constraints?
Materialized views?

20
Outline

XML for data exchange
XML for structured and unstructured data
Conclusion

21
30000-foot view of Data Management Today

Essentially two camps
Structured camp Relational database systems
Highly structured data
Precise and sophisticated queries over this data
Unstructured camp Information retrieval systems
Unstructured data
Keyword search queries returning ranked results

22
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
23
Primary Advantages of Ranked Keyword Search
John Ithaca

Simple
As witnessed by popularity of keyword search over
the Internet
Facilitates information discovery
Ranks results in order of importance
Users do not need to know the schema of the
underlying data (if there is any)

24
Ranked Keyword Search over Structured and
Unstructured Data

Content management
Semi-structured data
Scientific documents, Shakespeares plays,
Mix of structured and unstructured data
Database with date and time of accident
(structured data) and accident description
(unstructured data)
Support flexible keyword search interface

25
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
26
Key Challenges

Generalize ranked keyword query semantics
Should work as usual for unstructured data
Should generalize to structured data too!
Allows users to query across both forms of data
Generalized inverted lists
Indexing mix of structured and unstructured data
More on this soon!

27
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
28
Complex Queries over Semi-structured Data
Motivation 1

Document repositories do not typically conform to
a rigid schema
Scientific documents
Powerpoint presentation
Push to publish these in XML form
Complex queries over such heterogeneous
collections (in conjunction with structured data)
Find all documents on XML authored by Almaden
employees

29
Complex Queries over Semi-structured Data
Motivation 2

Even structured data can have a widely varying
structure
Example electronic parts market place
2 million parts each having about 10-15
attributes
A total of 5000 distinct attributes
Structure changes very often (schema chaos)
New parts are added every day
May be better to treat this data as
semi-structured
But still need to ask structured queries
Find capacitors with capacitance between 10 and 20

30
Indexing and Query Processing

Index both schema and data
/bookauthor/name Jane
Treat schema as a data value
Benefits
Can capture arbitrarily heterogeneous schema
Easy schema evolution
Can implement it using a relational database
system! (using regular B-trees)
Supports wildcard (//) queries

31
OrderTatarinov et. al., SIGMOD02

Shakespeares plays marked up as XML
Acts ordered one after the other
Cannot view this as an unordered set
XQuery queries support ordered predicates
Find acts after Hamlet said to be or not to be
Again, treat order as a data value
Order encoding methods
Can be implemented in a relational database
system!

32
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
33
Unifying IR and Database Systems Motivation

The Internet is opening the door to ad-hoc
queries by end users
E.g., Used car marketplace
Find all bright red ford mustangs that cost
less than 20 of the average price of cars in its
class
Characteristics of queries
Ranked keyword search (for ease of use)
Complex query operations (information synthesis)
Want to see ranked results!

34
Main Challenge
Find bright red ford mustangs that cost
less than 20 of the average price of cars in its
class

Integrate ranking with structured query
operations
Developing XQuery framework
Build ranking into core of language
Both keyword search and structured operators
Open question
Will we be able to extend relational databases
for this purpose?

35
Outline

XML for data exchange
XML for structured and unstructured data
Overview
Ranked keyword search over XML documents
Conclusion

36
XRANK Ranked Keyword Search over XML Documents

Lin GuoFeng ShaoChavdar BotevJayavel
Shanmugasundaram

37
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
38
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
39
Design Principles

Return most specific element containing the query
keywords

40
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
41
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements

42
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
43
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree

44
System Architecture
Keyword query
Ranked Results
Input XML Documents
Query Evaluator
Data access
XML Elements with Standings
Standing Computation
Hybrid Dewey Inverted List
45
Naïve Approach

One main difference between document and XML
keyword search is result granularity
Treat each element as a document
Build regular inverted list index structures over
elements
Drawbacks
Space overhead (depth of document)
Ranking (two-dimensional proximity)
Spurious query results

46
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
47
Main Problem with Naïve Approach

Decouples representation of ancestors and
descendants
Space overhead
Spurious query results
Dewey encoding for ids
General knowledge classification (1850s)
LDAP, Ordered XML,

48
Dewey Encoding
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt

XQL and
Ricardo
49
Dewey Inverted List (DIL)
Position List
Dewey Id
Standing
XQL
5.0.3.0.0
85
32
Sorted by Dewey Id
5.0.3.8.3
38
89
91

Ricardo
5.0.3.0.1
82
38
Sorted by Dewey Id
8.2.1.4.2
99
52

50
Query Processing