Current Research Directions - PowerPoint PPT Presentation

About This Presentation
Title:

Current Research Directions

Description:

Cars R Us. Order Fulfillment. Application. Purchasing. Application ... Do we need to build a specialized XML database? Or can we leverage relational technology? ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 63
Provided by: jayavelsha
Category:

less

Transcript and Presenter's Notes

Title: Current Research Directions


1
Current Research Directions
  • Jayavel Shanmugasundaram
  • Cornell University

2
Two Main Projects
  • XML
  • Internet data management
  • Peer-to-peer systems
  • Querying the deep web

3
Why XML?
  • Internet data exchange
  • XML emerging as dominant standard for data
    interactions over the Internet (e.g., SOAP)
  • Consequently, web application developers deal
    with XML (e.g., WebSphere)
  • Captures structured and unstructured data
  • Content management
  • Semi-structured data

4
Outline
  • XML for data exchange
  • XML for structured and unstructured data
  • Conclusion

5
XML for Data Exchange
Cars R Us
Tires R Us
6
Key Challenges
  • Publishing relational data as XML
  • For the foreseeable future, most business data
    will continue to be store in relational databases
  • Need to publish relational data as XML
  • Storing XML using relational database systems
  • Need to manage XML documents being transferred
    across the wire (Purchase orders for auditing
    etc.)
  • Do we need to build a specialized XML database?
    Or can we leverage relational technology?

7
Outline
  • XML for data exchange
  • Publishing relational data as XML
  • Querying XML using relational databases
  • XML for structured and unstructured data
  • Conclusion

8
Example Relational Data
order
item
payment
9
XML View for Users
ltorder id10gt ltcustomergt Smith
Construction lt/customergt ltitemsgt
ltitem descriptiongenerator gt
ltcostgt 8000 lt/costgt lt/itemgt
ltitem descriptionbackhoegt
ltcostgt 24000 lt/costgt lt/itemgt
lt/itemsgt ltpaymentsgt
ltpayment due1/10/01gt
ltamountgt 20000 lt/amountgt lt/paymentgt
ltpayment due6/10/01gt
ltamountgt 12000 lt/amountgt
lt/paymentgt lt/paymentsgtlt/ordergt
10
Allow Users to Query View
Get all orders of customer Smith
for order in view(orders)where
order/customer/text() like Smith return order
11
// First prepare all the SQL statements to be
executed and create cursors for them Exec SQL
Prepare CustStmt From select cust.id, cust.name
from Customer cust where cust.name Jack Exec
SQL Declare CustCursor Cursor For CustStmt Exec
SQL Prepare AcctStmt From select acct.id,
acct.acctnum from Account acct where acct.custId
? Exec SQL Declare AcctCursor Cursor For
AcctStmtExec SQL Prepare PorderStmt From select
porder.id, porder.acct, porder.date from
PurchOrder porder
where porder.custId
? Exec SQL Declare PorderCursor Cursor For
PorderStmtExec SQL Prepare ItemStmt From select
item.id, item.desc from Item item where item.poId
? Exec SQL Declare ItemCursor Cursor For
ItemStmtExec SQL Prepare PayStmt From select
pay.id, pay.desc from Payment pay where item.poId
? Exec SQL Declare PayCursor Cursor For
PayStmt// Now execute SQL statements in nested
order of XML document result. Start with
customer XMLresult Exec SQL Open
CustCursorwhile (CustCursor has more rows)
Exec SQL Fetch CustCursor Into custId,
custName XMLResult ltcustomer id
custId gtltnamegt custName
lt/namegtltaccountsgt // For each customer,
issue sub-query to get account information and
add to custAccts Exec SQL Open AcctCursor
Using custId while (AcctCursor has more
rows) Exec SQL Fetch AcctCursor
Into acctId, acctNum XMLResult
ltaccount id acctId gt acctNum
lt/accountgt XMLResult
lt/accountsgtltpordersgt // For each
customer, issue sub-query to get purchase order
information and add to custPorders Exec SQL
Open PorderCursor Using custId while
(PorderCursor has more rows) Exec
SQL Fetch PorderCursor Into poId, poAcct,
poDate XMLResult ltporder id
poId acctpoAcct gtltdategtpoDate
lt/dategtltitemsgt // For each
purchase order, issue a sub-query to get item
information and add to porderItems
Exec SQL Open ItemCursor Using poId
while (ItemCursor has more rows)
Exec SQL Fetch ItemCursor Into itemId,
itemDesc XMLResult ltitem
id itemId gt itemDesc lt/itemgt
XMLResult
lt/itemsgtltpaymentsgt // For each
purchase order, issue a sub-query to get payment
information and add to porderPays
Exec SQL Open PayCursor Using poId
while (PayCursor has more rows)
Exec SQL Fetch PayCursor Into payId, payDesc
XMLResult ltpayment id
payId gt payDesc lt/paymentgt
XMLResult lt/paymentsgtlt/pordergt
// End of looping over all purchase
orders associated with a customer
XMLResult lt/customergt Return
XMLResult as one result row reset XMLResult
// loop until all customers are tagged and
output
12
Previous Work
  • SQL extensions for publishing relational data as
    XML
  • Shanmugasundaram et al., VLDB 2000
  • Prototyped in DB2
  • Input into SQL/X working group
  • XML publishing using XQuery
  • Shanmugasundaram et al., VLDB 2001
  • Initially XPERANTO prototype now XTables

13
Updates
  • Updating XML views of relational data
  • Extend XQuery with update semantics
  • Translate XQuery updates to SQL updates (when
    possible!) efficiently!

for order in view(orders)where
order/customer/text() Smith update
order/cost order/cost 100.00
14
Recursion
  • // queries are very popular
  • Navigational recursion
  • XQuery functions allow structural recursion
  • Part hierarchies
  • Nested catalogs
  • How can we evaluate them using a relational
    database system?
  • View composition
  • Fix-point recursion in SQL

15
Outline
  • XML for data exchange
  • Publishing relational data as XML
  • Querying XML using relational databases
  • XML for structured and unstructured data
  • Conclusion

16
Native XML Documents
ltPurchaseOrder BuyerExcavation Corp. Date1
Jan 2000gt ltItemsgt ltItem
ItemId10 Price 10000/gt ltItem
ItemId 20 Price6000/gt lt/Itemsgt
ltPaymentsgt ltPayment CreditCard834239843
2 ChargeAmt8000.00/gt ltPayment
CreditCard3474324934 ChargeAmt2000.00/gt
lt/Paymentsgtlt/PurchaseOrdergt
17
Querying Native XML Documents
  • Native XML database systems
  • Specialized for XML document processing
  • Extend relational (or object-oriented) database
    systems
  • Leverage gt 30 years of research and development
  • Harness sophisticated functionality, tools

18
Previous WorkShanmugasundaram et. al.,
VLDB99Shanmugasundaram et al., SIGMOD
Record01
XML Translation Layer
Relational Database System
19
Query Workload
  • Different XML shredding techniques can have a
    dramatic influence on performance
  • How can we choose appropriate shredding based on
    XML query and update workload?
  • SMART for XML

20
Typing
  • XML Schemas have many sophisticated constraints
  • Min occurs, max occurs
  • Structural constraints
  • How can we preserve these in relational database
    systems in presence of updates?
  • Relational constraints?
  • Materialized views?

21
Outline
  • XML for data exchange
  • XML for structured and unstructured data
  • Conclusion

22
30000-foot view of Data Management Today
  • Essentially two camps
  • Structured camp Relational database systems
  • Highly structured data
  • Precise and sophisticated queries over this data
  • Unstructured camp Information retrieval systems
  • Unstructured data
  • Keyword search queries returning ranked results

23
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
24
Primary Advantages of Ranked Keyword Search
John Ithaca
  • Simple
  • As witnessed by popularity of keyword search over
    the Internet
  • Facilitates information discovery
  • Ranks results in order of importance
  • Users do not need to know the schema of the
    underlying data (if there is any)

25
Ranked Keyword Search over Structured Data
Applications
  • Publishing databases on the Internet
  • Information discovery
  • Structured schema can be complex!
  • E.g., SAP creates 100s of tables under the
    covers
  • Can take users days to figure out the schema
  • Flexible query interface
  • Complex SQL queries rarely written by end-users
  • End-users thus have to work with existing queries
    or wait patiently for an expert

26
Example Relational Database
Customer
Id
Name
Policy
SS
200
Jane
354859
134-983-8348
300
John
495493
324-978-2836
Accidents
Dependents
Day
Location
Year
Cid
Month
Cost
Name
Age
Cid
10
1999
Ithaca
200
June
1000.00
Mark
200
16
300
23
2001
Boston
March
2500.00
300
200
3
2001
17
Kim
Boston
May
500.00
300
200
15
2000
Boston
16
John Jr.
July
1100.00
27
Ranked Keyword Search over Structured Data
John Ithaca
28
Example Relational Database
Customer
Id
Name
Policy
SS
200
Jane
354859
134-983-8348
300
John
495493
324-978-2836
Accidents
Dependents
Day
Location
Year
Cid
Month
Cost
Name
Age
Cid
10
1999
Ithaca
200
June
1000.00
Mark
200
16
300
23
2001
Boston
March
2500.00
300
200
3
2001
17
Kim
Boston
May
500.00
300
200
15
2000
Boston
16
John Jr.
July
1100.00
29
Key Challenges
  • Structured data is typically normalized
  • A single tuple may not contain all the query
    keywords
  • Promising initial work Agrawal et al., Bhalotia
    et al., Hristidis et al.
  • Ranking
  • Keyword search is just one aspect
  • Can we exploit hyperlink and/or tf-idf for
    ranking?

30
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
31
Ranked Keyword Search over Structured and
Unstructured Data
  • Content management
  • Semi-structured data
  • Scientific documents, Shakespeares plays,
  • Mix of structured and unstructured data
  • Database with date and time of accident
    (structured data) and accident description
    (unstructured data)
  • Support flexible keyword search interface
  • Same advantages as for structured data

32
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
33
Key Challenges
  • Generalize ranked keyword query semantics
  • Should work as usual for unstructured data
  • Should generalize to structured data too!
  • Allows users to query across both forms of data
  • Generalized inverted lists
  • Indexing mix of structured and unstructured data
  • More on this later!

34
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
35
Complex Queries overUnstructured Data Motivation
The Department of Computer Science at Cornell
University, which was organized in 1965, is one
of the oldest departments of its kind in the
country. It has a full-time faculty of 30,
Find all computer science departments that were
founded between 1960 and 1970
36
Integrate Querying with Meta-data Extraction
  • Use Natural Language Processing to extract
    metadata
  • Talking to NLP group at Cornell
  • How can we integrate this information with other
    structured data?
  • Semantic maps? Ontologies?

37
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
38
Complex Queries over Semi-structured Data
Motivation 1
  • Document repositories do not typically conform to
    a rigid schema
  • Scientific documents
  • Powerpoint presentation
  • Push to publish these in XML form
  • Complex queries over such heterogeneous
    collections (in conjunction with structured data)
  • Find all documents on XML authored by Almaden
    employees

39
Complex Queries over Semi-structured Data
Motivation 2
  • Even structured data can have a widely varying
    structure Agrawal et al.
  • Example electronic parts market place
  • 2 million parts each having about 10-15
    attributes
  • A total of 5000 distinct attributes
  • Structure changes very often (schema chaos)
  • New parts are added every day
  • May be better to treat this data as
    semi-structured
  • But still need to ask structured queries
  • Find capacitors with capacitance between 10 and 20

40
Indexing and Query Processing
  • Index both schema and data
  • Treat schema as a data value
  • Benefits
  • Can capture arbitrarily heterogeneous schema
  • Easy schema evolution
  • Can implement it using a relational database
    system! (using regular B-trees)
  • Supports wildcard (//) queries

41
OrderTatarinov et. al., SIGMOD02
  • Shakespeares plays marked up as XML
  • Acts ordered one after the other
  • Cannot view this as an unordered set
  • XQuery queries support ordered predicates
  • Find acts after Hamlet said to be or not to be
  • Again, treat order as a data value
  • Order encoding methods
  • Can be implemented in a relational database
    system!

42
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
43
Unifying IR and Database Systems Motivation
  • The Internet is enabling end-users to directly
    ask queries
  • E.g., Used car marketplace
  • Find all bright red ford mustangs that cost
    less than 20 of the average price of cars in its
    class
  • Characteristics of queries
  • Ranked keyword search (for ease of use)
  • Complex query operations (information synthesis)
  • Want to see ranked results!

44
Main Challenge
Find bright red ford mustangs that cost
less than 20 of the average price of cars in its
class
  • Integrate ranking with structured query
    operations
  • Developing XQuery framework
  • Build ranking into core of language
  • Both keyword search and structured operators
  • Open question
  • Will we be able to extend relational databases
    for this purpose?

45
Outline
  • XML for data exchange
  • XML for structured and unstructured data
  • Overview
  • Ranked keyword search over XML documents
  • Conclusion

46
XRANK Ranked Keyword Search over XML Documents
  • Lin GuoFeng ShaoChavdar BotevJayavel
    Shanmugasundaram

47
Traditional Data Management Landscape
1
2
Information Retrieval Systems
RankedKeywordSearch
Queries
3
4
(Relational) Database Systems
ComplexandStructured
Structured
Unstructured
Data
48
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
49
Design Principles
  • Return most specific element containing the query
    keywords

50
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
51
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements

52
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
53
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree

54
System Architecture
Keyword query
Ranked Results
Input XML Documents
Query Evaluator
Data access
XML Elements with Standings
Standing Computation
Hybrid Dewey Inverted List
55
Naïve Approach
  • One main difference between document and XML
    keyword search is result granularity
  • Treat each element as a document
  • Build regular inverted list index structures over
    elements
  • Drawbacks
  • Space overhead (depth of document)
  • Ranking (two-dimensional proximity)
  • Spurious query results

56
Semi-structured Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt A Query

Language lt/citegt
lt/papergt
57
Main Problem with Naïve Approach
  • Decouples representation of ancestors and
    descendants
  • Space overhead
  • Spurious query results
  • Dewey encoding for ids
  • General knowledge classification (1850s)
  • LDAP, Ordered XML,

58
Dewey Encoding
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt


XQL and
Ricardo
59
Dewey Inverted List (DIL)
Position List
Dewey Id
Standing
XQL
5.0.3.0.0
85
32
Sorted by Dewey Id
5.0.3.8.3
38
89
91



Ricardo
5.0.3.0.1
82
38
Sorted by Dewey Id
8.2.1.4.2
99
52




60
Query Processing
  • Can answer XML keyword search queries in single
    pass over DIL
  • Space savings
  • Time savings (smaller inverted lists)
  • Ranking refinements
  • Ranked Dewey Inverted List (RDIL)
  • Hybrid Dewey Inverted List (HDIL)

61
Outline
  • XML for data exchange
  • XML for structured and unstructured data
  • Conclusion

62
Conclusion
  • Two main uses of XML
  • Data exchange
  • Managing structured and unstructured data
  • Each of these gives rise to exciting data
    management opportunities
  • Pursuing these actively at Cornell
  • Also started a project on Internet data
    management
  • http//www.cs.cornell.edu/people/jai
Write a Comment
User Comments (0)
About PowerShow.com