Efficient Discovery of XML Data Redundancies - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Efficient Discovery of XML Data Redundancies

Description:

book. state. ISBN. title. au. au 'Borders' 'Borders' '... 269' 'DB' 'R.R.' 'J.G.' store ... For any two books, if they have the same ISBN, then they have the ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 37
Provided by: UofM9
Category:

less

Transcript and Presenter's Notes

Title: Efficient Discovery of XML Data Redundancies


1
Efficient Discovery of XML Data Redundancies
  • Cong Yu and H. V. Jagadish
  • University of Michigan, Ann Arbor
  • -
  • VLDB 2006, Seoul, Korea
  • September 12th, 2006

2
Talk Outline
  • Motivating Example
  • A Comprehensive Notion of XML FD
  • XML Redundancy Discovery Algorithms
  • Experimental Evaluation
  • Conclusion

3
An Example XML Document
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
4
Constraints on XML Data
  • An example constraint
  • For any two books, if they have the same ISBN,
    then they have the same title.
  • Similar to Equality Generating Dependencies
    (EGDs) BV84 and Nested EGDs YP04

Condition Element(s)
Implication Element(s)
Target
5
Data Redundancies
  • E.g., title is redundantly stored
  • Result of non-optimal design of the database
    schema in the presence of constraints
  • Lead to
  • Update anomalies
  • Increased cost for data transfer and manipulation
  • Constraints are the properties of data
  • May not be known at the design phase

6
  • Goal
  • Efficiently Discover Redundancies From the XML
    Database By Discovering Satisfied Constraints

7
Main Contributions
  • A comprehensive notion of XML FD
  • Capturing a semantically richer set of XML
    constraints
  • Definition of XML data redundancy in terms of XML
    FDs and XML Keys
  • Efficient algorithms for discovering FDs and data
    redundancies from an XML database
  • Experimental Evaluation

8
Talk Outline
  • Motivating Example
  • A Comprehensive Notion of XML FD
  • XML Redundancy Discovery Algorithms
  • Experimental Evaluation
  • Conclusion

9
Backup slide Example XML Constraints
  • Regular condition and implication elements are
    children of target


state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
10
Example XML Constraints
  • Hierarchical condition and/or implication
    elements can come from multiple hierarchies


state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
11
Example XML Constraints, Contd
  • Set elements condition and/or implication
    elements can involve set elements


state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
12
Functional Dependencies (FDs)
  • FDs are used to describe constraints in
    relational databases
  • A similar notion of FD is needed for XML
  • Challenges
  • Target is difficult to specify due to the
    hierarchical structure
  • Set elements introduce new semantics

XML FD needs richer semantics !
13
Previous Notions
  • Path Based Notion LLL02,VLL04
  • Example /warehouse/state/store/book/ISBN ?
    /warehouse/state/store/book/title
  • Format LHS ? RHS
  • Semantics for any two RHS nodes, same
    (associated) LHS indicates same RHS
  • Tree Tuple Based Notion AL04
  • A tree tuple is a data tree, with exactly one
    data node for each schema element
  • Format LHS ? RHS
  • Semantics for any two tree tuples, same LHS
    indicates same RHS

14
Previous Notions, contd
  • Both capture hierarchical constraints
  • Neither can capture set constraints
  • /store/book/ISBN ? /store/book/au
  • Violated in previous
  • Satisfied if the two au
  • nodes are a single set
  • /store/book/title,
  • /store/book/au ?
  • /store/book/ISBN
  • Undefined in previous
  • Intuitive if au nodes are
  • a single set

store
name
book
Borders
au
au
price
ISBN
title
269
DB
R.R.
J.G.
59.9
15
A New Comprehensive Notion
  • Generalized Tree Tuple
  • A data tree constructed around a pivot data node
    (np)
  • Entire subtree rooted at np is kept
  • All ancestors of np and their attributes are
    kept
  • Tuple Class CP
  • The set of all generalized tree tuples, whose
    pivot nodes share the same path P (called pivot
    path)

16
Example Generalized Tree Tuple
warehouse
Pivot
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
269
DB
51.1
17
Example Generalized Tree Tuple
Pivot
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
ISBN
title
au
au
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
title
ISBN
269
DB
51.1
18
XML FD
  • LHS ? RHS w.r.t. CP
  • Semantics
  • for any two generalized tree tuple t1, t2 in CP,
    if they share the same LHS, they have the same
    RHS.
  • E.g., ./title, ./au ? ./ISBN, w.r.t.
    C/warehouse/state/store/book

19
Repeatable Elements Are Special
warehouse
state
state
state
store
store
store

name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
au
au
ISBN
price
ISBN
title
R.R.
269
DB
J.G.
59.9
price
269
R.R.
DB
J.G.
59.9
ISBN
title
DB
51.1
269
20
Essential Tuple Classes
  • Definition
  • Tuple classes with pivot paths that correspond
    to repeatable schema elements
  • C/warehouse/state/store/book is essential
  • C/warehouse/state/store/name is not
  • Express XML FDs that are expressible with
    non-essential tuple classes
  • See paper for detailed proof

21
Backup slide Structurally Redundant XML FDs
  • Definition FDs where none of the paths in LHS
    and RHS is a descendant of pivot path
  • Their satisfaction on a data tree is mirrored by
    other FDs
  • I.e., they are satisfied if and only if some
    other FD is satisfied
  • See paper for detailed explanation

22
Backup slide Interesting XML FD
  • RHS is not contained in LHS
  • CP is an essential Tuple Class
  • RHS is descendent of pivot node
  • See paper for details

23
XML Key and Data Redundancy
  • Let attribute _at_key uniquely identify each node in
    the entire data tree
  • is an XML Key, when the database
    satisfies XML FD LHS ? ./_at_key w.r.t. CP
  • Similar to the relative key notion proposed in
    BDF01
  • Data redundancy exists if the database
  • Satisfies the XML FD ,
  • But is not an XML key
  • ? RHS is redundantly stored.

24
Talk Outline
  • Motivating Example
  • A Comprehensive Notion of XML FD
  • XML Redundancy Discovery Algorithms
  • Experimental Evaluation
  • Conclusion

25
Strategy
  • Discover satisfied XML FDs and Keys
  • Data redundancies can then be discovered based on
    the definition
  • First, we need an efficient representation of the
    XML data

26
Hierarchical Representation of XML Data
  • Each essential tuple class ? a relation
  • Similar to nested relations OY87,MNE96
  • All relations together form a hierarchy
  • Tree tuples can be reconstructed by joining _at_key
    with parent

R_state _at_key parent 2 root 3 root 18
root . . . . .
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
27
Intra-Relation FDs
  • ./ISBN ? ./title, w.r.t. C/warehouse/state/store
    /book


state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
28
Inter-Relation FDs
  • ../name, ./ISBN ? ./price, w.r.t.
    C/warehouse/state/store/book


Present in R_store
state
state
store
store
store
name
name
name
Borders
Amazon
book
book
book
Borders
au
au
price
title
ISBN
au
au
price
ISBN
title
DB
R.R.
J.G.
59.9
269
269
price
DB
R.R.
J.G.
59.9
title
ISBN
DB
51.1
269
Present in R_book
29
Overview of the Discovery Process
  • Only interested in minimal FDs
  • Bottom-Up
  • At each relation
  • Discover intra-relation FDs and Keys
  • Discover inter-relation FDs and Keys involving
    descendant relations
  • Generate candidate inter-relation FDs and Keys
    for examination at the parent level
  • Attribute Partition as the basic data structure

30
Attribute Partition
  • Groups tuples
  • according to the
  • attribute value
  • ?price for Cbook t6,t20, t13
  • ?_at_key for Cbook t6, t20, t13
  • ?price, _at_key for Cbook t6, t20, t13
  • FD LHS ? RHS w.r.t. CP is satisfied iff
  • ?LHS?RHS ?LHS

R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
31
Set Attribute Partition
  • Generated through refinement
  • ? Initialize ?au for R_book to be t6, t13,
    t20
  • ? ?_at_text for R_au
  • t10, t24, t11, t25
  • ? t6, t20, t6, t20
  • ? ?au for R_book
  • t6, t20, t13
  • ?au can then be used as
  • a normal partition

R_au _at_key parent _at_text 10 6 R.R. 11
6 J.G. 24 20 R.R. 25 20 J.G.
R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
Convert to parent
Refine ?au using partitions in ?_at_text
32
Discovery Algorithms
  • DiscoverFD
  • Discover intra-relation FDs and Keys
  • Similar to existing relational algorithms
  • DiscoverXFD
  • Discover inter-relation FDs and Keys
  • Key component
  • Candidate inter-relation XML FD generation

33
Generating Candidate Inter-Relation FDs
  • Let P' be a parent relation of P
  • Parent satisfaction property
  • For LHS?X ? RHS w.r.t. CP to hold for any
    attribute set X in relation P', LHS?./parent ?
    RHS w.r.t. CP must hold
  • Child implication property
  • For LHS?X ? RHS w.r.t. CP to be a non-trivial FD
    for any attribute set X in relation P', LHS ? RHS
    w.r.t. CP must not hold
  • An FD is a candidate inter-relation FD if it
    satisfies both properties

34
Backup slide Generating Partition Target
  • Example candidate FD
  • ./ISBN ? ./price w.r.t. Cbook
  • We associate each FD with a Partition Target
    (PT)
  • Specifying inequalities parent attribute
    partitions must satisfy

R_book _at_key parent ISBN title price 6 4
269 DB 59.9 13 12 269 DB
51.1 20 19 269 DB 59.9
?ISBN t6, t13, t20 ?price t6,
t20, t13 PT t4 ? t12, t19 ? t12
35
Backup slide Checking Partition Target
  • Candidate FD
  • ./ISBN ? ./price w.r.t. Cbook
  • We check each parent attribute partition against
    the PT to discover inter-relation FDs
  • We use various techniques to compactly represent
    PT
  • See analysis in Paper

R_store _at_key parent name 4 3 Borders
12 3 Amazon 19 18 Borders
PT t4 ? t12, t19 ? t12 ?name t4,
t19, t12 ../name ? ./price w.r.t. Cbook
36
Talk Outline
  • Motivating Example
  • A Comprehensive Notion of XML FD
  • XML Redundancy Discovery Algorithms
  • Experimental Evaluation
  • Conclusion

37
Real Datasets
  • DBLP contains a fair amount of redundancy, as
    noted earlier in AL04 as well
  • 10 redundancies in PIR (measured as of
    redundant elements over total of elements),
    schema modification reported to PIR

38
Scalability on XMark
  • Linear in terms of scale factor ( of elements)
    even though exponential in theory
  • Orders of magnitude faster than direct
    application of a state-of-the-art relational
    discovery algorithm
  • The latter takes over 3 hours to run on XMark
    scale factor 1

39
Related Work
  • XML Integrity Constraints (FDs and Keys)
  • BDF01, LLL02, FS03
  • XML Normal Form
  • AL04, VLL04
  • Nested Relation Normal Form
  • OY87, MNE96
  • Relational FD discovery
  • FUN, Dep-Miner, TANE, fdep, FastFDs

40
Backup slide GORDIAN
  • Both use extensive pruning strategies based on
    the properties of FDs
  • E.g., singleton pruning are adopted in both
  • GORDIAN is more aggressive since it only looks
    for keys
  • Our algorithm is more comprehensive, it discovers
    satisfied FDs, in addition to keys

41
Conclusion
  • A comprehensive notion of XML FDs and Keys,
    capturing set semantics
  • A system for for detecting XML data redundancies
    through the discovery of FDs and Keys
  • The system is practical for real datasets and
    out-performs direct application of the best
    available relational algorithm by orders of
    magnitude.

42
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com