Schema Summarization - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Schema Summarization

Description:

Schema Summarization. Cong Yu and H. V. Jagadish. University of Michigan, Ann Arbor ... Why are complex schemas difficult to deal with ? ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 37
Provided by: uofm72
Category:

less

Transcript and Presenter's Notes

Title: Schema Summarization


1
Schema Summarization
  • Cong Yu and H. V. Jagadish
  • University of Michigan, Ann Arbor
  • -
  • VLDB 2006, Seoul, Korea
  • September 13th, 2006

2
Many Databases Are Complex
Number of elements tables columns
(relational) elements attributes (XML)
3
Reactome Schema
4
Whats the Problem ?
  • Why are complex schemas difficult to deal with ?
  • For data integration administrators (DIAs)
    Difficult to grasp the major topics of a complex
    schema
  • For ordinary users Difficult to identify the
    small subset of relevant schema elements
  • Can we avoid them ?
  • Probably not scientific databases are in fact
    getting more and more complex MiMI is an example

5
Existing Approaches
  • Ignore the schema
  • Keyword-based search over relational and XML
    databases
  • Guess the schema
  • Schema-Free XQuery, FleXPath, etc.
  • Limitations
  • Provide imprecise (and sometimes incorrect)
    answers
  • No help in understanding the schema (and the
    database) itself

6
Our Approach
  • Summarize the schema
  • Represent the original complex schema with a
    simpler schema, i.e., a summary of the original
    schema
  • Help users explore the schema via the summary
  • Illustrates the main topics of the database
  • Filters away irrelevant parts of the schema

Challenge how to create a good summary ?
7
Talk Outline
  • Motivation
  • Background Definitions
  • Desiderata of Schema Summary
  • Efficient Schema Summarization
  • Evaluation
  • Conclusion and Related Work

8
Schema
  • A labeled, directed graph
  • Nodes
  • Relational table and column
  • Hierarchical element and attribute
  • Links
  • Structural links parent/child constraints
  • Value links inclusion constraints (key / foreign
    key)

warehouse
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
9
Schema Summary
  • A schema itself, but
  • Fewer number of elements ? Simpler
  • Contains abstract elements and links
  • Abstract element
  • Represents a group of original elements
  • Abstract link
  • Connects at least one abstract element

warehouse
author
book
10
Talk Outline
  • Motivation
  • Background Definitions
  • Desiderata of Schema Summary
  • Efficient Schema Summarization
  • Evaluation
  • Conclusion and Related Work

11
What Makes a Good Schema Summary ?
warehouse
warehouse
warehouse
state
authors
store
_at_name
store
author
book
author
contact
book
_at_id
_at_name
book
_at_name
isbn
price
title
_at_address
author
  • Which one should be the summary ?

12
What Information Do We Need ?
  • Schema summary is not only a summary of the
    schema, but also in fact a summary of the
    database !

schema structure and data distribution
13
Desired Properties of Schema Summary
  • Small enough (in terms of number of elements) to
    comprehend Summary Complexity
  • Show elements in which users are more likely to
    be interested Summary Importance
  • Show elements that represent the entire database
    well Summary Coverage
  • Importance and Coverage calculation will need to
    consider both schema structure and data
    distribution

14
Intuition Behind Importance
warehouse
  • Not all schema elements are created equal !
  • First Observation
  • more links, more important - schema
  • Second Observation
  • more popular, more important - data

state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
15
Compute Summary Importance
  • Schema Element Importance
  • W Neighbor Weight the percentage of ejs
    information flows into e, estimated using
    relative cardinalities
  • Summary Importance

16
Backup Slide Neighbor Weight
warehouse
5/1
state
2/1
authors
150/1
store
_at_name
100/1
author
contact
book
_at_id
  • Reflects the relative importance of an element
    toward an element it directly connects, among the
    neighbors
  • Captures both element connectivity and data
    cardinality

_at_name
3/2
_at_name
isbn
3/1
_at_address
title
price
author
17
Intuition Behind Coverage
warehouse
  • Important ? Inclusion in the summary
  • Elements can be too close to each other
  • Two basic notions
  • Element Affinity
  • Element Coverage

state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
18
Intuition Behind Coverage, contd
  • Element Affinity
  • less hops, higher affinity
  • higher relative cardinality, lower affinity
  • Element Coverage
  • Element Affinity
  • Neighbor Weight

warehouse
state
authors
store
_at_name
author
contact
book
_at_id
_at_name
_at_name
isbn
price
title
_at_address
author
19
Compute Summary Coverage
  • Schema element affinity from ea to eb
  • Schema element coverage of eb by ea
  • Summary Coverage

20
What makes a good schema summary ?
data distribution
schema structure
summary importance
summary coverage
21
Talk Outline
  • Motivation
  • Background Definitions
  • Desiderata of Schema Summary
  • Efficient Schema Summarization
  • Evaluation
  • Conclusion and Related Work

22
Overview
Database
K
Schema
(1) Annotating Schema Graph
(Computing statistics)
(Algorithms MaxImportance and MaxCoverage)
(2.1) Calculating Importance
(2.2) Calculating Coverage
Set of K elements with high coverage Set S of
Coverage Domination Pairs
List L of elements sorted by Importance
(3) Determine K summary elements
(Algorithm BalanceSummary)
(4) Cluster Original Schema Elements
Balanced Summary of Size K
23
Algorithm MaxImportance
  • MaxImportance generates a summary of a given size
    k, maximizing summary importance

Compute steady-state element importance values
Sort and pick top-k important elements
Compute assignments of remaining elements
  • Complexity O(N2 NlogN)
  • Convergence is proved in MGR02.

24
Algorithm MaxCoverage
  • MaxCoverage generates a summary of a given size
    k, maximizing summary coverage in a heuristic way

Eliminate elements being dominated Compute
summary coverage for all element set of size-k
Compute coverage dominance (bottom up with A/D
pairs)
Pick the set with highest coverage
  • Complexity O(kN2nk)
  • See paper for details on coverage dominance

25
Backup Slide Calculating K-elements with Maximum
Combined Coverage
  • Problem
  • K highest coverage elements ? A set of K
    elements with highest combined coverage
  • Need to consider all K-element sets
  • A pruning strategy
  • Heuristic approach
  • Coverage Domination Pair ea dominates eb iif for
    any summary with only eb, we can replace eb with
    ea and obtain a summary with higher coverage.
  • We eliminate eb from consideration

26
Backup Slide Coverage Domination Pair
  • Conditions
  • Let ej be the elements eb covers better than ea
    ec be the element covers ea best except ea itself
  • Ca be the total coverage ea has for all ej
  • Cb be the total coverage eb has for all ej

27
Generate Balanced Summary
  • No single optimal criteria to balance the two
    desired properties
  • A heuristic approach
  • Pick elements in the order of their importance
  • Ignore elements that are dominated by elements
    already in the summary
  • Works well in practice

28
Talk Outline
  • Motivation
  • Background Definitions
  • Desiderata of Schema Summary
  • Efficient Schema Summarization
  • Evaluation
  • Conclusion and Related Work

29
Evaluation Strategies
  • Observation
  • Comparing automatic summaries with summaries
    generated by human experts
  • In general, automatic summaries agree well with
    human ( 80)
  • An objective evaluation framework
  • Models schema exploration based query behavior
  • Query discovery cost the number of extra
    elements visited in order to construct a correct
    query from a query intention

30
Query Discovery Cost Example
  • Query Intention Retrieve ISBN of all books
  • Query for b in doc()/state/store/book return
    b/isbn

warehouse
warehouse
Cost 3
Cost 5
state
state
authors
store
_at_name
store
_at_name
author
author
book
contact
book
contact
book
_at_id
_at_name
_at_name
isbn
_at_name
isbn
price
price
title
_at_address
title
_at_address
author
author
31
Data Sets
32
Summary Benefits
33
Contributions of Schema Structure and Data
Distribution
34
Impact of Balancing Importance and Coverage
Percentage in parenthesis shows the reduction
in savings
35
Talk Outline
  • Motivation
  • Background Definitions
  • Desiderata of Schema Summary
  • Efficient Schema Summarization
  • Evaluation
  • Conclusion and Related Work

36
Related Work
  • First study on summarizing schemas
  • Related to ER model abstraction
  • Limitations of ER model abstraction
  • Does not reflect the data distribution
  • ER models may not be available and may be
    out-of-date
  • For most database schemas, structure or value
    links are semantics-free, ER model abstraction
    methods are ineffective in this case (tagging
    those links involve significant amount of manual
    effort)

37
Related Work, contd
  • Summary element importance calculation is
    partially inspired by PageRank
  • Summary element affinity calculation (used in
    summary coverage) is partially inspired by
    similar measurements in social network analysis

38
Conclusions and Contributions
  • Introduced concept of schema summary
  • Defined summary importance and summary coverage
    as desiderata of schema summary
  • Emphasized both schema structure and data
    distribution as essential features for importance
    and coverage calculation
  • Designed and implemented efficient schema
    summarization algorithms
  • An objective evaluation framework

39
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com