Flexible Queries over Semistructured Data - PowerPoint PPT Presentation

About This Presentation
Title:

Flexible Queries over Semistructured Data

Description:

Flexible Queries over Semistructured Data Yaron Kanza Yehoshua Sagiv The Hebrew University – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 37
Provided by: Yaro60
Category:

less

Transcript and Presenter's Notes

Title: Flexible Queries over Semistructured Data


1
Flexible Queries over Semistructured Data
  • Yaron Kanza
  • Yehoshua Sagiv
  • The Hebrew University

2
Overview of the Talk
  • New semantics for queries over semistructured
    data
  • New results for
  • Query evaluation
  • Query equivalence
  • Database equivalence (databases could be
    equivalent even if they are not identical!)
  • Transforming a database into a tree

3
Why is it Difficult to Formulate Queries over
Semistructured Data?
It is difficult to design queries
Data does not conform to a rigid schema
The structure of the database changes frequently
Queries should be rewritten frequently
Data is contributed by many users in a variety
of designs
The query should deal with different structures
of data
The description of the schema is large (e.g., a
DTD of XML)
It is difficult to use the schema for formulating
queries
4
A University Scenario
University Website Database
5
Database
  • Following OEM, the database is represented as a
    rooted labeled directed graph

6
University
1
Course
Teacher
Course
2
3
4
Course
Teacher
Teacher
Title
Title
Course
Name
11
5
6
7
8
10
9
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A teacher node can either be below or above a
course node
Thus, it is difficult to write a query that
looks for all the teachers and their courses
7
Queries
  • Queries are represented as rooted labeled
    directed graphs
  • The nodes of the graph are considered as variables

8
University
r
Course
Course
u
v
Teacher
Teacher
w
Name
y
A query that finds all pairs of courses taught
by the same teacher
However, if in the database, courses are
descendents of teachers, the query has to be
reformulated
Instead, we propose new ways of matching queries
to databases
9
A Rigid Matching
  • The query root is mapped to the db root
  • A query edge with label l is mapped to a db edge
    with label l (and, hence, a path is mapped to a
    path)
  • It is the usual semantics for queries
  • (e.g., Lorel, XML-QL, XQL, etc.)

Query Root
Database Root
r
1
x
x
9
l
l
y
11
10
University
University
1
u
Course
Teacher
Course
Teacher
2
3
4
v
Course
Teacher
Teacher
Title
Title
Course
Name
Course
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A Rigid Matching Example
Another Rigid Matching
11
A Semiflexible Matching
  • The query root is mapped to the db root
  • A query node with an incoming label l is mapped
    to a db node with an incoming label l
  • The image of every query path is embedded in some
    database path
  • SCC is mapped to SCC

The last two conditions cannot be verified
locally, i.e., by considering one query edge at
a time
l
l
y
11
12
University
University
1
u
Course
Teacher
Course
Teacher
2
3
4
v
Course
Teacher
Teacher
Title
Title
Course
Name
Course
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A Semiflexible Matching Example
We get all the teacher-course pairs
13
University
University
1
u
Course
Teacher
Course
Course
Course
2
3
4
v
Course
Teacher
Teacher
x
Teacher
Title
Title
Course
Name
Teacher
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
Impossible to get this pair by means of a rigid
matching, since the query is a dag and the db is
a tree
Another Example of a Semiflexible Matching
The SF matching gives a pair of courses taught by
the same teacher
14
A Flexible Matching
  • The query root is mapped to the db root
  • A query node with an incoming label l is mapped
    to a db node with an incoming label l
  • An edge is mapped to two nodes on one path
  • Notice that a path in the query is not
    necessarily mapped to a path in the db

l
l
y
11
15
University
University
1
u
Course
Teacher
Course
Course
Course
2
3
4
v
Course
Teacher
Teacher
x
Teacher
Title
Title
Course
Name
Teacher
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
Name
12
13
15
14
y
A. Cohen
B. Levi
Databases
Compilers
A Flexible Matching Example
A query edge is mapped to two db nodes on one path
This flexible matching is neither a rigid
matching nor a semiflexible matching
16
Differences Between the Semiflexible and Flexible
Semantics
  • On a technical level, in flexible matchings
  • Query paths are not necessarily embedded in
    database paths
  • SCCs are not necessarily mapped to SCCs
  • On a conceptual level, in the semiflexible
    semantics, nodes are semantically related if
    they are on the same path, and hence
  • Query paths are embedded in database paths
  • In the flexible semantics, this condition is
    relaxed
  • Query edges are embedded in database paths

17
Inclusion
  • Proposition
  • R-MATQ(D) ? SF-MATQ(D) ? F-MATQ(D)
  • where
  • R-MATQ(D) is the set of rigid matchings
  • SF-MATQ(D) is the set of semiflexible
    matchings
  • F-MATQ(D) is the set of flexible
    matchings

18
Verifying that Mappings are Semiflexible Matchings
  • Is a given mapping of query nodes to database
    nodes a semiflexible matching?
  • Not as simple as for rigid matchings (no local
    test, i.e., need to consider paths rather than
    edges)
  • In a dag query, the number of paths may be
    exponential
  • Yet, verifying is in polynomial time
  • In a cyclic query, the number of paths may be
    infinite
  • Yet, verifying is in exponential time

19
Verifying that a Mapping is a Semiflexible
Matching
Cyclic Query DAG Query Tree Query Path Query Query / Database
No matchings PTIME PTIME PTIME Path Database
No matchings PTIME PTIME PTIME Tree Database
No matchings PTIME PTIME PTIME DAG Database
coNP coNP PTIME PTIME Cyclic Database
20
Complexity of Query Evaluation
  • Not surprisingly, for both the semiflexible and
    flexible semantics
  • Data complexity is polynomial
  • Query complexity is exponential

But is it exponential because the result is
large or because the result is hard to compute?
21
Input-Output Complexity of Query Evaluation for
the Semiflexible Semantics
  • The input consists of both the query and the
    database
  • The input-output complexity is a function of the
    query, the database and the result
  • Next slide summarizes results about the
    input-output complexity
  • Polynomial for a dag query and a tree database
    (or simpler cases)
  • Rather difficult to prove, even when the query is
    a tree, since there is no local test for
    verifying that mappings are semiflexible
    matchings
  • Exponential lower bounds for other cases

22
I/O Complexity for SF Semantics (lower bounds
are for non-emptiness)
Cyclic Query DAG Query Tree Query Path Query Query / Database
Result is empty PTIME PTIME PTIME Path Database
Result is empty PTIME PTIME PTIME Tree Database
Result is empty NP-Complete NP-Complete NP-Complete DAG Database
NP-Hard (in ?P2) NP-Hard (in ?P2) NP-Complete NP-Complete Cyclic Database
23
I/O Complexity of Query Evaluation for the
Flexible Semantics
  • Results follow from a reduction to query
    evaluation under the rigid semantics
  • Tree query
  • Input-Output complexity is polynomial
  • DAG query
  • Testing for non-emptiness is NP-Complete

24
Query Containment
  • Q1 ? Q2 if for all database D,
  • the set of matchings of Q1 w.r.t. to D
  • is contained in
  • the set of matchings of Q2 w.r.t. to D
  • We assume that
  • Both queries have the same set of variables, and
  • All variables are distinguished

25
Query Equivalence
  • Useful for optimization
  • Given a query, equivalent queries can be created
    by transformations

These two queries are equivalent under both the
flexible and semiflexible semantics
26
Database Equivalence
  • D1 and D2 are equivalent if for all queries Q,
  • the set of matchings of Q w.r.t. to D1
  • is equal to
  • the set of matchings of Q w.r.t. to D2
  • Both databases must have the same set of objects
    and the same root

27
Database Transformation
University
1
Course
Course
Course
2
3
4
Logic
Compilers
Databases
Teacher
Teacher
Teacher
6
8
A. Cohen
C. Katz
The databases are equivalent under both the
flexible and semiflexible semantics
A DAG has become a TREE!
28
Transforming a Database into a Tree
  • Reasons for transforming a database into an
    equivalent tree database
  • Evaluation of queries over a tree database is
    more efficient
  • In a graphical user interface, it is easier to
    present trees than dags or cyclic graphs
  • Storing the data in a serial form (e.g., XML)
    requires no references

29
Transformation into a Tree
  • There are algorithms for
  • Testing if a database can be transformed into an
    equivalent tree database, and
  • Performing the transformation
  • For the semiflexible semantics
  • The algorithms are polynomial
  • For the flexible semantics
  • The algorithms are exponential

30
o0
l1
o1
l6
l3
l2
o6
o3
o2
l4
l4
o4
l5
l5
o5
o0, o1, o2, o4, o5
o0, o1, o3, o4, o5
o0, o5, o6
31
Complexity Analysis
  • for
  • Query Containment
  • and
  • Database Equivalence

32
Complexity of Query Containment
  • Under the semiflexible semantics, Q1 ? Q2 iff the
    identity mapping is a semiflexible matching of Q1
    w.r.t. Q2
  • Thus, containment is
  • in coNP when Q1 is a cyclic graph and Q2 is
    either a dag or a cyclic graph
  • in polynomial time in all other cases
  • Under the flexible semantics, query containment
    is always in polynomial time

33
Complexity of Database Equivalence
  • For the semiflexible semantics, deciding
    equivalence of databases is
  • in polynomial time if both databases are dags
  • in coNP if one of the databases has cycles
  • For the flexible semantics, deciding equivalence
    of databases is polynomial in all cases

34
Conclusion
  • Flexible and semiflexible queries facilitate easy
    and intuitive querying of semistructured
    databases
  • Querying the database even when the user is
    oblivious to the structure of the database
  • Queries are insensitive to variations in the
    structure of the database

35
Conclusion (contd)
  • Compared to languages that use regular path
    expressions,
  • Less expressive power, but
  • Easier to formulate queries, and
  • More favorable complexities for
  • Query evaluation, and
  • Query optimization

36
Thank You
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com