Title: A Semantic Approach to Rewriting Queries for Integrated XML Data
1A Semantic Approach to Rewriting Queries for
Integrated XML Data
- Xia Yang1, Mong Li Lee1, Tok Wang Ling1, Gillian
Dobbie2 - 1School of Computing, National University of
Singapore - 2Department of Computer Science, The University
of Auckland, New Zealand
2The Problem
Return the title of the books in the integrated
database where the authors name is Tom and the
publication year is 2000
S1
S2
S12 Local schema S1, S2,
Integrated schema S12 .
3Outline
- Background
- Preliminary Definitions
- Query Rewriting
- Comparison with Related Work
- Conclusion
4Background
- In data integration, many systems construct a
global or mediated schema from numerous
heterogeneous data sources. - The local sources are XML repositories.
- When XML repositories are involved in data
integration, query rewriting algorithms will need
to take into consideration the hierarchical
structures of XML schemas. - XML schema languages such as DTD and XML Schema
lack the necessary semantic information. - Our approach utilizes the ORA-SS model which
provides the semantic information needed for the
query rewriting process.
5Preliminary Definitions
- XML Schema Model
- ORA-SS schema diagrams
- XML Query Language
- XQuery
- Integrating ORA-SS Schemas
- Mapping table
- Query allocation table
6ORA-SS model
- The ORA-SS model (Object-Relationship-Attribute
model for SemiStructured data)1 is a
semantically rich data model that has been
designed for semistructured data. - The ORA-SS model distinguishes between object
classes, relationship types and attributes. - The main contributions are that
- relationship type is expressed explicitly
- the degree of the relationship type expresses the
number of object classes involved in the
relationship type - the attributes are classified as attributes of
object class and attributes of relationship type.
- The semantically rich ORA-SS model makes it
possible to get the correct result in query
rewriting for integrated XML data. - 1. Tok Wang Ling, Mong Li Lee, and Gillian
Dobbie. Semistructured Database Design. Springer,
2005
7ORA-SS model example
- An example of ORA-SS schema diagram
8Integrating ORA-SS Schemas
- Integrating ORA-SS Schemas
- The algorithm to correctly integrate ORA-SS
schemas was presented in a previous ER
conference 2.
S1
S2
S12 Local schema S1, S2,
Integrated schema S12 2. X. Yang, M.L. Lee, T.W.
Ling. Resolving Structural Conflicts in the
Integration of XML Schemas A Semantic Approach.
ER 2003.
9Mapping table
- A mapping table is created during the process of
integrating the various local schemas S1, S2, ,
Sn into an integrated schema S. - The mapping table contains the mappings between
the object classes and attributes in the local
source schemas and the object classes and
attributes in the integrated schema.
10Mapping table example
S1 S2
S3 S4
- S5 Integrated Schema S12345
11Mapping table example
12Query allocation table
- The QAT is derived from the mapping table.
- A QAT is created for each query.
- The query allocation table (QAT) stores the
selection condition paths, the return result
paths, and the local schemas where the data for
these paths can be found. - The QAT has two parts. The first part contains
the selection conditions, and the second part
contains the return result paths. - The QAT is automatically generated by our
algorithm.
13Query allocation table example
S1 S2
S3 S4
S5 S12345 S12345 is the
integrated schema of local schemas S1, S2, S3,
S4, S5. Query Q3 for b in /book where
b/authorTom and b/year2000
return ltresultgt b/year/text()
b/title/text() lt/resultgt
14Query allocation table example
Query Allocation Table for Query Q3 Selection
Condition Table
Return Result Table
15Query Rewriting
- Step1 Build the query allocation table.
- Step 2 Group local schemas to form join groups
that answer the users query. - Step 3 Decompose the user query to subqueries on
the local sources. - Step 4 Compose the subqueries from local schemas
in a join group.
16Step1 Build the query allocation table.
- A query allocation table (QAT) consists of a
selection condition table and a return result
table. - The path of each selection condition and the
return result is inserted into the selection
condition table and the return result table
respectively. The associated schemas identified
from the mapping table are inserted into the
corresponding rows.
17Step 2 Group local schemas to form join groups
that answer the users query.
- The algorithm finds which local schemas must be
combined to get the expected results. - This is done by scanning the QAT to find the join
groups. - The local schemas in each join group must contain
all the paths required for the selection
condition but may be missing some of the paths
for the return result. This is possible because
the result of a query can be incomplete.
18Step 2 Example demonstrating finding the sources
from QAT that answer users query
- S1 S2
S3 S4
S5 S12345 - S12345 is the integrated schema of local schemas
S1, S2, S3, S4, S5. - Query Q3 for b in /book where b/authorTom
and b/year2000 - return ltresultgt b/year/text()
b/title/text() lt/resultgt
19Step 2 Example demonstrating finding the sources
from QAT that answer users query
Query Allocation Table for Query Q3 Selection
Condition Table
Return Result Table
- Output join group S1, S3 S1, S4 S2, S4
S3 - Note S2, S4 S3 are join groups, even though
they cannot offer full results of the query. We
allow the return results with missing information.
20Step 3 Decompose the user query to subqueries on
the local sources
- This step decomposes the user query into queries
on the local schema based on the join groups. - Since subqueries are later composed to compute
the answers in the same join group, the subquery
must return the data asked for in the user query
and also the data necessary to join the parts of
the answers from different local schemas
together. We call the classes necessary for
joining the parts of answers, join object classes.
21Step 3 Decompose the user query to subqueries on
the local sources
- The key of the join object class is used for
testing the equivalence when joining the
subqueries. - Finding join object classes
- Case 1 For a join group, if there are n paths in
the QAT from different local schemas with a
common ancestor in the user query, then the least
common ancestor in the user query is a join
object class. - There are two other cases that we will not
consider in this talk.
22Step 3 Example demonstrating decomposing the
user query to subqueries on the local sources
(case1)
S1 S2
S3 S4
S5 S12345 S12345 is the
integrated schema of local schemas S1, S2, S3,
S4, S5. Query Q3 for b in /book where
b/authorTom and b/year2000
return ltresultgt b/year/text()
b/title/text() lt/resultgt
23Step 3 Example demonstrating decomposing the
user query to subqueries on the local sources
(case1)
- Recall the join group S1, S3, S1 provides
/book/title, /book/author and S3 provides
/book/year, /book/author. - To answer the query Q3, the subqueries from S1
and S3 need to be composed using the key of their
least common ancestor i.e. the key isbn of the
join object class book.
24Step 4 Compose the subqueries from local schemas
in a join group
- When joining subqueries on local schemas in the
same join group, the identifier of the join
object classes must be tested for equivalence. - We start by considering the basic case where the
same object attributes are from different local
schemas. To compose subqueries from these local
schemas in join groups, the for, where, and
return clause are combined together with the join
condition equivalence test inserted in the where
clause. - We allow the return results to have missing
information. The parent object will not be
removed from the return result, if it has a
missing child. For each return object or
attribute, the join equivalence condition test
related to this return object or attribute is
nested in the appropriate part of the query.
25Step 4 Example of composing the subqueries from
local schemas in a join group
S1 S2
S3 S4
S5 S12345 S12345 is the
integrated schema of local schemas S1, S2, S3,
S4, S5. Query Q9 for b in /book where
b/authorTom and b/year2000
returnltresultgtb/year/text() b/title/text()
for p in b/publisher
where contains (b/publisher/location/text(),S
ingapore) returnltpublishergt
b/publisher/name lt/publishergt
lt/resultgt
26Step 4 Example of composing the subqueries from
local schemas in a join group
- The join groups are S1, S3, S5, S1, S4, S5,
S2, S4, S5 and S3, S5. - We show the query example for the join group S1,
S3, S5. The user query is decomposed into
subqueries on the local schemas S1, S3, and S5.
The join object class is book for these local
schemas.
27Step 4 Example of composing the subqueries from
local schemas in a join group
- The subqueries on S1, S3 and S5 are shown below
- Query Q9_S1
- for b in /book
- where b/authorTom
- return ltresultgt b/isbn/text()
b/title/text() lt/resultgt - Query Q9_S3
- for b in /book
- where b/authorTom and
b/year2000 - return ltresultgt b/isbn/text()
b/year/text() lt/resultgt - Query Q9_S5
- for b in /book
- where contains (b/publisher/address/te
xt(),Singapore) - returnltresultgtb/isbn/text()
- ltpublishergtb/publisher/name
lt/publishergtlt/resultgt
28Step 4 Example of composing the subqueries from
local schemas in a join group
- The composition of the subqueries for local
schemas S1, S3 and S5 are as follows - for b1 in doc(S1.xml)/book, b3 in
doc(S3.xml)/book - where b1/authorTom and b3/authorTom and
b3/year2000 - and b1/isbnb3/isbn
- return ltresultgtb3/year/text()
b1/title/text() - for b5 in doc(S5.xml)/book
- where contains (b5/publisher/address/text(),S
ingapore) and b5/isbnb1/isbn - returnltpublishergt b5/publisher/nameltpublishe
rgtlt/resultgt - Note that although the join object class for S1,
S3 and S5 is book, the equivalence tests are on
separate lines in the rewritten query. This is
because the parent information should be returned
even when a child object class is missing.
29Comparison with Related Work
- Amman et al. 3 propose a mediator architecture
for querying and integrating XML data sources. - Lakshmanan and Sadri 4 propose an
infrastructure for interoperability among XML
data sources. - Yu and Popa 5 introduce an algorithm for
answering queries via a target schema. - 3. B. Amann, C. Beeri, I. Fundulaki, M. Scholl.
Querying XML sources using an Ontology-based
Mediator. CoopIS, 2002. - 4. L.V.S. Lakshmanan, F. Sadri. Interoperability
on XML Data. ICSW, 2003. - 5. C. Yu, L. Popa. Constraint-Based XML Query
Rewriting for Data Integration. SIGMOD 2004.
30Comparison with Related Work
- The models that these authors use cannot
represent whether a relationship type is binary
or n-ary and do not distinguish between
attributes of object classes and attributes of
relationship types from the local XML sources.
The lack of such semantic information can lead to
the retrieval of wrong results.
31Comparison with Related Work
- S1 S2
S12 - S12 is the integrated schema of S1, S2
- Query Q1 for j in /project
- return ltprojectgt j/jno
- for p in j/part
- return ltpartgtp/pno
- for s in
p/supplier return s lt/partgt - lt/projectgt
32Comparison with Related Work (example)
33Comparison with Related Work (example)
34Comparison with Related Work
- The results returned by the query rewriting
method in related work contain the project with
jno j01 has part p01, which is supplied by
suppliers with sno s01 and s02. This violates
the local data sources X1 and X2, where the
project with jno j01 has part p01 is only
supplied by suppliers with sno s01.
35Comparison with Related Work
- This is because the methods in the related work
treat the relationship type between part and
supplier as a binary relationship type, instead
of the intended ternary relationship type
involving project, part, and supplier. They treat
the quantity as the attribute of part in S2, so
when they find the part with pno p01 has
quantity 100 in X1, and has quantity 200 in
X2, they will combine them to make the final
result. This leads to the wrong answer returned. - In contrast, our algorithm takes the XML
hierarchy structure into consideration and
retrieves the expected answer.
36Conclusion
- We develop an algorithm for rewriting queries,
using the ORA-SS schema diagram that takes the
semantic relationship between the source schemas
and the integrated schema into account. - This allows us to distinguish between binary and
n-ary relationship types, attributes of object
classes and attributes of relationship types, and
in turn treat these cases differently in our
rewriting algorithm.
37 38Step 2 Algorithm that finds the sources from QAT
that answer users query
39Step 3 Decompose the user query to subqueries on
the local sources
- Finding join object classes
- Case 2 For a join group, if the paths in the QAT
are from different local schemas, and there is an
object class that is the end of one path and the
start of the other path, then this common object
class is a join object class. - Case 3 For a join group, if two attributes of
the same relationship type in a user query are
from different local schemas, then all the object
classes involved in this relationship type are
join object classes.
40Step 3 Decompose the user query to subqueries on
the local sources
- We describe how to rewrite a user query for a
local schema where the subquery on the local
schema returns only one object class or
attribute. - Then we describe how to rewrite a user query for
a local schema where the subquery on the local
schema returns many object classes or attributes. - (For details, please refer to the
paper.)
41References
- 1. B. Amann, C. Beeri, I. Fundulaki, M. Scholl.
Querying XML sources using an Ontology-based
Mediator. CoopIS, 2002. - 2. P. Buneman, S. Davidson, W. Fan, C. Hara, W.C.
Tan. Keys for XML. WWW Conference, 2001. - 3. Y. Chen, T.W. Ling, M.L. Lee. Automatic
Generation of XQuery View Definitions from ORA-SS
Views. ER, 2003. - 4. S. Cluet, P. Veltri, D. Vodislav. Views in a
Large Scale XML Repository. VLDB, 2001. - 5. O.M. Duschka, M.R. Genesereth. Answering
Recursive Queries Using Views. ACM PODS, 1997. - 6. L. Haas, D. Kossmann, E. Wimmers, J. Yang.
Optimizing queries across diverse data sources.
VLDB, 1997. - 7. A. Halevy. Theory of Answering Queries Using
Views. ACM SIGMOD Record 29(4), 2000. - 8. L.V.S. Lakshmanan, F. Sadri. Interoperability
on XML Data. ICSW, 2003. - 9. M.L. Lee, T.W. Ling, W.L. Low. Designing
Functional Dependencies for XML. EDBT, 2002. - 10. A. Levy. Logic-Based Techniques in Data
Integration. Logic based artificial intelligence,
1999. - 11. T.W. Ling, M.L. Lee, G. Dobbie.
Semistructured Database Design, ISBN
0-387-23567-1, Springer, 2005. - 12. I. Manolescu, D. Florescu, D. Kossman.
Answering XML queries over heterogeneous data
sources. VLDB, 2001. - 13. H. Garcia-Molina, Y. Papakonstantinou, D.
Quass, et al. The TSIMMIS project Integration of
heterogeneous information sources. Journal of
Intelligent Information Systems, 1997. - 14. K. Passi, E. Chaudhry. A Global-to-Local
Rewriting Querying Mechanism using Semantic
Mapping for XML Schema Integration. ODBASE 2003. - 15. K. Passi, L. Lane, S.Madria, Bipin C.
Sakamuri, M. Mohania, S. Bhowmick. A Model for
XML Schema Integration. EC-Web, 2002. - 16. R. Pottinger, A. Levy. A Scalable Algorithm
for Answering Queries Using Views. VLDB, 2000. - 17. M. Stonebraker. Implementation of integrity
constraints and views by query modification. ACM
SIGMOD, 1975. - 18. Xyleme. A dynamic warehouse for XML Data of
the Web. IEEE Data Engineering Bulletin, 2001. - 19. H.Z. Yang, P.A. Larson. Query Transformation
for PSJ-queries. VLDB, 1987.