Title: Web Data Management with XML
1Web Data Management with XML
- Sanjay Kumar Madria
- Department of Computer Science
- University of Missouri-Rolla
- MO 65401
- madrias_at_umr.edu
- www.umr.edu/madrias
2Source and Acknowledgement
- This tutorial is based on material obtained from
different papers, web sites and students
presentation. Selected papers are solely based on
the speakers own decision. - It reflects speakers own understanding of the
material and their importance.
3XML V/S HTML
- HTML Text Presentation
- XML Data Structure
- HTML Fixed Tags, difficult to extract data,
good for human consumption - XML User defined tags, good for application,
easy to extract data from structure (data is
separated), self-describing, schema and data
together - XML slow (about 1) but more used in intranet
4XML - standards
- Schema (XML schema), Stylesheet (XSL), Resource
Description Framework (RDF...), XPATH, XPointer - WML, MathML, SMIL (multimedia), SportsML , RSS
(news),... - Xquery
5Some Upcoming Featureshttp//www.oasis-open.org/c
over/xml.html
- WS-Inspection specification - how an application
can discover an XML Web service description on a
Web server (IBM and Microsoft ) - Data Extraction Language - XML format for
describing data conversion processes from other
data formats to XML - VoiceXML
- Cascading Style Sheets (CSS2) specification for
mobile devices
6How is XML Used
- XML handles information content of diverse data
sources including - Structured and semi-structured documents
- Relational databases
- Legacy databases
- Object repositories
- Database views (as XML)
- Electronic data interchange (EDI)
- Facilitate B2B communication
- Data exchange between data sources
7XML Data Management- Main Issues
- XML Query Languages
- XML Data and Schema Integration
- XML Indexing
- XML views and storing XML in other Data Models
(Relational ) - XML Updates
8- XML Query Language Requirements
- Expressive power
- Standard relational language operations SPJ,
IDU - Restructuring operations reduction, merge,
- Formal Semantics
- Important for dealing with query transformation
and optimization. - Compositionality
- The output of a query should be in the same
language as the input
9XML Query Capabilities
- Query multiple documents
- Query distributed data stored in a variety of
formats such as Relational and OO databases. - XML query must be translatable to query facility
for underlying data. - Ability to create XML schemas from non-XML data
sources - Query should be usable on a document without a
schema - Presence of a schema should permit query
validation
10- XML Query Languages
- http//www.w3.org/TR/xquery/
- Alin Deutsch, Mary F. Fernandez, Daniela
Florescu, Alon Y. Levy, David Maier, Dan Suciu
Querying XML Data. IEEE Data Engineering Bulletin
22(3) 10-18 (1999) - A. Bonifati and S. Ceri. Comparative analysis of
five XML query languages. SIG-MOD Record, March
2000. - R. Goldman, J. Mchugh and J. Widom, From
Semi-structured Data to XML Migrating Lore to
XML Model and Query Language. In ACM SIGMOD WebDB
workshop99.
11Xquery (June, 2001)
- Derived from Quilt, which borrowed features from
several other languages. - From XPath and XQL -Path expression syntax
- From XML-QL - Notion of binding variables
- From SQL - SELECT-FROM-WHERE.
- From OQL - Different kinds of expressions that
can be nested with full generality. - Influenced by query languages such as Lorel and
YATL.
12XML
- ltdbgt
- ltlab ID"baselab manager"smith1"gt
- ltnamegtSeattle Bio Lablt/namegt
- ltlocationgt
- ltcitygtSeattlelt/citygt
- ltcountrygtUSAlt/countrygt
- lt/locationgt
- lt/labgt
- ltbiologist ID"smith1"gt
- ltlastnamegtSmithlt/lastnamegt
- lt/biologistgt
- lt/dbgt
13XQL Node-labeled tree
14Path Expression
- (Q) In the second chapter of the document named
"zoo.xml", find the figure(s) with caption "Tree
Frogs". - document("zoo.xml")//chapter2//figurecaption
"Tree Frogs" - First step locates the root node of a document.
- Second step locates the second chapter of the
document (ordinal number) - Third step finds figure elements occurring
anywhere within the chapter, but retains those
figure elements that have a caption with the
value "Tree Frogs."
15Element Constructors
- A query often needs to generate new elements. The
simplest way to generate a new element is to
embed the element directly in a query using XML
notation. - (Q) Generate an ltempgt element that has an "empid"
attribute and nested ltnamegt and ltjobgt elements. - ltemp empid "12345"gt
- ltnamegtJohn Smithlt/namegt
- ltjobgtAnthropologistlt/jobgt
- lt/empgt
16Element Constructors
- (Q) Generate an ltempgt element that has an "empid"
attribute. The value of the attribute and the
content of the element are specified by variables
that are bound in other parts of the query. - ltemp empid idgt
- name
- job
- lt/empgt
17FLWR Expressions
- A FLWR (pronounced "flower") expression is
constructed from FOR, LET, WHERE, and RETURN
clauses, which must appear in a specific order. - A FLWR expression binds values to one or more
variables and then uses these variables to
construct a result.
18FLWR Expressions
- (Q) List each publisher and the average price of
its books. - FOR p IN distinct(document("bib.xml")//publisher)
- LET a avg(document("bib.xml")//bookpublisher
p/price) - RETURN
- ltpublishergt
- ltnamegt p/text() lt/namegt
- ltavgpricegt a lt/avgpricegt
- lt/publishergt
19Sorting
- A sequence can be ordered by means of a SORTBY
clause that contains one or more "ordering
expressions." - (Q)List all books with price greater than 100,
in order by first author within each group of
books with the same first author, list the books
in order by title. - document("bib.xml")//bookprice gt 100 SORTBY
(author1, title)
20Operators and Expressions
- (Q) Prepare a "critical sequence" report
consisting of all elements that occur between the
first and second incision in the first procedure.
- ltcritical_sequencegt
-
- LET p //procedure1
- FOR e IN // AFTER (p//incision)1
- BEFORE (p//incision)2
- RETURN shallow(e)
-
- lt/critical_sequencegt
- The shallow function makes a shallow copy of a
node, including attributes but not including
subelements.
21Conditional Expressions
- (Q) Make a list of holdings, ordered by title.
For journals, include the editor, and for all
other holdings, include the author. - FOR h IN //holding
- RETURN
- ltholdinggt
- h/title,
- IF (h/_at_type "Journal")
- THEN h/editor
- ELSE h/author
-
- lt/holdinggt
- SORTBY (title)
22Quantified Expressions
- (Q) Find titles of books in which both sailing
and windsurfing are mentioned in the same
paragraph. - FOR b IN //book
- WHERE SOME p IN b//para SATISFIES (contains(p,
"sailing") AND contains(p, "windsurfing")) - RETURN b/title
23XML-QL Features
- Extraction How will data be extracted from
large XML documents? - Transformation How will XML data be exchanged
between user communities using different but
related DTD's? - Integration How will XML data from multiple XML
sources be integrated? - Conversion of data between relational or OO to XML
24BIB . DTD
- lt!ELEMENT book (author, title, publisher)gt
- lt!ATTLIST book year CDATAgt
- lt!ELEMENT article (author, title, year?,
(shortversionlongversion))gt - lt!ATTLIST article type CDATAgt
- lt!ELEMENT publisher (name, address)gt
- lt!ELEMENT author (firstname?, lastname)gt
25XML Data Model
root
Unique -id
book
book
(year1998)
(year1995)
author
title
author
title
publisher
author
publisher
Foundations for ...
lastname
lastname
An introduction
name
lastname
name
Date
Datwen
Addison-Wesley
Addison-Wesley
Date
26Selection/Extraction
- Find all the names of the authors whose publisher
is - Addison-Wesley
- WHERE ltbookgt
- ltpublishergtltnamegt Addison-Wesley lt/namegt
lt/publishergt - lttitlegt t lt/titlegt
- ltauthorgt a lt/authorgt
- lt/bookgt IN "www.a.b.c/bib.xml"
- CONSTRUCT a
- Result
- ltlastnamegt Date lt/lastnamegt
27Constructing new XML dataReduction Restructure
- WHERE ltbookgt
- ltpublishergt ltnamegt Addison-Wesley lt/gtlt/gt
- lttitlegt t lt/gt
- ltauthorgt a lt/gt
- lt/gt IN "www.a.b.c/bib.xml"
- CONSTRUCT ltresultgt
- ltauthorgt a lt/gt
- lttitlegt t lt/gt
- lt/gt
28 XML-QL Example Data
ltbibgt ltbook year1995gt lttitlegt An
Introduction to DB Systems lt/titlegt ltauthorgt
ltlastnamegt Date lt/lastnamegtlt/authorgt ltpublishergt
ltnamegt Addison-Wesleylt/namegt lt/publishergt lt/bookgt
ltbook year1995gt lttitlegt Foundations for
OR Databases lt/titlegt ltauthorgt ltlastnamegt Date
lt/lastnamegtlt/authorgt ltauthorgt ltlastnamegt
Darwen lt/lastnamegtlt/authorgt ltpublishergtltnamegt
Addison-Wesleylt/namegt lt/publishergt lt/bookgt lt/bibgt
29Grouping with Nested Queries Preserve structure
- Where ltbookgt p lt/bookgt IN "www.a.b.c/bib.xml ,
- ltpublishergt ltnamegtAddison-Wesley lt/gt lt/gt IN p ,
- lttitlegt t lt/gt IN p
- CONSTRUCT
- ltresultgt
- lttitlegt t lt/gt
- WHERE ltauthorgt a lt/gt IN p
- CONSTRUCT ltauthorgt a lt/gt
- lt/gt
30Reduction
- Where ltbookgt ltpublishergt ltnamegt
- Addition-wesleylt/gt lt/gt
- lttitlegtt lt/gt Element_As x
- ltauthorgt alt/gt Element_As y
- lt/gt lt/gt IN www.a.b.c/bib/xml
- Construct ltresultgt x y lt/gt
31Joining element by values
- WHERE ltarticlegt
- ltauthorgt
- ltfirstnamegt fn lt/gt -- firstname f
- ltlastnamegt ln lt/gt -- firstname l
- lt/gt
- lt/gt CONTENT_AS a IN "www.a.b.c/bib.xml,
- ltbook year y gt
- ltauthorgt
- ltfirstnamegt fn lt/gt -- join the same
firstname f - ltlastnamegt ln lt/gt -- join the same
lastname l - lt/gt
- lt/gt IN "www.a.b.c/bib.xml,
- y gt 1995
- CONSTRUCT ltarticlegt a lt/gt
32Element Identity, IDs, and ID Reference
- ltperson ID o123gt
- ltfirstnamegt John lt/firstnamegt
- ltlastnamegt Smith lt/lastnamegt
- lt/persongt
- ltperson ID o234gt
- ...
- lt/persongt
- ltarticle author o123 o234gt
- lttitlegt lt/titlegt
- ltyeargt 1995 lt/yeargt
- lt/articlegt
33XML graph including ID IDREFS
root
article
person
person
author
first name
first name
title
last name
year
last name
1995
John
Smith
34Writing queries using IDs
- Without IDs
- WHERE ltarticlegtltauthorgtltlastnamegt n lt/gtlt/gtlt/gt IN
abc.xml - Using IDREF (All last name, title pairs)
- WHERE ltarticle author igt
- lttitlegt lt/gt ELEMENT_AS t
- lt/gt,
- ltperson ID igt
- ltlastnamegt lt/gt ELEMENT_AS l
- lt/gt
- CONSTRUCT ltresultgt t l lt/gt
35Tag Variables, No schema required
WHERE lt p gt -- p can be article,
book lttitlegt t lt/gt ltyeargt1995 lt/gt --
referring attr. as an element lt e gtltlastnamegt
Date lt/gt lt/gt lt/gt IN "bib.xml", e IN
author, editor CONSTRUCT lt p gt
lttitlegt t lt/gt lt e gt Date lt/gt
lt/gt
All publications published in 1995 in which Date
is either an author, or an editor
36Regular Path Expressions
- XML data can specify nested and cyclic
structures, such as trees, directed acyclic
graphs, and arbitrary graphs. - The following DTD defines a self-recursive
element part - lt!ELEMENT part (name brand part)gt
- lt!ELEMENT name CDATAgt
- lt!ELEMENT brand CDATAgt
37Regular Path Expressions (cont)
- Here part is a regular path expression, and
- matches any sequence of edges, all of which
- are labeled part
- WHERE lt part gt
- ltnamegt r lt/gt
- ltbrandgt Ford lt/gt
- lt/gt IN "www.a.b.c/parts.xml"
- CONSTRUCT ltresultgt r lt/gt
38Regular Path Expressions (cont)
- The wildcard matches any tag and appear
wherever a tag is permitted - Example
- WHERE lt gt
- ltnamegt r lt/gt
- ltbrandgt Ford lt/gt
- lt/gt IN "www.a.b.c/parts.xml"
- CONSTRUCT ltresultgt r lt/gt
39Transforming XML data with Skolem function
- WHERE ltgt ltauthorgt
- ltfirstnamegt fn lt/gt
- ltlastnamegt ln lt/gt
- lt/gt
- lttitlegt t lt/gt
- lt/gt IN "www.a.b.c/bib.xml",
- CONSTRUCT ltperson IDPersonID(fn, ln)gt
- ltfirstnamegt fn lt/gt
- ltlastnamegt ln lt/gt
- ltpublicationtitlegt t lt/gt
- lt/gt
- lt!ELEMENT person (lastname, firstname,
- address?, phone?, publicationtitle)
40Integrating data from multiple XML sources
- WHERE ltpersongt
- ltnamegtlt/gt ELEMENT_AS n
- ltssngt ssn lt/gt
- lt/gt IN payroll.xml",
- lttaxpayergt
- ltssngt ssn lt/gt
- ltincomegtlt/gt ELEMENT_AS i
- lt/gt IN "taxpayers.xml
- CONSTRUCT ltresultgt n i lt/gt
41Integrating data from multiple XML sources
(Skolem function)
WHERE ltpersongt ltnamegt lt/gt ELEMENT_AS
n ltssngt ssn lt/gt lt/gt IN payroll.xml"
CONSTRUCT ltresult IDSSNID(ssn)gt n lt/gt
WHERE lttaxpayergt ltssngt ssn lt/gt ltincomegt lt/gt
ELEMENT_AS i lt/gt IN "taxpayers.xml" CONSTRUCT
ltresult IDSSNID(ssn)gt n i lt/gt
42Integrating data (cont)
All titles published in 95, in addition the
month of journal articles and the publishers for
books WHERE lt e gt lttitlegt t lt/gt ltyeargt
1995 lt/gt lt/gt CONTENT_AS p IN
"www.a.b.c/bib.xml" CONSTRUCT ltresult
IDResultID(p)gt lttitlegt t lt/gt lt/gt WHERE e
article", ltmonthgt m lt/gt IN p CONSTRUCT
ltresult IDResultID(p)gt ltmonthgt m lt/gt lt/gt
WHERE e "book", ltpublishergt q lt/gt IN p
CONSTRUCT ltresult IDResultID(p)gt ltpublishergt q
lt/gt lt/gt
43Functions definitions and DTDs
- function query()
-
- CONSTRUCT ltresultgt
- findDeclaredIncomes("taxpayers.xml","payroll.xml
") - lt/resultgt
-
- function findDeclaredIncome(Taxpayers,Employees)
-
- WHERE lttaxpayergtltssngt s lt/gt
- ltincomegt x lt/gtlt/gt IN Taxpayer,
- ltemployeegtltssngt s lt/gt
- ltnamegt n lt/gt lt/gt IN Employees
- CONSTRUCT ltresultgtltnamegt n lt/gtltincomegt x lt/gt
lt/gt -
44Embedding queries in data
- ltresultgt
- ltarticlesgt
- WHERE ltarticlegt lttitlegt t lt/gtltyeargt y lt/gt
- lt/gt IN www.a.b.c/bib.xml, y gt
1995 - CONSTRUCT lttitlegt t lt/gt
- lt/gt
- ltbooksgt
- WHERE ltbookgt lttitlegt t lt/gt ltyeargt y lt/gt
- lt/gt IN www.a.b.c/bib.xml, y gt1995
- CONSTRUCT lttitlegt t lt/gt
- lt/gt
- lt/gt
45Indexes for element
- XML support element-order variables.
- Example
- ltaigt lt/gt
- ltxjgt lt/gt
- here i and j are bind to an integer 0, 1, 2
that represent the index in the local order of
the edges.
46Indexes for element (graph)
root
book 0
book1
( 1 )
( 8 )
(year1998)
( 2 )
(year1995)
author3
title0
author2
( 14 )
title0
publisher1
author2
publisher1
( 9 )
( 12 )
( 10 )
( 6 )
Foundations for ...
lastname0
( 4)
( 3 )
An introduction
lastname0
name0
lastname0
( 13 )
name0
( 11 )
( 15 )
Date
Datwen
( 5 )
Addison-Wesley
( 7 )
Addison-Wesley
Date
47Indexes for element (cont.)
- Example
- Retrieves all the persons whose lastname
precedes the firstname - WHERE ltpersongt p lt/gt IN www.a.b.c/people.xml
- ltfirstname k gt x lt/gt IN p,
- ltlastnamej gt y lt/gt IN p,
- j lt k
- CONSTRUCT ltpersongt p lt/gt
48ORDER-BY
Reverse the order of all authors in a
publication WHERE ltpubgt p lt/gt IN
www.a.b.c/people.xml, CONSTRUCT ltpubgt WHERE
ltauthorkgt a lt/gt IN p ORDER-BY k
DESCENDING CONSTRUCT ltauthorgt a lt/gt WHERE lt e
gt v lt/gt IN p e ! author CONSTRUCT ltegt
v lt/gt lt/pubgt
49Web Warehousing
- http//www-rocq.inria.fr/verso/xyleme/
- www.xyleme.com
- www.cais.ntu.edu.sg/whoweda/
50Current Web
- Keywords to retrieve URLs Difficult to extract
data of interest - Query results cannot be directly processed
- Documents change on the Web expiration
- Sites are unavailableFile not found
- Wrappers
- Expensive
- Incomplete
- Short-lived, not adapted to the Web constant
changes -
51Web Warehouse Technology
- A warehouse of XML documents from the Web, Xyleme
- Stores huge quantities of data
- Warehouse is not a search engine (only index) or
a mediator (only virtual data) - Warehouse incorporates data evolution/changes
52Warehouse (Xylene) Challenges
- Data Acquisition and Maintenance Logical level
- discover data of interest and maintain it up to
date - Discover XML pages on the web that are of
interest for customers - For this crawl the web (HTMLXML, store only XML)
- Bounded resources
- Natix Repository Physical Level
- store this data and index it so that it can be
processed efficiently - Query Processing Logical level
- support efficiently an SQL-style query language
53- Semantic Integration Application Level
- Understand DTD and tags, partition the Web into
semantic domains, provide a simple view of each
domain - Change Control - Application Level
- Monitor the web and offer services such as Query
Subscription - Metadata management
54Page Scheduling
- Decide which page to read next
- discovery (read first) and refresh (read again)
- Based on
- Importance of the page
- read often important pages
- also used to order query results
- Change rate of the page
- dont read a page that is probably up-to-date
55Natix Repository
- Instead of storing each tree node in a separate
record, we store whole documents( or subtrees of
documents) together in one record. - Typical data trees may not fit on a single page.
So the data trees are distributed data over
several pages.
f1
Physical Tree
r1
p2
p1
Proxy object
h2
r3
h2
r2
Helper aggregate object
f7
f6
f5
f2
f3
f4
56Natix Repository
- Splitting a record
- A records subtree before a split
57The WHOWEDA Project
- WHOWEDA A WareHouse of WEb DAta
- To design and implement a web warehousing system
capable of effective extraction, management, and
processing of information on the World Wide Web - Data model WHOM (WareHouse Object Model)
handles XML and HTML - Change Management
- Knowledge Discovery and Mining
- Journal papers in WWW journal00, DKE01, TKDE
(under review), - Conference papers in ICDCS00, ICPADS00
DOLAP00, DASFAA99, DAWAK99, FODO98, ER98,
DEXA98, . - Under submission DKE, DPDS, CJ, KAIS
58Detecting Changes in Web Documents
- Gregory Cobena, Serge Abiteboul, Amélie Marian
Detecting Changes in XML Documents. ICDE 2002 - Sourav S. Bhowmick, Sanjay Kumar Madria, Wee
Keong Ng, Ee-Peng Lim Detecting and Representing
Relevant Web Deltas using Web Join. ICDCS 2000
246-253
59Change Control
60Overview of Whoweda Approach
- Step 1 Two snapshots of old and new relevant
data coupled from the Web using global web
coupling operation and materialized in two web
tables. - Step 2 Web join, left outer join and right outer
joined operations are performed on these two web
tables - Result is joined, left and right outer joined web
tables - Step 3 Delta web tables containing different
types of web deltas generated from these
resultant web tables.
61XML Schema Integration
- http//data.cs.washington.edu/integration/tukwila
- Denise Draper, Alon Y. Halevy, Daniel S. Weld
The Nimble XML Data Integration System. ICDE
2001 155-160 - MIX http//www.db.ucsd.edu/projects/MIX/
62 Data Integration
E-Commerce applications use data from different
sources and need to be integrated. A mediated
schema is created to represent a particular
application domain and data sources are mapped as
views over the mediated schema.
63XML Integration
- The modeling problem How to model the global
schema, the sources, and the relationships
between the two - The querying problem How to answer queries
expressed on the global schema - Query is expressed in terms of the global schema,
and the mediator reformulate the query in terms
of a set of queries at the sources - Query Plan, decomposex the query into a set of
subqueries to the sources - Subqueries are shipped to the sources, and the
results are assembled into the final answer
64 Different ways to integrate XML data
- Integrating XML documents
- Mapping of local schemas to global integrated
schema if the global schema is known or querying
the data to obtain the required global schema. - Integrating XML Schema
65Complexities in integrating XML Data
- Need to extract the schema from the document.
- Integrate the schemas obtained or perform mapping
from the individual schema documents to the
global schema if the global schema is already
present. - Parse the XML documents and integrate the data
according to the global schema. - Querying on XML documents can be done to obtain
the integrated document.
66Tukwila Data Integration System
- Uses a mediated schema to integrate data from
different sources. - User asks a query over the mediated schema and
the data Integration system reformulates the
query over the data sources and executes it. - Uses an Query Re-formulator and Optimizer to
query large amounts of data efficiently. - Map the query from the mediated schema to data
sources. - Uses an x-scan operator that can query streaming
XML data.
67Querying XML Stream of Data
- In many applications involving XML we must be
able to process queries over streams of incoming
XML data (whose content changes continuously)
without - first loading the data into a local repository
stored in some internal representation - Evaluating regular path expressions using either
index structures or join operations across the
tables or objects - X-scan matches regular path expression patterns
from the query, returning results in pipelined
fashion as the data streams across the network.
68(No Transcript)
69- WHERE ltdbgt ltlabgt
ltnamegtnlt/gt lt_gtltcitygtclt/gtlt/gt lt/gt
ELEMENT_AS l lt/gt IN "fig1.xml"
70 Tukwila x-scan operator
71 Tukwila x-scan operator contd..
72(No Transcript)
73 XML Schema Integration Model- Another Approach
- Automated integration of XML schemas -
- traditional forms of view integration and
database integration. - Integrated schema forms the basis for a valid
query over a particular set of XML documents. - Schemas to be integrated currently validate a
set of existing XML documents, data integrity and
continued document delivery are chief concerns of
the integration process.
74XSchema Integration Model
- Object-oriented data model called XSDM ( XML
Schema Data Model ) - Three-layered architecture consisting of
pre-integration, comparison and integration is
used for the integration. - A global schema meets the following criteria
completeness, minimality and understandability.
75 Three Phases of integration
Pre-Integration Element, attribute and datatype
definitions are extracted through parsing the
actual schema document. Comparison
Correspondences between elements and attributes
are determined either by using semantic learning
or using human interaction. Integration
Conflicts that exist between the corresponding
elements and/or attributes such as naming
conflicts, datatype conflicts and structural
conflicts are resolved.
76XML Schema Data Model (XSDM)
- Four structures are defined Node Object, Child
Object, Datatype Object and Attribute Object. - Node Object Represents an element, which may be
either non-terminal or terminal. - Each node represents another set of structures
that define the node Name, Namespace,
Attribute, Datatype, Substitution Group Name,
Child list and Node Type which has six types
terminal, sequence, choice, all, any or empty. - Child Object Represents an element, which is a
part of childList. Each child has structures that
define itself Name, namespace, Max Occurances,
and Min Occurances.
77XML Schema Data Model (XSDM) contd..
Datatype Object Represents datatype of elements
and attributes. The structures that define this
are Name, Variety(atomic, union, list),
Kind(simple and derived datatype), and
Constraining Facets. Attribute Object
Represents attributes associated with a
non-terminal or terminal element. The structures
that define an attribute Name, Namespace, Use,
DataType, and value(default value).
78Example
79Graphical Representation of XML Schemas
80Graphical representation of sample schema for GSE1
81Graphical representation of sample schema for GSE2
82Conflict Resolution
- Naming Conflicts
- Synonym Naming Conflict Different names but same
definition. Solved using substitution group
names. - Homonym Naming conflict Same name but different
structure. - Homonym conflicts at Non-terminals are called
structural conflicts and - at terminals they are called datatype conflicts.
83Conflict Resolution contd..
- Datatype Scale differences
- Disjoint or incompatible datatypes union
- E.g. String, integer
- Compatible datatypes scale adjustment
- E.g. Integer, float
- Enumerated datatype taking set of all the
enumerations - E.g. a,b, b,c gt a,b,c
- Scale differences constraint facet redefinition
84 Conflict Resolution contd..
- Structural Conflicts
- Type Conflicts Terminal in one schema and
non-terminal in another schema Add both to the
global schema. - Key conflicts
- If both schemas have their individual keys, then
the global schemas key should be a composite of
both the keys. - If an element is declared as key in one schema
and as a non-key in other schema, a complete
knowledge of the data present in the documents is
required. - If the same element is declared as key in both
the schemas, a prefix can be added to the keys to
make the key elements unique globally.
85 Integration phase
- Constructing correspondences table - contain the
information about the corresponding
elements/attributes. - Constructing dependencies table The
elements/attributes are integrated only after
their dependencies are integrated.
86Graphical representation of Global schema obtained
87Indexing XML Data
- Xset
- Region algebra
- Indexes for Arbitrary Semistructured Data
- T-indexes
- Index Fabric (VLDB01)
- Resources
- Index Structures for Path Expressions by T. Milo
and Suciu, in ICDT'99 - XSet description http//www.openhealth.org/XSet/
- Data on the Web Abiteboul, Buneman, Suciu
section 8.2
88The Data
- Semistructured data instance a large graph
89The queries
- Regular expressions (using Lorel-like syntax)
SELECT X fROM (Bib..author).(lastnamefirstname).
Abiteboul X
Select x from part._.supplier.name x
Requires to traverse data from root, return all
nodes x reachable by a path matching the given
path expression.
Select X From part._.supplier name X,
address Philadelphia
Need index on values to narrow search to parts of
the database that contain the string
Philadelphia.
90Analyzing the problem
- What kind of data
- tree data (XML) easier to index
- graph data used in more complex applications
- What kind of queries
- restricted regular expressions (e.g. XPath) may
be more efficient - arbitrary regular expressions rarely encountered
in practice
91XSet a simple index for XML
- Part of the Ninja project at Berkeley
- Example XML data
92XSet a simple index for XML
- Each node a hashtable
- Each entry list of pointers to data nodes (not
shown)
93XSet Efficient query evaluation
(R1) SELECT X FROM part.name X
-yes (R2) SELECT X FROM part.supplier.name X
-yes (R3) SELECT X FROM .supplier.name X
-maybe (R4) SELECT X FROM part..subpart.name X
-maybe
- To evaluate R1, look for part in the root hash
table h1, follow the link to table h2, then look
for name. - R4 following part leads to h2 traverse all
nodes in the index (corresponding to ), then
continue with the path subpart.name. - Thus, explore the entire subtree dominated by h2.
- Will be efficient if index is small and fits in
memory - R3 leading wild card forces to consider all
nodes in the index tree, resulting in less
efficient computation than for R4. - Can index the index itself.
- Retrieve all hash tables that contain a supplier
entry, continue a normal search from there.
94Region Algebras
- Structured text text with tags (like XML)
- Powerful indexing techniques
- Critical limitationordered data only (like text)
- Assume data given as an XML text file, and
implicit ordering in the file. - Less critical limitation restricted regular
expressions
95Region Algebras Definitions
- data sequence of characters c1c2c3
- region segment of the text in a file
- representation (x,y) cx,cx1, cy, x start
position, y end position of the region - example ltsectiongt lt/sectiongt
- region set a set of regions s.t. any two
regions are either disjoint or one included in
the other - example all ltsectiongt regions (may be nested)
- Tree data each node defines a region and each
set of nodes define a region set. - example region p2 consisting of text under p2,
set p2,s2,s1 is a region set with three regions
96Representation of a region set
- Example the ltsubpartgt region set
- region algebra operators on region set,
- s1 op s2 defines a new region set
97Region algebra some operators
- s1 intersect s2 r r? s1, r ?s2
- s1 included s2 r r?s1, ?r ? s2, r ? r
- s1 including s2 r r? s1, ?r ? s2, r ? r
- s1 parent s2 r r? s1, ?r ? s2, r is a
parent of r - s1 child s2 r r? s1, ?r ? s2, r is child of
r
Examples ltsubpartgt included ltpartgt s1, s2,
s3, s5 ltpartgt including ltsubpartgt p2,
p3 ltnamegt child ltpartgt n1, n3, n12
98Efficient computation of Region Algebra Operators
- Example s1 included s2
- s1 (x1,x1'), (x2,x2'),
- s2 (y1,y1'), (y2,y2'),
- (i.e. assume each consists of disjoint regions)
- Algorithm
- if xi lt yj then i i 1
- if xi' gt yj' then j j 1
- otherwise print (xi,xi'), do i i 1
- Can do in sub-linear time when one region is
very small
99From path expressions to region expressions
- Use region algebra operators to answer regular
path expressions - Only restricted forms of regular path expressions
can be translated into region algebra operators - expressions of the form R1.R2Rn, where each Ri
is either a label constant or the Kleene closure
.
part.name name child (part child
root) part.supplier.name name child (supplier
child (part child root)) .supplier.name
name child supplier part..subpart.name name
child (subpart included (part child root))
Region expressions correspond to simple XPath
expressions
100From path expressions to region expressions
- Answering more complex queries
- Translates into the following region algebra
expression - Philadelphia denotes a region set consisting of
all regions corresponding to the word
Philadelphia in the text. - Such a region can be computed dynamically using a
full text index. - Region expressions correspond to simple XPath
expressions
Select X From .subpart name X,
.supplier.address Philadelphia
Name child (subpart includes (supplier parent
(address intersect Philadelphia)))
101Indexes for Arbitrary Semistructured Data
- A semistructured data instance that is a DAG
102Indexes for Arbitrary Semistructured Data
- The data represents employees and projects in a
company. - Two kinds of employees programmers and
statisticians - Three kinds of links to projects leads,
workson, consultants - Index graph reduced graph that summarizes all
paths from root in the data graph
103- Example node p1 paths from root to p1 labeled
with the following five sequences - Project
- Employee.leads
- Employee.workson
- Programmer.employee.leads
- Programmer.employee.workson
- Node p2 paths from root to p2 labeled by same
five sequences - p1 and p2 are language-equivalent
104Indexes for Arbitrary Semistructured Data
- For each node x in the data graph,
- Lx w ? a path from the root to x
labeled w - ?x,y x ? y ? Lx Ly
-
- x y x ? y
- Nodes(I) x
x ? nodes(G) - I
- Edges(I) x
y x? ? x, y? ? y, x? y?
105Indexes for Arbitrary Semistructured Data
- We have the following equivalences
- e1 ? e2
- e3 ? e4 ? e5
- p1 ? p2
- p3 ? p4
- p5 ? p6 ? p7
106Indexes for Arbitrary Semistructured Data
- Computing path expression queries
- Compute query on I and obtain set of index nodes
- Compute union of all extents
- Returns nodes h8, h9.
- Their extents are p5, p6, p7 and p8,
respectively - result set p5, p6, p7, p8
- Always size(I) ? size(G)
- Efficient when I can be stored in main memory
- Checking x ? y is expensive.
Select X From statistician.employee.(leadsconsult
s) X
107T-Indexes
- Milo Suciu ICDT 99
- 1-index
- data graph
- arbitrary regular expressions
- 2-index, T-index for more complex queries,
consisting of more regular expressions.
108T-Indexes
- T-index template index
- Trades space for generality
- The class of paths associated with a given
T-index is specified by a path template - Example 1 x y. Here can
be replaced by any regular expression. - Example 2 (.Restaurant) x y. The first
regular expression is fixed this T-index takes
less space but is less general. - T-indexes can be generated efficiently.
- The size of a T-index associated to a single
regular expression is at most linear in that of
the database
P
P
P
P
1091-Indexes
- Database DB (V,E,Roots), V is finite set of
nodes, E is a set of labeled edges, R is a set of
root nodes. - Regular path expressions
- P ? ? ƒ (PP) (P.P) P. where ƒ
are formulas defined over predicates p1, p2,on
the set of data values. - A path expression p v0 ? v1 ? v2vn-1 ? vn
- Queries regular path expressions q(DB)
- A query path is an expression of the form
- P1 x1 P2 x2 Pn xn, xi variable names,
Pis path expressions - A query has the form
- Select x1, x2, , xn from P1 x1 P2 x2 Pn xn
a1
a2
an
1101-Indexes
- Path template t T1 x1 T2 x2 T3 x3, Ti a
regular expression or or - Instantiating query paths
- Query path q instantiating and
by regular path expression and some formula,
respectively, in template t - Example path template t (.Restaurant) x1
x2 Name x3 x4 - Query path instantiations
- q1 (.Restaurant) x1 x2 Name x3 Fridays
x4 - q2 (.Restaurant) x1 x2 Name x3 _ x4
( _ is a predicate with True) - q3 (.Restaurant) x1 ( ? _ ) x2 Name x3
Fridays x4
P
F
P
F
P
F
1111-Indexes
- Goal compute efficiently queries q ? inst(
x) - A first attempt
- ?u?V. Lu a1an v0 ? ? vn ?DB, v0?Root,
vnu - ?u,v?V. u ? v ? Lu Lv
- ?u?V. u v u ? v
P
a1
an
1121-Indexes
- Nodes(I) u u in nodes(DB)
- Edges(I) u ? u? ?u ? u, ?u? ? u?,
(u ? u?) ? Edges(DB) - Roots(I) r r ? roots(DB)
a
a
I
q(DB) u ? u? ? q(I), u ? u
Example
Inefficient construction cost
1131-Indexes
x ? y ? z, but x ?s y ?s z and x ?b y ?b z
114Analyzing1-Indexes
- Storing I-index
- Associate an oid s to each node in I
- Store graph I in standard form
- Store for each node s, extent(s)
- Extent(s) v s is an oid for v
- Always size(I) lt size(DB) (unlike Dataguide)
- Always can compute in O(nlogn) time nsize(DB)
- When DB is a tree ?b , ?s , ? coincide
- no penalty for ?b , ?s
- 1-index Dataguide XSet
115Analyzing1-Indexes
- Do we have size(I) ltlt size(DB) ? No. Two worst
cases - Facts
- in theory except for these two DBs, size(I) ltlt
size(DB) - in practice its a different story. Experiments
size(I) ? 1/3 size(DB)
116Evaluating Query Paths with 1-indexes
- Example query q t.a x
- The evaluation of q follows two paths t.a in I
rather than five in DB and unions their extents
7,13 ? 8,10,12 - The extents in strong data guide overlap, hence
storage may be larger
1172-Indexes
- Database DB (V, E, Roots)
- Queries select x1, x2 from x1 P x2, with P a
regular path expression - Template x1 x2.
- Find pairs of nodes (x1, x2)
- L(u,v) a1 an u ? ? v in DB
- (u,v) ? (u?,v?) ? L(u,v) L(u?,v?)
- Since computing ? is expensive, refinements ? are
used instead that satisfy - (v,u) ? (v?,u?) ? (v,u) ? (v?,u?)
P
a1
an
1182-Indexes
- Nodes(I) (u,v) u,v ? Nodes(DB)
- I2 Roots(I) (u,u) u ?
Nodes(DB) - Edges(I) (u,v) ? (u,v?) v ? v? ?
Edges(DB) - Storing I2
- The graph
- Extent(s) (v,u), for each node s representing
the equivalence class (v,u) - L(v,u)(DB) L(v,u)(I2),
- L(v,u)(DB) represents paths between v and u
- L(v,u)(I2) represents the paths in the 2-index
I2, between some root of the index and (v,u) - Query evaluation
- To compute select x, y from x P y, we compute
the query path P y on I2 and take the union of
the extents. - This saves the search, but may have to start at
several roots in I2, which is only one in case of
acyclic databases
a
a
1192-Index Example
- Cost size(I) ? O(n2)
- May be less in practice, similar to PAT trees
(Patricia tree) for text databases
120Querying and Storing XML Views of Relational Data
- eXcelon The XML application development
environment. - http//www.odi.com/excelon/main.htm.
- Object Design's eXcelon XML repository maps XML
documents to objects and supports queries over
them based on the XQL language. - Jayavel, Jerry, et al. Querying XML Views of
Relational Data, VLDB01) - www.microsoft.com/XML
- Oracle 8i and DB2 supports XML views
- Stored and Silkroute
121Motivation
- Most web data will continue to be stored in
relational databases (more than 90) - Need some way to convert relational data to XML
- XPERANTO (IBM) allows existing relational data to
be viewed and queried as XML
122Web Services Example
Supplier provides an XML View of its Data
XQuery over Catalog
Internet
Buyer
XQuery Result
XQuery
XQuery Result
Application CodeConvert XQuery toSQL Query
Application CodeConvert RelationalData to XML
Supplier
SQL Query
SQL Result
Relational Database
123High-Level Architecture
XQuery Query
Query Result
XPERANTO
XQuery to SQLConverter
Tagger
SQL Query
SQL Result
Relational Database
push data- and memory-intensive computationdown
to relational engine
124Example Relational Data
order
id custname custnum
10 Smith Construction 7734
9 Western Builders 7725
item
payment
oid desc cost
10 generator 8000
10 backhoe 24000
oid due amt
10 1/10/01 20000
10 6/10/01 12000
125Default XML View
ltdbgt ltordergt ltrowgt ltidgt10 lt/idgt
ltcustnamegt Smith Construction lt/custnamegt
lt/rowgt ltrowgt ltidgt 9 lt/idgt
ltcustnamegtWestern Builders lt/custnamegt lt/rowgt
lt/ordergt ltitemgt ltrowgt ltoidgt 10
lt/oidgt ltdescgt generator lt/descgt ltcostgt 8000
lt/costgt lt/rowgt ltrowgt ltoidgt 10 lt/oidgt
ltdescgt backhoe lt/descgt ltcostgt 24000 lt/costgt
lt/rowgt lt/itemgt ltpaymentgt
similar to ltordergt and ltitemgt
lt/paymentgt lt/dbgt
126XML View for Purchase Order
ltorder id10gt ltcustomergt Smith
Construction lt/customergt ltitemsgt
ltitem descriptiongenerator gt
ltcostgt 8000 lt/costgt lt/itemgt
ltitem descriptionbackhoegt
ltcostgt 24000 lt/costgt lt/itemgt
lt/itemsgt ltpaymentsgt
ltpayment due1/10/01gt
ltamountgt 20000 lt/amountgt lt/paymentgt
ltpayment due6/10/01gt
ltamountgt 12000 lt/amountgt
lt/paymentgt lt/paymentsgtlt/ordergt
127Creating an XPERANTO View
create view orders as ( for order in
view(default)/order/row return ltorder
idorder/idgt ltcustomergt
order/custname lt/customergt
ltitemsgt lt/itemsgt
ltpaymentsgt
lt/paymentsgt lt/ordergt)
for item in view(default)/item/row where
order/id item/oid return ltitem
descriptionitem/desc gt
ltcostgt item/cost lt/costgt lt/itemgt
for payment in
view(default)/item/row
where order/id payment/oid
return ltpayment duepayment/dategt
ltamountgt payment/amount lt/amountgt
lt/paymentgt
sortby(_at_due)
128Query for querying XML View
Get all orders of customer Smith
for order in view(orders)where
order/customer/text() like Smith return order
129Query Processing in XPERANTO
XQuery
Query Result
XPERANTO Query Engine
XQuery Parser
XQGM
Query Rewrite View Composition
XQGM
Computation Pushdown
TaggerRuntime
Tagger Graph
SQL Query
Tuples
RDBMS
130XQGM
- Intermediate representation
- General enough to capture semantics of a powerful
language such as XQuery - Easy translation to SQL
- XQGM based on
- Borrows from other work on XML algebras
- An extension of DB2s QGM
131XQGM (contd.)
- XQGM consists of
- Operators
- Functions (invoked inside operators)
- Operators capture manipulation of relationships
(similar to relational operators) - Functions capture manipulation of XML entities
(elements, attributes, etc.) - XML construction functions
- XML navigation functions
132for order in view(orders)where
order/customer/text() like
Smith return order
133Updating XML
- XML stored in relations relational data
published in XML. - Need to support updates of XML data stored in
relations
Igor Tatarinov, Ives, Halvey et al. , Updating
XML , ACM Sigmod 2001
134Example
- ltbookdbgt
- ltbook publisherphgt
- lttitlegtA First Course in Database
Systemslt/titlegt - ltauthorgt
- ltnamegtUllmanlt/namegt
- lt/authorgt
- ltauthorgt
- ltnamegtWidomlt/namegt
- lt/authorgt
- lt/bookgt
- ltpublisher codephgt
- ltnamegtPrentice Halllt/namegt
- lt/publishergt
- lt/bookdbgt
135bookdb
publisher
book
publisherph
codeph
IDREF
name
title
author
author
A First Course
Prentice Hall
name
name
Ullman
Widom
XML Data Tree
136XML Update Operations
- Update primitives
- Delete (child) (recursive)
- Insert (content)
- Replace (delete insert not efficient)
- Move (copy delete not always feasible)
- Multiple operations in a statement
- Nested updates
- Focus on semantics rather than syntax!
137Querying XML Data
- Xquery - Does not support updates
FOR b IN document(bookdb.xml)/book WHERE
b/author/name Bernstein RETURN b
- Entire book elements are retrieved!
138Multiple Update Operations
FOR book IN document(bookdb.xml)/book, pric
e IN book/price, WHERE price gt 100 UPDATE
book REPLACE price WITH price0.50 INSERT
ltcloseout/gt
- Need to DELETE/INSERT/REPLACE in a single update!
- Single UPDATE in SQL (when NULLs are used)
139Semantic Issues in XML Updates
- IDs and IDREFs
- Cant duplicate XML IDs
- Cant leave dangling references
- Non-deterministic updates
- There is more than one way (XPath expression) to
get to an XML element - Would like to detect it at compile time
140Deletion IDREFs
- Solutions
- Dont allow delete at all
- Remove incoming refs
- Delete entire referrers
- Reattach to new parent? (schema violation likely)
141Insertion (Copying) IDs IDREFs
- Cant duplicate IDs
- Elements with IDs cant be copied
- XML IDs are created by user/app gt cant generate
new IDs - OK to move (or copy into a different document)
- Can copy IDREFs
142Ambiguous Updates
- Multiple paths to the same element gt
non-deterministic update - Solution constrain the language
- Update operations can only modify the children of
the element being updated - Limit XPath expressions that can be used
Book/Price 10 Book/Price 0.8
book
IDREF
143Implementing XML updates over a relational store
- Map XML schema (DTD) to relations
- Shared Inlining method SGT99
- In our example
- Book(tuple_id, title, price)
- Author(parent_id, name)
- tuple_id and parent_id link child tuples to their
parents - Tuple ids are different from XML IDs!!
- Map XQuery updates to SQL
- Minimize the number of SQL statements
144Deletion Methods
- Trigger-based
- Using per-tuple triggers
- Using per-statement triggers
- Cascading delete
- Using Access Support Relations (ASRs)
145Per-tuple Trigger-based Delete
- Per-tuple trigger on table Book
- DELETE FROM Author
- WHERE parent_id deleted.tuple_id
- Can use index fast !
146Per-stmt Trigger-based Delete
- Per-statement trigger on Book
- DELETE FROM Author
- WHERE parent_id NOT IN
- (SELECT tuple_id FROM Book)
- Have to scan entire Author table - slow!
147Insertion Methods
- Copying is the hard (interesting) case
- Challenge assign new tuple ids so that the new
parent/child pairs are linked correctly
148Storing Semistructured Data with STORED
- Technique to store and manage semistructured data
with relational database systems - STORED - Semistructured TO Relational Data
- Apply STORED to XML data as an instance of
semistructured data
149Introduction
- Lorel and Tsimmis store data as graphs
- Strudel stores data as structured text externally
and as graph internally - XML data is stored along with tags that encode
the schema - Advantages quick new data load, seamless changes
of old data structures - Disadvantages space cost, processing time cost,
no use for commercial RDBMS
150STORED Usage
- Aggressive mapping from semistructured to
relational models (includes query mapping) - Overflow graphs store parts of semistructured
data not fit in the schema - Usage 1 to store and manage existing
semistructured data - Usage 2 to convert relational sources into
semistructured format (XML for instance)
151Contribution
- STORED storage mapping of semistructured to
relational data overflow graphs - Algorithm for constructing relational schema and
STORED mapping - Algorithm for generating overflow mapping for a
given relational mapping (can exploit DTD) - Query and update rewriting algorithm
152(No Transcript)
153(No Transcript)
154Simple Storage Queries
- M1 FROM Audit.taxpayer X
- name N,
- addr street S, zip Z
-
- STORE Tpr1(X, N, S, Z)
- M2 FROM Audit.taxpayer X
- name N,
- addr street S, zip Z ,
- OPTaudited A,
- OPTtaxamount T
-
- STORE Tpr3(X, N, S, Z, A, T)
- All variables (except intermediate) in the FROM
clause must be used in STORE clause - Optional attributes can also be stored
155Multiple Attributes
- M1 FROM Audit.taxpayer X
- name N,
- audited A1,
- OPT audited A2
-
- STORE Taxpr7(N, A1, A2)
- M2 FROM Audit.taxpayer X
- name1 N,
- audited1 A1,
- OPT audited2 A2
-
- STORE Taxpr7(N, A1, A2)
- Objects may have multiple occurrences of the same
attribute - STORED queries are rewritten with indexed
attribute names
156Label Variables
- PhoneBook
- John phone03-6540000, fax 09-7659900,
- Joe phone 06-6542311,fax 08-8648000
-
- FROM PhoneBook.L X
- phone P, fax F
-
- STORE R(X, L, P, F)
- Possible to store data as attributes
- Label variables in STORED are stored in relations
as values
157Overflow Queries
- M1 FROM Audit.taxpayer X
- name N,
- addr street S, zip Z, O _ ,
- OVERFLOW G(O)
-
- STORE Taxpr1(X,