Web data management and distribution XQuery processing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web data management and distribution XQuery processing

1
Web data management and distributionXQuery
processing
S.Abiteboul, I.Manolescu, P.Rigaux,
P.Senellart Gemo group, INRIA SaclayÎle-de-France
2
Context XML query processing

Several possible flows
Data comes from persistent (disk-based) storage
First load, then query
Query processing "at first sight"
Data is queried when it is first seen
Not our topic LMP02, FSC03, BCF03, FHK03

3
XML query processing scenarios
Loading
endElem("XML") ... beginElem("XML")
ltXML docgt
ltXML docgt
Query
In-memory data structures
Java / Lisp / CAML...
Querying
Query
Result
Result
Result
4
XML query processing scenarios (1/2)

"Persistent store"
Logging / archiving an ongoing activity
Clients, orders, products...
Structured text (documentation, news, image
annotations, scientific data...)
Warehousing XML

"At first sight"
Fast processing of incoming documents
Web service messages
Workflow coordination
Many small documents to process
In-memory, programming language approach feasible

5
XML query processing scenarios (2/2)

"Persistent store"
Heavier
Needs loading
All DBMS goodies
Set-at-a-time processing
Query optimization
Persistence
Transactions
Concurrence control
View-based management...

"At first sight"
Lighter
May blend easily into a programming framework
Data marshalling is a pain
In real life, there are not just databases

Streaming (stack-based) evaluation of tree
pattern queries

7
Streaming processing of tree pattern queries
XML document ltrgt lta1gtltb1gt
ltc1/gt ltc2/gt lt/b1gt
ltb2gt ltc3/gt
lt/b2gt lt/a1gt lta2gt ltc4/gt
ltb3/gt lt/a2gt lt/rgt
Query //a/b/c
Create 1 stack per query node Stacks are
connected following the query structure
8
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20
XML document ltrgt lta1gtltb1gt
ltc1/gt ltc2/gt lt/b1gt
ltb2gt ltc3/gt
lt/b2gt lt/a1gt lta2gt ltc4/gt
ltb3/gt lt/a2gt lt/rgt
Query //a/b/c
SAX traversal
Traverse the document sequentiallyand issue
events Start element x / end element x / text s
9
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17 18
19 20 21 22
Query //a/b/c
Traverse the document sequentiallyand issue
events Start element x / end element x / text s
10
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
When pushed, matches are open
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
11
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b1
12
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b1
c1
13
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
After end element, a match is closed
a1
b1
c1
14
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b1
c2
c1
15
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b1
c2
c1
16
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b1
c2
c1
17
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a1
b2
b1
c3
c2
c1
18
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc.
There is no open b match
a2
b2
c3
a1
c2
b1
c1
19
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
a2
b3
b2
c3
a1
c2
b1
c1
20
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc
No child in the c stack!
a2
b3
b2
c3
a1
c2
b1
c1
21
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc.
No child in the b stack!
a2
b2
c3
a1
c2
b1
c1
22
Streaming processing of tree pattern queries
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b/c
On begin element x If there is a stack for
x Then if the element appears in the
right context then push it on the
stack connect it to the
parent matchOn end element x If there
is a stack for x Then if x is on top of the
stack then if x lacks some required
children then pop x, possibly
some desc.
b2
c3
a1
c2
b1
c1
23
Variations capturing element content
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b
a1
ltb1gt
ltc1
/gt
ltc2
Sa
b1
/gt
lt/b1gt
Sb
24
Variations capturing element content
ltrgtlta1gtltb1gtltc1/gtltc2/gtlt/b1gtltb2gtltc3/gtlt/b2gtlt/a1gtlta2gtlt
c4/gtltb3/gtlt/a2gtlt/rgt 1 2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 20
Query //a/b
a1
ltb1gt
ltc1
/gt
ltc2
Sa
b1
/gt
lt/b1gt
Sb
25
Other variations

Testing value predicates on nodes do not push
unless predicates are satisfied
// edges no change
-labeled query nodes use S, on which any node
of the right context can be pushed
Optional query nodes do not prune parent match
if a child match is lacking

26
Complexity

Time linear in the size of the document
Space
Number of stacks number of query nodes
Maximal stack height maximal depth of matches
which are ancestors of one another lt document
depth
If string results are returned, string buffers
may be large!
In some cases the string buffers can be flushed
as we go
Interesting for small-to-moderate documents, not
very selective patterns
Not interesting for large documents with highly
selective patterns

27
Processing XML queries in a persistent store
roadmap
28
Roadmap

Part 1 Query processing stages
Part 2 Storing XML documents in a persistent
database
Requirements
Storage models
Impact on query processing
Part 3 Some physical query execution methods
Part 4 An algebra for XQuery

29
Part 1 query processing stages
30
Query processing stages in a persistent database
First stages in query processing query parsing
optimization
Query parser
Query optimizer
Logical query plan
Logical query plan
XQuery
Abstract Syntax Tree
Physical query plan
Logical query plan
Logical query plan
31
Query processing stages in a persistent database
Last stage in query processing query execution
Execution engine
Physical query plan
Storage system
32
Comparison with relational databases
33
Part 2Storing XML documents in a persistent
database
34
2.0 Requirements for an XML storage method
35
Requirements for an XML storage method

Completeness
Must preserve all information content of the
document
Amenable to efficient processing
Navigation queries benefit from fragmentation
Reconstruction queries suffer from fragmentation
Flexible
Generic
Must not require precise schema information
Human-understandable

36
Sample XML document
auctions
item
item
id"item1"
id"item2"
name
description
comment
name
description
"Gold pin"
"Surveillance camera"
"Remarkable dragon brooch"
text
parlist
text
parlist
"8 channel"
listitem
listitem
"Art Nouveau gold pin"
listitem
parlist
text
text
listitem
listitem
"also pendant"
"Video recorder, 8 dome cameras"
text
text
"Large diamond"
"Circa 1900"
37
Nodes and node identity

7 types of nodes document, element, attribute,
text, namespace, processing instruction, comment
XQDM
Element, attribute, namespace, PI nodes have a
unique identity
(ElemID, attr name) determine attr. value ? key
issue is element identity
In-memory processing "the pointer is the ID"
Persistent stores must materialize some
persistent IDs (not necessarily for all elements)

38
Assigning persistent IDs to elements
1
auctions
item
item
id"item1"
id"item2"
2
16
17
15
name
description
comment
name
description
18
4
3
"Gold pin"
"Surveillance camera"
"Remarkable dragon brooch"
text
parlist
19
20
5
text
parlist
6
"8 channel"
7
listitem
"Art Nouveau gold pin"
listitem
13
21
listitem
parlist
8
text
14
text
22
9
11
listitem
listitem
"also pendant"
"Video recorder, 8 dome cameras"
10
text
12
text
"Large diamond"
"Circa 1900"
39
Data values
...

Text nodes
Level 0 bunch ofstrings
Level 1 strings, numbers,booleans
Level 2 bags of words, numbers, boolean
This is still a simplification

item
id"item2"
price
name
description
"200"
"Surveillance camera"
text
parlist
"8 channel"
listitem
text
"Video recorder, 8 dome cameras"
40
Document structure relationships among nodes

Level 0 store parent-child relationships
Given a node, it must be possible to find
Its children
Its parent
Parent-child relationships between elements
"Ownership" relationships between an element and
an attribute
"Text value" relationships between elements and
text
Elements may have several text children

41
Document structure relationships among nodes
1
auctions
item
item
id"item1"
id"item2"
2
16
17
15
name
description
comment
name
description
18
4
3
"Gold pin"
"Surveillance camera"
"Remarkable dragon brooch"
text
parlist
19
20
text
parlist
5
6

Element 1 is parent of elements 2 and 16
Element 2 has the attribute id"item1"
Element 3 has the text child "Gold pin"

42
Document structure order and names

Nodes in an XML document appear in a well-defined
total order
It must be possible to retrieve this order

...
item
2
Item name before item description Also node
names
15
name
description
comment
4
3
"Gold pin"
"Remarkable dragon brooch"
5
text
parlist
6
"Art Nouveau gold pin"
7
listitem
listitem
8
43
Storage completeness summary

Need to store
Node identity and order
(Typed) data values
Document structure invariants particular
instances
Many invariants is good (regular data)... but
they should remain small (to handle easily)
DTDs, XML Schemas are there, but do not express
all desirable constraints
Complex constraints require special care for
updates

44
Where (most) persistent storage systems stand

Provide selective data access
Provide set-at-a-time execution primitives, for
a language whose formal semantics is defined
tuple-at-a-time
Advantage performance
Selective data access
Set-at-a-time processing allows for better disk
access locality
Hash join vs. nested loops join
Naive navigation-based tuple-at-a-time path query
processors have complexity exponential in the
size of the query GKP02 set-at-a-time scales
better.

45
Where (most) persistent storage systems stand

Provide set-at-a-time execution primitives, for
a language whose formal semantics is defined
tuple-at-a-time
Disadvantages
Must make explicit effort to prove correctness
Some features may not be doable this way
But they do not seem to be the essential ones

46
2.1Persistent storage models for XML
47
Storage models for XML

They are determined by
data model tuples or trees
fragmentation strategy choice of invariant
Choose some property node name, node path,
Group together all tuples/trees that have the
same value for the same property E.g. table A
contains all A elements E.g. collection C1 has
all trees on path /A/B
Store each group in a separate structure

48
Problems related to the choice of fragmentation
(1)

Access path selection choosing the best way to
retrieve from disk a given data set
Parameters data model, available storage
structures
Relational scan vs. index-based access
Object navigation vs. scanning class extents

49
Problems related to the choice of fragmentation
(2)

View-based query rewriting reformulating a query
based on a set of (materialized) views
Parameters data model, view and query languages
Local-as-view data integration

Mediator schema
view definitions
Source schema 2
Source schema 1
50
Problems related to the choice of data
fragmentation (3)

Materialized view/index selection choosing a set
of views to materialize to speed up certain
queries (supposes view-based query rewriting)
Parameters query language, query workload,
space
Materialized view selection in a data warehouse
all sales of beer, or sales of beer in May,
or sales of beer in France by brand...
Supported in current-release relational DBMSs
For XML ?

51
Problems related to the choice of data
fragmentation (4)

Automatic index and view selection choosing a
set of views and indexes to speed up certain
queries
Parameters query language, query workload,
space
Index tuning wizards (SQL Server, DB2)

52
The simplest store no fragmentation(introduced
for OEM PGW95 )

OEM Object exchange model
Labeled, directed, unordered graph of objects
Objects have unique identity
Atomic objects values (simple atomic types)

53
Storing OEM objects in LORE MAG97

Objects clustered in pages in depth-first order,
including simple value leaves
Basic physical operator Scan(obj, path)

Auctions
item
item
open_auctions
object
object
comment
name
description
auction
description
auction
name
...
bids
bids
initial
initial
...
...
...
...
54
Navigation in a persistent graph

Navigation-based, tuple-at-a-time, pointer-chasing

Scan(Auctions, "item")
2 pages accessed
Auctions
item
item
open_auctions
object
object
comment
name
description
auction
description
auction
name
...
bids
bids
initial
initial
...
...
...
...
55
Navigation in a persistent graph

Scan(Auctions, "item.description")

4 pages accessed
4 pages accessed
Scan(Auctions, "open_auctions.auction.object")
Auctions
item
item
open_auctions
object
object
name
comment
description
auction
description
auction
name
...
bids
bids
initial
initial
...
...
...
...
56
Indexing objects in a graph MW97,MWA98,MW99a,MW9
9b

VIndex(l, o, pred) all objects o with an
incoming l-edge, satisfying pred
LIndex(o, l, p) all parents of o via an l-edge
"Reverse pointers"
BIndex(x, l, y) all edges labeled l

select X from Auction.open_auctions.auction
X where X.initial lt 10
tuple at a time
bulk access
57
Indexing objects in a graph MW97

PIndex(p, o) all objects o reachable by the path
p

select X from Auction.open_auctions.auction.initia
l X where X.initial lt 10
Set at a time
Tuple at a time
58
The idea behind path indexes DataGuides GW97
59
The idea behind path indexes DataGuides GW97

Graph-shaped summaries of graph data
Invariants extracted from the data ("a posteriori
schema")
Groups all nodes reachableby the same paths

Auctions
oid1
open_auctions
item
oid12
oid2,oid15
object
auction
comment
description
oid15,oid16
name
initial
bids
oid40,oid41
oid22,oid23
oid24,oid25
oid30,oid31
oid20,oid21
...
...
60
More on graph indexing

Graph indexing
Partition nodes into equivalence classes
Store the extent of each equivalence class, use
it as "pre-cooked" answer to some queries
Equivalence notions
Reachable by some common paths DataGuide MW97
Reachable by exactly the same paths 1-index
MS99 or, equivalently,indistinguishable by any
forward path expression
Indistinguishable by any (forward and backward)
path expression FB Index ABS99,KBN02
Indistinguishable by the (forward and backward)
path expressions in the set Q covering index
KBN02
Indistinguishable by any path expression of
length lt k A(k) index KSB02

61
FB index

Group together nodes reachable by exactly the
same paths
Path language
Navigate along one edge in both directions
Navigate along any number of edges, in both
directions
n1 n2 for any path expression p, either n1 and
n2 are in the answer of p, or neither are in the
answer of p.

1
a
a
a
a
Data graph
FB index
2
4
3
5
b
c
b
b
6
5
7
8
62
Summary persistent graph / tree storage and
indexing

Very simple storage models
Quite simple value indexing MWA98
Multiple graph schema/index structures
Identify invariants / regularity / interesting
node groups
Use interesting node groups
Simplify path queries
Basis for indexing
Store IDs of all nodes in an interesting group.
Access them directly (avoid navigation).
Formalism behind it bisimulation ABS99

63
2.2Persistent storage models for XMLthe
relational shredding option
64
Shredding XML documents in relational databases
XML document
Load in RDBMS
Recompose document

Query processing
Translate XQuery in SQL
Run SQL query
Serialize XML result

RDBMS
ACID properties
Performance

65
Relational storage models for XML

It must be possible to store any document
Documents may have an XML schema
Relational schemas seek to
Exploit / identify structural invariants
"Tables for fixed or frequent structures"
Factorize or avoid storage of labels
Factorize storage of values
Compared to graph storage, relational XML storage
is "already" partitioning / organizing the data
(even without index)

66
Classification of relational storage schemes for
XML documents

No XML schema assumption ( "Tree OEM")
Generic same relational schema for all
Derive relational schema from the data
Based on an XML schema
Based only on an XML schema
Based on schema and cost information
User-defined

67
Simplest relational model "Edge" FK99
auctions
1
item
2
id"item1"
4
description
3
comment
"Remarkable dragon brooch"
parlist
5

6
listitem
listitem
10
7
parlist
text
General (no schema or queries used) No regularity
assumed ID may reflect document order Index on
pID, and (name,target)
listitem
listitem
8
9
"also pendant"
text
text
"Large diamond"
"Circa 1900"
68
Path query processing on "Edge"
//item //comment/text() //item_at_id"item1"/
description
select target from Edge where name"item" select
target from Edge where name"comment" select
e3.target from Edge e1, Edge e2, Edge
e3 where e1.name"item" and
e1.targete2.pID and e2.name"id" and
e2.target"item1" and
e1.targete3.pID and e3.name"descr"
Actual query answering also requires
reconstructing the element... will discuss that
later
3-way join on the Edge table
69
Partitioned "Edge"

EdgeAuction(pID,ord,target)
EdgeItem(pID,ord,target)
EdgeID(pID,ord,target)
EdgeDescription(pID,ord,target)

Similar to graph index,but this is the storage
Interesting groups of nodes those with the same
label Store tags in schema, not in data Some code
on the side keeps the mapping between tags and
table names
3-way join on (much) smaller tables
70
Other storage schemes derived from "Edge"

Universal relation FK99EdgeAuction
EdgeItem EdgeID ...
Query-dictated materialized views over the
partitioned Edge tablesEdgeAuction
EdgeItemEdgeAuction EdgeID
Materialized view selection problem

71
Relational schemas based on frequent tree
structures DFS99

Analyze elements to find frequent patterns
Create one table for each type with enough
support in the databaseE.g. one relation for
articles with up to 3 author
Stores the remaining elements in overflow
tablesE.g. the remaining article authors
Materialized views over partitioned Edge, based
on pattern frequency
Performance benefits less joins for many queries
"Interesting node groups" are frequent ones
Most tags stored in the schema, not in the data

72
Path partitioning SKV00

1 relation per root-to-leaf data path Path(id1,
id2, ord)
1 relation per root-to-inner node data path
Val(id, val)

auctions
1
auctions (id1, id2, ord) auctions.item(id1, id2,
ord) auctions.item.id(id1, id2,
ord) auctions.item.name(id1, id2,
ord) ... auctions.item.id.val (id,
val) auction.item.name.val (id, val) ...
item
id"item1"
2
4
3
5
7
name
description
comment
15
6
"Gold pin"
"Remarkable dragon brooch"
16
10
text
parlist
8
11
listitem
listitem
"Art Nouveau gold pin"
12
9
73
Path partitioning SKV00

(Very) similar to graph indexes
IDs are stored once per descendent
Values stored once
No join required for linearpath expressions
wildcards
Dataguide is neededto find elementary paths for
a given path expression

auctions (id1, id2, ord) auctions.item(id1, id2,
ord) auctions.item.id(id1, id2,
ord) auctions.item.name(id1, id2,
ord) ... auctions.item.id.val (id,
val) auction.item.name.val (id, val) ...
74
Path partitioning SKV00

IDs are stored once per descendent
Values stored once
No join required for linearpath expressions
wildcards
Dataguide is neededto find elementary paths for
a given path expression

auctions
//listitem
item
id
name
description
comment
text
parlist
text
text
listitem
text
parlist
text
listitem
text
75
XRel path partitioning interval encoding
YAT01
Factorize path information in Path table (1,
"/auctions")(2, "/auctions/item")(3,
/auctions/item/name") Assign region IDs Start,
End to nodes N1 ancestor of N2 iffN1.Start lt
N2.Start andN1.End gt N2.End Children
relationship extra condition on paths

Path(PathID,PathExpr)
Element(DocID, PathID, Start, End, Ordinal)
Text(DocID, PathID, Start, End, Value)
Attribute(DocID, PathID, Start, End, Value)

auctions
76
XRel path partitioning interval encoding
YAT01

Path(PathID,PathExpr)
Element(DocID, PathID, Start, End, Ordinal)
Text(DocID, PathID, Start, End, Value)
Attribute(DocID, PathID, Start, End, Value)
Processing regular path expressions match PE
against Path.PathExpr (string pattern matching)
Thanks to region IDs, no joins required for
linear path expressions
Theta-joins still needed for branching path
expressions
match paths, then use indexes on Element.pathID

77
Classification of relational storage schemes for
XML documents

No XML schema assumption ( "Tree OEM")
Generic same relational schema for all
Derive relational schema from the data
Based on an XML schema
Based only on an XML schema
Based on schema and cost information
User-defined

78
DTD-derived relational storage schemes STH99

Use a DTD graph derived from the DTD

lt!ELEMENT a (b, f, d)gt lt!ELEMENT b
(c,d)gt lt!ATTLIST b bID REQUIRED IDgt lt!ELEMENT c
(e, g)gt lt!ELEMENT d PCDATAgt lt!ELEMENT e (b?)
gt lt!ATTLIST e bIDRef IDREFgt lt!ELEMENT
f PCDATAgt lt!ELEMENT g PCDATAgt
79
"Shared" model STH99

1 relation for every DTD graph node with
in-degree 0 or gt1, or under a or backpointer
Less relations, no redundancy
Requires joins for reconstruction
Order support is complicated by ID absence

a
A(a.ID, a.f.val, a.f.isRoot)B(b.ID, b.parentID,
b.ID.val)C(c.ID, c.parentID, c.e.val, c.e.
bIDRef.val, c.e.isRoot)D(d.ID, d.parentID,
d.parentCode, d.val) G(g.ID, g.parentID,
g.val)

b
f

?
c
d

e
g
80
"Hybrid" model STH99

1 relation for every DTD graph node with indegree
0, under a , or under a backpointer
Less relations, no redundancy
Requires joins for reconstruction
Order support is complicated by ID absence

a

A(a.ID, a.f.val, a.f.isRoot, a.d.val)B(b.ID,
b.parentID, b.ID.val, b.d.val)C(c.ID,
c.parentID, c.e.val, c.e. bIDRef.val,
c.e.isRoot)D(d.ID, d.parentID, d.val) G(g.ID,
g.parentID, g.val)
b
f

?
c
d

e
g
81
DTD-derived models

Labels stored in the schema only some IDs stored
E-R analysis of data in XML
"If every A has 1 D, and all Ds have A parents,
then D is a property of A otherwise, D is an
entity"
DTD (graph) must be kept and exploited to
translate queries, handle inlined elements, ...
Long path queries may be simplified, then matched
accessing few tables
To bind many branches, (several) joins still
needed

82
Classification of relational storage schemes for
XML documents

No XML schema assumption ( "Tree OEM")
Generic same relational schema for all
Derive relational schema from the data
Based on an XML schema
Based only on an XML schema
Based on schema and cost information
User-defined

83
Relational models based on a query workload
BFR02

Adapt relational schemas to documents (XML
schema), queries
Problem different XML schemas may accept the
same sets of documents, yet produce different
relational schemas !
Fixed storage for a given XML schema (shared)
Transformation rules for moving between
equivalent XML Schemas
Cost-based optimization in the space of possible
transformations
Impact of a given XML Schema transformation
estimated by the RDBMS's optimizer

84
Cost-based choice of a relational schema
XML schema
RDBMS optimizer
85
Schema transformation rules driven by workload
queries

Inlining / Outlining
type Ab Integer, C, d, type Ce String
equivalent totype Ab Integer, e String,
d
Inlining useful if C is always queried through
ancestor A
Union Factorization / Distribution
(a, (bc)) equivalent to (a, b) (a, c)
at1t2 equivalent to at1 at2
Useful to separate if at1 often queried
together, at2 rarely or never queried together

86
Schema transformation rules driven by workload
queries

Repetitions merge / split
a equivalent to (a, a)
If the first ltagt is isolated, it can be inlined
with parent
Wildcard rewritings
Ab String equivalent to a b (cd),
where ctag1String and d(! tag1)String
If a/b/tag1 often queried, a/b/other never
queried, separate them.

87
Cost-based choice of relational schema

Materialized view selection for a dataset and
workload
This is the storage
Optimizer estimates can be wrong, but the
optimizer will make the same mistake when
choosing the best plan
Search space explored
Node labels factorized in the schema
Schema management module needed to identify
pertinent relations
Various points in the search space vary the
number of unions and joins required by a query

88
Classification of relational storage schemes for
XML documents

No XML schema assumption ( "Tree OEM")
Generic same relational schema for all
Derive relational schema from the data
Based on an XML schema
Based only on an XML schema
Based on schema and cost information
User-defined

89
User-defined storage model for XML MFK01,DS03
Does each item have exactly one description ?
Does each item have exactly one price ?

Express (relational) storage by custom
(algebraic) expressions over the XML document
Relation materialized view over the XML
document
Finding useful tables requires view-based query
rewriting
R(y,z) - Auctions.item x, x._at_id.text() y,
x.price.text() z
S(u,v) - Auctions.item t, t._at_id.text() u,
t.description.text() v
for x in //item
return ltresgt x/price, x/description lt/resgt

Is Auctions.item the same as //item ?
Is _at_id a key for item ?
Not so fast.
90
User-defined storage model for XML MFK01,DS03
Does each item have exactly one description ?
Does each item have exactly one price ?

Express (relational) storage by custom
expressions over the XML document
Relation materialized view over the XML
document
Finding useful tables requires view-based query
rewriting

Is Auctions.item the same as //item ?

XPath containment
Functional dependency
Cardinality constraints

Is _at_id a key for item ?
Query containment /rewriting under
constraints Techniques based on the chase DS03
91
User-defined storage model for XML MFK01,DS03

Express (relational) storage by custom
expressions over the XML document
Relation materialized view over the XML
document
Other issue we would want to write
R(XID, y,z) - Auctions.item x, x.id XID,
x._at_id.text() y, x.price.text() z
S(XID, u,v) - Auctions.item t, x.id XID
t._at_id.text() u, t.description.text() v
x.id not in the document ! But, in the data model
XQDM

92
User-defined storage model for XML

Express (relational) storage by custom
expressions over the XML
Must check storage completeness
Most generic potential for good performance
(materialized views !)
Can also express non-relational storage models
Rewriting is complex.
Poor man's solution cut in flexibility (and
performance)
Less freedom in the mappings
Assign IDs to all elements
Map each element to a table...

93
Summary of relational storage models for XML

Relations alone only go that far
Many solutions around S-P-J materialized view
selection over partitioned Edge table
Flexible (or generic) storage requires view-based
query rewriting
Interesting performance advantages stem from
various encodings path, ID, ...
Fragmentation (horizontal/vertical) facilitates
navigation and complicates reconstruction

94
Part 2.3 Native XML storage. Conclusions and
examples
95
Native XML storage models

Native storage persistent trees JKC02,FHK02
also current XML-enabled relational DBMS
Oracle, IBM DB2, MS SQL Server
XML nodes split among disk pages
Logical XML document Physical
representation
Node representation optimized based on fixed page
size
Storage evolution on updates B-trees FHK02

96
Native XML storage and indexing models

Persistent trees only allow navigation ? OK for
small documents.
Approach taken in Timber (U. Michigan) and Natix
Attach region-based labels to XML elements
Use structural joins structural join or region
labels
Index labels by tag. This largely outperforms
navigation HBG03.
XQueC approach (Gemo U. Calabria MAB04 )
Attach region-based labels to XML elements
partition region labels by incoming path (as
storage)
Commercial products (IBM DB2, Oracle etc.)
Plain navigation XPath indices
More native encoding storage indexing schemes
WKF03, K03 based on binary tree encoding,
RM04,ZKO04

97
Native storage and indexing models summary

Overall transformation
Document -- logical encoding/labeling -- mapping
to physical store
Some labeling schemes reduce the number of joins
(ORDPATHS)
Navigation outperformed by index-based access
Persistent trees provide for fast serialization
Construction of new, complex result remains
complex
Likely to yield interesting encoding and storage
solutions

98
Part 3 query execution techniques
99
Plan

Labeling schemes and their desireable properties
Fundamental primitive for path queries
structural joins
XML element reconstruction

100
Why start with labeling schemes ?

Idea assign labels to XML elements
unique identifiers
useful information for query processing
Source of big performance improvements over
relational storage traditional joins
Many labeling schemes
trade-off between space occupancy, information
contents, and suitability to updates CKM02
most frequent one region-based ("pre-post")
shortcomings and alternatives
new ones still being produced

101
Region-based labeling schemes
ltbgt child of ltagt region of ltbgt included in region
of ltagt

Idea label elements to reflect nesting
(containment)
ltagt Some content here and then a ltbgt element lt/bgt
lt/agt
Label ltagt with 1, 2, and ltbgt with 2, 1
Add also nesting level
label ltagt with 1, 2, 1
label ltbgt with 2, 1, 2
Variant start and end counting in characters in
the file (recycled IR indexing technique)

102
Region-based structural identifiers
1,0,22
auctions
16,1,21
2,1,14
item
item
id"item1"
id"item2"
15,2,13
3,2,1
4,2,12
17,2,15
18,2,20
name
description
comment
name
description
"Gold pin"
"Surveillance camera"
"Remarkable dragon brooch"
19,3,16
20,3,19
6,3,11
text
parlist
text
parlist
5,3,2
"8 channel"
7,4,8
13,4,10
21,4,18
listitem
listitem
"Art Nouveau gold pin"
listitem
8,5,7
14,5,9
22,5,17
parlist
text
text
9,6,4
11,6,6
listitem
listitem
"also pendant"
"Video recorder, 8 dome cameras"
10,7,3
12,7,5
text
text
"Large diamond"
"Circa 1900"
103
The interest of region-based identifiers

Element e1 is an ancestor of element e2 iff
e1.pre lt e2.pre and e1.post gt e2.post
Element e1 is a parent of element e2 iffe1.pre
lt e2.pre and e1.post gt e2.post and
e1.depth1e2.depth
Establishing ancestor-descendent relationships
becomes as easy as establishing parent-child
relationships
Structural join (see next)

The end of navigation !
104
Query primitive structural join

Relationship established throughsimple
comparisons
select from Element e1, Element e2where e1.doc
e2.doc and e1.pre lt e2.pre and e1.post gt
e2.post
Non-equality join !
Relational implementations perform poorly
ZND01
Alternative efficient stack-based algorithms
AJK02
Algorithm StackTreeDesc AJK02, holistic twig
joins BKS02
Supporting indexes JLW03, join ordering
WPJ03

105
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g
2, 5, 2 3, 3, 3 7, 14, 2 11,12, 3
5, 2, 4 10, 7, 4 13,10, 4 14,11, 4
106
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g
3, 3, 3 7, 14, 2 11,12, 3
5, 2, 4 10, 7, 4 13,10, 4 14,11, 4
2, 5, 2
107
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

7, 14, 2 11,12, 3
5, 2, 4 10, 7, 4 13,10, 4 14,11, 4
3, 3, 3 2, 5, 2
108
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

7, 14, 2 11,12, 3
5, 2, 4 10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
3, 3, 3 2, 5, 2
109
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

7, 14, 2 11,12, 3
10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
3, 3, 3 2, 5, 2
110
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

7, 14, 2 11,12, 3
10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
2, 5, 2
111
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

7, 14, 2 11,12, 3
10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
112
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

11,12, 3
10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
7, 14, 2
113
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

10, 7, 4 13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2, 4
11,12,3 7, 14, 2
114
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2,
4 7, 14, 2, 10, 7, 4
11,12,3 7, 14, 2
115
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

13,10, 4 14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2,
4 7, 14, 2, 10, 7, 4
11,12,3 7, 14, 2
7 ,14, 2, 13,10,4 11,12,3, 13,10,4
116
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

14,11, 4
2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2,
4 7, 14, 2, 10, 7, 4
11,12,3 7, 14, 2
7, 14, 2, 13,10,4 11,12,3, 13,10,4
7, 14, 2, 14,11,4 11,12,3, 14,11,4
117
Algorithm StackTreeDesc
Result //b join //g
Stack
//b
//g

2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2,
4 7, 14, 2, 10, 7, 4
7, 14, 2, 13,10,4 11,12,3, 13,10,4
7, 14, 2, 14,11,4 11,12,3, 14,11,4
118
Remarks on StackTreeDesc
Result //b join //g

I/O cost in O(mn) stack
Does not compute //b//g (duplicates !)
Returns result in descendent order
Similar algorithm for result in ancestor order

2, 5, 2, 5, 2, 4 3, 3, 3, 5, 2,
4 7, 14, 2, 10, 7, 4
7, 14, 2, 13,10,4 11,12,3, 13,10,4
7, 14, 2, 14,11,4 11,12,3, 14,11,4
119
Region-based identifiers and updates

Possible solutions
Leave empty intervals LM01
Use real numbers JKC02
Multi-versioning CTZ02
Append another "discriminating" label to pre,
depth, post

16,2,21
item
id"item2"
17,3,15
18,3,20
name
description
...
19,4,16
18,4,16
text
parlist
"8 channel"
listitem
text
...
120
Other labeling scheme ORDPATHS NNP04
1

Implemented in MS SQL Server
Label each node by a sequence of integer numbers
Initial loading all odd numbers
This scheme records order depth
parentancestor-descendent relationships.

1.1
1.3
1.5
1.3.1
1.3.3
Insert first sibling decrease leftmost code by
2 Insert last sibling increase rightmost code by
2 Insert between siblings use even numbers (does
not affect node depth)
121
Internal representation of ORDPATHs
1

L0 O0 L1 O1 ... Li Oi ... where
Oi is the i-th number
Li encodes Oi's length
Li's must satisfy prefix property
Partition -max, max in intervals of length 2i
Build prefix tree on the partitioning

1.1
1.3
1.5
1.3.1
1.3.3
0
-21,-6 -5,-2 -1,0 1,1
2,3 4,7 8,23
If Oi?-21,-6, Li000, -21 is 0000, ..., -6 is
1111 If Oi?-5,-2, Li001, -5 is 00, ...,
-2 is 11 If Oi?-1,0, Li01, -1 is 0, 0
is 1 If Oi?1,1, Li10, Oi is not
represented If Oi?2,3, Li110, 2 is
0, 3 is 1 If Oi?4,7, Li1110, 4 is 00,
7 is 11 If Oi?8,23, Li1111, 8 is 0000, 23
is 1111
0
1
0
1
0
1
0
1
0
Moderate length on real-life documents MBF03
1
1
122
(Re)constructing XML elements

Query /Auctions/item
This concerns
/Auctions/item/Auctions/item/name/Auctions/item
/_at_id/Auctions/item/comment/Auctions/item/descrip
tion/Auctions/item/description/text/Auctions/ite
m/description/parlist/Auctions/item/description/p
arlist/listitem.............
Simplification item, comment, name

i1 - - i2 - n1 i3 c1 - i4 c2 n2 i5 c3
n3 i5 c3 n4 i5 c4 n3 i5 c4 n4
i1 - i2 - i3 c1 i4 c2 i5 c3 i5 c4
123
(Re)constructing XML elements

/Auctions/item

i1 i2 i3 c1 i4 c2 i5 c3 i5 c4
i1 i2 n1 i3 i4 n2 i5 n3 i5 n4
item i1,..i5
name n1,...n4
124
(Re)constructing XML elements

/Auctions/item

Two alternatives
Run P twice
Materialize intermediary results
Result of fragmentation
Same problem for complex returned results

i1 i2 i3 c1 i4 c2 i5 c3 i5 c4
i1 i2 n1 i3 i4 n2 i5 n3 i5 n4
Plan P
name n1,...n4
125
References
126
References

ABC04 A.Arion, A.Bonifati, G.Costa,
S.D'Aguanno, I.Manolescu and A.Pugliese.
"Efficient Query Evaluation over Compressed XML
Data", EDBT 2004
ABS99 S.Abiteboul, P.Buneman and D.Suciu. "Data
on the web from relations to semistructured data
and XML", Morgan Kauffmann, 1999.
AJK02 S.Al-Khalifa, H.V.Jagadish, N.Koudas,
J.Patel, D.Srivastava and Y.Wu. "Structural
Joins A Primitive for Efficient XML Query
Pattern Matching", ICDE 2002
BCF03 V.Benzaken, G.Castagna and A.Frisch.
"CDuce an XML-centric general-purpose language",
ICFP 2003

127
References

BFR02 P.Bohannon, J.Freire, P.Roy and
J.Simeon. "From XML Schema to Relations A
Cost-Based Approach to XML Storage", ICDE 2002
BKS02 N.Bruno, N.Koudas and D.Srivastava.
"Holistic Twig Joins Optimal XML Pattern
Matching", SIGMOD 2002
CKM02 E.Cohen, H.Kaplan and T.Milo. "Labeling
Dynamic XML", PODS 2002
CSF01 B.Cooper, N.Sample, M.Franklin,
G.Hjaltason and M.Shadmon. "A Fast Index for
Semistructured Data", VLDB 2001
CTZ01 S.Chien, V.Tsotras, C.Zaniolo and
D.Zhang. "Efficient complex query support for
multiversion XML documents", EDBT 2002
DFS99 A.Deutsch, M.Fernandez and D.Suciu.
"Storing Semistructured Data with STORED", SIGMOD
1999
DT03 A.Deutsch and V.Tannen. "MARS A System
for Publishing XML from Mixed and Redundant
Storage", VLDB 2003

128
References

FHK02 T.Fiebig, S.Helmer, C-C. Kanne,
G.Moerkotte, J.Neumann, R.Schiele and T.Westman.
"Anatomy of a native XML base management system",
VLDB Journal 2002.
FHK03 D.Florescu, C.Hillery, D.Kossmann,
P.Lucas, F.Riccardi, T.Westmann, M.Carey,
A.Sundararajan and G.Agrawal. "The BEA/XQRL
Streaming XQuery Processor", VLDB 2003
FK99 D.Florescu and D.Kossmann. "Storing and
Querying XML Using an RDBMS. IEEE Data
Engineering Bulletin, 1999.
FS98 M.Fernandez and D.Suciu. "Optimizing
Regular Path Expressions Using Graph Schemas",
ICDE 1998
FSC03 M.Fernandez and J.Simeon and B.Choi and
A.Marian and G.Sur. "Implementing XQuery 1.0 The
Galax Experience", VLDB 2003

129
References

G02 T.Grust. "Accelerating XPath location
steps", VLDB 2002
GKP02 G.Gottlob, C.Koch and R.Pichler.
"Efficient Algorithms for Processing XPath
Queries", VLDB 2002
GW97 R.Goldman and J.Widom. "DataGuides
Enabling Query Formulation and Optimization in
Semistructured Databases", VLDB 1997
HBG03 A.Halverson, J.Burger, L.Galanis,
A.Kini, R.Krishnamurthy, A.Rao, F.Tian, S.Viglas,
Y.Wang, J. Naughton and D.DeWitt. "Mixed Mode XML
Querey Processing", VLDB 2003
JLW01 H.Jiang, H.Lu, W.Wang and J.Yu. "Path
Materialization Revisited An Efficient Storage
Model for XML Data", AICE 2001

130
References

JKC02 H.V.Jagadish, S.Al-Khalifa, A.Chapman,
L.Lakshmanan, A.Nierman, S.Paparizos, J.Patel,
D.Srivastava, N.Wiwatwattana, Y.Wu and C.Yu.
"TIMBER a Native XML database", VLDB Journal
2002
JLW03 H.Jiang, H.Lu, W.Wang and B.Ooi.
"XR-Tree Indexing XML Data for Efficient
Structural Joins", ICDE 2003.
K03 C.Koch. "Efficient Processing of Expressive
Node-Selecting Queries on XML Data in Secondary
Storage A Tree Automata-based Approach", VLDB
2003
KBN02 R.Kaushik, P.Bohannon, J.Naughton and
H.Korth. "Covering Indexes for Branching Path
Queries", SIGMOD 2002
LM01 Q. Li and B.Moon. "Indexing and querying
XML data for regular path expressions", VLDB 2001
LMP02 B. Ludascher, P.Mukhopadhyay and
Y.Papakonstantinou. "A Transducer-Based XML Query
Processor", VLDB 2002

Web data management and distribution XQuery processing PowerPoint PPT Presentation