Complex Group-By Queries for XML - PowerPoint PPT Presentation

About This Presentation

Title:

Complex Group-By Queries for XML

Description:

Complex Group-By Queries for XML C. Gokhale+, N. Gupta#, P. Kumar+, L.V.S. Lakshmanan, R. Ng, and B.A. Prakash+ * University of British Columbia, Vancouver – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 38

Provided by: Nitin59

Category:

more less

Transcript and Presenter's Notes

Title: Complex Group-By Queries for XML

1
Complex Group-By Queries for XML

C. Gokhale, N. Gupta, P. Kumar, L.V.S.
Lakshmanan, R. Ng, and B.A. Prakash
University of British Columbia, Vancouver
Indian Institute of Technology, Bombay
Cornell University, Ithaca

2
Why this paper?

Of late, there is considerable interest in
analytical processing of XML data.
Even for data integration, there is a compelling
need for performing various group-by style
aggregate operations.
A core operator needed for analytics is the
group-by operator, which is widely used in
relational as well as OLAP database applications.

3
Why this paper?

XQuery requires group-by operations to be
simulated using nesting.
State of current research
Provide support for grouping at the logical or
physical level and recognize grouping operations
from nested queries and rewrite them with
grouping operations
Extend XQuery FLWOR expressions with explicit
constructs similar to the group-by, order-by and
having clauses in SQL
Direct algorithmic support for a group-by
operator is not explored.

4
Overview

Nested Aggregation
Expressing complex aggregation queries
Having clause
Moving Windows
Related Work
Algorithm MERGE-GB
Extensions for disk-based
Dealing with nesting
Dealing with having clause
Dealing with moving windows
Experimental results

5
Syntax
Median of Sold
Outer grouping attribute
Average of price
Inner grouping attribute
6
Syntax
7
Syntax

group //Book
by //Name return (
//Name, avg(/Price), count()
then by /Year return (
/Year, median(/Sold)
)
)

8
General Framework

a is an absolute XPath expression, while ßs and
?s are relative to a.
aggouts and aggins are aggregation operations
such as min(), count(), avg(), median().
Nested aggregations medianMax(), minCount()
Cons, AggConsout and AggConsin are sets of
conditions.
mwouts and mwins denote moving window
specifications.
mwi (width, step, winType, domType)

9
Nested Aggregation Features

Having Clause
aggi(?i) ?i ci, where ?i in , ?, gt,lt,, , and
ci is a constant.
Moving Windows
(width, step, winType, domType)
group //Book
by //Name
having count() 100 return (
//Name, avg(/Price), count()
then by /Year return (
/Year(10,5), median(/Sold)
)
)

10
Related Work

2, 3 give syntactic extensions to FLOWR
expressions
Discuss related analytics (moving window
aggregations and cube)
Do NOT discuss algorithms for directly computing
group-bys
Support for group-by at logical/physical level
6
4, 5, 9, 12 detect group-bys from nested
queries
Challenging and queries are hard to express and
understand.
Group-by and cube queries for relational data 8,
10.
Algorithms not directly applicable to
hierarchical data
Group-by elements involve combination of forward
and backward axes
Aggregations on values nested and occur at
multiple levels

11
Merge-GB

Overview
No support for nesting, having, moving windows
Three phase process
Initialization
Prune nodes other than a, ß and ? result in a
Canonical Tree
Counter Initialization Associate counter with
edge based on aggregation operator
Count() counter is a scalar with value 1
Spread() counter is a range of type min, max
Median() counter is a frequency table
Merge Phase
Answer Extraction
Running query
group //Book by //Year
return Year, Median(Sold), MedianCount(Author),
Spread(price), count()

12
Merge-GB Initialization
Catalogue
Subject
Subject
Music CD
Book
Name
Book
Book
Name
Book
PubInfo
Title Mechanics
Author Smith
Year 1999
Sold 5600
Price 60
Author Newton
PubInfo
Title Sound Waves
Year 2000
Sold 900
Price 96
Author McManus
PubInfo
Title Carbon Comps
Year 1999
Sold 8000
Price 25
Author Johnson
Name Kaufman
City NY
Name Wesley
City LA
Name Kaufman
City NY
13
Merge-GB Initialization
1
1
1
1
Book
Book
Book
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Year 1999
Sold
Price
Author
Year 2000
Sold
Price
Author
Year 1999
Sold
Price
Author
14
Merge-GB Initialization
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
15
Merge-GB Merge Phase

Repeated merging of nodes
Based on equality of associated values (eff
group-by labels)
Counters are updated
Child nodes are merged analogously

16
Merge-GB Merge Phase
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
17
Merge-GB Merge Phase
112
1
1
1
Book (2000)
Book (1999)
Book
Book (1999)
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
1
25,25
lt8000,1gt
Sold
Price
Author
Merge ( 25,25, 60,60 ) 25,60
18
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
lt8000,1gt
Sold
Author
Merge ( lt5600,1gt, lt8000,1gt ) (lt5600,1gtlt8000,1gt)
19
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
What about Author? Can these nodes be merged
analogously?
20
Merge-GB Merge Phase

Repeated merging of nodes
Based on equality of associated values (eff
group-by labels)
Counters are updated
Child nodes are merged analogously
Different types of merges
Siblings vs Non-siblings
Primarily to handle nested aggregations

21
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Count requires
simple addition
22
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
2
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Median requires a
frequency count
23
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
lt1,1gt lt2,1gt
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
24
Merge-GB Nesting

Processing ß nodes
Create group corresponding to each new value.
Inner ß is processed analogously create a group
with label ßout, ßin.
For group inversion, create a dummy a node and
update later.
Processing ? nodes/ updating counters
Holistic collect all ? values corresponding to
that ß.
In memory if possible otherwise dump to disk in
?-file
Otherwise compute aggregation on-the-fly
Siblings or non-siblings? Native XML structure
allows decidability using exactly one integer per
answer node!

25
Merge-GB Having clause

Anti-monotonic early pruning
Constraint remains false once violated
Helps in early pruning, for unnested queries
Nested Queries
Associate a flag with each ß group
? calculations can be avoided for groups with
negative flags
Processing of inner-nests can also be avoided
Inner block not processed until the having
clauses in all the outer blocks have been
processed
Separate ?files for each holistic calculation

26
Merge-GB Moving Windows

mw (width, step, winType, domType)
Repeated-aggregation strategy
Enumerate each group apriori
Aggregate for each group treating them
independently
Extend the notion of hash(ß) to update all
appropriate counters
Example
Year (WS 5, Step 2, Standard Domain, Incremental)
1990-1994
1992-1996
1994-1998

Val20
Val50
1993, Price30
1990, Price20
Val30
27
Merge-GB Moving Windows

mw (width, step, winType, domType)
Rolling-over strategy
Construct a window with step1, width1
Results in an intermediate answer tree
Compute moving window aggregate after computing
aggregate for first window, remove starting
elements and add elements for the subsequent
window

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Val20
Val30
1990, Price20
1993, Price30
28
Merge-GB Additional Details

Multiple moving windows
Result in hyper-rectangular windows
Essentially performs a cartesian-product over all
involved attributes
Moving windows and Having clause
Repeated aggregation requires no change
Rolling-over with holistic aggregation requires
delay in processing of ?files.
Properties
Lemma 1 With the aforementioned setup, a value
in an inner block that does not appear in the
answer is not read after the value was written
into the appropriate gamma file.
Lemma 2 For each value of , the rolling-over
strategy guarantees that aggregation is done at
most once.

29
Performance Evaluation

Setup
2GHz CPU, 1GB RAM machine.
Datasets
XMARK (50-500MB)
DBLP (250-400MB)
Protein (13MB)
Comparison with Galax and Qizx
Evaluation Parameters
Size and number of groups
Fully vs. Partially specified paths
Increasing levels of nesting
Scalability

30
Performance Evaluation
Q1 Small number of large groups Q2 Large number
of small groups N-GB is not affected while Qizx
cannot handle a large number of groups
31
Performance Evaluation
Q5 Query with fully specified paths Q6 Query
with partially specified paths N-GB is not
affected while Qizx cannot handle partially
specified paths
32
Performance Evaluation
N-GB is linear as expected, whereas Qizx becomes
exponential with increasing levels of nesting
33
Performance Evaluation
Early pruning gives manifold improvement for
queries with having clause(s)
34
Performance Evaluation
Increase in window width/stop size results in
more overlapping computations and therefore gives
more gain
35
Performance Evaluation
Early pruning provides the maximum benefit,
while repeated aggregations degrades performance
36
Conclusions