Parallelizing the Data Cube - PowerPoint PPT Presentation

About This Presentation
Title:

Parallelizing the Data Cube

Description:

On-line Analytical Processing: the foundation for a range of essential business ... CHER Y. Chen, F.Dehne, Todd Eavis, A. Rau-Chaplin, Parallel ROLAP Data Cube ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 27
Provided by: users79
Category:

less

Transcript and Presenter's Notes

Title: Parallelizing the Data Cube


1
Parallelizing the Data Cube
  • PhD Oral Defence
  • Todd Eavis
  • July 23, 2003

2
Overview
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

3
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

4
Why study OLAP and the Data Cube?
  • On-line Analytical Processing the foundation for
    a range of essential business applications
  • sales and marketing analysis, planning and
    budgeting
  • 4 billion dollar industry by 2005
  • Data Cube a core OLAP construct, first proposed
    in 1996 by Gray et al GBLP, that supports
    sophisticated multi-dimensional data analysis
  • Relevance to the Research Community? Results of
    Citeseer queries
  • OLAP 797 papers
  • Data Cube 362 papers
  • Our interest Data Cube Generation and Querying

5
Scale of OLAP Data Warehouses
  • Average size of production data warehouses
    currently 700 GB survey.com/Olap Report
  • Expected to reach 4 TB by 2004
  • 1/3 currently lt 50 GB. In two years, this number
    will drop to just 6
  • Biggest data warehouses growing by a factor of 20
    Winter Report
  • Biggest expected to exceed 100 TB within 2 years
  • Our Interest Exploiting Parallel Algorithms

6
Fundamental Design Alternatives
  • MOLAP (Multi-dimensional OLAP)
  • Materialize data cube as a multi-dimensional
    array
  • In theory implicit indexing. In practice hybrid
    schemes for sparse and dense regions
  • Best for dense, low-dimensional spaces
  • ROLAP (Relational OLAP)
  • Store data as relational tables
  • Requires an explicit multi-dimensional index
  • Scales well to higher dimensions and higher
    cardinalities
  • Our Interest Highly Scalable ROLAP model

7
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

8
Computing the Full Cube in Parallel
  • Small number of previous projects GC, LHL, MM,
    NWY
  • Speedup quite limited
  • Our approach Parallel Pipesort DEHR2, DER3
  • Model 2d views as a task graph
  • Create Scan Pipelines AADG as Minimum Cost
    Spanning Tree using O(dn(m nlogn)) bipartite
    matching (n nodes, m edges)
  • Partition task graph into sub-trees with O(p3d
    p2d) augmented k-min-max BSP and distribute
    sub-trees to p processors
  • Use over-sampling S p sub-trees to improve
    load balance. High-low pairing in S 1 rounds
    provides approximation to NP-Complete problem
  • Use performance optimized algorithms for sorting,
    scanning, and I/O to generation local views

9
Computing Partial Cubes in Parallel
  • Important in practice for environments with
    higher dimensions and/or specific visualization
    needs
  • Little previous work, only partial solutions
    BR,GC,SAG
  • Our approach Greedy algorithms for Schedule
    Tree construction DER2, DER5
  • Solution consists of algorithms for generating
    efficieint Essential Trees (red) and
    algorithms for adding beneficial non-selected
    nodes (blue)
  • Greedy method record state information in
    Plan Objects. Incrementally add nodes with
    maximum benefit
  • Pre-sorting candidate views by estimated size
    can reduce run-time from O(n3) to O(n2)
  • O(dn) heuristic extensions for higher
    dimensional space. A confidence factor ß limits
    risk

10
A Parallel Date Cube Query Engine
  • Views must be indexed prior to access
  • Related work sequential r-trees for data cube
    RYR and general purpose parallel r-trees SL
  • Our Approach Parallel RCUBE DER1, DER6
  • Records ordered as per Hilbert Space Filling
    Curve
  • P-processor round-robin record striping
  • Construct Partial r-tree indexes on each node,
    packing page blocks in Hilbert order
  • Parallel Query Engine
  • Combines indexing and OLAP post-processing (query
    transformation, parallel Sample Sort, record
    permutation, etc.)
  • Uses surrogate views to support Partial Cubes
  • Supports linear dimension hierarchies

11
The Virtual Data Cube
  • Motivation Hide the complexity of Data Cube
    algorithms and implementation
  • Requires no knowledge of
  • Format or extent of indexing
  • Degree of materialization (full or partial)
  • Representation of hierarchies
  • Physical order of view attributes
  • Degree of parallelism

12
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

13
Systems Overview
  • Full, robust Data Cube prototype DER4
  • Approximately 20,000 lines of code
  • C/C, LEDA, MPI, STL
  • Template-based graph algorithms
  • Designed for, and evaluated on, contemporary
    parallel machines
  • Shared nothing Linux cluster (Dalhousie)
  • Shared disk SunFire multi-processor (HPCVL)
  • Supporting systems include
  • Flex/Bison based data generator
  • Batch query generator
  • View Subset generator

14
Key Performance Issues
  • Dynamic selection of best sorting algorithm
  • Radix sort versus quicksort
  • Minimization of data movement
  • Use of horizontal and vertical indirection
  • New pipeline aggregation algorithm
  • Lazy aggregation
  • Streamlined I/O
  • I/O manager
  • Independent I/O and computation threads

15
Costing model
  • Sophisticated cost model, common to both full and
    partial cube DEHR1
  • Based upon view size estimator
  • Probabilistic counting technique
  • Experimentally supported metrics for
  • Dynamic Sorting (linear time versus comparison
    based)
  • In-memory scanning and data movement
  • Machine specific Read and Write I/O
  • Dynamically considers impact of computation
    versus I/O

16
A Better Search Strategy
  • Standard r-tree search strategy employs Depth
    First Search
  • Our approach Linear Breadth First Search
  • Map the search algorithm to the linearly ordered
    levels of the packed index
  • Resolve query with a left-to-right, top-to-bottom
    walk of the tree
  • Disk head never moves backwards
  • Resolution consists of a sequence of clustered
    scans
  • Degrades gracefully to a sequential scan of index
    sequential scan of data

17
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

18
Experimental Evaluation
  • Default test environment includes
  • 16 to 24 processors
  • 2 million records/80 MB
  • 4 to 14 dimensions
  • Random query batches
  • 24-node Linux cluster, 16-node SunFire MP (disk
    array)
  • Parallel Speedup approaching linear for all
    components
  • Efficiency between 80 and 95

Partial Cube
Full Cube
Query Processing
19
Full Cube Evaluation
  • Shared Disk
  • 80 90 efficiency
  • Disk array is bottleneck
  • Optimized pipeline processing
  • Order of magnitude improvement
  • Over sampling factor
  • SF 2 consistently best

20
Partial Cube Evaluation
  • Tree pruning with confidence factor (on 14 d)
  • Can eliminate up to 60 of original tree
  • Virtually no reduction in tree quality
  • Using partial cube algorithms for full cube
  • All within 6 of best benchmark
  • Recursive algorithm with 0.1
  • Partial cube of 3 dimensions or less
  • Reductions over naïve method of 65 70

21
Query Evaluation
  • Overhead of using surrogate views (1 to 16
    processors)
  • Run time on materialized views versus time when
    those views were unavailable
  • Record retrieval imbalance (16 processors)
  • Only 0.3 from optimal load balance
  • Ratio of blocks retrieved to required seeks
  • Random query batches
  • Up to 1401 for large, sparse spaces

22
  • Motivation for Parallel, Relational OLAP
  • Core Algorithms and Methods
  • Primary Systems Contributions
  • Experimental Evaluation and Results
  • Conclusions and Future Work

23
Thesis Conclusions
  • ROLAP a viable alternative to MOLAP in parallel
    setting
  • Partial cubes can be efficiently generated
  • ROLAP cubes can be efficiently indexed
  • Virtual cube abstraction can be efficiently
    supported

24
Research Highlights
  • First parallel ROLAP system in the Data Cube
    literature
  • A balanced approach to data cube research
  • Algorithm design
  • Systems engineering
  • Extensive performance analysis
  • Evaluated on contemporary parallel machines
  • Commodity-style shared nothing cluster
  • Shared disk architectures
  • Integration of three independent data cube
    research projects into a single cohesive OLAP
    framework the Virtual Cube

25
Future Work
  • Automated partial cube specification
  • Extension of virtual cube
  • Parallel Query optimization
  • In addition to range queries or linear
    hierarchies
  • High volume query environments
  • OLAP visualization
  • New projects are building on the current base
  • Generation of Iceberg Cubes
  • Mining of association rules

26
Thank You!
References Our own Virtual Data Cube Research
References The Data Cube Literature
GBLP J. Gray and A. Bosworth and A. Layman and
H. Pirahesh", Data Cube A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and
Sub-Totals", ICDE, 1996 GC S. Goil and A.
Choudhary, A Parallel Scalable Infrastructure
for OLAP and Data Mining, IDEAS,1999 LHL H.
Lu, X. Huang and Z. Li,, Computing Data Cubes
Using Massively Parallel Processors, PCW '97,
1997 MM S. Muto and M. Kitsuregawa, A dynamic
Load Balancing Strategy for Parallel Datacube
Computation, WDW O,1999 NWY R. Ng, A. Wagner
and Y. Yin, Iceberg-cube Computation with PC
Clusters, SIGMOD, 2001 AADG S. Agarwal and R.
Agrawal and P. Deshpande and A. Gupta and J.
Naughton and R. Ramakrishnan and S. Sarawagi, On
the Computation of Multidimensional aggregates,
VLDB, 1996 BSP R. Becker and S. Schach and Y.
Perl, A shifting algorithm for min-max tree
partitioning, Journal of the ACM, 1982 BR K.
Beyer and R. Ramakrishnan, Bottom-up computation
of sparse and Iceberg CUBEs, SIGMOD,1999 RYR N.
Roussopoulos and Y. Kotidis and M. Roussopolis,
Cubetree Organization of the bulk incremental
updates on the data cube, SIGMOD, 1997 SL B.
Schnitzer and S. Leutenegger, Master-client
r-trees a new parallel architecture, SSDM, 1999
  • DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A.
    Rau-Chaplin, Parallelizing The Data Cube,
    Parallel and Distributed Databases An
    International Journal, 2001
  • DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin,
    Distributed Multi-dimensional ROLAP Indexing for
    the Data Cube, CCGrid, 2003.
  • DER2 F. Dehne, T. Eavis and A. Rau-Chaplin,
    Computing Partial Data Cubes for Parallel Data
    Warehousing Applications, Euro PVM-MPI, 2001.
  • DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin,
    Coarse Grained Parallel On-Line Analytical
    Processing (OLAP) For Data Mining, ICCS, 2001.
  • DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A
    Cluster Architecture for Parallel Data
    Warehousing, CCGrid, 2001.
  • DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A.
    Rau-Chaplin, Parallelizing The Data Cube, ICDT,
    2001.
  • CHER Y. Chen, F.Dehne, Todd Eavis, A.
    Rau-Chaplin, Parallel ROLAP Data Cube
    Construction on Shared Nothing Multi-Processors,
    IPDPS, 2003.
  • DER5 F Dehne, T.Eavis, and A. Rau-Chaplin,
    Computing Partial Data Cubes, Submitted to HICCS,
    2003.
  • DER6 F Dehne, T.Eavis, and A. Rau-Chaplin,
    RCUBE Parallel Multi-Dimensional ROLAP Indexing,
    Submitted to Journal of to Data Mining and
    Knowledge Discovery., 2003
Write a Comment
User Comments (0)
About PowerShow.com