Parallel Star Join DataIndexes : Efficient Query Processing in Data Warehousing and OLAP presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parallel Star Join DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

1
Parallel Star Join DataIndexes Efficient
Query Processing in Data Warehousing and OLAP

Anindya Datta
Debra VanderMeer
Krithi Ramamritham
Presented by
Ashutosh Joshi

2
Motivation

OLAP involves efficient retrieval of data from
data warehouses for decision-support purposes
Data Warehouses are extremely large and queries
are highly computationally expensive
DataIndex is a storage structure serving as both
index and data
Parallel Star Join (PSJ) is an efficient
algorithm for performing star join in parallel

3
The Road Map

A physical design principle for exploiting
parallelism
Parallel Star Join algorithm
Experiment results

4
The Star Schema
Dimension Table
PART
CUSTOMER
Fact Table
PartKey 4 Name 55 Mfgr 25 Brand 10 Type
25 Size 4 Others... 41 164
CustKey 4 Name 25 Address
40 Nation 25 Region 25 Phone
15 AcctBal 8 MktSegment 10 Comment 117
269
SALES
PartKey 4 SuppKey 4 CustKey
4 Quantity 8 ExtPrice 8 Discount
8 Tax 8 RetFlag 1 Status
1 ShipDate 2 CommitDate 2 ReceiptDate
2 ShipInstruct 25 ShipMode 10 Comment
44 137
200,000
150,000
SUPPLIER
SuppKey 4 Name 25 Address 40 Nation
25 Region 25 Phone 15 AcctBal 8 Comment
101 243
TIME
TimeKey 2 Alpha 10 Year 4 Month 4 Week
4 Day 4 28
6,000,000
2,557
10,000
5
A Physical Design Principle

DataIndexes
Serve as both index as well as data
Based on vertical partitioning of tables
Two types
Projection Index (PI)
Join Index (JI)

6
Projection Index
Base Table
CustKey
Qty
Discount
ExtPrice
CK1
Q1
D1
E1
CK2
Q2
D2
E2
CK3
Q3
D3
E3
CK4
Q4
D4
E4
PI
PI
PI
Discount
ExtPrice
Qty
D1
E1
Q1
D2
E2
Q2
D3
E3
Q3
D4
E4
Q4
7
Join Index
Base Fact Table
Base Dimension Table
Tax
Name
Address
Discount
ExtPrice
CustKey
CustKey
T1
N1
A1
D1
E1
CK1
CK1
T2
N2
A2
D2
E2
CK2
CK2
T3
N3
A3
D3
E3
CK3
CK3
T4
D4
E4
CK3
JI
PI
PI
PI
PI
PI
Tax
Name
Address
Discount
ExtPrice
CustKey
T1
N1
A1
D1
E1
CK1
T2
N2
A2
D2
E2
CK2
T3
N3
A3
D3
E3
CK3
T4
D4
E4
8
The Principle

Each foreign key column in the fact table is
stored as Join Index (JI)
Rest of the columns (for both dimension as well
as fact table) are stored as Projection Index (PI)

9
Parallel Star Join

Data placement strategy
Based on shared nothing architecture with N
processors
Assume a d dimensional data warehouse
Partition N processors into d1 groups
Assign to each group j, dimension table Dj and Jj
, the fact table join index
Assign metric PIs to the group d1

10
Processor Group Partitioning

Number of processors is governed by the size of
dimension table Dj
Size of jth processor group
Size of metric group

11
Physical Data Placement

Horizontally partition JIs across all processors
Replicate PIs on all processors
Use round-robin strategy for partitioning JIs

12
The Parallel Star Join Algorithm

A general k- dimensional star join query
Select AdP, AmP
from F, D1, , Dk
where Pjoin and Pselect
The algorithm has three phases
Local rowset generation
Global rowset synthesis
Output preparation

13
Local Rowset generation

Load PI fragment

Pc
P1
P2
PI fragment
PI fragment
PI fragment
Qty gt 10
PI fragment
Rowset fragment
14
Local Rowset Generation (contd)

Merge dimension rowset fragments
Distribute dimension rowset

Rowset fragment
P1
P2
P3
P4
OR
Rdim,i
15
Local Rowset Generation (contd)

Load JI fragment
Merge partial fact rowsets

Rfact,i
Rdim,i
JIi
16
Global Rowset Synthesis

Merge local fact rowsets
Distribute global rowset to groups participating
in the output phase

Rfact,2
G1
G2
Rfact,1
G3
G4
AND
Rglobal
17
Output Preparation

Distribute global rowset to individual processors
Load PI columns necessary for output
Merge output

JIi
Rglobal
PIi
18
Performance Comparison

The PSJ algorithm was compared with Bitmapped
Join Index algorithm and the Pipelined Hash join
algorithm
Two performance metrics used
Response time in block access (RTBA)
Aggregate Data Transmission (ADT)

19
Scalability Experiments

The curves rise as the scale factor and number of
processors increase
PSJ cost is much lower than BJI and HASH costs
At large memory sizes, PSJ approaches
near-perfect scalability

20
Scalability Experiments(contd)

Transmission costs for PSJ and BJI are the same
Both curves exhibit imperfect scalability
HASH has substantially higher transmission costs
than PSJ

21
Conclusion

DataIndex is a physical design strategy which
provides efficient partitioning of the schema
Parallel Star Join algorithm provides a means to
perform star join in parallel
PSJ algorithm performs better than BJI and HASH
algorithms in terms of I/O and transmission costs

Write a Comment

User Comments (0)

Parallel Star Join DataIndexes : Efficient Query Processing in Data Warehousing and OLAP PowerPoint PPT Presentation