Title: Parallel Star Join DataIndexes : Efficient Query Processing in Data Warehousing and OLAP
1Parallel Star Join DataIndexes Efficient
Query Processing in Data Warehousing and OLAP
- Anindya Datta
- Debra VanderMeer
- Krithi Ramamritham
- Presented by
- Ashutosh Joshi
2Motivation
- OLAP involves efficient retrieval of data from
data warehouses for decision-support purposes - Data Warehouses are extremely large and queries
are highly computationally expensive - DataIndex is a storage structure serving as both
index and data - Parallel Star Join (PSJ) is an efficient
algorithm for performing star join in parallel
3The Road Map
- A physical design principle for exploiting
parallelism - Parallel Star Join algorithm
- Experiment results
4The Star Schema
Dimension Table
PART
CUSTOMER
Fact Table
PartKey 4 Name 55 Mfgr 25 Brand 10 Type
25 Size 4 Others... 41 164
CustKey 4 Name 25 Address
40 Nation 25 Region 25 Phone
15 AcctBal 8 MktSegment 10 Comment 117
269
SALES
PartKey 4 SuppKey 4 CustKey
4 Quantity 8 ExtPrice 8 Discount
8 Tax 8 RetFlag 1 Status
1 ShipDate 2 CommitDate 2 ReceiptDate
2 ShipInstruct 25 ShipMode 10 Comment
44 137
200,000
150,000
SUPPLIER
SuppKey 4 Name 25 Address 40 Nation
25 Region 25 Phone 15 AcctBal 8 Comment
101 243
TIME
TimeKey 2 Alpha 10 Year 4 Month 4 Week
4 Day 4 28
6,000,000
2,557
10,000
5A Physical Design Principle
- DataIndexes
- Serve as both index as well as data
- Based on vertical partitioning of tables
- Two types
- Projection Index (PI)
- Join Index (JI)
6Projection Index
Base Table
CustKey
Qty
Discount
ExtPrice
CK1
Q1
D1
E1
CK2
Q2
D2
E2
CK3
Q3
D3
E3
CK4
Q4
D4
E4
PI
PI
PI
Discount
ExtPrice
Qty
D1
E1
Q1
D2
E2
Q2
D3
E3
Q3
D4
E4
Q4
7Join Index
Base Fact Table
Base Dimension Table
Tax
Name
Address
Discount
ExtPrice
CustKey
CustKey
T1
N1
A1
D1
E1
CK1
CK1
T2
N2
A2
D2
E2
CK2
CK2
T3
N3
A3
D3
E3
CK3
CK3
T4
D4
E4
CK3
JI
PI
PI
PI
PI
PI
Tax
Name
Address
Discount
ExtPrice
CustKey
T1
N1
A1
D1
E1
CK1
T2
N2
A2
D2
E2
CK2
T3
N3
A3
D3
E3
CK3
T4
D4
E4
8The Principle
- Each foreign key column in the fact table is
stored as Join Index (JI) - Rest of the columns (for both dimension as well
as fact table) are stored as Projection Index (PI)
9Parallel Star Join
- Data placement strategy
- Based on shared nothing architecture with N
processors - Assume a d dimensional data warehouse
- Partition N processors into d1 groups
- Assign to each group j, dimension table Dj and Jj
, the fact table join index - Assign metric PIs to the group d1
10Processor Group Partitioning
- Number of processors is governed by the size of
dimension table Dj - Size of jth processor group
- Size of metric group
11Physical Data Placement
- Horizontally partition JIs across all processors
- Replicate PIs on all processors
- Use round-robin strategy for partitioning JIs
12The Parallel Star Join Algorithm
- A general k- dimensional star join query
- Select AdP, AmP
- from F, D1, , Dk
- where Pjoin and Pselect
- The algorithm has three phases
- Local rowset generation
- Global rowset synthesis
- Output preparation
13Local Rowset generation
Pc
P1
P2
PI fragment
PI fragment
PI fragment
Qty gt 10
PI fragment
Rowset fragment
14Local Rowset Generation (contd)
- Merge dimension rowset fragments
- Distribute dimension rowset
Rowset fragment
P1
P2
P3
P4
OR
Rdim,i
15Local Rowset Generation (contd)
- Load JI fragment
- Merge partial fact rowsets
Rfact,i
Rdim,i
JIi
16Global Rowset Synthesis
- Merge local fact rowsets
- Distribute global rowset to groups participating
in the output phase
Rfact,2
G1
G2
Rfact,1
G3
G4
AND
Rglobal
17Output Preparation
- Distribute global rowset to individual processors
- Load PI columns necessary for output
- Merge output
JIi
Rglobal
PIi
18Performance Comparison
- The PSJ algorithm was compared with Bitmapped
Join Index algorithm and the Pipelined Hash join
algorithm - Two performance metrics used
- Response time in block access (RTBA)
- Aggregate Data Transmission (ADT)
19Scalability Experiments
- The curves rise as the scale factor and number of
processors increase - PSJ cost is much lower than BJI and HASH costs
- At large memory sizes, PSJ approaches
near-perfect scalability
20Scalability Experiments(contd)
- Transmission costs for PSJ and BJI are the same
- Both curves exhibit imperfect scalability
- HASH has substantially higher transmission costs
than PSJ
21Conclusion
- DataIndex is a physical design strategy which
provides efficient partitioning of the schema - Parallel Star Join algorithm provides a means to
perform star join in parallel - PSJ algorithm performs better than BJI and HASH
algorithms in terms of I/O and transmission costs