Joining Massive High-Dimensional Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

Joining Massive High-Dimensional Datasets

Description:

... Random Seek Cost: Cost Clustering. Dataset 2. Dataset 1. The ... Cluster order = (C4,C2,C1,C3,C5) 5 2 3 3 2=15 page reads. Scenario 1 = 19. C1. C2. C3 ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 38
Provided by: tubayavu
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Joining Massive High-Dimensional Datasets


1
Joining Massive High-Dimensional Datasets
  • Tamer Kahveci
  • Christian A. Lang
  • Ambuj K. Singh
  • Department of Computer Science
  • University of California at Santa Barbara
  • http//www.cs.ucsb.edu/tamer

2
Motivation Sample Queries
  • Join is fundamental database primitive
  • Spatial Join Find all hotels in California that
    are within three miles of a recreation area.
  • Sequence Join Find all pairs of companies from
    New York Exchange and Tokyo Exchange that have
    similar closing prices for one month

3
Motivation
  • We assume limited buffer space.
  • Joining two datasets is expensive
  • I/O cost
  • CPU cost
  • O(mn)

4
The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
5
Outline
  • Reducing search space Prediction Matrix
  • Minimizing I/O cost by clustering
  • Square Cluster
  • Cost Cluster
  • Maximizing buffer reuse
  • Experimental results

6
PM-NLJ
  • Predict the candidate page pairs using plane
    sweep method on an index structure.

Dataset 1
Dataset 2
7
Prediction of Join
8
Prediction of Join
9
PM-NLJ
  • Predict the candidate page pairs using plane
    sweep method on an index structure.

Dataset 1
  • The final estimate is called Prediction Matrix
    (PM).
  • Restrict NLJ to marked entries of PM.
  • We call this method PM-NLJ.

Dataset 2
10
PM-NLJ
  • The number of marked entries e.
  • Performance improvement rate mn/e.

Dataset 1
Dataset 2
Is there a better read schedule?
11
Outline
  • Reducing search space Prediction Matrix
  • Minimizing I/O cost by clustering
  • Square Cluster
  • Cost Cluster
  • Maximizing buffer reuse
  • Experimental results

12
Minimizing Number of I/OSquare Clustering
  • PM-NLJ reads minm,ne 9 pages.
  • Let B6.

Dataset 1
  • mn 6 page reads suffices.
  • Savings
  • e-maxm,n.
  • Maximize e
  • Minimize maxm,n
  • mn B
  • mnB/2.

Dataset 2
13
Minimizing Number of I/OSquare Clustering
Dataset 1
O(e) space time complexity
Can we reduce total I/O cost by reducing the
amount of random seeks?
Dataset 2
14
Minimizing Random Seek Cost Cost Clustering
Dataset 1
  • The location of
  • the pages is
  • important as
  • well as their
  • number!

Dataset 2
15
Minimizing Random Seek Cost Cost Clustering
Dataset 1
  • O(e) space
  • complexity
  • O(e3/2) time
  • complexity

Dataset 2
16
Outline
  • Reducing search space Prediction Matrix
  • Minimizing I/O cost by clustering
  • Square Cluster
  • Cost Cluster
  • Maximizing buffer reuse
  • Experimental results

17
Maximizing Cache Reuse
Dataset 1
B 5 pages
C1
C3
  • Scenario 1
  • Cluster order
  • (C4,C1,C3,C5,C2)
  • 5432519 page reads.

C2
Dataset 2
C4
C5
18
Maximizing Cache Reuse
Dataset 1
Scenario 1 19
C1
  • Scenario 2
  • Cluster order
  • (C4,C2,C1,C3,C5)
  • 5233215 page reads.

C3
C2
Dataset 2
C4
What is the best schedule?
C5
19
Sharing Graph (SG)
Dataset 1
C1
C3
C2
Dataset 2
C4
C5
20
Finding Best Schedule
  • Each schedule is a path on SG.
  • Cache reuse sum of weights of the edges of the
    corresponding path on SG.
  • Equivalent to TSP.
  • NP-Complete.
  • Use greedy heuristic to find optimal path.

C2
2
C1
1
1
1
C3
3
C5
C4
21
Outline
  • Reducing search space Prediction Matrix
  • Minimizing I/O cost by clustering
  • Square Cluster
  • Cost Cluster
  • Maximizing buffer reuse
  • Experimental results

22
Experimental Setup Datasets
  • Low dimensional data
  • 2-D road intersections of Long Beach (LBeach)
    Montgomery County (MGcounty).
  • 53K 39K vectors
  • High dimensional data
  • 60-D feature vectors for satellite image database
    (landsat).
  • 275K vectors
  • Sequence data
  • Human chromosome 18 (HChr18) mouse chromosome
    18 (MChr18)
  • 4.2 M 2.3 M nucleotides

23
Experimental Setup Compared Techniques
  • NLJ
  • Epsilon Grid Order (EGO) BBKK01
  • BFRJ HJR97
  • PM-NLJ
  • Random-SC
  • SC
  • CC

24
Experimental Setup
  • Three optimizations tested
  • OPT 1 reducing space by using the PM.
  • OPT 2 clustering.
  • OPT 3 cluster scheduling.

25
Itemized Cost Analysis
Join on MGCounty LBeach
26
Total Cost Analysis of Various Optimizations
Self-join on HChr18
Buffer Size (num pages)
27
Comparison of SC CC
28
Total Cost Analysis
Join on landsat data
Buffer Size (num pages)
29
Scalability Analysis
Join on landsat data
Database Size (num vectors per database)
30
Discussion
  • We proposed three optimizations for join
    operator.
  • Prediction matrix
  • Clustering
  • Buffer recycling
  • SC is 2 to 86 times faster than competing
    techniques for spatial databases, and 13 to 133
    times faster than competing techniques for
    sequence databases
  • SC is very close to the optimal technique (CC).

31
Future Directions
  • The solution can be generalized to multi-way
    joins.
  • Similar optimizations can be applied to NN
    queries.
  • Can be applied to biological data.

32
Related Work
  • Join without index
  • Arge et al 1998
  • Blasgen et al 1977
  • Bohm et al 2001
  • Chan et al 1997
  • Graefe 1994
  • Koudas et al 1997
  • Koudas et al 2000
  • Orenstein 1986
  • Patel et al 1996
  • Shim et al 2002
  • Xiao et al 2001
  • Join with index
  • Bercken et al 2000
  • Bohm et al 2001
  • Brinkhoff et al 1993
  • Gurret et al 2000
  • Hjaltson et al 1998
  • Huang et al 1997
  • Lo et al 1994
  • Lo et al 1996

THANK YOU
33
Using Sharing Graph to Determine Cache Reuse
Scenario 1
Scenario 2
C2
C2
2
2
C1
C1
1
1
1
1
1
1
0
C3
C3
3
0
0
3
C5
C5
C4
C4
Reuse 11 2
Reuse 321 6
34
Spatial Join Example
Recreation areas
Hotels
35
Spatial Join Example
36
The Naive Solution NLJ
Dataset 1 (m pages)
Buffer B 4
minm,nmn/(B-1) page reads
mn page comparisons
Dataset 2 (n pages)
We do not need to compare all page pairs!
37
Reading Pages in a Better Order
1 seek 4 page transfers
3 seeks 3 page transfers
Write a Comment
User Comments (0)
About PowerShow.com