Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. - PowerPoint PPT Presentation

About This Presentation
Title:

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.

Description:

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. Authors : ... Define a clustering by a vector of edge types & number ( 8,4;2,1 ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 28
Provided by: Itt88
Category:

less

Transcript and Presenter's Notes

Title: Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.


1
Snakes and Sandwiches Optimal Clustering
Strategies for a Data Warehouse.

Authors H. V. Jagadish Laks V. S.
Lakshmanan Divesh Srivastava Presented by
Ittay Freiman
2
The Problem
  • Disk I/O is a major parameter in the cost of a
    (select-where) query.
  • Physical layout of records impacts how much I/O
    needs to be done.
  • Fragmentation of results costs more.
  • No linear ordering of records can be optimal for
    any query.

3
Glossary
  • Grid query - a query according to hierarchies.

4
Linear Clustering Workload
  • Linear clustering - the ordering of records on
    disk.
  • Workload - frequencies on the different types
    (classes) of queries.

P1
P2
Hilbert
5
The Problem - Examples
Expected cost (lttotal cost of all
queriesgt/ltqueries gt
Relative costs (ltbest total costgt / ltworstgt)
6
Agenda
  • Introduction definitions.
  • Lattice, lattice paths.
  • Cost of a lattice path.
  • Algorithm to find an optimal lattice path.
  • Snaking.
  • Experiments.

7
Definitions
  • Query class - a vector (v1,,vk) of level numbers
    in a grid.
  • (0,..,0) - All queries that return a single cell.
  • (1,..,1) - All queries returning 2x2..x2 block
  • Etc.
  • For example
  • Q1 is from the (1,1) class
  • Q2 is from the (2,1) class

8
Definitions
  • (v1,..,vk) ltdef (u1,..,uk) ?? v1lt u1,.., vklt
    uk.
  • (1,0) lt (1,0) , (1,1) , (2,1)
  • (1,0) ? (0,1) - not known
  • (0,..,0) - lower bound, (u1max,..,ukmax) - upper
    bound.
  • D-successor - (u1,..,uk) is a d-successor of
    (v1,..,vk) if (v1,., vi1,..,vk) (u1,..,uk)
  • (1,1) is a d-successor of (1,0)

9
Lattice path
  • Sequence of vectors, starting from the lower
    bound and ending in the upper bound when each
    vector is a d-successor of its previous.

10
Example
  • A 2D grid with binary hierarchies of 3 levels
    each
  • (0,0),(0,1),(0,2),(1,2),(2,2)

example
example
11
Clustering via lattice paths
  • Go through the path. Each entry is a query class.
    For each query in the current entry, cluster the
    cells so theyll be continues.

1
0
2
12
Cost of a lattice path
  • The cost of a query (class) is the number of
    continues segments of the records answering it on
    disk.
  • The cost of a query that is in on the lattice
    path is 1.
  • More formally
  • is a d-successor of . the
    average fanout in dimension d, at level i (the
    dimension and level in which differ).
  • P a lattice path, any point on P.


13
Example
(Fanout 2 in each level)

(2,2)
dist((2,1)) 4
(1,2)
(2,1)
2
(2,0)
(0,2)
2
(1,1)
(1,0)
(0,1)
2
dist((1,0)) 2
(0,0)




14
Cost of a lattice path
  • The weighted average of the costs of each query
    class
  • Optimal lattice path lowest cost.

15
Algorithm
  • For a 2D grid.
  • If is optimal, so is
  • General description
  • Initialization
  • For (i,j) (m,n) to (0,0) do
  • Compute the cost for (i,j) as the min
  • when going through (i1,j)
  • when going through (i,j1)
  • Take the best path through (i,j) accordingly
  • In the end, the path through (0,0) is the
    optimal.

16
Algorithm
  • Time complexity O((m1)(n1))O(mn)
  • Space complexity O(mn)
  • The algorithm works without regard to the fanout
    (only the cost is influenced)
  • For unbalanced hierarchies we add dummy nodes.

17
Snaking
  • Reversing the clustering order of alternate
    queries in the clustering.
  • Note - every adjacent cells on disk differ only
    in one dimension

Revs.
Double Reversing
As is








Revs.




As is
Revs.
18
Improvement by Snaking
  • Snaked paths have no diagonal edges.
  • Edge - Two adjacent points in the clustering
    path.
  • Edge Type - The dimensions in which the
    end-points differ.
  • Non-Diagonal Edge - The points differ in one
    dimension only.
  • There is a snaked lattice path that has optimal
    cost over all possible clusterings.

19
Limits Guarantees
  • For any workload. The result of snaking the
    optimal lattice path lowers the cost by a factor
    of less than 2.
  • A optimal lattice path that has been snaked is
    at most twice worse than the global optimal
    clustering (for a specific workload).

20
Experiments
  • LineItem table from the TPC-D benchmark.
  • 3 dimensions parts, supplier and time.
  • Record size 125 bytes. Page size 8K.
  • The results counted the number of pages and seeks
    for a set of queries with a given workload.
  • The results were normalized by the minimum amount
    of pages needed.
  • 27 workloads were tested.
  • Comparison with the 6 row major strategies

21
Results
22
Results
23
Results
24
Conclusions (from the article)
  • Physical clustering is crucial to data warehouse
    performance
  • Lattice paths can improve performance.
  • With respect to the Hilbert clustering, lattice
    path clustering performs sometimes better.
  • In a follow up paper Hilbert is sandwiched
    between 2 snaked lattice paths, for a workload.
  • Snaking always reduces the cost.
  • Theres a snaked path which is globally optimal.
  • The snaked optimal lattice path is at most twice
    worst of the optimal clustering strategy.

25
Remarks
  • How can we apply this to grids which contain
    dynamic lists in their records?
  • Lack of clarity and formalism in the proofs (some
    not published yet).
  • More work needs to be done to get meaningful
    improvements.

26
Proofs Sketch
  • Define a clustering by a vector of edge types
    number (lt8,42,1gt).
  • Show constraints on such a vector. - consistent
    vector.
  • Define an ordering on these vectors (or any),
    with a minimal definition.
  • Show that every minimal vector (with no diagonal
    edges powers of two in the numbers is a snaked
    lattice path
  • Extend the definition of cost to arbitrary
    constrainded vectors.
  • Show that for every diagonal strategy there
    exists a non-diagonal constrained vector with
    lower cost.

27
Proofs Sketch
  • Show that for every workload there exists a
    snaked lattice path such that its cost is
    minimal.
  • Show that for any workload. The result of snaking
    the optimal lattice path lowers the cost by a
    factor of less than 2.
  • Show that for any workload the optimal snaked
    lattice path improves the snaked optimal lattice
    path by a factor of less than 2.
Write a Comment
User Comments (0)
About PowerShow.com