Set Containment Joins: The Good, The Bad and The Ugly - PowerPoint PPT Presentation

About This Presentation

Title:

Set Containment Joins: The Good, The Bad and The Ugly

Description:

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik ... – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 34

Provided by: kart194

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Set Containment Joins: The Good, The Bad and The Ugly

1
Set Containment Joins The Good, The Bad and The
Ugly

Karthikeyan Ramasamy
Jointly With
Jignesh Patel, Jeffrey F. Naughton and Raghav
Kaushik

2
Introduction

Most data mining today takes place outside of any
DBMS
Unfortunate - many potential advantages arise
from using a DBMS
scalability
flexibility
consistency

3
Why dont people use RDBMS?

Too slow.
Difficult to express data mining algorithms in
SQL
Potential for improvement set-valued attributes.

4
Relational DBMS 101

Data model
tables with rows and columns
each individual entry is an atomic element (an
integer, a float, a character string.)
New extension set-valued attributes
individual entries of tables can be sets.

5
Sets and Data Mining

Canonical example customers and their
transactions.
No sets two tables,
customers(cid, name, address, )
transactions(cid, product, date,)
Sets one table,
customers(cid, name, address, trans, )

6
Open questions

How do you store the sets?
How do you implement operations on these
set-valued attributes?
Do they really help move data mining into SQL?

7
Set Containment Joins

Consider two relations
Containment is defined as
Computes pair of tuples one from R and the other
from S such that set from R tuple is contained or
equal to the set from S tuple

8
Set Containment Joins (Cont.)

Example
STUDENT (sid, courses-taken)
COURSES (cid, prereqs)
Find the set of courses that student is eligible
to take

9
Storage Representations

Nested internal.
Grouped and stored along with the rest of the
attributes in the tuple.
Unnested external.
Set instances are unnested and stored in a
separate relation.
Requires join to assemble elements.

10
Nested Internal Representation
Cardinality
Length
Tuple
Element 1
Element 2
A1
A2
A3
.
.
Element N
11
Unnested External - Good Old SQL
SELECT RS.i, SS.j FROM RS, SS WHERE RS.b
SS.d GROUP BY RS.i, SS.j HAVING COUNT() (
SELECT count()
FROM NRS RS
WHERE NRS.i RS.i )
12
SQL Approach - Pros and Cons

Pros.
Easy to add to an existing DBMS.
Cons
Requires extra joins for projecting other
attributes
Nested query must be evaluated for each group
Number of groups is RS

13
SQL Approach - Mitigation

Magic Sets Rewriting

Count Query INSERT INTO T1(i,counti) SELECT RS.i,
COUNT() FROM RS GROUP BY RS.i
Candidate Query INSERT INTO T2(i,j,countij) SELECT
RS.i, SS.j, COUNT() FROM RS, SS WHERE RS.b
SS.d GROUP BY RS.i, SS.j
Verify Query SELECT T2.i, T2.j FROM T2, T1 WHERE
T2.i T1.i AND T2.countij T1.counti
14
Signature Nested Loops (Sig-NL)

Applicable for Nested Internal Representation
Signatures
Signatures are bit vectors for approximating sets
Approximation leads to false drops
Three phases of the algorithm
Signature construction phase
Comparing signature for containment
Verification of actual subsets

15
Signature Nested Loops (Contd)

Signature Construction Phase
Take a bit vector
Apply a hash function M for each element and set
the corresponding bit
Comparison Phase
Necessary condition for subset satisfaction
and

16
Partition Algorithms

Reduce join execution time by partitioning the
problem into smaller sub-problems.
A partitioning function is used to partition the
problem.
An ideal partitioning function requires
Tuple r of R falls in one of the partitions Ri
Tuple s of S falls in one of Si
Join is accomplished by joining only Ri with Si

17
Partitioned Set Join Algorithm

Three phases of algorithm
Partitioning Phase
Joining Phase
Verification Phase

18
Partition Set Join Algorithm (PSJ)
R(1,2,3)
(3,0100001,OIDR)
(4,0100101,OIDS)
S(1,2,3,6)
19
PSJ Joining Phase

Any efficient algorithm for joining signatures
can be used.
Signature based partition algorithm
Partition R signatures based on randomly chosen
bit that is set.
Probe each S signature multiple times for each
bit set.
Outputs the result object id pairs (OIDR,OIDS).

20
PSJ Pros and Cons

Pros
Easy to implement similar to hash joins
Easily parallelizable
Issues
Determination of the number of partitions
Determination of the signature size

21
PSJ Number of Partitions

Large number of partitions leads to large
overhead
Smaller number of partitions leads to more join
cost
Using a detailed analytical model

22
PSJ Signature Size

Inversely related to number of partitions
Cyclic dependency. Solve simultaneously and use
bisection method

23
Set Distributions

Many degrees of freedom
Each degree can follow a distribution of its own.
Huge distribution space!

24
Classifying Set Distributions
Large, Large
Small, Large
Large
Set Cardinality
Small, Small
Large, Small
Small
Small
Large
Relation Cardinality
25
Performance Settings

Implementation in research version of Paradise
using extensible operator framework and Set Adt
Intel Pentium 333 MHz - Solaris 2.6
Main memory - 128 MB
Buffer pool size - 32 MB
Used raw disks of size 4 GB and I/O bandwidth of
6 MB/sec
Each experiment was run against cold database
Synthetic data set

26
Varying Relation Cardinality
Set Cardinality of 20
27
Cost Breakdown of Sig-NL
Set Cardinality of 20
28
Cost Breakdown of PSJ
Set Cardinality of 20
29
Effect of Signature Size
Relation Cardinality of 20000 and Set Cardinality
of 20
30
Effect of Increasing Partitions
Relation Cardinality of 20000 and Set Cardinality
of 120
31
Performance Space
PSJ
PSJ-1, PSJ
Large
Set Cardinality
Sig-NL, PSJ-1
PSJ
Small
Small
Large
Relation Cardinality
32
Conclusion