Set Containment Joins: The Good, The Bad and The Ugly - PowerPoint PPT Presentation

About This Presentation
Title:

Set Containment Joins: The Good, The Bad and The Ugly

Description:

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 34
Provided by: kart194
Category:
Tags: bad | containment | good | joins | set | ugly

less

Transcript and Presenter's Notes

Title: Set Containment Joins: The Good, The Bad and The Ugly


1
Set Containment Joins The Good, The Bad and The
Ugly
  • Karthikeyan Ramasamy
  • Jointly With
  • Jignesh Patel, Jeffrey F. Naughton and Raghav
    Kaushik

2
Introduction
  • Most data mining today takes place outside of any
    DBMS
  • Unfortunate - many potential advantages arise
    from using a DBMS
  • scalability
  • flexibility
  • consistency

3
Why dont people use RDBMS?
  • Too slow.
  • Difficult to express data mining algorithms in
    SQL
  • Potential for improvement set-valued attributes.

4
Relational DBMS 101
  • Data model
  • tables with rows and columns
  • each individual entry is an atomic element (an
    integer, a float, a character string.)
  • New extension set-valued attributes
  • individual entries of tables can be sets.

5
Sets and Data Mining
  • Canonical example customers and their
    transactions.
  • No sets two tables,
  • customers(cid, name, address, )
  • transactions(cid, product, date,)
  • Sets one table,
  • customers(cid, name, address, trans, )

6
Open questions
  • How do you store the sets?
  • How do you implement operations on these
    set-valued attributes?
  • Do they really help move data mining into SQL?

7
Set Containment Joins
  • Consider two relations
  • Containment is defined as
  • Computes pair of tuples one from R and the other
    from S such that set from R tuple is contained or
    equal to the set from S tuple

8
Set Containment Joins (Cont.)
  • Example
  • STUDENT (sid, courses-taken)
  • COURSES (cid, prereqs)
  • Find the set of courses that student is eligible
    to take

9
Storage Representations
  • Nested internal.
  • Grouped and stored along with the rest of the
    attributes in the tuple.
  • Unnested external.
  • Set instances are unnested and stored in a
    separate relation.
  • Requires join to assemble elements.

10
Nested Internal Representation
Cardinality
Length
Tuple
Element 1
Element 2
A1
A2
A3
.
.
Element N
11
Unnested External - Good Old SQL
SELECT RS.i, SS.j FROM RS, SS WHERE RS.b
SS.d GROUP BY RS.i, SS.j HAVING COUNT() (
SELECT count()
FROM NRS RS
WHERE NRS.i RS.i )
12
SQL Approach - Pros and Cons
  • Pros.
  • Easy to add to an existing DBMS.
  • Cons
  • Requires extra joins for projecting other
    attributes
  • Nested query must be evaluated for each group
  • Number of groups is RS

13
SQL Approach - Mitigation
  • Magic Sets Rewriting

Count Query INSERT INTO T1(i,counti) SELECT RS.i,
COUNT() FROM RS GROUP BY RS.i
Candidate Query INSERT INTO T2(i,j,countij) SELECT
RS.i, SS.j, COUNT() FROM RS, SS WHERE RS.b
SS.d GROUP BY RS.i, SS.j
Verify Query SELECT T2.i, T2.j FROM T2, T1 WHERE
T2.i T1.i AND T2.countij T1.counti
14
Signature Nested Loops (Sig-NL)
  • Applicable for Nested Internal Representation
  • Signatures
  • Signatures are bit vectors for approximating sets
  • Approximation leads to false drops
  • Three phases of the algorithm
  • Signature construction phase
  • Comparing signature for containment
  • Verification of actual subsets

15
Signature Nested Loops (Contd)
  • Signature Construction Phase
  • Take a bit vector
  • Apply a hash function M for each element and set
    the corresponding bit
  • Comparison Phase
  • Necessary condition for subset satisfaction
  • and

16
Partition Algorithms
  • Reduce join execution time by partitioning the
    problem into smaller sub-problems.
  • A partitioning function is used to partition the
    problem.
  • An ideal partitioning function requires
  • Tuple r of R falls in one of the partitions Ri
  • Tuple s of S falls in one of Si
  • Join is accomplished by joining only Ri with Si

17
Partitioned Set Join Algorithm
  • Three phases of algorithm
  • Partitioning Phase
  • Joining Phase
  • Verification Phase

18
Partition Set Join Algorithm (PSJ)
R(1,2,3)
(3,0100001,OIDR)
(4,0100101,OIDS)
S(1,2,3,6)
19
PSJ Joining Phase
  • Any efficient algorithm for joining signatures
    can be used.
  • Signature based partition algorithm
  • Partition R signatures based on randomly chosen
    bit that is set.
  • Probe each S signature multiple times for each
    bit set.
  • Outputs the result object id pairs (OIDR,OIDS).

20
PSJ Pros and Cons
  • Pros
  • Easy to implement similar to hash joins
  • Easily parallelizable
  • Issues
  • Determination of the number of partitions
  • Determination of the signature size

21
PSJ Number of Partitions
  • Large number of partitions leads to large
    overhead
  • Smaller number of partitions leads to more join
    cost
  • Using a detailed analytical model

22
PSJ Signature Size
  • Inversely related to number of partitions
  • Cyclic dependency. Solve simultaneously and use
    bisection method

23
Set Distributions
  • Many degrees of freedom
  • Each degree can follow a distribution of its own.
  • Huge distribution space!

24
Classifying Set Distributions
Large, Large
Small, Large
Large
Set Cardinality
Small, Small
Large, Small
Small
Small
Large
Relation Cardinality
25
Performance Settings
  • Implementation in research version of Paradise
    using extensible operator framework and Set Adt
  • Intel Pentium 333 MHz - Solaris 2.6
  • Main memory - 128 MB
  • Buffer pool size - 32 MB
  • Used raw disks of size 4 GB and I/O bandwidth of
    6 MB/sec
  • Each experiment was run against cold database
  • Synthetic data set

26
Varying Relation Cardinality
Set Cardinality of 20
27
Cost Breakdown of Sig-NL
Set Cardinality of 20
28
Cost Breakdown of PSJ
Set Cardinality of 20
29
Effect of Signature Size
Relation Cardinality of 20000 and Set Cardinality
of 20
30
Effect of Increasing Partitions
Relation Cardinality of 20000 and Set Cardinality
of 120
31
Performance Space
PSJ
PSJ-1, PSJ
Large
Set Cardinality
Sig-NL, PSJ-1
PSJ
Small
Small
Large
Relation Cardinality
32
Conclusion
  • Developed a partition based algorithm for set
    containment joins
  • Performance study shows that PSJ works well on
    most data sets
  • The advantages of PSJ are
  • Simple
  • Effectiveness
  • Easily parallelizable

33
Future Work
  • Algorithm can be easily extended for set
    intersection joins
  • Investigate the applicability of nested
    algorithms for unnested external representations
Write a Comment
User Comments (0)
About PowerShow.com