Query Flocks - PowerPoint PPT Presentation

About This Presentation
Title:

Query Flocks

Description:

Market Basket Mining ... First find those items that appeared in at least 20 baskets ... What values of $1 does the query answer (B) : - baskets (B, $1) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: jeremyhi
Learn more at: http://web.cs.wpi.edu
Category:
Tags: baskets | flocks | query

less

Transcript and Presenter's Notes

Title: Query Flocks


1
Query Flocks
  • Umar Hammoud
  • Elizabeth Cash
  • March 25, 2003

2
Presentation Based On
  • Paper Title
  • Query Flocks A Generalization of
    Association-Rule Mining
  • Authors   Dick Tsur   Jeffrey D. Ullman  
    Serge Abiteboul   Chris Clifton   Rajeev
    Motwani   Svetlozar Nestorov   Arnon Rosenthal

3
Association-Rule
  • The goal is to find sets of items that are
    associated
  • The fact of their association is called
    association-rule

4
Market Basket Mining
  • Understand the behavior of the customers when
    they shop to improve marketing
  • An attempt by retail store to learn what items
    its customers purchase together
  • A way to find items that tend to appear together
    in a market basket

5
Precise Measures of Association
  • Given a relation,
  • baskets(BID, Item) where BID is basket
    ID

1. Support The items must appear in many
baskets.
2. Confidence The probability of one item given
that the others are in the basket must be high.
3. Interest That probability must be
significantly higher or lower than the expected
probability if items were purchased at random.
6
ExamplesMeasures of Association
  • People who buy milk often by cereal. cereal,
    milk

1. High support means that many people buy both
cereal and milk
2. High confidence means that a lot of people
who buy cereal also buy milk.
3. High interest means that if you buy cereal,
then you are much more likely to buy milk than
the general population.
7
Association-Rule Optimization
  • Can be optimized by taking advantage of many of
    the query optimization ideas (e.g. a-priori)

8
The A-Priori Optimization
Let S be a set of items that appear in at least n
baskets And S is subset of S Then S appears in
at least n baskets
  • Using this technique tuples can be eliminated
    before the join

9
A-Priori Generalization
  • Extended to provide efficient mining of very
    large databases, for many different kinds of
    patterns.
  • Can be used for
  • general-purpose mining systems
  • future generation of query optimizers.
  • Known as Query flocks

10
Query Flocks
  • A parameterized query with a filter condition to
    eliminate the uninteresting values of the
    parameters
  • Represented in Datalog

11
Mining Languages
  • Can SQL be used as a mining Language?

 In principal, it can, but right optimization is
not there.
12
SQL Whats the Problem?
  • The A-Priori trick has not been implemented by
    any conventional optimizer
  • SELECT i1.Item, i2.Item
  • FROM baskets i1, baskets i2
  • WHERE i1.Item
  • i1.BID i2.BID
  • GROUP BY i1.Item, i2.Item
  • HAVING 20
  • Better performance can be achieved if the query
    is rewritten in the following way
  • First find those items that appeared in at least
    20 baskets
  • Join the set of these items with the baskets
    relation

13
Mining with Flocks
  • Many data mining problems can benefit from the
    A-priori for code optimization
  • The Formalism of query flocks is an important
    tool for building better optimizers

14
Query Flocks
  • A family of identical queries that are asked
    simultaneously
  • The answers to these queries are filtered
  • The ones filtered enable their parameters to
    become part of the answer

15
Query Flock Settings
  • Queries are parameterized by one or more
    parameter
  • Ability to express filter conditions about the
    results of the query

16
Query Flock Designation
  • One or more predicates that represent data stored
    as relations
  • A set of parameters with names starting with
  • A query
  • A filter that specifies a condition

17
Language for Flocks
  • Conjunctive Queries augmented with arithmetic
    and with union
  • Datalog is used rather than SQL because it gives
    the following capabilities
  • The notion of safe query for Datalog figures
    into potential optimizations
  • The set of options for adapting the A-priori
    trick to arbitrary flocks is most easily
    expressed in Datalog
  • SQL is used for the filter language only

18
Market Basket as a Query Flock
QUERY
Answer(B) - baskets(B,1) AND baskets(B,2)
FILTER
COUNT(answer.B) 20
19
Language Extensions
  • To apply query optimizations proposed, extensions
    must be added
  • Negated subgoals
  • Arithmetic subgoals for variables and parameters

20
Extensions Usage
  • Add arithmetic extension to the previous query to
    restrict item pairs to appear in lexicographic
    order

Answer(B) - baskets(B,1) AND baskets(B,2)
AND 1
21
Extensions Usage
  • Given the following relations
  • diagnoses(Patient, Disease)
  • exhibits(Patient, Symptom)
  • treatments(Patient,Medicine )
  • causes(Disease, Symptom)
  • Find unexplained side effects

QUERY answer(P) - exhibits(P,s)
AND treatment(P,m) AND diagnosis(P,D)
AND NOT causes(D,s) FILTER COUNT(answer.P)
20
22
Generalizing A-Priori Techniques
  • Evaluate the less expensive query first

The answer allows us to upper bound the size of
the answer obtained with certain parameters.
If the bound is less than the filter threshold,
eliminate the certain values of parameters
without further consideration
For Query Q1 to puts an upper bound on the size
of the result of query Q2 It must be provable
that the result of Q2 is a subset of the result
of Q1
  • The containment-mapping theorem says
  • Q2 ? Q1 can hold if Q1 is constructed from Q2
    by
  • Taking a subset of the subgoals of Q2, and
  • Splitting zero or more variables into several
    variables.

23
Safe Query Example
answer(B) - baskets(B,1) AND baskets(B,2)
AND
 Two formed by taking two proper subsets of
subgoals
answer(B) - baskets(B,1)
and
answer(B) - baskets(B,2)
24
Safe Query Example cont.
If we take the first, we can ask
  • What values of 1 does the query answer (B) -
    baskets (B, 1)
  • Produce a number of values of B that is over the
    threshold given in the filter.
  • Any other value of 1 can be eliminated as
    member of a pair of items meeting the filter
    condition

25
Search for Optimal Query-Flock Evaluators
R(P) FILTER(P,Q,C)
P set of parameters
Q query involving parameters P
R relation whose tuples are values of parameter P
C condition on the result of the query Q
26
A Query Plan
  • okS(s) FILTER(s,
  • answer(P) - exhibits(P,s),
  • COUNT(answer.P) 20)
  • okM(m) FILTER(m,
  • answer(P) - treatments(P,m),
  • COUNT(answer.P) 20)
  • ok(s,m) FILTER(s,m,
  • answer(P) -
  • okS(s) AND
  • okM(m) AND
  • diagnoses(P,D) AND
  • exhibits(P,s) AND
  • treatments(P,m) AND
  • NOT causes(D,s),
  • COUNT(answer.P) 20)

27
Is there a Rule for Generating the Query
Plans?????
  • Consider only sequences of filter steps that
    satisfy these conditions
  • Steps must use same filter condition as original
    query flock query
  • Each step must define a uniquely named relation
  • Each step derived from the given query flock by
    following
  • Start with original query flock
  • Add in zero or more subgoals that are copies of
    the left side of the assignment ( ) in some
    previous filter step
  • Delete zero or more subgoals but, following the
    optimization principle for conjunctive queries,
    make sure that the resulting query is safe.
  • The final step must not delete any subgoals of
    the original query it may have additional
    subgoals derived from previous steps, of course.

28
Exponential SearchQuery Plan
  • Candidate for best possible
  • Long sequence of steps in which each uses the
    results of the previous step
  • How to restrict the search
  • Select sets of parameters
  • Select list of subsets of the subgoals of the
    original query that form safe queries.

29
Dynamic Selection of Filter Steps
  • We let the sizes of intermediate relations
    determine whether or not to apply filters
  • The important special case
  • When the set of parameters for a relation has not
    previously been encountered.
  • If support threshold is low, then it is likely to
    be useful to filter
  • If support threshold is high, then it is unlikely
    a useful filter

30
Possible Query Plan
  • temp1(s) FILTER(s,
  • answer(P) - exhibits(P.s),
  • COUNT(answer.X) 20
  • )
  • temp2(P,s,m) (temp1(s) JOIN
  • exhibits(P,s)) JOIN treatments(P,m)
  • temp3(s,m) FILTER(s,m,
  • answer(P) - temp2(s,m).,
  • COUNT(answer.X) 20
  • )
  • temp4(P,D,s,m) ((temp3(s,m) JOIN
  • temp2(P,s,m))
  • JOIN diagnoses(P,D)) JOIN
  • (NOT causes(D,s)
  • sideEffect(s,m) FILTER(s,m,
  • answer(P) - temp4(P,D,s,m),
  • COUNT(answer.X) 20
  • )

31
Conclusions
  • Its a generate-and-test model for data-mining
    problems
  • Uses "parts of queries" constructively to prune
    answer sets for main queries
  • Provides a parameterized way to specify a set of
    queries, whose answer is the parameter(s)

32
So What Should Tim Tell His Mother?
  • In one sentence
  • Generalization of query optimization techniques
    to be used for data mining.
  • And Questions?
Write a Comment
User Comments (0)
About PowerShow.com