Frequent Pattern Queries with Constraints Language Algorithms - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Frequent Pattern Queries with Constraints Language Algorithms

Description:

(Language Algorithms) Francesco Bonchi, Fosca Giannotti, Dino Pedreschi ... Supervisors: Dr. Fosca Giannotti, Prof. Dino Pedreschi ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 49
Provided by: maril85
Category:

less

Transcript and Presenter's Notes

Title: Frequent Pattern Queries with Constraints Language Algorithms


1
Pisa KDD Laboratory http//www-kdd.isti.cnr.it/
Frequent Pattern Querieswith Constraints(Langu
age Algorithms)
Francesco Bonchi, Fosca Giannotti, Dino
Pedreschi Workshop on Inductive Databases and
Constraint Based Mining 12/03/04
2
Frequent Pattern Queries Language and
OptimizationsPh.D. Thesis
  • Supervisors Dr. Fosca Giannotti, Prof. Dino
    Pedreschi
  • International Reviewers Prof. Jean-Francois
    Boulicaut, Prof. Jawei Han
  • Part 1 Data Mining Query Language
  • Frequent Pattern Queries (FPQs) definition
  • A language for FPQs
  • Part 2 Optimizations
  • Algorithms for FPQs
  • (Pushing Monotone Constraint)
  • Part 3 Conclusion
  • Optimized Operational Semantics for FPQs
  • (putting together Part 1 and Part 2)

3
Plan of the talk
  • Language for Frequent Pattern Queries
  • Algorithms for Constrained Frequent Pattern
    Mining
  • Adaptive Constraint Pushing Bonchi et al.
    PKDD03
  • ExAnte preprocessing Bonchi et al. PKDD03
  • Further exploiting the ExAnte property
  • ExAMiner (breadth-first) Bonchi et al.
    ICDM03
  • FP-bonsai (depth-first) Bonchi and Goethals
    PAKDD04
  • On going and future work P3D project

4
Interesting feature of all our algorithms
  • They provide the exact support for all solution
    itemsets.
  • This feature distinguish our algorithms from
    those ones presented by Luc (this morning) and by
    Johannes (this afternoon).
  • In some sense they solve different problems
  • Analogy with classical frequent itemsets without
    constraints
  • Our algorithms ? frequent itemset mining
    (Apriori)
  • Their algorithms ? maximal frequent itemset
    mining (MaxMiner)

5
Language for Frequent Pattern Queries
6
Our Objective
  • To study Frequent Pattern Queries optimizations
    in a Logic-based Knowledge Discovery Support
    Environment
  • a flexible knowledge discovery systems,
    capable of obtaining, mantaining, representing
    and using both induced and deduced knowledge in a
    unified framework.
  • Need for such a system is suggested by the
    real-world mining applications Bonchi et al.
    KDD99 Giannotti et al. DMKD99
  • A Deductive Database can easily represents both
    extensional and intensional data.
  • Previous works by our group have shown that this
    capability makes it viable for suitable
    representation of domain knowledge and support of
    the various steps of the KDD process.
  • LKDSE Deductive Database a set of inductive
    primitives.
  • LKDSE Inductive Database where the DB component
    is a Deductive DB.

7
Logic-based Knowledge Discovery Support
Environment
(Inductive rules)
Closure Principle!
MINING
KNOWLEDGE
(Relational extensions) source data background
knowledge extracted knowledge
(Deductive rules)
(Deductive rules)
POSTPROCESSING Reasoning on extracted knowledge
PREPROCESSING Background knowledge integration
8
Logic-based Knowledge Discovery Support
Environment
  • The main issue for a deductive approach
  • how to choose a suitable representation for
    the inductive part? (In other words how to
    define inductive mining queries?)
  • Mancos Ph.D. thesis
  • inductive queries user-defined aggregates
  • LDL-Mine
  • LDL (deductive database language and system)
    user-defined aggregates external calls to
    implement mining primitives
  • Main drawback atomicity of the aggregate gives
    no optimization opportunity. Boulicaut and
    DeRaedt PKDD02 Tutorial
  • Example constraint pushing techiniques in
    frequent itemsets mining would require the
    analyst to create her own aggregates for any
    different situation (conjunction of constraints).

9
Our Vision Declarative Mining
  • The analyst must have a high-level vision of the
    knowledge discovery system, without worrying
    about the details of the computational engine, in
    the very same way a database designer has not to
    worry about query optimizations.
  • She just needs to declaratively specify in the
    inductive query how the desired patterns should
    look like and which conditions they must satisfy
    (a set of constraints).
  • It will be due to the query optimizer to compose
    all constraints and to produce the most efficient
    mining strategy (? execution plan) for the given
    inductive query.

10
Inductive Rule
  • INDUCTIVE RULE a conjunction of sentences about
    the desired patterns.
  • H ? B1,,Bn
  • H is a relation representing the induced pattern.
  • Sentences B1, Bn are taken from a restricted
    class of sentences.
  • The set of all allowed sentences is just some
    "syntactic sugar" on top of an algorithm (the
    inductive engine).
  • Each sentence can be defined over some relations.
  • Having a well defined and restricted set of
    admitted sentences allow us to write higly
    optimized algorithms to compute inductive rules
    with any conjunction of sentences.
  • In particular we focus on Frequent Pattern
    Queries

11
Why Frequent Pattern Queries?
  • Frequency of a pattern is the most important
    interestingness measure.
  • Frequency has the right granularity to be a
    primitive in a DMQL
  • its a simple low-level concept
  • complex and time-consuming computation
  • many different kinds of pattern are based on
    frequency
  • frequent itemsets
  • frequent sequences (or sequential pattern)
  • frequent episodes
  • frequent substructures in graph data
  • can be used to define a whole class of data
    mining tasks
  • Association Rules, Correlation Rules, Casuality
    Rules, Ratio Rules,
  • Iceberg Queries and Iceberg Rules, Partial
    Periodicity, Emerging Pattern,
  • Classification, Clustering

12
Research Path Followed
  • trying to define a language for FPQs expressive
    enough to express the most of interesting
    inductive queries, simple enough to be higly
    optimized.
  • FPQ Definition ? identification of all basic
    components of a FPQ
  • Syntax ? syntactic sugar to express all basic
    components of a FPQ
  • Safety ? not all inductive rules derivable from
    the provided grammar are meaningfull
  • Formal Semantics ? by showing that exists a
    unique mapping from each safe FPQ (inductive
    rule) of our language to a Datalog program (set
    of deductive rules) with user-defined aggregates
    (Mancos framework). Thanks to this mapping we
    can define the formal declarative semantics of an
    inductive rule as the iterated stable model of
    the corresponding Datalog program.
  • Expressiveness ? by means of a suite of examples
    of interesting complex queries.

13
Inductive Query Example
Compute simple association rules, having exactly
2 items in the head and at least 3 items in the
body, creating transactions by grouping tuples by
day and customer, having support greater than
1000 and confidence more than 0.4, and spending
at least 50 in toys (total sum of prices of
items of type toys involved in the rule).
Inductive rule interesting_set(Set,Sup,Card,T) ?
Sup freq(Set,X), X ?I ?D,C??,
sales(D,C,I,Q), Sup ? 1000, Card card(Set),
J ? Set, T
sum(P,product(J,P,toy)).

Deductive (LDL) rule interesting_rules(L,R,Sup,C
onf) ? interesting_set(Set,Sup,Card,T), Card ? 5,
T ? 50, interesting_set(R,S1,2,T1),
subset(R,Set), difference(Set,R,L), Conf
Sup / S1, Conf ? 0.4.
14
Algorithms for Constrained Frequent Pattern
Mining
15
Why Constraints?
  • Frequent pattern mining usually produces too many
    solution patterns. This situation is harmful for
    two reasons
  • Performance mining is usually inefficient or,
    often, simply unfeasible
  • Identification of fragments of interesting
    knowledge blurred within a huge quantity of
    small, mostly useless patterns, is an hard task.
  • Constraints are the solution to both these
    problems
  • they can be pushed in the frequent pattern
    computation exploiting them in pruning the search
    space, thus reducing time and resources
    requirements
  • they provide to the user guidance over the mining
    process and a way of focussing on the
    interesting knowledge.
  • With constraints we obtain less patterns which
    are more interesting. Indeed constraints are the
    way we use to define what is interesting.

16
Constrained Frequent Itemset Mining Problem
  • Notation
  • We indicate the frequency constraint with Cfreq
    without explicitely indicating the dataset and
    the min_sup
  • Given a constraint C , let Th(C) X C(X)
    denote the set of all itemsets X that satisfy C.
  • The frequent itemsets mining problem requires to
    compute Th(Cfreq)
  • The constrained frequent itemsets mining problem
    requires to compute Th(Cfreq) ? Th(C).

17
Problem Definition Anti-monotone Constraint
  • Frequency is an anti-monotone constraint.
  • "Apriori trick" if an itemset X does not
    satisfy Cfreq, then no superset of X can satisfy
    Cfreq.
  • Other examples of anti-monotone constraint
  • sum(X.prices) ? 20 euro
    X ? 5

18
Problem Definition Monotone Constraint
19
Our Problem
to compute itemsets which satisfy a conjunction
of anti-monotone and monotone
constraints.
  • Why Monotone Constraints?
  • Theyre the most useful in order to discover
    local high-value patterns (for instance very
    expansive or very large itemsets which can be
    found only with a very small min-sup)
  • We know how to exploit the other kinds of
    constraints (antimonotone, succinct) since 98
    Ng et al. SIGMOD98, while for monotone
    constraints the situation is more complex

20
Search Space Characterization
21
Problem Characterization
22
Tradeoff between AM and M pruning
23
The 2 extremes priority to AM
  • Strategy GT (Generate and Test)
  • Apriori level-wise computation
    followed by the test of CM
  • Maximum possible antimonotone pruning. No
    monotone pruning.

24
The 2 extremes priority to M
  • Strategy MCP (Monotone Constraint Pushing)
  • First finds B(CM). Then only itemsets over
    the border are generated as candidates to be
    tested for frequency.
  • Candidate generation function generate_over
    Boulicaut and Jeudy 2000 add to each solution
    at the previous level 1 item.
  • Antimonotone pruning possible only partially.
  • Maximum possible monotone pruning. Little
    antimonotone pruning.

25
Strategies Analysis (w.r.t. frequency tests
performed)
GT MCP
Solutions
Not prunable
Ideal
AM prunable
M prunable
M prunable
AM and M prunable
B-(Cfreq )
No one of the two strategies outpermorfs the
other on every input problem
26
Adaptive Constraint Pushing
27
Adaptive Constraint Pushing
  • ACP explores a portion of Th(? CM) looking for
    infrequent itemsets (negative border of
    frequency).
  • Each infrequent itemset found (region 5) will
    help in AM-pruning the search space in Th(CM)
    (in particular region 3).
  • Each infrequent itemset lost (region 5) will
    induce exploration of a portion of region 6.
  • Each itemset found frequent (region 4) will be
    just a useless frequency test performed.
  • 2 questions
  • What is a good portion of candidates?
  • How large this portion should be?
  • Itemsets which have higher probability to be
    found infrequent
  • This is what the adaptivity of ACP is about

28
Adaptivity Parameter
  • We introduce a parameter ? ? 0,1 which
    represents the fraction of candidates under the
    monotone border to be chosen among all possible
    candidates under the monotone border.
  • Initialized after the first scan TDB using all
    information available.
  • Example
  • Number of transactions in TDB
  • Total number of 1-itemsets (singleton items)
  • Number of frequent 1-itemsets
  • Number of 1-itemsets satisfying the monotone
    constraint
  • Ect...
  • Updated level by level with the newly collected
    knowledge.
  • Updating ? ACP adapts its behaviour to the given
    input TDB and constraints ...
  • Extreme cases
  • If ? 0 constantly then ACP ? MCP
  • If ? 1 constantly then ACP ? GT

29
The Algorithm
  • Notation at iteration k (k-itemsets)
  • Pk Set of itemsets whose proper subsets do not
    satisfy CM and have not been found infrequent
  • Bk subset of Pk containing itemsets which
    satisfy CM (positive monotone border)
  • Ek Pk\ Bk
  • CkO candidates over the monotone border
  • CkU candidates under the monotone border
  • Rk solutions
  • Nk itemsets under the monotone border found
    infrequent
  • N union of all Nk

30
The Algorithm
generate_over
Rk-1
CkO
Rk
N
CkU
Bk
Pk
Satisfy CM ?
Ek
?
generate_apriori
Pk1
31
The Algorithm
32
? selection and adaptivity
  • How candidates are selected by ??
  • Among all itemsets in Ek the ?-portion with
    lowest estimated support is selected to enter in
    CkU.
  • Support is estimated using the real support
    values of singleton items belonging to the
    itemset, balancing between complete independence
    (products of values) and maximal correlation
    (minimum value).
  • How does ? adapt itself?
  • According to the performance of the ?-selection
    at the previous iteration
  • ?-focus Nk / CkU
  • An ?-focus very close to 1 ? very good selection
    ? probabily ? is selecting too few candidates ?
    it risks to lose some infrequent itemsets ? ?
    value is raised accordingly
  • A low ?-focus ? poor selection ? ? is selecting
    too much candidates ? ? value is reduced
    accordingly

33
Experimental Results
34
ExAnte(a preprocessing algorithm)
35
AM Vs. M
  • State of art before ExAnte when dealing with a
    conjunction of AM and M constraints we face a
    tradeoff between AM and M pruning.
  • Tradeoff pushing M constraints into the
    computation can help pruning the search space,
    but at the same time can lead to a reduction of
    AM pruning opportunities.
  • Our observation this is true only if we focus
    exclusively on the search space of itemsets.
    Reasoning on both the search space and the input
    TDB together we can find the real sinergy of AM
    and M pruning.
  • The real sinergy do not exploit M constraints
    directly to prune the search space, but use them
    to reduce the data, which in turn induces a much
    stronger pruning of the search space.
  • The real sinergy of AM and M pruning lies in Data
    Reduction

36
ExAnte ?-reduction
  • Definition ?-reduction
  • Given a transaction database TDB and a monotone
    constraint CM, we define the ?-reduction of TDB
    as the dataset resulting from pruning the
    transactions that do not satisfy CM.
  • Example CM ? sum(X.price) ? 55

37
ExAnte Property
38
ExAnte ?-reduction
  • ?-reducing a transaction means to delete from the
    transaction infrequent singleton items (or more
    generally singleton items which do not satisfy a
    given anti-monotone constraint)
  • ?-reducing a database of transaction means to
    ?-reduce all transaction in the database.
  • ?-reducing a database is correct, i.e. it does
    not change support to solution itemsets (trivial
    by anti-monotonicity)

39
A Fix-Point Computation
Shorter transactions
Less transactions which satisfy CM
TDB
Less frequent 1-itemsets
until a fix-point is reached
Less Transactions in TDB
40
ExAnte Preprocessing Example
  • Min_sup 4
  • CM ? sum(X.price) ? 45

4 7 5 7 4 3 6 2
4 4 4
X
X
X
X
X
41
Experimental Results Data Reduction
42
Experimental Results Items Reduction
43
Experimental Results Search Space Reduction
44
Experimental Results Runtime Comparison
45
Further Exploiting the ExAnte Property
  • ExAMiner
  • ExAnte Miner (in contrast with ExAnte
    preprocessor)
  • Miner which Exploits Antimonotone and Monotone
    constraints togheter
  • Basic idea to generalize ExAnte data reduction
    at all levels of a level-wise Apriori-like
    computation.
  • Performs better on sparse datasets.
  • Breadth-first
  • FP-Bonsai the art of growing and pruning small
    FP-Trees
  • Basic idea embedding ExAnte data-reduction in
    FP-growth computation.
  • Performs very well on both dense and sparse
    datasets.
  • Depth-first
  • ExAnte property works even in other pattern
    domains (sequences, trees, graphs)

46
P3D Projecthttp//www-kdd.isti.cnr.it/p3d/
an ISTI C.N.R. internal curiosity driven
project
Pisa KDD Laboratory
High Performance Computing Laboratory ( DCI
people Salvatore Orlando, Raffaele Perego and
others)
47
Activities
Patternist devising knowledge discovery support
environment focused on frequent pattern
discovery which offers the repertoire of
algorithms studied and implemented by the
researchers participating to the project in the
last few years PDQL devising a highly
expressive query language for frequent pattern
discovery PPDM devising privacy-preserving
methods for frequent pattern discovery from
sources that typically contain personal sensitive
data Applications devising some benchmarking
test bed in the domain of biological data
developed within the above environment. Other
kinds of pattern closed-frequent itemsets,
sequential patterns, graph-based frequent
patterns ...
48
What about constraint-based frequent itemset
mining _at_ FIMI04?
Write a Comment
User Comments (0)
About PowerShow.com