Parameter Tuning for Differential Mining of String Patterns - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Parameter Tuning for Differential Mining of String Patterns

Description:

Parameter Tuning for Differential Mining of String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 27

Provided by: C932

Category:

more less

Transcript and Presenter's Notes

Title: Parameter Tuning for Differential Mining of String Patterns

1
Parameter Tuning for Differential Miningof
String Patterns

J.Besson, C. Rigotti, I. Mitasiunaite and J.-F.
Boulicaut

2
Tuning extraction parameters

Local pattern mining itemsets, closed itemsets,
episodes, seq. patterns, substrings
. under constraints (monotonic or not or
neither, pattern shapes, occurrence properties,
measures )
can select/focus .
where to look in the parameter space ?
often easy when a single threshold
but when multiple constraints/multiple
thresholds ?

3
Two different kinds of tuning

1) exploratory stage find in parameter space
promising areas
2) fine grain tuning ako greedy strategy by
small local exploration of the parameter space

4
Tools ?

Best ever tool used in exploratory stage to find
promising setting of the parameters in local
pattern mining ???

5
Tools

GREP Word Count
method manual mix
count extracted patterns
choose points in parameter space
random walk
try local greedy strategy
having in mind known properties of the
constraints (when applicable) and domain
knowledge

6
Tools

when several parameters, several thresholds,
e.g., minimal support and maximal support on
another dataset
perform more exhaustive exploration of pattern
space
draw curves depicting the extraction landscape

7
Tools / landscape

Examples

8
Obtaining extraction landscapes

use script - can need a lot of resources to
execute - too much time needed to explore a large
parameter space (several parameters)
use a global model of the presence of the local
patterns to estimate the number of patterns
reuse/adapt a model - not so much exist
develop a new global model - each kind of
patterns and each conjunction of constraints can
be a research problem in itself
incorporate K of domain ? Global analytical
model even more complex to exhibit

9
What about sampling the pattern space ?

sounds too naive, needing complicated frameworks
how to sample ?
size of the sample ?
number of pattern in the sample that satisfy the
constraints ?
using domain knowledge ?
how to estimate value for the whole pattern space
?

10
What about simple choices ?

sampling with replacement in pat. that satisfies
the syntactic constraints (conjunction of
constraints)
number of patterns in the sample that satisfy the
constraints
compute probability to satisfy the constraints
for each patterns (incorporate K of the domain)
in the sample
approx. number of patterns that sat. the
constraints (in the sample)
sample size growth the sample up to convergence
of percentage of patterns satisfying the
constraints
estimate the number of patterns in the pattern
space that satisfy the constraints percentage of
the pat. that sat. syntactic constraints

11
Whole process

1) built an initial sample of Psynt
2) comp. estimate of E(N) from the sample
3) add more patt. to the sample
4) comp. estimate of E(N) from the sample
5) if estimate changes a lot goto 3)

12
Using it in freq. substring mining

Two datasets R1 and R2 (two sets of strings)
Constraints
having size Z
appearing at least min times in R1
appearing no more than max times in R2
Consider exact and approx. matching

13
Pattern space and K of domain

string over an alphabet of 4 or 8 symbols
K of domain as three models of symbol
distribution
Me - independent symbols with equal frequency
Md - independent symb. with different
frequencies
Mm - first order Markov model
for given p, and Me or Md or Mm, we have the
proba that exits at-least one occ. of p in a
string
from binomial distribution we have the proba that
p sat. min and max support constraints

14
Example / random data

4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of
length 1000 in R1 and R2 , exact match

15
Example / random data

4 symb. Mm, 100 strings of length 1000 in R1 and
R2, exact and approx. match

16
Example / gene promoter seq.

4 symb. A,C,G,T - Md, strings of 4000 symb., 29
in R1 and 21 in R2 - approx. match

17
Example / gene promoter seq.

Estimate vs. extraction

18
Conclusion

Drawing extraction landscape for parameter
tuning, in local pattern extraction, using
pattern space sampling
seems possible
at-least in some cases
using simple framework
incorparating K of domain (to some extend -
many works on proba of a given patt. to sat.
constraints)
simplier than building a global analytical model
faster than running real extractions
sufficient in exploratory stage ?
companion software?

19
Example / random data