Mining Biological Data - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Mining Biological Data

Description:

Each object or attribute in a coherent cluster may bear some relative bias (that ... Strategy: find the maximum coherent attribute sets for each pair of objects with ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 41
Provided by: IBMU394
Category:

less

Transcript and Presenter's Notes

Title: Mining Biological Data


1
Mining Biological Data
  • Jiong Yang, Ph. D.
  • Visiting Assistant Professor
  • UIUC
  • jioyang_at_cs.uiuc.edu

2
Data is Everywhere
3
Data Mining is a Powerful Tool
Data Mining
Data
Knowledge
  • Computational Biology
  • E-Commerce
  • Intrusion Detection
  • Multimedia Processing
  • Unstructured Data
  • . . .

4
Biological Data
  • Bio-informatics have become one of the most
    important applications in data mining.
  • DNA sequences
  • Protein sequences
  • Protein folding
  • Microarray data

5
Outline
  • Approximate sequential pattern mining
  • Coherent cluster clustering by pattern
    similarity in a large data set

6
Frequent Patterns
  • Model
  • A set of sequences of symbols.
  • a1,a2,a4
  • a2,a3,a5
  • a1,a4,a5,a6,a7
  • If a pattern occurs more than a certain number of
    times, then this pattern is considered important.
  • a1,a4
  • Widely studied
  • Frequent itemset mining Agarwal and Srikant
    (IBM Almaden)
  • FP growth Han (UIUC)
  • Stream data Motwani (Stanford)

7
Apriori Property
  • Widely used in data mining field
  • It holds for the support metrics
  • All patterns form a lattice.
  • (a, b, d) is a super-pattern of (a, d) and it is
    a sub-pattern of (a, b, c, d).
  • Support metric defines a partial order on the
    lattice.
  • Support(a, b, d) lt minSupport(b, d) ,
    Support(a, d) , Support(a, b)
  • Level-wise search algorithm can be used

8
Shortcomings
  • Require exact match and fail to recognize
    possible substitution among symbols
  • Protein may mutate without change of its
    functionality.
  • A sensor may make some mistakes
  • Different web pages may have similar contents.
  • A word may have many synonyms.
  • How can the symbol substitution be modeled

9
Compatibility Matrix
Compatibility matrix of 5 symbols
10
Compatibility Matrix
  • The compatibility matrix serves as a bridge
    between the observation and the underlying
    substance.
  • Each observed symbol is interpreted as an
    occurrence of a set of symbols with various
    probabilities.
  • An observed symbol combination is treated as an
    occurrence of a set of patterns with various
    degrees.
  • Obtain the compatibility matrix through
  • empirical study
  • domain expert

11
Match
  • A new metric, match, is then proposed to quantify
    the importance of a pattern.
  • The match of a pattern P in a subsequence s (with
    the same length) is defined as the conditional
    probability Prob(P s).
  • The match of a pattern P in a sequence S is
    defined as the maximal match of P in every
    distinct subsequence in S.
  • A dynamic programming technique is used to
    compute the match of P in a sequence S

12
Match
  • M(d1d2di, S1S2Sj) is the maximum of M(d1s2di,
    S1S2Sj-1) and M(d1d2di-1,S1S2Sj-1) x C(di, Sj)
  • The match of a pattern P in a set of sequence is
    defined as the sum of the pattern P with each
    sequence.
  • A pattern is called a frequent pattern if its
    match exceeds a user-specified threshold
    min_match.

S
S
p
d1
d3
d4
d1
max
0.9
0.9
0.9
S
d1
0.9
p
p
d2
0.045
0.09
0.09
13
Challenges
  • Previous work focuses on short patterns.
  • Long patterns require a large number of scans
    through the input sequence.
  • Expensive I/O cost
  • Performance vs. Accuracy
  • Probabilistic Approach

14
Chernoff Bound
  • Let X be a random variable whose range is R.
    Suppose that we have n independent observations
    of X and the observed mean is ?. The Chernoff
    bound states that, with probability (1- ?), the
    true mean of X is at least ? - ?, where
  • With probability (1- ?), the true value of X is
    at most ? ?.

15
Approach
  • Three-stage approach to mine patterns with length
    l
  • Finding Match of Individual Symbols and Take a
    Sample set of sequences
  • Pattern Discovery on Samples
  • Ambiguous Patterns Determination
  • Pattern Discovery on Samples
  • Sample size depending on memory size
  • Based on the samples, three types of patterns are
    determined.

16
Approach
  • Frequent pattern if match is greater than
    (min_match ?)
  • Ambiguous pattern if match is between (min_match
    - ?) and (min_match ?).
  • Infrequent pattern otherwise

17
Ambiguous Patterns
  • Ambiguous Patterns
  • Too many
  • Border collapse
  • We have the negative and positive borders of
    significant patterns.
  • Our goal is to collapse the border as fast as
    possible.

18
Ambiguous Patterns
(d1,d2,d3,d4,d5)
(d1,d2,d3,d4)
(d1,d2,d3,d5)
(d1,d2,d4,d5)
(d1,d3,d4,d5)
(d1,d2,d3)
(d1,d2,d4)
(d1,d2,d5)
(d1,d3,d4)
(d1,d3,d5)
(d1,d4,d5)
(d1,d2)
(d1,d3)
(d1,d4)
(d1,d5)
(d1)
19
Ambiguous Patterns
20
Effects of 1-?
Without Border Collapse
With Border Collapse
21
Approximate Pattern Mining
  • Reference
  • Mining long sequential patterns in a noisy
    environment, Proceeding of ACM SIGMOD
    International Conference on Management of Data
    (SIGMOD), pp. 406-417, 2002.
  • Other Work
  • Periodic Patterns (KDD2000, ICDM2001)
  • Statistically significant Patterns (KDD2001, ICDM
    2002)

22
Outline
  • Approximate sequential pattern mining
  • Coherent cluster clustering by pattern
    similarity in a large data set

23
Coherent Cluster
  • In many applications, data can be of very high
    dimensionality.
  • Gene expression data
  • Dozens to hundreds conditions/samples
  • Customer evaluation
  • Thousands or more merchants
  • Objective discover peer groups

attributes
aj
a1
. . .
. . .
o1
.
.
.
dij
oi
objects
24
17 conditions
40 genes
25
Coherent Cluster
26
40 genes
27
Coherent Cluster
Co-regulated genes
28
Coherent Cluster
  • Observations
  • If mapped to points in high dimensional space,
    they may not be close to each other.
  • Bias exists universally.
  • Only a subset of objects and a subset of
    attributes may participate.
  • Need to accommodate some degree of noise.
  • Solution subspace cluster, bicluster, coherent
    cluster

29
Subspace cluster
  • CLICK Argawal et al IBM Almaden
  • Find a subset of dimensions and a subset of
    objects such that the distance between the
    objects on the subset of dimensions is close.
  • The clusters may overlap
  • Proclus Aggawal et al IBM T. J. Watson
  • Do not allow overlap

30
Bicluster
  • Developed in 2000 by Cheung and Church
  • Using mean squared error residual
  • After discovering one cluster, replace the
    cluster with random data and find another
  • Not efficient and not accurate

31
Coherent Cluster
  • Coherent cluster
  • Subspace clustering
  • Measure distance on mutual bias
  • pair-wise disparity
  • For a 2?2 (sub)matrix consisting of objects x,
    y and attributes a, b

dxa
dxb
x
x
dya
dyb
y
y
mutual bias of attribute a
mutual bias of attribute b
a
b
a
b
attribute
32
Coherent Cluster
  • A 2?2 (sub)matrix is a ?-coherent cluster if its
    D value is less than or equal to ?.
  • An m?n matrix X is a ?-coherent cluster if every
    2?2 submatrix of X is ?-coherent cluster.
  • A ?-coherent cluster is a maximum ?-coherent
    cluster if it is not a submatrix of any other
    ?-coherent cluster.
  • Objective given a data matrix and a threshold ?,
    find all maximum ?-coherent clusters.

33
Coherent Cluster
  • Challenges
  • Finding subspace clustering based on distance
    itself is already a difficult task due to the
    curse of dimensionality.
  • The (sub)set of objects and the (sub)set of
    attributes that form a cluster are unknown in
    advance and may not be adjacent to each other in
    the data matrix.
  • The actual values of the objects in a coherent
    cluster may be far apart from each other.
  • Each object or attribute in a coherent cluster
    may bear some relative bias (that are unknown in
    advance) and such bias may be local to the
    coherent cluster.

34
Coherent Cluster
Compute the maximum coherent attribute sets for
each pair of objects
Compute the maximum coherent object sets for
each pair of attributes
Two way pruning
Construct the lexicographical tree
Post-order traverse the tree to find maximum
coherent clusters
35
Coherent Cluster
  • Observation Given a pair of objects o1, o2 and
    a (sub)set of attributes a1, a2, , ak, the 2?k
    submatrix is a ?-coherent cluster iff, for every
    attribute ai, the mutual bias (do1ai do2ai)
    does not differ from each other by more than ?.

If ? 1.5, then a1,a2,a3,a4,a5 is a coherent
attribute set (CAS) of (o1,o2).
36
Coherent Cluster
  • Strategy find the maximum coherent attribute
    sets for each pair of objects with respect to the
    given threshold ?.

? 1
The maximum coherent attribute sets define the
search space for maximum coherent clusters.
37
Two Way Pruning
a0 a1 a2
o0 1 4 2
o1 2 5 5
o2 3 6 5
o3 4 200 7
o4 300 7 6
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
delta1 nc 3 nr 3
MCAS
MCOS
38
Coherent Cluster
  • High expressive power
  • The coherent cluster can capture many interesting
    and meaningful patterns overlooked by previous
    clustering methods.
  • Efficient and highly scalable
  • Wide applications
  • Gene expression analysis
  • Collaborative filtering

traditional clustering
coherent clustering
39
Coherent Cluster
  • References
  • Delta-cluster capturing subspace correlation in
    a large data set, Proceedings of the 18th IEEE
    International Conference on Data Engineering
    (ICDE), pp. 517-528, 2002.
  • Clustering by pattern similarity in large data
    sets, Proceedings of the ACM SIGMOD International
    Conference on Management of Data (SIGMOD), pp.
    394-405, 2002.
  • Enhanced biclustering on expression data,
    Proceedings of the IEEE bio-informatics and
    bioengineering (BIBE), 2003.
  • Other Work
  • STING (VLDB1997)
  • STING (ICDE1999, TKDE 2000)
  • CLUSEQ (CSB2002, ICDE2003)
  • Cluster Streams (ICDE2003)

40
Remarks
  • Similarity measure
  • Powerful in capturing high order statistics and
    dependencies
  • Efficient in computation
  • Robust to noise
  • Clustering algorithm
  • High accuracy
  • High adaptability
  • High scalability
  • High reliability
Write a Comment
User Comments (0)
About PowerShow.com