What Data Mining Methods May Help BioInformatics - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

What Data Mining Methods May Help BioInformatics

Description:

... DNA data ... Decision tree induction. Bayesian Classification. Classification ... Family. History. LungCancer. PositiveXRay. Smoker. Emphysema. Dyspnea. LC ~LC ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 75
Provided by: jiaw186
Category:

less

Transcript and Presenter's Notes

Title: What Data Mining Methods May Help BioInformatics


1
What Data Mining Methods May Help Bio-Informatics?
  • Jiawei Han
  • Database Systems Research Lab
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign,
    U.S.A.
  • http//www.cs.uiuc.edu/hanj

2
Bio-informatics and Data Mining
  • Data mining search for or discovery of patterns
    and knowledge hidden in data
  • Biomedical/DNA data mining
  • Biological data is abundant and information rich
    (e.g., gene chips, bio-testing data)
  • It is critical to find correlations, linkages
    between disease and gene sequences,
    classification, clustering, outliers, etc.
  • Lots of challenges and new techniques can be
    developed A field yet to be explored

3
Biomedical Data Mining and DNA Analysis
  • DNA sequences
  • Four basic building blocks (nucleotides) adenine
    (A), cytosine (C), guanine (G), and thymine (T).
  • Gene a sequence of hundreds of individual
    nucleotides arranged in a particular order
  • Humans have around 30,000 genes
  • Tremendous number of ways that the nucleotides
    can be ordered and sequenced to form distinct
    genes
  • DNA micro-arrays and protein arrays have
    accumulated tremendous amount of data related to
    patients and diseases

4
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

5
Semantic Integration of Heterogeneous,
Distributed Genome Databases
  • Current situationhighly distributed,
    uncontrolled generation and use of a wide variety
    of DNA data
  • Semantic integration of different genome
    databasesa critical task
  • It is highly desirable to build Web-based,
    integrated, multi-dimensional genome databases
  • Data cleaning and data integration methods
    developed in data mining/data warehousing will
    help

6
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

7
Discovery and Comparison of DNA Sequences
  • Finding tandem repeats
  • Fault-tolerant sequential patterns (Is Blast
    enough?)
  • Similarity search and comparison among DNA
    sequences
  • Compare the frequently occurring patterns of each
    class (e.g., diseased and healthy)
  • Query-based Identify gene sequence patterns that
    play roles in various diseases

8
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

9
Similarity Search in Multimedia Data
  • Description-based retrieval systems
  • Build indices and perform object retrieval based
    on image descriptions, such as keywords,
    captions, size, and time of creation
  • Labor-intensive if performed manually
  • Results are typically of poor quality if
    automated
  • Content-based retrieval systems
  • Support retrieval based on the image content,
    such as color histogram, texture, shape, objects,
    and wavelet transforms

10
Approaches Based on Image Signature
  • Color histogram-based signature
  • The signature includes color histograms based on
    color composition of an image regardless of its
    scale or orientation
  • No information about shape, location, or texture
  • Two images with similar color composition may
    contain very different shapes or textures, and
    thus could be completely unrelated in semantics
  • Multifeature composed signature
  • Define different distance functions for color,
    shape, location, and texture, and subsequently
    combine them to derive the overall result.

11
One Signature for the Entire Image?
  • Walnus NRS99 by Natsev, Rastogi, and Shim
  • Similar images may contain similar regions, but a
    region in one image could be a translation or
    scaling of a matching region in the other
  • Wavelet-based signature with region-based
    granularity
  • Define regions by clustering signatures of
    windows of varying sizes within the image
  • Signature of a region is the centroid of the
    cluster
  • Similarity is defined in terms of the fraction of
    the area of the two images covered by matching
    pairs of regions from two images

12
Similarity Search in Time-Series Analysis
  • Normal database query finds exact match
  • Similarity search finds data sequences that
    differ only slightly from the given query
    sequence
  • Two categories of similarity queries
  • Whole matching find a sequence that is similar
    to the query sequence
  • Subsequence matching find all pairs of similar
    sequences
  • Typical Applications
  • Financial market
  • Market basket data analysis
  • Scientific databases
  • Medical diagnosis

13
Similar time series analysis
14
Similar time series analysis
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund
group
15
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

16
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X ? Y ? Z
  • confidence, c, conditional probability that a
    transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

17
Association Rule Mining A Road Map
  • Boolean vs. quantitative associations (Based on
    the types of values handled)
  • buys(x, SQLServer) buys(x, DMBook)
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations (see ex. Above)
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?
  • Various extensions
  • Correlation, causality analysis
  • Association does not necessarily imply
    correlation or causality
  • Maxpatterns and closed itemsets
  • Constraints enforced
  • E.g., small sales (sum lt 100) trigger big buys
    (sum gt 1,000)?

18
Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 0.5
  • Steps
  • Scan DB once, find frequent 1-itemset (single
    item pattern)
  • Order frequent items in frequency descending
    order
  • Scan DB again, construct FP-tree

19
Classification of Constraints
Monotone
Antimonotone
Strongly convertible
Succinct
Convertible anti-monotone
Convertible monotone
Inconvertible
20
Association and Path Analysis in Bio-Medical and
DNA Data Mining
  • Association analysis identification of
    co-occurring gene sequences
  • Most diseases are not triggered by a single gene
    but by a combination of genes acting together
  • Association analysis may help determine the kinds
    of genes that are likely to co-occur together in
    target samples
  • Path analysis linking genes to different disease
    development stages
  • Different genes may become active at different
    stages of the disease
  • Develop pharmaceutical interventions that target
    the different stages separately
  • Visualization tools and genetic data analysis

21
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

22
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
23
Pair-wise Checking Using S-matrix
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
ltaagt happens twice
lt(ac)gt happens once
S-matrix
ltacgt happens 4 times
ltcagt happens twice
All length-2 sequential patterns are found in
S-matrix
24
Constraint-Based Sequential Pattern Mining
  • Constraint-based sequential pattern mining
  • Constraints User-specified, for focused mining
    of desired patterns
  • How to explore efficient mining with constraints?
    Optimization
  • Classification of constraints
  • Anti-monotone E.g., value_sum(S) lt 150, min(S) gt
    10
  • Monotone E.g., count (S) gt 5, S ? PC,
    digital_camera
  • Succinct E.g., length(S) ? 10, S ? Pentium,
    MS/Office, MS/Money
  • Convertible E.g., value_avg(S) lt 25, profit_sum
    (S) gt 160, max(S)/avg(S) lt 2, median(S) min(S)
    gt 5
  • Inconvertible E.g., avg(S) median(S) 0

25
From Sequential Patterns to Structured Patterns
  • Sets, sequences, trees and other structures
  • Transaction DB Sets of items
  • i1, i2, , im,
  • Seq. DB Sequences of sets
  • lti1, i2, , im, in, ikgt,
  • Sets of Sequences
  • lti1, i2gt, , ltim, in, ikgt,
  • Sets of trees (each element being a tree)
  • t1, t2, , tn
  • Applications Mining structured patterns in XML
    documents

26
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

27
Classification Methods
  • Decision tree induction
  • Bayesian Classification
  • Classification by Neural Networks
  • Classification by Support Vector Machines (SVM)
  • Classification based on concepts from association
    rule mining
  • Other Classification Methods

28
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
29
Classification in MultiMediaMiner
30
Bayesian Belief Network An Example
Family History
Smoker
(FH, S)
(FH, S)
(FH, S)
(FH, S)
LC
0.7
0.8
0.5
0.1
LungCancer
Emphysema
LC
0.3
0.2
0.5
0.9
The conditional probability table for the
variable LungCancer Shows the conditional
probability for each possible combination of its
parents
PositiveXRay
Dyspnea
Bayesian Belief Networks
31
Multi-Layer Perceptron
Output vector
Output nodes
Hidden nodes
wij
Input nodes
Input vector xi
32
Linear Classification
  • Binary Classification problem
  • The data above the red line belongs to class x
  • The data below red line belongs to class o
  • Examples SVM, Perceptron, Winnow, Probabilistic
    Classifiers

x
x
x
x
x
x
x
o
x
x
o
o
x
o
o
o
o
o
o
o
o
o
o
33
SVM Support Vector Machines
34
Association-Based Classification
  • Several methods for association-based
    classification
  • ARCS Quantitative association mining and
    clustering of association rules (Lent et al97)
  • It beats C4.5 in (mainly) scalability and also
    accuracy
  • Associative classification (Liu et al98)
  • It mines high support and high confidence rules
    in the form of cond_set gt y, where y is a
    class label
  • CAEP (Classification by aggregating emerging
    patterns) (Dong et al99)
  • Emerging patterns (EPs) the itemsets whose
    support increases significantly from one class to
    another
  • Mine Eps based on minimum support and growth rate

35
The k-Nearest Neighbor Algorithm
  • All instances correspond to points in the n-D
    space.
  • The nearest neighbor are defined in terms of
    Euclidean distance.
  • The target function could be discrete- or real-
    valued.
  • For discrete-valued, the k-NN returns the most
    common value among the k training examples
    nearest to xq.
  • Vonoroi diagram the decision surface induced by
    1-NN for a typical set of training examples.

.
_
_
_
.
_
.

.

.
_

xq
.
_

36
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

37
Cluster Analysis and Outliner Detection
  • Partitioning Methods
  • K-means and k-medoids algorithms
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods
  • Constraint-Based Clustering
  • Outlier Analysis

38
The K-Means Clustering Method
  • Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
39
Typical k-medoids algorithm (PAM)
Total Cost 20
10
9
8
Arbitrary choose k object as initial medoids
Assign each remaining object to nearest medoids
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
K2
Randomly select a nonmedoid object,Oramdom
Total Cost 26
Do loop Until no change
Compute total cost of swapping
Swapping O and Oramdom If quality is improved.
40
Hierarchical Clustering
  • Use distance matrix as clustering criteria. This
    method does not require the number of clusters k
    as an input, but needs a termination condition

41
CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
42
CURE (Clustering Using REpresentatives )
  • CURE proposed by Guha, Rastogi Shim, 1998
  • Stops the creation of a cluster hierarchy if a
    level consists of k clusters
  • Uses multiple representative points to evaluate
    the distance between clusters, adjusts well to
    arbitrary shaped clusters and avoids single-link
    effect

43
Overall Framework of CHAMELEON
Construct Sparse Graph
Partition the Graph
Data Set
Merge Partition
Final Clusters
44
DBSCAN Density Based Spatial Clustering of
Applications with Noise
  • Relies on a density-based notion of cluster A
    cluster is defined as a maximal set of
    density-connected points
  • Discovers clusters of arbitrary shape in spatial
    databases with noise

45
Reachability-distance
undefined

Cluster-order of the objects
46
Density-Based Cluster analysis OPTICS Its
Applications
47
Clustering and Distribution Density Functions
Density Attractor
48
Center-Defined and Arbitrary Shaped
49
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
50
STING A Statistical Information Grid Approach
  • Wang, Yang and Muntz (VLDB97)
  • Each cell stores statistical distribution of
    measure at low level
  • Multi-level resolution

51
WaveCluster
  • G. Sheikholeslami, et al. (1998) Multiple wavelet
    transformation-based cluster analysis

52
Constraint-Based Clustering Planning ATM
Locations
C3
C2
Bridge
C1
River
Mountain
C4
Spatial data with obstacles
Clustering without taking obstacles into
consideration
53
Clustering with Spatial Obstacles
Taking obstacles into account
Not Taking obstacles into account
54
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

55
Multidimensional Data and Data Cubes
  • Sales volume as a function of product, month, and
    region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
56
Mining Multimedia Databases in
MultiMediaMiner
57
Mining and Explorative Analysis of Data Cubes
(and Multi-Dimensional Databases)
  • Efficient computation of data or iceberg cubes
  • Discovery-driven data cube analysis
  • Cube-gradient analysis
  • What are the changes of the average house value
    in Sillicon Valley in 2001 comparing with 2000?
  • Under what conditions the average house value
    increases 10 per year in Chicago area in 1990s?

58
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

59
Visual Data Mining Data Visualization
  • Integration of visualization and data mining
  • data visualization
  • data mining result visualization
  • data mining process visualization
  • interactive visual data mining
  • Data visualization
  • Data in a database or data warehouse can be
    viewed
  • at different levels of abstraction
  • as different combinations of attributes or
    dimensions
  • Data can be presented in various visual forms

60
Data Mining Result Visualization
  • Presentation of the results or knowledge obtained
    from data mining in visual forms
  • Examples
  • Scatter plots and boxplots (obtained from
    descriptive data mining)
  • Decision trees
  • Association rules
  • Clusters
  • Outliers
  • Generalized rules

61
Boxplots from Statsoft Multiple Variable
Combinations
62
Visualization of Data Mining Results in SAS
Enterprise Miner Scatter Plots

63
Visualization of Association Rules in SGI/MineSet
3.0
64
Visualization of a Decision Tree in SGI/MineSet
3.0
65
Visualization of Cluster Grouping in IBM
Intelligent Miner
66
Data Mining Process Visualization
  • Presentation of the various processes of data
    mining in visual forms so that users can see
  • Data extraction process
  • Where the data is extracted
  • How the data is cleaned, integrated,
    preprocessed, and mined
  • Method selected for data mining
  • Where the results are stored
  • How they may be viewed

67
Visualization of Data Mining Processes by
Clementine

See your solution discovery process clearly
Understand variations with visualized data
68
Interactive Visual Data Mining
  • Using visualization tools in the data mining
    process to help users make smart data mining
    decisions
  • Example
  • Display the data distribution in a set of
    attributes using colored sectors or columns
    (depending on whether the whole space is
    represented by either a circle or a set of
    columns)
  • Use the display to which sector should first be
    selected for classification and where a good
    split point for this sector may be

69
Interactive Visual Mining by Perception-Based
Classification (PBC)
70
Audio Data Mining
  • Uses audio signals to indicate the patterns of
    data or the features of data mining results
  • An interesting alternative to visual mining
  • An inverse task of mining audio (such as music)
    databases which is to find patterns from audio
    data
  • Visual data mining may disclose interesting
    patterns using graphical displays, but requires
    users to concentrate on watching patterns
  • Instead, transform patterns into sound and music
    and listen to pitches, rhythms, tune, and melody
    in order to identify anything interesting or
    unusual

71
What Data Mining Methods May Help
Bio-Informatics?
  • Semantic integration of heterogeneous,
    distributed genome databases
  • Discovery of tandem repeats Blast and beyond
  • Similarity search in genome databases
  • Association, correlation, and linkage analysis
  • Fault-tolerant sequential and structured pattern
    mining
  • Advanced classification techniques
  • Cluster analysis and outlier detection
  • Multi-dimensional data mining environments
  • Visual data mining
  • Invisible data mining

72
Invisible Data Mining
  • Embed mining functions into information services
  • Web search engine (link analysis, authoritative
    pages, user profiles)adaptive web sites, etc.
  • Improvement of query processing history data
  • Making service smart and efficient
  • Benefits from/to data mining research
  • Data mining research has produced many scalable,
    efficient, novel mining solutions
  • Applications feed new challenge problems to
    research
  • Can we make bio-informatics based data mining
    invisible?

73
Conclusions
  • Data mining and bio-informatics Both are young
    and promising disciplines
  • Data mining A confluence of multiple
    disciplinesdatabase, data warehouse, machine
    learning, statistics, high performance computing,
    bio-technology, etc.
  • Lots of research issues need biologists and
    computer scientists working together

74
http//www.cs.uiuc.edu/hanj
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com