Title: Potential Data Mining Techniques for Flow Cyt Data Analysis
1Potential Data Mining Techniques for Flow Cyt
Data Analysis
2Data Mining Functionalities
- Association analysis
- Classification and prediction
- Cluster analysis
- Evolution analysis
3Flow Cyt Data
- Sample over time points
- Flow cyt data at each time point
- cell marker intensity matrix
4Data Preprocessing
- Data cleaning
- Reduce noise and handle missing values
- Data transformation
- Discretization discretize marker values into
ranges (gates?) - Normalization
- Marker based
- Cell based
- Sample based
5Potential Analysis
- Marker-based clustering
- Cluster markers based on their expression
patterns - Cell-based clustering
- Cluster cells based on their expression patterns
- Marker-based frequent itemsets analysis
- Find frequent co-elevated marker groups
- Sample-based clustering
- Cluster samples based on their flow-cyt data
- Sample-based classification
- Classify patient based on their flow-cyt data and
other clinical data into pre-defined classes - Sample-based time series analysis
- Analyze how the flow cyt data evolves
6Marker-Based Clustering
- Plot each marker as a point in N-dimensional
space (N of cells) - Define a distance metric between every two marker
points in the N-dimensional space - Euclidean distance
- Pearson correlation
- Markers with a small distance share the same
expression characteristics -gt functionally
related or similar? - Clustering -gt functionally related markers?
7Cell Based Clustering
- Plot each cell as a point in N-dimensional space
(N of markers) - Define a distance metric between every two cell
points in the N-dimensional space - Cells with a small distance share the same
expression characteristics -gt functionally
related or similar? - Clustering -gt functionally related cells?
- Help with gating? (N-dimensional vs.
2-dimensional)
8Clustering Techniques
- Partition-based Partition data into a set of
disjoint clusters - Hierarchical Organize elements into a tree
(dendrogram), representing a hierarchical series
of nested clusters - Agglomerative Start with every element in its
own cluster, and iteratively join clusters
together - Divisive Start with one cluster and iteratively
divide it into smaller clusters - Graph-theoretical Present data in proximity
graph and solve graph-theoretical problems such
as finding minimum cut or maximal cliques - Others
9Partitioning Methods K-Means Clustering
- Input A set, V, consisting of n points and a
parameter k - Output A set X consisting of k points (cluster
centers) that minimizes the squared error
distortion d(V,X) over all possible choices of X - Given a data point v and a set of points X,
define the distance from v to X, d(v, X), as the
(Eucledian) distance from v to the closest point
from X. Given a set of n data points Vv1vn
and a set of k points X, define the Squared Error
Distortion - d(V,X) ?d(vi, X)2 / n
1 lt i lt n
10K-Means Clustering Lloyd Algorithm
- Lloyd Algorithm
- Arbitrarily assign the k cluster centers
- while the cluster centers keep changing
- Assign each data point to the cluster Ci
corresponding to the closest cluster center (1
i k) - Update cluster centers according to the
center of gravity of each cluster, that is, ?v \
C for all v in C for every cluster C -
- This may lead to merely a locally optimal
clustering.
11Some Discussion on k-means Clustering
- May leads to a merely locally optimal clustering
- Works well when the clusters are compact clouds
that are rather well separated from one another.
- Not suitable for clusters with nonconvex shapes
or clusters of very different size. - Sensitive to noise and outlier data points
- Necessity for users to specify k
12Hierarchical Clustering
13Hierarchical Clustering Algorithm
- Hierarchical Clustering (d , n)
- Form n clusters each with one element
- Construct a graph T by assigning one vertex
to each cluster - while there is more than one cluster
- Find the two closest clusters C1 and C2
- Merge C1 and C2 into new cluster C with
C1 C2 elements - Compute distance from C to all other
clusters - Add a new vertex C to T and connect to
vertices C1 and C2 - Remove rows and columns of d corresponding
to C1 and C2 - Add a row and column to d corrsponding to
the new cluster C - return T
The algorithm takes a nxn distance matrix d of
pairwise distances between points as an
input. Different ways to define distances between
clusters may lead to different clusters
14Graph Theoretical Methods Clique Graphs
- Turn the distance matrix into a distance graph
- Cells are represented as vertices in the graph
- Choose a distance threshold ?
- If the distance between two vertices is below ?,
draw an edge between them - Transform the distance graph into clique graph by
adding or removing edges - The resulting graph may contain cliques that
represent clusters of closely located data
points!
15Marker Association Analysis
- Convert intensity values to present or absent
- The cell-marker intensity matrix can be
transformed to cell list of present markers
data - Mine for frequent marker sets that are co-present
in cells - Potential association analysis for different
gates (vs. present or absent)
16Frequent Pattern Mining Methods
- Three major approaches
- Apriori (Agrawal Srikant_at_VLDB94)
- Freq. pattern growth (FPgrowthHan, Pei Yin
_at_SIGMOD00) - Vertical data format approach (CharmZaki Hsiao
_at_SDM02)
17Sample Based Classification
- Predict target attributes for patients based on a
set of features e.g. is the patient healthy?
Will the patient reject the transplant?
18Classification Model Construction
Labels
18
19Feature Generation
- Potential features
- Marker data
- Microarray data
- Clinical data
20Marker Data Features
- Cell distribution for each marker
- Histograms of cells for each range/gate
(corresponds to what users are currently plotting
for pair-wise markers) - Min, max, average, variance of the intensity
levels - Distribution curves of cells for each
intensity value
21Cell Distribution for Individual Marker (CD 62L)
22Question
- Is the cell distribution enough to represent the
flow cyt data? - In other words, can we say two samples are
similar or the same if they have the same cell
distribution for each marker?
23Cross-Marker Distribution
- Pair-wise cell distribution?
- Can we use any results form the marker based
clustering, cell based clustering, and the marker
based association analysis? - Others?
24Feature Selection and Feature Extraction
- Problem curse of dimensionality
- Limited data samples
- Large set of features
- Techniques dimension reduction
- Feature selection
- Feature extraction
25Feature Selection
- Select a subset of features from feature space
- Theoretically - an optimal feature selection
requires exhaustive search of all possible
subsets of features - Practically - satisfactory set of features
- Approaches
- Relevance analysis remove redundant features
based on correlation or mutual information
(dependencies) among features - Greedy hill climbing select an approximate
best set of features using correlation and
mutual information between features and class
attributes
26Feature Extraction
- Map high dimensional feature space to low
dimensional feature space - Approaches
- Principle Component Analysis linear
transformation that maps projection of the data
with greatest variance to the first coordinate
(the first principal component), and so on.
27Classification Methods
- Decision tree induction
- Bayesian classification
- Neural network
- Support Vector Machines (SVM)
- Instance based methods
28Decision Tree
29Decision Tree - Comments
- Relatively faster learning speed (than other
classification methods) - Convertible to simple and easy to understand
classification rules (e.g. if agelt30 and IFM1
then healthy) - Can use SQL queries for accessing databases
- Comparable classification accuracy with other
methods
30Probabilistic Learning (Bayesian Classification)
- Calculate explicit probabilities for hypothesis
- Characteristics
- Incremental
- Standard
- Computationally intractable
- Naïve Bayesian Classifier
- Conditional independency of attributes
31Naïve Bayesian Classifier - Comments
- Advantages
- Easy to implement
- Good results obtained in most of the cases
- Disadvantages
- Assumption conditional independence
- Practically, dependencies exist among variables
and cannot be modeled by Naïve Bayesian
Classifier
32Discriminative Analysis
- Learning a function of its inputs to base its
decision on
32
33Neural Network
34SVM
35Discriminative Classifiers vs. Bayesian
Classifiers
- Advantages
- prediction accuracy is generally high
- robust, works when training examples contain
errors - fast evaluation of the learned target function
- Criticism
- long training time
- difficult to understand the learned function
(weights) - not easy to incorporate domain knowledge
35
36Instance-Based Methods
- Instance-based learning
- Store training examples and delay the processing
(lazy evaluation) until a new instance must be
classified - Typical approaches
- k-nearest neighbor approach
- Instances represented as points in a Euclidean
space. - Locally weighted regression
- Constructs local approximation
- Case-based reasoning
- Uses symbolic representations and knowledge-based
inference
October 26, 2005
36
37Popular Implementations
- General
- Weka an open source toolkit written in Java with
implementations of many basic algorithms of
classification, clustering, association analysis,
can be accessed through GUI interface or Java API - Specialized ones
- SVM-light simple implementation of SVM in C
- KDNuggets a good directory of data mining
software (commercial as well as open source) - http//www.kdnuggets.com/software/index.html
38Summary
- Potential adaptation and evaluation of a set of
data mining techniques for flow cyt data analysis - Domain knowledge is important for each of the
steps