Potential Data Mining Techniques for Flow Cyt Data Analysis - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Potential Data Mining Techniques for Flow Cyt Data Analysis

Description:

Discriminative Analysis. Learning a function of its inputs to base its decision on ... Discriminative Classifiers vs. Bayesian Classifiers. Advantages ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 39
Provided by: lxi6
Category:

less

Transcript and Presenter's Notes

Title: Potential Data Mining Techniques for Flow Cyt Data Analysis


1
Potential Data Mining Techniques for Flow Cyt
Data Analysis
  • Li Xiong

2
Data Mining Functionalities
  • Association analysis
  • Classification and prediction
  • Cluster analysis
  • Evolution analysis

3
Flow Cyt Data
  • Sample over time points
  • Flow cyt data at each time point
  • cell marker intensity matrix

4
Data Preprocessing
  • Data cleaning
  • Reduce noise and handle missing values
  • Data transformation
  • Discretization discretize marker values into
    ranges (gates?)
  • Normalization
  • Marker based
  • Cell based
  • Sample based

5
Potential Analysis
  • Marker-based clustering
  • Cluster markers based on their expression
    patterns
  • Cell-based clustering
  • Cluster cells based on their expression patterns
  • Marker-based frequent itemsets analysis
  • Find frequent co-elevated marker groups
  • Sample-based clustering
  • Cluster samples based on their flow-cyt data
  • Sample-based classification
  • Classify patient based on their flow-cyt data and
    other clinical data into pre-defined classes
  • Sample-based time series analysis
  • Analyze how the flow cyt data evolves

6
Marker-Based Clustering
  • Plot each marker as a point in N-dimensional
    space (N of cells)
  • Define a distance metric between every two marker
    points in the N-dimensional space
  • Euclidean distance
  • Pearson correlation
  • Markers with a small distance share the same
    expression characteristics -gt functionally
    related or similar?
  • Clustering -gt functionally related markers?

7
Cell Based Clustering
  • Plot each cell as a point in N-dimensional space
    (N of markers)
  • Define a distance metric between every two cell
    points in the N-dimensional space
  • Cells with a small distance share the same
    expression characteristics -gt functionally
    related or similar?
  • Clustering -gt functionally related cells?
  • Help with gating? (N-dimensional vs.
    2-dimensional)

8
Clustering Techniques
  • Partition-based Partition data into a set of
    disjoint clusters
  • Hierarchical Organize elements into a tree
    (dendrogram), representing a hierarchical series
    of nested clusters
  • Agglomerative Start with every element in its
    own cluster, and iteratively join clusters
    together
  • Divisive Start with one cluster and iteratively
    divide it into smaller clusters
  • Graph-theoretical Present data in proximity
    graph and solve graph-theoretical problems such
    as finding minimum cut or maximal cliques
  • Others

9
Partitioning Methods K-Means Clustering
  • Input A set, V, consisting of n points and a
    parameter k
  • Output A set X consisting of k points (cluster
    centers) that minimizes the squared error
    distortion d(V,X) over all possible choices of X
  • Given a data point v and a set of points X,
    define the distance from v to X, d(v, X), as the
    (Eucledian) distance from v to the closest point
    from X. Given a set of n data points Vv1vn
    and a set of k points X, define the Squared Error
    Distortion
  • d(V,X) ?d(vi, X)2 / n
    1 lt i lt n

10
K-Means Clustering Lloyd Algorithm
  • Lloyd Algorithm
  • Arbitrarily assign the k cluster centers
  • while the cluster centers keep changing
  • Assign each data point to the cluster Ci
    corresponding to the closest cluster center (1
    i k)
  • Update cluster centers according to the
    center of gravity of each cluster, that is, ?v \
    C for all v in C for every cluster C
  • This may lead to merely a locally optimal
    clustering.

11
Some Discussion on k-means Clustering
  • May leads to a merely locally optimal clustering
  • Works well when the clusters are compact clouds
    that are rather well separated from one another.
  • Not suitable for clusters with nonconvex shapes
    or clusters of very different size.
  • Sensitive to noise and outlier data points
  • Necessity for users to specify k

12
Hierarchical Clustering
13
Hierarchical Clustering Algorithm
  • Hierarchical Clustering (d , n)
  • Form n clusters each with one element
  • Construct a graph T by assigning one vertex
    to each cluster
  • while there is more than one cluster
  • Find the two closest clusters C1 and C2
  • Merge C1 and C2 into new cluster C with
    C1 C2 elements
  • Compute distance from C to all other
    clusters
  • Add a new vertex C to T and connect to
    vertices C1 and C2
  • Remove rows and columns of d corresponding
    to C1 and C2
  • Add a row and column to d corrsponding to
    the new cluster C
  • return T

The algorithm takes a nxn distance matrix d of
pairwise distances between points as an
input. Different ways to define distances between
clusters may lead to different clusters
14
Graph Theoretical Methods Clique Graphs
  • Turn the distance matrix into a distance graph
  • Cells are represented as vertices in the graph
  • Choose a distance threshold ?
  • If the distance between two vertices is below ?,
    draw an edge between them
  • Transform the distance graph into clique graph by
    adding or removing edges
  • The resulting graph may contain cliques that
    represent clusters of closely located data
    points!

15
Marker Association Analysis
  • Convert intensity values to present or absent
  • The cell-marker intensity matrix can be
    transformed to cell list of present markers
    data
  • Mine for frequent marker sets that are co-present
    in cells
  • Potential association analysis for different
    gates (vs. present or absent)

16
Frequent Pattern Mining Methods
  • Three major approaches
  • Apriori (Agrawal Srikant_at_VLDB94)
  • Freq. pattern growth (FPgrowthHan, Pei Yin
    _at_SIGMOD00)
  • Vertical data format approach (CharmZaki Hsiao
    _at_SDM02)

17
Sample Based Classification
  • Predict target attributes for patients based on a
    set of features e.g. is the patient healthy?
    Will the patient reject the transplant?

18
Classification Model Construction
Labels
18
19
Feature Generation
  • Potential features
  • Marker data
  • Microarray data
  • Clinical data

20
Marker Data Features
  • Cell distribution for each marker
  • Histograms of cells for each range/gate
    (corresponds to what users are currently plotting
    for pair-wise markers)
  • Min, max, average, variance of the intensity
    levels
  • Distribution curves of cells for each
    intensity value

21
Cell Distribution for Individual Marker (CD 62L)
22
Question
  • Is the cell distribution enough to represent the
    flow cyt data?
  • In other words, can we say two samples are
    similar or the same if they have the same cell
    distribution for each marker?

23
Cross-Marker Distribution
  • Pair-wise cell distribution?
  • Can we use any results form the marker based
    clustering, cell based clustering, and the marker
    based association analysis?
  • Others?

24
Feature Selection and Feature Extraction
  • Problem curse of dimensionality
  • Limited data samples
  • Large set of features
  • Techniques dimension reduction
  • Feature selection
  • Feature extraction

25
Feature Selection
  • Select a subset of features from feature space
  • Theoretically - an optimal feature selection
    requires exhaustive search of all possible
    subsets of features
  • Practically - satisfactory set of features
  • Approaches
  • Relevance analysis remove redundant features
    based on correlation or mutual information
    (dependencies) among features
  • Greedy hill climbing select an approximate
    best set of features using correlation and
    mutual information between features and class
    attributes

26
Feature Extraction
  • Map high dimensional feature space to low
    dimensional feature space
  • Approaches
  • Principle Component Analysis linear
    transformation that maps projection of the data
    with greatest variance to the first coordinate
    (the first principal component), and so on.

27
Classification Methods
  • Decision tree induction
  • Bayesian classification
  • Neural network
  • Support Vector Machines (SVM)
  • Instance based methods

28
Decision Tree
29
Decision Tree - Comments
  • Relatively faster learning speed (than other
    classification methods)
  • Convertible to simple and easy to understand
    classification rules (e.g. if agelt30 and IFM1
    then healthy)
  • Can use SQL queries for accessing databases
  • Comparable classification accuracy with other
    methods

30
Probabilistic Learning (Bayesian Classification)
  • Calculate explicit probabilities for hypothesis
  • Characteristics
  • Incremental
  • Standard
  • Computationally intractable
  • Naïve Bayesian Classifier
  • Conditional independency of attributes

31
Naïve Bayesian Classifier - Comments
  • Advantages
  • Easy to implement
  • Good results obtained in most of the cases
  • Disadvantages
  • Assumption conditional independence
  • Practically, dependencies exist among variables
    and cannot be modeled by Naïve Bayesian
    Classifier

32
Discriminative Analysis
  • Learning a function of its inputs to base its
    decision on

32
33
Neural Network
34
SVM
35
Discriminative Classifiers vs. Bayesian
Classifiers
  • Advantages
  • prediction accuracy is generally high
  • robust, works when training examples contain
    errors
  • fast evaluation of the learned target function
  • Criticism
  • long training time
  • difficult to understand the learned function
    (weights)
  • not easy to incorporate domain knowledge

35
36
Instance-Based Methods
  • Instance-based learning
  • Store training examples and delay the processing
    (lazy evaluation) until a new instance must be
    classified
  • Typical approaches
  • k-nearest neighbor approach
  • Instances represented as points in a Euclidean
    space.
  • Locally weighted regression
  • Constructs local approximation
  • Case-based reasoning
  • Uses symbolic representations and knowledge-based
    inference

October 26, 2005
36
37
Popular Implementations
  • General
  • Weka an open source toolkit written in Java with
    implementations of many basic algorithms of
    classification, clustering, association analysis,
    can be accessed through GUI interface or Java API
  • Specialized ones
  • SVM-light simple implementation of SVM in C
  • KDNuggets a good directory of data mining
    software (commercial as well as open source)
  • http//www.kdnuggets.com/software/index.html

38
Summary
  • Potential adaptation and evaluation of a set of
    data mining techniques for flow cyt data analysis
  • Domain knowledge is important for each of the
    steps
Write a Comment
User Comments (0)
About PowerShow.com