Methods for Micro-Array Analysis Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Methods for Micro-Array Analysis Data Mining

Description:

Typical characteristic of micro array data is the large number of variables ... Then we obtain a data set (mij; i=1,...,G, j=1,...,p), in which mij is the ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 30
Provided by: pre80
Category:

less

Transcript and Presenter's Notes

Title: Methods for Micro-Array Analysis Data Mining


1
Methods for Micro-Array Analysis Data Mining
Machine Learning Approaches
2
What is Micro-Array Analysis?
3
Analysis of Micro-Array Data
  • Challenges posed
  • Typical characteristic of micro array data is the
    large number of variables relative to the number
    of observations.
  • Hidden knowledge in these data has to be
    discovered
  • Eg.Gene expression data from 72 leukemia patients
    (samples) with 7,070 genes (variables)
  • The study of the variability of gene expression
    patterns
  • Problems
  • How to analyze micro-array data with the
    following requirements met simultaneously ?
  • Efficiency
  • Accuracy
  • Automation

4
Typical Micro-Array data set
  • Suppose that the identical micro-array experiment
    is repeted p times (e.g. colon cancer cells from
    p patients compired with p wild tipes). Then we
    obtain a data set (mij i1,,G, j1,,p), in
    which mij is the expression ratio in gene I in
    jth experiment.
  • Usually generate large data sets with expression
    values for thousands of genes (200020000) but
    not more than a few dozens of samples
  • For example

mij 1 2 3 4 5 6 7 p
1 1.04 1.17 1.08 1.06 1.14 1.09 1.07 1.11
2 1.02 1.01 1.15 1.15 1.01 1.06 1.08 1.34

G 1.45 1.08 1.02 1.06 1.12 1.57 1.11 1.06
Dataset Number of genes Samples References
Leukemia ALL versus AML 7129 4725 bone marrow samples Golub et al. (1999)
Lung cancer (malignant pleural mesothelioma (MPM) versus adenocarcinoma of the lung (ADCA) ) 12533 31150 tissue samles Gordon et al (2002)
Prostate Cancer (Tumor versus Normal classification) 12600 7759 prostate tumor samples and normal samples Singh et al. (2002)
5
Main objectives in micro-array data analyses
  • 1. To find the genes that are differently
    expressed (DF) in the two samples (e.g. the given
    colon cancer sample / the wild type cells that
    submited a given treatment / no treatment ).
    Although biologists can discover DF genes even
    with p1, it has been realized lately that making
    independent replications is a good practise.
  • Questions that could be asked
  • - Which genes expression is modified by the
    condition? (it has been reported that many
    diseases, especially tumors, have never been
    caused by a single gene mutation but are the
    result of a series of gene changes)
  • - Has the treatment changed the expression
    level of specific (target) genes / gene sequences
    to noticeably different levels? If so is it
    important (i.e. is the patients condition
    improved due to this change in expression levels)
  • 2. To find genes that behave similarly in
    different conditions (i.e. clustering the row
    vectors) and to find subgroups of samples (or
    patients tissues), that are similar to each
    other (i.e. clustering the column vectors).
  • - novel discovery of genes in related
    biological pathway or having related functions
  • - clinically important subgroups of
    patients
  • 3. Classification
  • - For example Golub et al. (1999) 2
    types of leukemia / based on gene expression
    profile of each sample
  • 4. Validation of the models, assessment of
    robustness/ predicting power of the classifiers
    (models)

6
Main objectives in micro-array data analyses
7
1.Finding differently expressed
genes.Parametrical methods t-test
  • Standard t test
  • H0 - no difference between the treatment
    and the controlled samples
  • H1 - treatment has an influence.
  • Knowing the probability distribution of the T
    variable under H0 (Student law of p-1 ddl), the
    actual T is computed and compared to this
    distribution.
  • At a smaller p-value it is less likely to see
    extreme differences by chance.

8
1.Finding differently expressed genes. t-test
  • Advantages simple and implemented in all
    comercial microarray analysis packages
  • Disadvantages distributional assumptions and
    the problem of multiple testing (due to the small
    number of samples, we can not assume normality of
    the mean of the samples) . -gt what is the false
    descovery rate ?
  • Alternatives empirical Bayes and parametric
    Bayes

9
1.Finding differently expressed genes.Fold
approach
  • If the average expression level of the genes is
    examined
  • If it changed by a certain number of folds, the
    gene is declared changed (on or off)
  • Disadvantage does not reveal the desired
    correlation between the gene and its function.
    Does not find related genes.

10
Data Mining

11
Gene Expression grouping and classification.
Overview of existing approaches
  • Micro-Array analysis employs machine learning
    algorithms and techniques to mine useful data.
  • Unsupervised data analysis
  • Principal Component Analysis (PCA)
  • Hierarchical Clustering
  • Non-Hierarchical Clustering
  • K means
  • Self organizing maps (a type of neural networks)
  • Supervised data analysis
  • Decision Trees - C5.0 implementation
  • Artificial Neural Networks Back-propagation
    algorithm
  • Two complementary techniques
  • Cross-validation
  • Multi-model approaches (boosting, bagging,
    stacking)

12
Principal Component Analysis (PCA)
  • This is a technique for finding major
    combinations of data (I.e. genes that are
    regularly up- and down- regulated together)
  • Objectives
  • Graphically resume a large rectangular table of
    numbers, R, simplify its comprehension, find
    pertinent features.
  • Reduce the dimensionality of the data set, (e.g.
    co-regulated genes)
  • Graphically resume
  • - The correlations between the
    variables.
  • - Find new meaningful underlying
    variables (dimensions), resuming the initial
    variables in this way.
  • MAXIMIZE THE INTRA-CLASS VARIANCE
  • MINIMIZE THE INTER-CLASS VARI12ANCE
  • - The proximities and the
    principal oppositions between the individuals
  • Simple example
  • Imagine a micro-array data set consisting of
    only 2 experiments (2 samples)
  • Graphically represent the data.

13
Principal Component Analysis(PCA)
  • Principal component analysis of a
    two-dimensional data cloud. The line shown is the
    direction of the first principal component, which
    gives an optimal (in the mean-square sense)
    linear reduction of dimension from 2 to 1
    dimensions.

14
Principal Component Analysis(PCA)
  • Principal component analysis (PCA) involves a
    mathematical procedure that transforms a number
    of (possibly) correlated variables into a
    (smaller) number of uncorrelated variables called
    principal components.
  • Illustration for the case of 2 samples
  • The variance of the sample x is given by
  • The variance of the sample y is given by
  • The covariance of between x and y
  • Then we can write
  • This matrix is square and symmetric, admits a
    characteristic polynomial and is diagonalizable.
    Also admits a basis of orthogonal eigen vectors

15
Principal Component Analysis(PCA)
  • Then, it exists a matrix U so that
  • The 2 eigen vectors orthogonal. Represent a
    new system of INDEPENDENT coordinates. The
    quantities u11 and u12 are actually the
    coordinates of the new axis expressed in a
    vectorial format. Same for u21 and u22 .
  • Each coefficient indicates the weight of a
    particular experiment within this component !
    (how much participates this experiment at the
    generation of this pattern)
  • A translation and a rotation of the coordinate
    system.

16
Principal Component Analysis(PCA)
  • The first principal component - as much of the
    variability in the data as possible,
  • Each succeeding component - as much of the
    remaining variability as possible.
  • Imagine cloud of data in many dimentions ?
    benefits !
  • The projection of a point A (x, y) on a axis u
    (u1, u2) is obtained by performing the scalar
    product of the coordinates of this point and the
    vectorial coordinates of the axis projection
    xu1yu2.
  • Now, our the points are the genes.
  • It is intersting to plot the eigen values, which
    expresses the way that the variability of data is
    repartised in the new coordinate system. The
    relative sizes of the major and minor axes in the
    ellipse.

17
Principal Component Analysis(PCA)
  • Application to sporulation time-series
    observations of differential expression for
    thousands of genes across multiple conditions
  • Usually, the first component has all positive
    coefficients, indicating a weighted average of
    all experiments
  • The second principal coefficient has negative
    values at early time points and positive values
    for the latter time points, indicating a measure
    of change in expression

mij t1 t2 t3 t4 t5 t6 t7 t10
1 1.04 1.17 1.08 1.06 1.14 1.09 1.07 1.11
2 1.02 1.01 1.15 1.15 1.01 1.06 1.08 1.34

G 1.45 1.08 1.02 1.06 1.12 1.57 1.11 1.06
18
Machine Learning for Micro-Array Analysis
clustering
  • Cluster analysis
  • Identification of new subgroups or classes of
    some biological entity (e.g. ,tumors)

19
Hierarchical Clustering
  • Hierarchical cluster methos differ in
  • the distance measure selected
  • the manner in which the distances are computed
    between the growing clusters and the remaining
    members of the data set
  • Single Linkage. Disadvantage - loose clusters
  • Complete Linkage. Disadvantage to compact
    clusters of very similar size.
  • Average Linkage
  • Unweighted pair-group method average (UPGMA) To
    groups of the lowest average distance are joined
    to form a new cluster.

20
Hierarchical Clustering
  • Euclidian and Manhattan sensitive to absolute
    expression levels. Reveal genes that have similar
    expression levels.
  • A and B have aproximately the
    same expression levels
  • Correlation coefficient with centering sensitive
    to expression profiles. Reveal genes that have
    similar expression profiles.
  • D and E enhanced
  • A and C repressed
  • Absolute correlation coefficient
  • A, C, D, E may be involved in
    the same biological pathway

21
K-means Clustering
  • 1. Randomly assign data to the clusters.
  • Suppose there are m genes per cluster.
  • 2. Calculate an average expression vector for
    each cluster i.
  • This Corresponds to the centroid of the
    cluster.
  • 3. Calculate a mean interclass distance
    between each point and the centroid, for each
    cluster.
  • 4. Move the data from one class to another.
  • Aim of minimizing the averall interclass
    distance measure.
  • ADVANTAGES easy to implement.
  • DISADVANTAGES computationally intensive.
  • outcome
    determined by such factors as distance metrics
    chose.

22
Non-parametric models
  • Models that rely heavily on the empirical
    analysis of large data sets rather than on prior
    domain knowledge
  • Non-parametric Approaches
  • Decision trees, Neural networks, Genetic
    algorithms, and Nearest neighbor methods.
  • Fundamental assumption
  • Consistently observed relationships or
    patterns in large data sets will recur in future
    observations.
  • Advantages
  • Does not require a thorough understanding of the
    underlying system or problem
  • Can be used to build arbitrarily complex models,
    that are highly non-linear and not restricted by
    human comprehension.

23
Decision Tree
  • Strengths
  • Clearly indicates which attributes are most
    important for prediction or classification.
  • Weaknesses
  • Limited ability to handle estimation or
    regression tasks where the goal is to predict the
    value of a continuous variable
  • Error-prone when the number of training examples
    per class is small

24
Neural Networks
  • Strengths
  • Ability to handle a range of problem tasks
    including classification (discrete outputs) and
    estimation or regression tasks (continuous
    outputs)
  • Provision of an indication (through sensitivity
    analysis) of which attributes are most important
    for prediction or classification
  • Weaknesses
  • The risk of premature convergence to an inferior
    solution (this is normally addressed by
    performing a sensible cross-validation procedure)

25
Multi-Model Approaches
  • Problem with the regular models
  • Instability of Prediction Method
  • Sensitivity of the final model to small changes
    in the training set.
  • Unstable machine learning methods
  • Decision trees
  • Stable methods
  • k-nearest neighbor
  • Neural models

Now, let us see an approach to address the
instability problem.
26
Machine Learning for Micro-Array Analysis
  • Cross validation
  • To test the robustness of the classifier
  • Algorithm choice depends on
  • Attributes
  • Ratio of the training data
  • TP,TNif TP is small- over-fitting occurs
  • Combined approaches
  • Limited amount of training data, the individual
    classifier may not represent the true hypotheses.
  • Combined classifier may produce a good
    approximation to the true hypotheses.

27
Multi-Model Approaches
  • Common methods for constructing multi-model
    systems
  • Boosting, Bagging, and Stacking
  • What they do?
  • Creates and Combines multiple classifiers
  • How are they different from each other?
  • Differ in how the classifiers are trained and in
    how their outputs are combined.
  • How they improve accuracy?
  • They improve accuracy by focusing the learning
    process on examples in the data that are harder
    to model than others.

28
Boosting
29
Boosting Algorithm
  • Step 1 Form the Learning set and validation set
    (with uniform and without replacement sampling).
  • Step 2 N different training set replicas are
    sampled adaptively (with non-uniform sampling
    probabilities and with replacement)
  • Step 3 Build each classifier, f'i (x), based on
    the training set.
  • Step 4 Establish each classifiers performance
    by testing it against the learning set.
  • Step 5 Calculate a weight for each classifier
    based on its performance
  • Step 6 Combine model by means of a weighted
    voting scheme, where each individual prediction
    model carries a different weight.
Write a Comment
User Comments (0)
About PowerShow.com