DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini - PowerPoint PPT Presentation

About This Presentation
Title:

DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini

Description:

... lecture and Sorin Draghici's book 'Data Analysis Tools for DNA Microarrays' ... to work through this large data set and make sense of the data are desired. ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 98
Provided by: SteveH149
Category:

less

Transcript and Presenter's Notes

Title: DNA Chips and Their Analysis Comp. Genomics: Lecture 13 based on many sources, primarily Zohar Yakhini


1
DNA Chips and Their AnalysisComp. Genomics
Lecture 13based on many sources, primarily Zohar
Yakhini
2
DNA Microarras Basics
  • What are they.
  • Types of arrays (cDNA arrays, oligo arrays).
  • What is measured using DNA microarrays.
  • How are the measurements done?

3
DNA Microarras Computational Questions
  • Design of arrays.
  • Techniques for analyzing experiments.
  • Detecting differential expression.
  • Similar expression Clustering.
  • Other analysis techniques (mmmmmany).
  • Machine learning techniques, and applications for
    advanced diagnosis.

4
What is a DNA Microarray (I)
  • A surface (nylon, glass, or plastic).
  • Containing hundreds to thousand pixels.
  • Each pixel has copies of a sequence
  • of single stranded DNA (ssDNA).
  • Each such sequence is called a probe.

5
What is a DNA Microarray (II)
  • An experiment with 500-10k elements.
  • Way to concurrently explore the function of
    multiple genes.
  • A snapshot of the expression level of 500-10k
    genes under given test conditions

6
Some Microarray Terminology
  • Probe ssDNA printed on the solid substrate
    (nylon or glass). These are
  • short substrings of the genes we are going to
    be testing
  • Target cDNA which has been labeled and is to be
    washed over the probe

7
Back to Basics Watson and Crick
James Watson and Francis Crick discovered, in
1953, the double helix structure of DNA.
From Zohar Yakhini
8
Watson-Crick Complimentarity
A binds to T C binds to G
From Zohar Yakhini
9
Array Based Hybridization Assays (DNA Chips)
  • Array of probes
  • Thousands to millions of different probe
    sequences per array.

Unknown sequence or mixture (target).Many copies.
From Zohar Yakhini
10
Array Based Hyb Assays
  • Target hybs to WC complimentary probes only
  • Therefore the fluorescence pattern is
    indicative of the target sequence.

From Zohar Yakhini
11
DNA Sequencing Sanger Method
  • Generate all A,C,G,T terminated prefixes of the
    sequence, by a polymerase reaction with
    terminating corresponding bases.
  • Run in four different gel lanes.
  • Reconstruct sequence from the information on the
    lengths of all A,C,G,T terminated prefixes.
  • The need for 4 different reactions is avoided by
    using differentially dye labeled terminating
    bases.

From Zohar Yakhini
12
Central Dogma of Molecular Biology(reminder)
Cells express different subset of the genes in
different tissues and under different conditions
Gene (DNA)
From Zohar Yakhini
13
Expression Profiling on MicroArrays
  • Differentially label the query sample and the
    control (1-3).
  • Mix and hybridize to an array.
  • Analyze the image to obtain expression levels
    information.

From Zohar Yakhini
14
Microarray 2 Types of Fabrication
  • cDNA Arrays Deposition of DNA fragments
  • Deposition of PCR-amplified cDNA clones
  • Printing of already synthesized oligonucleotieds
  • Oligo Arrays In Situ synthesis
  • Photolithography
  • Ink Jet Printing
  • Electrochemical Synthesis

By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
15
cDNA Microarrays vs. Oligonucleotide Probes and
Cost
cDNA Arrays Oligonucleotide Arrays
Long Sequences Spot Unknown Sequences More variability Arrays cheaper Short Sequences Spot Known Sequences More reliable data Arrays typically more expensive
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
16
Photolithography (Affymetrix)
Photodeprotection
  • Similar to process used to generate VLSI circuits
  • Photolithographic masks are used to add each base
  • If base is present, there will be a hole in the
    corresponding mask
  • Can create high density arrays, but sequence
    length is limited

mask
C
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
17
Photolithography (Affymetrix)
From Zohar Yakhini
18
Ink Jet Printing
  • Four cartridges are loaded with the four
    nucleotides A, G, C,T
  • As the printer head moves across the array, the
    nucleotides are deposited in pixels where they
    are needed.
  • This way (many copies of) a 20-60 base long oligo
    is deposited in each pixel.

By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
19
Ink Jet Printing (Agilent)
The array is a stack of images in the colors A,
C, G, T.

From Zohar Yakhini
20
Inkjet Printed Microarrays
Inkjet head, squirting phosphor-ammodites
From Zohar Yakhini
21
Electrochemical Synthesis
  • Electrodes are embedded in the substrate to
    manage individual reaction sites
  • Electrodes are activated in necessary positions
    in a predetermined sequence that allows the
    sequences to be constructed base by base
  • Solutions containing specific bases are washed
    over the substrate while the electrodes are
    activated

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
22
Preparation of Samples
  • Use oligo(dT) on a separation column to extract
    mRNA from total cell populations.
  • Use olig(dT) initiated polymerase to reverse
    transcribe RNA into fluorescence labeled cDNA.
    RNA is unstable because of environment
    RNA-digesting enzymes.
  • Alternatively use random priming for this
    purpose, generating a population of transcript
    subsequences

From Zohar Yakhini
23
Expression Profiling on MicroArrays
  • Differentially label the query sample and the
    control (1-3).
  • Mix and hybridize to an array.
  • Analyze the image to obtain expression levels
    information.

From Zohar Yakhini
24
Expression Profiling a FLASH Demo
URL
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
25
Expression Profiling Probe Design Issues
  • Probe specificity and sensitivity.
  • Special designs for splice variations or other
    custom purposes.
  • Flat thermodynamics.
  • Generic and universal systems

From Zohar Yakhini
26
Hybridization Probes
  • SensitivityStrong interaction between the probe
    and its intended target, under the assay's
    conditions.How much target is needed for the
    reaction to be detectable or quantifiable?
  • SpecificityNo potential cross hybridization.

From Zohar Yakhini
27
Specificity
  • Symbolic specificity
  • Statistical protection in the unknown part of the
    genome.

Methods, software and application in
collaboration with Peter Webb, Doron Lipson.
From Zohar Yakhini
28
Reading Results Color Coding
Campbell Heyer, 2003
  • Numeric tables are difficult to read
  • Data is presented with a color scale
  • Coding scheme
  • Green repressed (less mRNA) gene in experiment
  • Red induced (more mRNA) gene in experiment
  • Black no change (11 ratio)
  • Or
  • Green control condition (e.g. aerobic)
  • Red experimental condition (e.g. anaerobic)
  • We usually use ratio

29
Thermal Ink Jet Arrays, by Agilent Technologies
In-Situ synthesized oligonucleotide array. 25-60
mers.
cDNA array, Inkjet deposition
30
Application of Microarrays
  • We only know the function of about 30 of the
    30,000 genes in the Human Genome
  • Gene exploration
  • Functional Genomics
  • First among many high
  • throughput genomic devices

http//www.gene-chips.com/sample1.html
By Steve Hookway lecture and Sorin Draghicis
book Data Analysis Tools for DNA Microarrays
31
A Data Mining Problem
  • On a given microarray, we test on the order of
    10k elements in one time
  • Number of microarrays used in typical
  • experiment is no more than 100.
  • Insufficient sampling.
  • Data is obtained faster than it can be processed.
  • High noise.
  • Algorithmic approaches to work through this large
    data set and make sense of the data are desired.

32
Informative Genes in aTwo Classes Experiment
  • Differentially expressed in the two classes.
  • Identifying (statistically significant)
    informative genes
  • - Provides biological insight
  • - Indicate promising research directions
  • - Reduce data dimensionality
  • - Diagnostic assay

From Zohar Yakhini
33
Scoring Genes
Expression pattern and pathological diagnosis
information (annotation), for a single gene
- - - - - -
- a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
a12 a13 a14 a15 Permute the annotation by
sorting the expression pattern (ascending, say).
From Zohar Yakhini
34
Separation Score
  • Compute a Gaussian fit for each class ? (?1 ,
    ?1) , (?2 , ?2) .
  • The Separation Score is(?1 - ?2)/(?1 ?2)

35
Threshold Error Rate (TNoM) Score
Find the threshold that best separates tumors
from normals, count the number of errors
committed there.
Ex 1
- - - - - - -
From Zohar Yakhini
36
p-Values
  • Relevance scores are more useful when we can
    compute their significance
  • p-value The probability of finding a gene with a
    given score if the labeling is random
  • p-Values allow for higher level statistical
    assessment of data quality.
  • p-Values provide a uniform platform for comparing
    relevance, across data sets.
  • p-Values enable class discovery

From Zohar Yakhini
37
BRCA1 Differential Expression
Genes over-expressed in BRCA1 wildtype
Genes over-expressed in BRCA1 mutants
Collab with NIH NEJM 2001
Sporadic sample s14321 With BRCA1-mutant
expression profile
BRCA1 Wildtype
BRCA1 mutants
From Zohar Yakhini
38
Data Analysis Leave One Out Cross Validation
(LOOCV)
  • Repeat, for each tissue (tumor/normal)
  • Hide the label of the test tissue
  • Diagnose the test tissue based on the remaining
    data
  • Compare the diagnosis to the hidden label

From Zohar Yakhini
39
BRCA1 LOOCV Results
From Zohar Yakhini
40
Lung Cancer Informative Genes
Data from Naftali Kaminskis lab, at Sheba.
  • 24 tumors (various types and origins)
  • 10 normals (normal edges and normal lung pools)

From Zohar Yakhini
41
And Now Global Analysisof Gene Expression Data
  • First (but not least) Clustering
  • either of genes, or of experiments

42
Example data fold change (ratios)
What is the pattern?
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Campbell Heyer, 2003
43
Example data 2
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell Heyer, 2003
44
Pearson Correlation Coefficient, r. values
in -1,1 interval
  • Gene expression over d experiments is a vector in
    Rd, e.g. for gene C (0, 3, 3.58, 4, 3.58, 3)
  • Given two vectors X and Y that contain N
    elements, we calculate r as follows

Cho Won, 2003
45
Example Pearson Correlation Coefficient, r
  • X Gene C (0, 3.00, 3.58, 4, 3.58, 3)Y Gene
    D (0, 1.58, 2.00, 2, 1.58, 1)
  • ?XY (0)(0)(3)(1.58)(3.58)(2)(4)(2)(3.58)(1.5
    8)(3)(1) 28.5564
  • ?X 33.5843.583 17.16
  • ?X2 323.582423.58232 59.6328
  • ?Y 1.58221.581 8.16
  • ?Y2 1.58222221.58212 13.9928
  • N 6
  • ?XY ?X?Y/N 28.5564 (17.16)(8.16)/6 5.2188
  • ?X2 (?X)2/N 59.6328 (17.16)2/6 10.5552
  • ?Y2 (?Y)2/N 13.9928 (8.16)2/6 2.8952
  • r 5.2188 / sqrt((10.5552)(2.8952))
  • 0.944

46
Example data Pearson correlation coefficients
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Campbell Heyer, 2003
47
Example Reorganization of data
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Campbell Heyer, 2003
48
Spearman Rank Order Coefficient
  • Replace each entry xi by its rank in vector x.
  • Then compute Pearson correlation coefficients of
    rank vectors.
  • Example X Gene C (0, 3.00, 3.41, 4, 3.58,
    3.01) Y Gene D (0, 1.51,
    2.00, 2.32, 1.58, 1)
  • Ranks(X) (1,2,4,6,5,3)
  • Ranks(Y) (1,3,5,6,4,2)
  • Ties should be taken care of (1) rare

  • (2) randomize (small effect)

49
Grouping and Reduction
  • Grouping Partition items into groups. Items in
    same group should be similar.
  • Items in different groups should be
    dissimilar.
  • Grouping may help discover patterns in the data.
  • Reduction reduce the complexity of data by
    removing redundant probes (genes).

50
Unsupervised Grouping Clustering
  • Pattern discovery via clustering
  • similarly expressed genes together
  • Techniques most often used
  • k-Means Clustering
  • Hierarchical Clustering
  • Biclustering
  • Alternative Methods Self Organizing Maps
    (SOMS),
  • plaid models, singular value decomposition
    (SVD),
  • order preserving submatrices (OPSM),

51
Clustering Overview
  • Different similarity measures in use
  • Pearson Correlation Coefficient
  • Cosine Coefficient
  • Euclidean Distance
  • Information Gain
  • Mutual Information
  • Signal to noise ratio
  • Simple Matching for Nominal

52
Clustering Overview (cont.)
  • Different Clustering Methods
  • Unsupervised
  • k-means Clustering (k nearest neighbors)
  • Hierarchical Clustering
  • Self-organizing map
  • Supervised
  • Support vector machine
  • Ensemble classifier
  • Data Mining

53
Clustering Limitations
  • Any data can be clustered, therefore we must be
    careful what conclusions we draw from our results
  • Clustering is often randomized and can and will
    produce different results for different runs on
    same data

54
K-means Clustering
  • Given a set of m data points in
  • n-dimensional space and an integer k.
  • We want to find the set of k centers in
  • n-dimensional space that minimizes the
    Euclidean (mean squared) distance from each data
    point to its nearest center.
  • No exact polynomial-time algorithms are
  • known for this problem (no wonder, NP-hard!).

A Local Search Approximation Algorithm for
k-Means Clustering by Kanungo et. al
55
K-means Heuristic (Lloyds Algorithm)
  • Has been shown to converge to a locally optimal
    solution
  • But can converge to a solution arbitrarily bad
    compared to the optimal solution

Data Points
Optimal Centers
Heuristic Centers
K3
  • K-means-type algorithms A generalized
    convergence theorem and characterization of local
    optimality by Selim and Ismail
  • A Local Search Approximation Algorithm for
    k-Means Clustering by Kanungo et al.

56
Euclidean Distance
Now to find the distance between two points, say
the origin and the point (3,4)
Simple and Fast! Remember this when we consider
the complexity!
57
Finding a Centroid
  • We use the following equation to find the n
    dimensional centroid point (center of mass) amid
    k (n dimensional) points

Example Lets find the midpoint between three 2D
points, say (2,4) (5,2) (8,9)
58
K-means Iterative Heuristic
  • Choose k initial center points randomly
  • Cluster data using Euclidean distance (or other
    distance metric)
  • Calculate new center points for each cluster,
  • using only points within the cluster
  • Re-Cluster all data using the new center points
  • (this step could cause some data points to be
    placed in a
  • different cluster)
  • Repeat steps 3 4 until no data points are moved
    from one cluster to another (stabilization), or
    till some other convergence criteria is met

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
59
An example with 2 clusters
  1. We Pick 2 centers at random
  2. We cluster our data around these center points

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
60
K-means example with k2
  1. We recalculate centers based on our current
    clusters

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
61
K-means example with k2
  1. We re-cluster our data around our new center
    points

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
62
K-means example with k2
5. We repeat the last two steps until no more
data points are moved into a different cluster
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
63
Choosing k
  • Run algorithm on data with several different
    values of k
  • Use advance knowledge about the characteristics
    of your test
  • (e.g. Cancerous vs Non-Cancerous Tissues,
  • in case the experiments are being clustered)

64
Cluster Quality
  • Since any data can be clustered, how do we know
    our clusters are meaningful?
  • The size (diameter) of the cluster
  • vs. the inter-cluster distance
  • Distance between the members of a cluster and the
    clusters center
  • Diameter of the smallest sphere containing the
    cluster

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
65
Cluster Quality Continued
distance5
diameter5
distance20
Quality of cluster assessed by ratio of distance
to nearest cluster and cluster diameter
diameter5
Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
66
Cluster Quality Continued
Quality can be assessed simply by looking at the
diameter of a cluster (alone????)
A cluster can be formed by the heuristic even
when there is no similarity between clustered
patterns. This occurs because the algorithm
forces k clusters to be created.
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
67
Characteristics of k-means Clustering
  • The random selection of initial center points
    creates the following properties
  • Non-Determinism
  • May produce clusters without patterns
  • One solution is to choose the centers randomly
    from existing patterns

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
68
Heuristics Complexity
  • Linear in the number of data points, N
  • Can be shown to have run time cN, where c does
    not depend on N, but rather the number of
    clusters, k
  • (not sure about dependence on dimension, n?)
  • ? heuristic is efficient

From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
69
Hierarchical Clustering
  • a different clustering paradigm

Figure Reproduced From Data Analysis Tools for
DNA Microarrays by Sorin Draghici
70
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Campbell Heyer, 2003
71
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G
Gene C 0.94 0.96 -0.40 0.95
Gene D 0.84 -0.10 0.94
Gene E -0.57 0.89
Gene F -0.35
Gene G
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
C
  • Average similarity to
  • Gene D (0.940.84)/2 0.89
  • Gene F (-0.40(-0.57))/2 -0.485
  • Gene G (0.950.89)/2 0.92

1
D
E
F
1
G
C
E
72
Hierarchical Clustering (cont.)
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
1
2
D
C
E
G
D
F
G
73
Hierarchical Clustering (cont.)
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
3
1
2
C
E
G
D
F
74
Hierarchical Clustering (cont.)
4
3 Gene F
3 -0.355
Gene F
3
F
1
2
F
C
E
G
D
75
Hierarchical Clustering (cont.)
algorithm looks familiar?
4
Remember Neighbor-Joining !
3
1
2
F
C
E
G
D
76
Clustering of entire yeast genome
Campbell Heyer, 2003
77
Hierarchical ClusteringYeast Gene Expression
Data
Eisen et al., 1998
78
A SOFM Example With Yeast
Interpresting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation by Tamayo et al.
79
SOM Description
  • Each unit of the SOM has a weighted connection to
    all inputs
  • As the algorithm progresses, neighboring units
    are grouped by similarity

Output Layer
Input Layer
From Data Analysis Tools for DNA Microarrays by
Sorin Draghici
80
An Example Using Color
Each color in the map is associated with a weight
From http//davis.wpi.edu/matt/courses/soms/
81
Cluster Analysis of Microarray Expression Data
Matrices
  • Application of cluster analysis techniques in the
    elucidation gene expression data

82
Function of Genes
The features of a living organism are governed
principally by its genes. If we want to fully
understand living systems we must know the
function of each gene. Once we know a genes
sequence we can design experiments to find its
function
The Classical Approach of Assigning a function to
a Gene
("???? ??? ?????? ???")
Delete Gene X
Gene X
Conclusion Gene X left eye gene.
However this approach is too slow to handle all
the gene sequence information we have today
(HGSP).
83
Microarray Analysis
Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions. Experiments are carried out on a
Physical Matrix like the one below
To facilitate computational analysis the physical
matrix which may contain 1000s of genes is
converted into a numerical matrix using image
analysis equipment.
Possible inference If Gene Xs activity
(expression) is affected by Condition Y (Extreme
Heat), then Gene X may be involved in protecting
the cellular components from extreme heat. Each
Gene has its corresponding Expression Profile for
a set of conditions. This Expression Profile may
be thought of as a feature profile for that gene
for that set of conditions (A condition feature
profile).
84
Cluster Analysis
  • Cluster Analysis is an unsupervised procedure
    which involves grouping of objects based on their
    similarity in feature space.
  • In the Gene Expression context Genes are grouped
    based on the similarity of their Condition
    feature profile.
  • Cluster analysis was first applied to Gene
    Expression data from Brewers Yeast
    (Saccharomyces cerevisiae) by Eisen et al. (1998).

Clustering
  • Two general conclusions can be drawn from these
    clusters
  • Genes clustered together may be related within a
    biological module/system.
  • If there are genes of known function within a
    cluster these may help to class this
    biological/module system.

85
From Data to Biological Hypothesis
Gene Expression Microarray
Cluster Set
Cluster C with four Genes may represent System
C Relating these genes aids in elucidation of
this System C
Conditions (A-Z)
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7
System C
External Stimulus( Condition X)
Cell Membrane
Regulator Protein
86
Some Drawbacks of Clustering Biological Data
  1. Clustering works well over small numbers of
    conditions but a typical Microarray may have
    hundreds of experimental conditions. A global
    clustering may not offer sufficient resolution
    with so many features.
  2. As with other clustering applications, it may be
    difficult to cluster noisy expression data.
  3. Biological Systems tend to be inter-related and
    may share numerous factors (Genes) Clustering
    enforces partitions which may not accurately
    represent these intimacies.
  4. Clustering Genes over all Conditions only finds
    the strongest signals in the dataset as a whole.
    More local signals within the data matrix may
    be missed.

87
How do we better model more complex systems?
  • One technique that allows detection of all
    signals in the data is biclustering.
  • Instead of clustering genes over all conditions
    biclustering clusters genes with respect to
    subsets of conditions.

This enables better representation of
88
Biclustering
Conditions
A B C D E F G H
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene
7 Gene 8 Gene 9
Clustering misses local signal (B,E,F),(1,4,6,7,9
) present over subset of conditions.
Biclustering discovers local coherences over a
subset of conditions
  • Technique first described by J.A. Hartigan in
    1972 and termed Direct Clustering.
  • First Introduced to Microarray expression data by
    Cheng and Church(2000)

89
Approaches to Biclustering Microarray Gene
Expression
  • First applied to Gene Expression Data by Cheng
    and Church(2000).
  • Used a sub-matrix scoring technique to locate
    biclusters.
  • Tanay et al.(2000)
  • Modelled the expression data on Bipartite graphs
    and used graph techniques to find complete
    graphs or biclusters.
  • Lazzeroni and Owen
  • Used matrix reordering to represent different
    layers of signals (biclusters) Plaid Models
    to represent multiple signals within data.
  • Ben-Dor et al. (2002)
  • Biclusters depending on order relations (OPSM).

90
Bipartite Graph Modelling
  • First proposed in Discovering statically
    significant biclusters in gene expressing data
    Tanay et al. Bioinformatics 2000

Within the graph modelling paradigm biclusters
are equivalent to complete bipartite
sub-graphs. Tanay and colleagues used
probabilistic models to determine the least
probable sub-graphs (those showing most order and
consequently most surprising) to identify
biclusters.
91
The Cheng and Church Approach
The core element in this approach is the
development of a scoring to prioritise
sub-matrices. This scoring is based on the
concept of the residue of an entry in a
matrix. In the Matrix (I,J) the residue score of
element is given by
J
j
I
a
i
92
The Cheng and Church Approach(2)
The mean squared residue score (H) for a matrix
(I,J) is then calculated
This Global H score gives an indication of how
the data fits together within that matrix-
whether it has some coherence or is random.
A high H value signifies that the data is
uncorrelated. - a matrix of equally spread
random values over the range a,b, has an
expected H score of (b-a)/12. range 0,800
then H(I,J) 53,333
A low H score means that there is a correlation
in the matrix - a score of H(I,J) 0 would
mean that the data in the matrix fluctuates in
unison i.e. the sub-matrix is a bicluster
93
Worked example of H score
Matrix (M) Avg. 6.5
Row Avg. 2 5 8 11
R(1) 1- 2 - 5.4 6.5 0.1 R(2) 2 - 2 - 6.4
6.5 0.1 R(12) 12 - 11 -7.4 6.5
0.1
Col Avg. 5.4 6.4 7.4
H (M) (0.01x12)/12 0.01
If 5 was replaced with 3 then the score would
changed to H(M2) 2.06 If the matrix was
reshuffled randomly the score would be
around H(M3) sqr(12-1)/12 10.08
94
The Cheng and Church Approach Node Deletion
Biclustering Algorithm
In order to find all possible biclusters in an
Expression Matrix all sub-matrices must be tested
using the H score.
Node Deletion
In a node deletion algorithm all columns and rows
are tested for deletion. If removing a row or
column decreases the H score of the Matrix than
it is removed.
This continues until it is not possible to
decrease the H score further. This low H score
coherent sub-matrix (bicluster) is then returned.
95
The Cheng and Church Approach
Some results on lymphoma data (4026?96)
No. of genes, no. of conditions No. of genes, no. of conditions No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
96
  • Conclusions
  • High throughput Functional Genomics (Microarrays)
    requires Data Mining Applications.
  • Biclustering resolves Expression Data more
    effectively than single dimensional Cluster
    Analysis.
  • Cheng and Church Approach offers good base for
    future work.
  • Future Research/Questions
  • Implement a simple H score program to facilitate
    study if H score concept.
  • Are there other alternative scorings which would
    better apply to gene expression data?
  • Have unbiclustered genes any significance?
    Horizontally transferred genes?
  • Implement full scale biclustering program and
    look at better adaptation to expression data sets
    and the biological context.

97
References
  • Basic microarray analysis grouping and feature
    reduction by Soumya Raychaudhuri, Patrick D.
    Sutphin, Jeffery T. Chang and Russ B. Altman
    Trends in Biotechnology Vol. 19 No. 5 May 2001
  • Self Organizing Maps, Tom Germano,
    http//davis.wpi.edu/matt/courses/soms
  • Data Analysis Tools for DNA Microarrays by
    Sorin Draghici Chapman Hall/CRC 2003
  • Self-Organizing-Feature-Maps versus Statistical
    Clustering Methods A Benchmark by A. Ultsh, C.
    Vetter FG Neuroinformatik Kunstliche
    Intelligenz Research Report 0994

98
References
  • Interpreting patterns of gene expression with
    self-organizing maps Methods and application to
    hematopoietic differentiation by Tamayo et al.
  • A Local Search Approximation Algorithm for
    k-Means Clustering by Kanungo et al.
  • K-means-type algorithms A generalized
    convergence theorem and characterization of local
    optimality by Selim and Ismail
Write a Comment
User Comments (0)
About PowerShow.com