Pictorial Demonstration - PowerPoint PPT Presentation

About This Presentation

Title:

Pictorial Demonstration

Description:

To the SVM classifier we add an extra scaling parameters for feature selection: ... b: Magnification of four neurons similarly colored in a. The bar graph in each ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 64

Provided by: tvisw

Category:

more less

Transcript and Presenter's Notes

Title: Pictorial Demonstration

1
Pictorial Demonstration
Rescale features to minimize the LOO bound R2/M2
2

SVM Functional
To the SVM classifier we add an extra scaling
parameters for feature selection

where the parameters ?, b are computed by
maximizing the the following functional, which is
equivalent to maximizing the margin
3
Radius Margin Bound
4
Jaakkola-Haussler Bound
5
Span Bound
6
The Algorithm
7
Computing Gradients
8
Toy Data
Linear problem with 6 relevant dimensions of 202
Nonlinear problem with 2 relevant dimensions of
52
9
Face Detection
On the CMU testset consisting of 479 faces and
57,000,000 non-faces we compare ROC curves
obtained for different number of selected
features. We see that using more than 60 features
does not help.
10

Molecular Classification of Cancer

11
Morphology Classification
12
Outcome Classification
13
Outcome Classification
Error rates ignore temporal information such as
when a patient dies. Survival analysis takes
temporal information into account. The
Kaplan-Meier survival plots and statistics for
the above predictions show significance.
Lymphoma
Medulloblastoma
14
Part4
Clustering Algorithms Hierarchical Clustering
15
Hierarchical clustering
16
Hierarchical clustering (continued)
To transform the genesexp matrix into
genesgenes matrix, use a gene similarity
metric. (Eisen et al. 1998 PNAS 9514863-14868)
Exactly same as Pearsons correlation except the
underline
Where Gi equal the (log-transformed) primary data
for gene G in condition i. For any two genes X
and Y observed over a series of N conditions.
Goffset is set to 0, corresponding to
fluorescence ratio of 1.0
17
Hierarchical clustering (continued)
Pearsons correlation example
What if genome expression is clustered based on
negative correlation?
18
Hierarchical clustering (continued)
19
Part5
Clustering Algorithms k-means Clustering
20
K-means clustering
This method differs from the hierarchical
clustering in many ways. In particular, - There
is no hierarchy, the data are partitioned. You
will be presented only with the final cluster
membership for each case. - There is no role for
the dendrogram in k-means clustering. - You must
supply the number of clusters (k) into which the
data are to be grouped.
21
K-means clustering(continued)
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
Step 2 Cluster genes based on a k-means
clustering algorithm
22
K-means clustering(continued)
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
(Tavazoie et al. Nature Genetics. 1999
Jul22(3)281-5)
Euclidean distance
Where any two genes X and Y observed over a
series of M conditions.
23
K-means clustering(continued)
24
K-means clustering algorithm
Step 1 Suppose distance of genes expression
patterns are positioned on a two dimensional
space based a distance matrix
Step 2 The first cluster center(red) is chosen
randomly and then subsequent centers are
by finding the data point farthest from the
centers already chosen. In this example, k3.
25
K-means clustering algorithm(continued)
Step 3 Each point is assigned to the
cluster associated with the closest
representative center
Step 4 Minimizes the within-cluster sum of
squared distances from the cluster mean by
moving the centroid (star points), that is
computing a new cluster representative
26
K-means clustering algorithm(continued)
Step 5 Repeat step 3 and 4 with a new
representative
Run step 3, 4 and 5 until no further changes
occur.
27
Part6
Clustering Algorithms Principal Component Analysis
28
Principal component analysis (PCA)
PCA is a variable reduction procedure. It is
useful when you have obtained data on a large
number of variables, and believe that there is
some redundancy in those variables.
29
PCA (continued)
30
PCA (continued)
31
PCA (continued)
- Items 1-4 are collapsed into a single new
variable that reflects the employees
satisfaction with supervision, and items 5-7 are
collapsed into a single new variable that
reflects satisfaction with pay.
- General form for the formula to compute scores
on the first component C1 b11(X1) b12(X2)
. b1p(Xp) where C1 the subjects score
on principal component 1 b1p the regression
coefficient(or weight) for observed variable p,
as used in creating principal
component 1 Xp the subjects score on
observed variable p.
32
PCA (continued)
For example, you could determine each subjects
score on principal component 1 (satisfaction with
supervision) and principal component 2
(satisfaction with pay ) by C1 .44(X1)
.40(X2) .47(X3) .32(X4) .02
(X5) .01 (X6) .03(X7) C2 .01(X1)
.04(X2) .02(X3) .02(X4)
.48(X5) .31 (X6) .39(X7)
These weights can be calculated using special
type of equation called an eigenequation.
33
PCA (continued)
(Alter et al., PNAS, 2000, 97(18) 10101-10106)
34
PCA (continued)
35
Part7
Clustering Algorithms Self-Organizing Maps
36
Clustering

Goals
Find natural classes in the data
Identify new classes / gene correlations
Refine existing taxonomies
Support biological analysis / discovery
Different Methods
Hierarchical clustering, SOM's, etc

37
Self organizing maps (SOM)
- A data visualization technique invented by
Professor Teuvo Kohonen which reduce the
dimensions of data through the use of
self-organizing neural networks. - A method for
producing ordered low-dimensional representations
of an input data space. - Typically such input
data is complex and high-dimensional with data
elements being related to each other in a
nonlinear fashion.
38
SOM (continued)
39
SOM (continued)
- Cerebral cortex of the brain is arranged as a
two-dimensional plane of neurons and spatial
mappings are used to model complex data
structures. - Topological relationships in
external stimuli are preserved and complex
multi-dimensional data can be represented in a
lower (usually two) dimensional space.
40
SOM (continued)
(Tamayo et al., 1999 PNAS 962907-2912)
-One chooses a geometry of "nodes"for example, a
3 2 grid. - The nodes are mapped into
k-dimensional space, initially at random, and
then iteratively adjusted. - Each iteration
involves randomly selecting a data point P and
moving the nodes in the direction of P.
41
SOM (continued)
- The closest node NP is moved the most, whereas
other nodes are moved by smaller amounts
depending on their distance from NP in the
initial geometry. - In this fashion, neighboring
points in the initial geometry tend to be mapped
to nearby points in k-dimensional space. The
process continues for 20,000-50,000 iterations.
42
SOM (continued)
Yeast Cell Cycle SOM - The 828 genes that passed
the variation filter were grouped into 30
clusters.
43

SOM analysis of data of yeast gene expression
during diauxic shift 2. Data were analyzed by a
prototype of GenePoint software
a Genes with a similar expression profile are
clustered in the same neuron of a 16 x 16 matrix
SOM and genes with closely related profiles are
in neighboring neurons. Neurons contain between
10 and 49 genes
b Magnification of four neurons similarly
colored in a. The bar graph in each neuron
displays the average expression of genes within
the neuron at 2-h intervals during the diauxic
shift
c SOM modified with Sammon's mapping algorithm.
The distance between two neurons corresponds to
the difference in gene expression pattern between
two neurons and the circle size to the number of
genes included in the neuron. Neurons marked in
green, yellow (upper left corner), red and blue
are similarly colored in a and b

44
Result of SOM clustering of Dictyostelium
expression data with a 6 x 4 structure of
centroids. A 6 x 4 24 clusters is the minimum
number of centroids needed to resolve the three
clusters revealed by percolation clustering
(encircled, from top to bottom down-regulated
genes, early upregulated genes, and late
upregulated genes). The remaining 21 clusters are
formed by forceful partitioning of the remaining
non-informative noisy data. Similarity of
expression within these 21 clusters is random,
and is biologically meaningless.
45
SOM clustering

SOM - self organizing maps
Preprocessing
filter away genes with insufficient biological
variation
normalize gene expression (across samples) to
mean 0, st. dev 1, for each gene separately.
Run SOM for many iterations
Plot the results

46
SOM results
Large grid 10x10
3 cells
47
Clustering visualization
48
2D SOM visualization
49
SOM output visualization
50
The Y-Cluster
51
Part8
Beyond Clustering
52
Support vector machines
Used for classification of genes according to
function 1) Choose positive and negative examples
(lable /-) 2) Transform input space to feature
space 3) Construct maximum margin hyperplane 4)
Classify new genes as members /non-members
53
Support vector machines (continued)
(Brown et al., 2000 PNAS 97(1), 262-267)
- Using the class definitions made by the MIPS
yeast genome database, SVMs were trained to
recognize six functional classes tricarboxylic
acid (TCA) cycle, respiration, cytoplasmic
ribosomes, proteasome, histones, and
helix-turn-helix proteins.
54
Support vector machines (continued)
Examples of predicted functional classifications
for previously unannotated genes by the SVMs
Class Gene Locus Comments TCA YHR188C Conserved
in worm, Schizosaccharomyces pombe, human
YKL039W PTM1 Major transport
facilitator family likely integral membrane
protein. Resp YKR016W Not highly conserved,
possible homolog in S. pombe YKR046C No
convincing homologs Ribo YKL056C Homolog of
translationally controlled tumor protein,
abundant, fingers YNL053W MSG5 Protein-tyrosine
phosphatase, bypasses growth arrest by mating
factor Prot YDR330W Ubiquitin regulatory domain
protein, S. pombe homolog YJL036W Member of
sorting nexin family YDL053C No convincing
homologs YLR387C Three C2H2 zinc fingers,
similar YBR267W not coregulated
55
Automatic discovery of regulatory patterns in
promoter region
(Juhl and Knudsen, 2000 Bioinformatics,
16326-333)
From SGD
DNA chip 91 data sets. These data sets consists
of the 500 bp upstream regions and the red-green
ratios
56
Automatic discovery of regulatory patterns in
promoter region (continued)
- Sequence patterns correlated to whole cell
expression data found by Kolmogorov-Smirnov tests
- Regulatory elements were identified by
systematic calculations of the significance of
correlation between words found in functional
annotation of genes and DNA words occuring in
their promoter regions.
57
Bayesian networks analysis
(Friedman et al. 2000 J. Comp. Biol., 7601-620)
- Graph-based model of joint multi-variate
probability distributions - The model can
captures properties of conditional independence
between variables. - Can describe complex
stochastic processes - Provide clear
methodologies for learning from (noisy)
observation
58
Bayesian networks analysis (continued)
59
Bayesian networks analysis (continued)
-76 gene expression measurement of 6177 yeast
ORFs. -800 genes whose expression varied over
cell-cycle stages were selected. -Learned
networks whose variables were the expression
level of each of these 800 genes
60
Movie
http//www.dkfz-heidelberg.de/abt0840/whuber/mamov
ie.html
61
Part9
Concluding Remarks
62
Future directions

Algorithms optimized for small samples (the no.
of samples will remain small for many tasks)
Integration with other data
biological networks
medical text
protein data
cost-sensitive classification algorithms
error cost depends on outcome (dont want to miss
treatable cancer), treatment side effects, etc.

63
Summary