Analysis of Gene Expression Anne R. Haake Rhys Price Jones - PowerPoint PPT Presentation

About This Presentation

Title:

Analysis of Gene Expression Anne R. Haake Rhys Price Jones

Description:

Analysis of Gene Expression Anne R. Haake Rhys Price Jones Gene Expression Analysis Connecting structural genomics to functional genomics How do we relate gene ... – PowerPoint PPT presentation

Number of Views:497

Avg rating:3.0/5.0

Slides: 93

Provided by: AnneH72

Learn more at: https://www.cs.rit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Analysis of Gene Expression Anne R. Haake Rhys Price Jones

1
Analysis of Gene ExpressionAnne R.
HaakeRhys Price Jones
2
Gene Expression Analysis

Connecting structural genomics to functional
genomics

3
How do we relate gene identity to cell
physiology, disease drug discovery?

Functional Genomics
development and application of global
(genome-wide or system-wide) experimental
approaches to assess gene function by making use
of the information and reagents provided by
structural genomics

4
High Throughput Systems for Studying Global Gene
Expression are Complex

Need to consider
the biology behind the experiments the
interpretation of the experiments
advancements in biotechnology
the computing issues

5
The Flow of Information

A gene is expressed in 2 steps
DNA is transcribed into RNA (mRNA)
RNA is translated into protein

6
Genotype to Phenotype

Individual cells in an organism have the same
genes (DNA)
the genotype
but.not all genes are active (expressed) in
each cell
It is the expression of thousands of genes and
their products (RNA, proteins), functioning in a
complicated and orchestrated way, that make a
specific cell what it is.
the phenotype

7
The Flow of Information
8
Gene Expression Depends on Context

The subsets of genes that are expressed
(RNA/protein) will differ among cells, tissues,
organs, conditions
the subset expressed confers unique properties to
the cell

neuron
liver
muscle
muscle
9
Differential Gene Expression

The level of expression of genes also differs
with the cellular context
i.e. the amount of a given RNA will vary
We can think of gene expression in eukaryotes as
having both an on/off switch and volume
control

10
Specific Patterns of Gene Expression

Tissue/Cell type-specific
-e.g. skin cell vs. brain cell
-e.g. keratinocyte vs. melanocyte
Developmental stage
-e.g. embryonic skin cell vs. adult skin cell
Disease state
-e.g. normal skin cell vs. skin tumor cell
Environment-specific (drugs, toxins)
-e.g. skin cell untreated vs. treated

11
Analyze Gene Expression

We measure gene expression by analyzing the
genetic molecule, messenger RNA (mRNA)
We also often are interested in measuring
proteins

12
We Can Analyze RNA Content

First, isolate mRNA from cells or tissues

Next, identify RNAs in the sample
One to few RNAs at a time
Multiple RNAs (high throughput techniques)
Sometimes called global expression analysis
Identify by hybridization (base-pairing)

radioactive or enzyme-linked probe to the
immobilized RNA
Probe is complementary to RNA of interest
Called cDNA

14
RNA Expression Analysis

One type of analysis is called a Dot Blot

samples are spotted onto filter and then
hybridized with labeled probe
So, the sequence is used to generate the data
(via hybridization) but the data itself is image
data. We scan the images to get intensities for
each spot.
15
State of the Art High-Throughput Methods

Multiple genes? entire genome expression analyzed
at once!
RNA (the transcriptome)
DNA microarrays
SAGEserial analysis of gene expression
MPSS multiple parallel signature sequencing
Proteins (the proteome)
protein arrays
mass spectrophotometry

16
Gene Expression Analysis

Thousands of different mRNAs are present in a
given cell together they make up the
transcriptional profile
It is important to remember that when a gene
expression profile is analyzed in a given sample,
it is just a snapshot in time and space.

17
Regulation of Gene ExpressionHow Do Different
Transcriptional Profiles Arise?

Prokaryotes (e.g. bacteria)
simple organisms gene expression responsive
primarily to environment
sets of genes are generally on or off
functionally related genes are organized into
units called Operons
Eukaryotes
Not just on/off and volume control but even more
complicated!
complex multi-level control

18
The Expression Snapshot
All of these mechanisms together determine which
RNAs/proteins are present in the snapshot.
19
Gene Regulatory Networks

Genes act in concert
Interrelationships are complex!
Scientists rarely get funded to study one gene
anymore!!
We need ways to understand what the snapshot
means

20
Gene Networks

"The approach to biology for the past 30 years
has been to study individual proteins and genes
in isolation. The future will be the study of
the genes and proteins of organisms in the
context of their informational pathways or
networks."
Leroy Hood, Director of the Institute for Systems
Biology, Nature, Oct. 19, 2000.

21
Gene Networks Some Examples

Genes and their products are related through
their roles in
metabolic pathways
cell signalling networks

22
Metabolic Pathway
23
Cell Signalling Networks
www.mpi-dortmund.mpg.de/departments/dep1/signaltra
nsduktion/image3.gif
24
What can we learn by studying global patterns of
gene expression?

Individual gene expression patterns
Classifications for diagnosis, prediction
Groups of Genes
Molecular taxonomy of disease
Gene Networks/Pathways
Reconstruction of metabolic regulatory pathways

25
Gene Expression Analysis

Biotechnology
High-Throughput techniques of Biochemistry/
Molecular Biology
RNA or protein

Informatics
Management of large, complex data sets
Data mining to gain useful information

26
High Throughput Techniques

We will only consider microarrays in detail today

27
Gene Expression Analysis Using Microarrays
Experimental Design Array Production Sample
Preparation Scanning Image Analysis Data
Processing Data Analysis Information Integration
Biology Wet Lab
Computer Workstation
28
GeneChip vs Spotted Arrays

GeneChip Arrays use oligonucleotides
Oligos arrays are built on a solid support

Spotted arrays utilize nucleic acids made in
solution
Solutions are then spotted onto a solid support

29
What are cDNA (Spotted) Microarrays?

A miniaturized simultaneous version of the
traditional cDNA
dot blot
Enables massive gene expression profiling
10,000s genes at once
cDNA probes are amplified directly from
culture
by PCR and purified
Purified probes are printed on to coated glass
slides

30
cDNA Microarray Expression Analysis
Duggan et al. (1999) Nature Genetics 21 11
31
ScanAnalyze Image Analysis

Spot Finding
Background subtraction
Intensity Calculation

32
Image Analysis Software

Freeware or Shareware
ScanAlyze
MAGIC (MicroArray Genome Imaging and Clustering
Tool)
Spot
Automated spot finding

33
What do the spots represent?

Fluorescence intensity is a measure of the
relative abundance of individual mRNAs
Experimental relative to control
expressed as a ratio from spotted arrays
(2-color)
http//www.ncbi.nlm.nih.gov/geo/query/acc.cgi?ac
cGSM1801
Have to be careful when comparing between arrays
from experiment to experiment.

34
GeneChip Oligonucleotide Array
35
GeneChip Expression AnalysisHybridization and
Staining
Array
Hybridized Array
cRNA Target
Streptavidin-phycoerythrin conjugate
Courtesy of M. Hessner, CAAGED Workshop
36
Affymetrix Chips
300,000 Probes Perfect Match and
Mismatch Average Difference Values Courtesy of J.
Glasner CAAGED Workshop
37
Current Problems Facing Expression Analysis on
the Biotech side

Standardization quality control in the
experiments (data quality at many levels)
Cost

38
Problem in reproducibility of experimental data

Lots of variation in arrays
more than 100 experimental steps
Sources of variation
biological variability in each RNA extract
each labeling reaction is different
each slide is a separate hybridization
spots on the slide are variable across slides
(and within slides when double spotted)
each color is scanned separately
Need Replicates and Statistics!

39
The Value of Replicates

What is a replicate?
Doing the same experiment more than once
An experimental design issue
How many?
Most people dont do true replicates
Why not?
cost is primary
limiting sample size
ignorance of statistical considerations
competition

40
Outcome

Noisy data
Data preprocessing is necessary
normalization
scaling
Heavy reliance on statistics today

41
Pre-processing

Gene filtering
control genes
uninformative genes
Normalization and scaling
allows comparisons across arrays
scaling to control dynamic range
Transformation
logarithmic transformation for improved
statistical properties

42
Normalization
Cy5 signal (log2)
Cy3 signal (log2)
43
Outcomes of Microarray Analysis

Large, complex data sets
example of a routine study
50,000 genes from 20 samples -? approx. 1-2
X 106 pieces of data
challenges for Bioinformatics
annotation, storage, retrieval, sharing of data
information from the data

44
Current State of Microarray Data Availability

Wide availability of technology has given rise to
a large number of distributed databases
data scattered among many independent sites
(accessible via Internet) or not publicly
available at all
Need for standardization!

45
Public Repositories Efforts Towards
Standardization

GeneX at US National Center for Genome Resources
ArrayExpress at European Bioinformatics Institute
Gene Expression Omnibus at US National Center for
Biotechnology Information
Stanford University Database

46
Standardization of the biological databases is a
big issue

A prime example of databases in need of
standardization the gene expression databases
Why?
wide availability of technologies such as the
microarray has given rise to a large number of
heterogeneous, distributed databases
differ in annotation, database structure,
availability
standardization is necessary to enable scientists
to share and compare data

47
MGED Group and Standardization Issues

Microarray Gene Expression Database (MGED) Group
www.mged.org
MGED is taking on the challenge of
standardization
Four major projects

48
MGED Projects

MIAME - The formulation of the minimum
information about a microarray experiment
required to interpret and verify the results.
MAGE - The establishment of a data exchange
format (MAGE-ML) and object model (MAGE-OM) for
microarray experiments.

49
MGED Projects

Ontologies - The development of ontologies for
microarray experiment description and biological
material (biomaterial) annotation in particular.
Normalization - The development of
recommendations regarding experimental controls
and data normalization methods.

50
Some Basic Statistics

dot product
mean
standard deviation
log base 2
etc.
util.ss

51
Gene chips

Spots representing thousands of genes
Two populations of cDNA
different conditions to be compared
One colored with Cy5 (red)
One colored with Cy3 (green)
Mixed, incubated with the chip
Figures from Campbell-Heyer Chapter 4

52
Red/Green Intensity measurements

(define redgreens '((2345 2467) (3589 2158)
(4109 1469) (1500 3589) (1246 1258) (1937 2104)
(2561 1562) (2962 3012) (3585 1209) (2796 1005)
(2170 4245) (1896 2996) (1023 3354) (1698 2896)))
Shows (red green) intensities for 14 (out of
6200!) genes

53
Should we normalize?

Average of reds is 2386.9
Average of greens is 2380.3
What does John Quackenbush say? (page 420)
Calculate standard deviations.
Return to this issue
For now, no normalization

54
Ratios of red values to green

(define redgreenratios (map (lambda (x)
(round2 (/ (car x) (cadr x)))) redgreens))
Produces (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98
2.97 2.78 0.51 0.63 0.31 0.59)
Which genes are expressed more in red than green?
Should these values be normalized?

55
Yet another Color scheme

(0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98 2.97 2.78
0.51 0.63 0.31 0.59)
Highly expressed Neutral Less expressed
gt2.0 gt1.3 close to 1.0 gt0.5 lt0.5
Seems arbitrary?
Log scale??
Why oh why did they re-use red and green?
Clustering? Meaning?

56
Larger experiment

12 Genes
Expression values at 0, 2, 4, 6, 8 and 10 hours

57
Table 4.2 of Campbell/Heyer

Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
hrsC 1 8 12 16 12
8 D 1 3 4 4 3
2 E 1 4 8 8 8
8 F 1 1 1 .25
.25 .1 G 1 2 3 4
3 2 H 1 .5 .33 .25
.33 .5 I 1 4 8 4
1 .5 J 1 2 1 2
1 2 K 1 1 1
1 3 3 L 1 2 3
4 3 2 M 1 .33
.25 .25 .33 .5 N 1 .125
.0833 .0625 .0833 .125
Normalized how?

58
Take logs

C 0 3.0 3.58 4.0 3.58 3.0
D 0 1.58 2.0 2.0 1.58 1.0
E 0 2.0 3.0 3.0 3.0 3.0
F 0 0 0 -2.0 -2.0 -3.32
G 0 1.0 1.58 2.0 1.58 1.0
H 0 -1.0 -1.6 -2.0 -1.6 -1.0
I 0 2.0 3.0 2.0 0 -1.0
J 0 1.0 0 1.0 0 1.0
K 0 0 0 0 1.58 1.58
L 0 1.0 1.58 2.0 1.58 1.0
M 0 -1.6 -2.0 -2.0 -1.6 -1.0
N 0 -3.0 -3.59 -4.0 -3.59 -3.0
Compare

59
How Similar are two Rows?

How similar are the expressions of two genes?
First well normalize each row
(define normalize substract mean and divide
by sd
(lambda (l)
(let ((m (mean l))
(s (standarddeviation l)))
(map (lambda (x) (/ (- x m) s)) l))))
What are the new mean and standard deviation?

60
How Similar are two Rows?

Calculate the Pearson Correlation between pairs
of rows
(define pc pearson correlation
(lambda (xs ys)
(/ (dotproduct (normalize xs) (normalize ys))
(length xs))))
gt (pc '( 1 2 3 4 3 2 ) row G
'( 1 2 3 4 3 2 )) row L
1.0
gt (pc '( 1 2 3 4 3 2 ) row G
'( 1 3 4 4 3 2 )) row D
0.8971499589146109

61
Some other pairs

Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
hrsC 1 8 12 16 12
8 D 1 3 4 4 3
2 E 1 4 8 8 8
8 F 1 1 1 .25
.25 .1 G 1 2 3 4
3 2 H 1 .5 .33 .25
.33 .5 I 1 4 8 4
1 .5 J 1 2 1 2
1 2 K 1 1 1
1 3 3 L 1 2 3
4 3 2 M 1 .33
.25 .25 .33 .5 N 1 .125
.0833 .0625 .0833 .125
gt (pc '( 1 3 4 4 3 2) row D
'( 1 .33 .25 .25 .33 .5)) row M
-0.9260278787295065
gt (pc '( 1 2 3 4 3 2) row G
'( 1 .5 .33 .25 .33 .5)) row H
-0.9090853650855358

62
Correlation is sensitive to relative magnitudes

pc(G,L) 1 -- identically expressed genes
pc(G,D) .897 -- similarly expressed genes
pc(D,M) -.926 -- reciprocally expressed
pc(G,H) -.909 -- also reciprocally expressed
What happens if, instead of using the expression
data we use the log transforms?
pc(G,L) 1.0
pc(G,D) 0.939
pc(D,M) -1.0
pc(G,H) -1.0

63
Hierarchical Clustering

Repeat
Replace the two closest objects by their
combination
Until only one object remains

64
What are the objects?

(define objects
(map (lambda (x)
(cons
(symbol-gtstring (car x))
(cdr x)))
logtable42))
Initially, the objects are the genes with the log
transformed expression levels
Typical object
("E" 0 2.0 3.0 3.0 3.0 3.0)

65
Combining objects

(define combine
(lambda (xs ys)
(cons (string-append (car xs) (car ys))
(map (lambda (x y) (/ ( x y) 2.0))
(cdr xs) (cdr ys)))))
combine names
average the entries
Typical combined pair
("EG" 0 1.5 2.29 2.5 2.29 2.0))

66
Manual Hierarchical Clustering

Lets go to emacs

67
K-means Clustering -- Lloyds Algorithm

Partition data into k clusters
REPEAT
FOR each datapoint
Calculate its distance to the centroid of each
cluster
IF this is minimal for its own cluster
Leave the datapoint in its current cluster
ELSE
Place it in its closest cluster
UNTIL no datapoint is moved
Goal minimize sum of distances from datapoints
to centroids

68
Analysis of k-means clustering

There are always exactly k clusters
No cluster is empty (why?)
The clusters are not hierarchical
The clusters do not overlap
Run time with n datapoints
Partitioning O(n)
FOR loop is O(nk)
REPEAT loop is ???
kanungo et al

Partition data into k clusters REPEAT FOR each
datapoint Calculate its distance to the
centroid of each cluster IF this is minimal for
its own cluster Leave the datapoint in its
current cluster ELSE Place it in its closest
cluster UNTIL no datapoint is moved
69
Pro and Con

Pro
With small k, may be faster than hierarchical
Clusters may be tighter
Con
Sensitive to initial choice of k
Sensitive to initial partition
May converge to local, rather than global minimum
Not clear how good resulting clusters are

70
Other Methods for Clustering

Self Organizing Maps
SWARM technology
SOM/SWARM hybrid

71
Mining of Expression Data

A gene expression pattern derived from a single
microarray is simply a snapshot (one
experimental sample vs reference)
Usually want to understand a process or changes
in expression over a collection of samples
? gene expression profile

72
Three levels of microarray gene expression data
processing
Brazma et al., Nature Genetics, 29365-371, 2001
73
Goal of Analysis of Expression Matrix

Some statistical methods applied to
Group similar genes together gt groups of
functionally similar genes.
Group similar cell samples together.
Extract representative genes in each group.

74
Typical approach

Look for patterns
compare rows to find evidence for co-regulation
of genes
compare columns to find evidence for relatedness
among samples
1) Choose a measure of similarity (distance)
among the objects being compared-each row or
column is considered a vector in space
2) Then, group together objects (genes or
samples) with similar properties-is a
multidimensional analysis

75
Pattern Recognition

Clustering
Feature extraction/selection
Classification-discrimination analysis

76
Analytic Approaches

Clustering Identification of associations
between data points organization of data into
groupsexploratory analysis
Clustering Algorithms
Hierarchical
K-means
Self-organizing maps
Others

77
samples
g e n e s
Gene Expression Matrix Hierarchical Clustering
Eisen et al. http//www.pnas.org/cgi/content/full
/95/25/14863
78
Feature Selection Classification

First, identify features (genes) that
discriminate between classes
Then use features for classification
machine learning approach
supervised analysis
assignment of a new samplepatternto a
previously specified class, based on sample
features and a trained classifier

79
Classic Example Classification of AML vs. ALL

Biological/Clinical Problems
previously, no single reliable test to
distinguish them
differ greatly in clinical course response to
treatments

Comparing 2 acute leukemias
acute myeloid leukemia (AML)
acute lymphoid leukemia (ALL)

Golub et al., Science Oct 15 1999 531-537
80
Study Design
Golub et al., Science Oct 15 1999 531-537
81
Results of the study

1) Clustering of microarray data using tumors of
known type
? found 1100 of 6817 genes correlated with class
distinction
2) Formation of a class predictor 50 most
informative genes used as a training set
? classification of unknown tumors

Golub et al., Science Oct 15 1999 531-537
82

The prediction of a new sample is based on
'weighted votes' of a set of informative genes
83
(No Transcript)
84
Free Software for Microarray Analysis

Cluster TreeView
Michael Eisen
http//rana.lbl.gov/EisenSoftware.htm

85
More Free Software

Expression Profiler (European Bioinformatics
Inst)
http//ep.ebi.ac.uk/
GeneX (National Center for Genome Research)
http//www.ncgr.org/genex/index.html
ArrayViewer and MEV (TIGR)
http//www.tigr.org/software/
Many More!!

86
Microarray Analysis Software

Popular commercial packages
Spotfire DecisionSite (Spotfire, Inc)
GeneSpring (Silicon Genetics)
Affymetrix Microarray Suite
Affymetrix Data Mining Tool (Affymetrix, Inc.)
Rosetta Inpharmatics' Resolver

87
Why cluster analysis may not be the answer

Clustering methods typically require user inputs
Example distance measure
Clustering methods differ in the way that the
number of clusters are specified.
Clustering methods are often sensitive to the
initialization condition (starting guess)
Local vs. global sampling of clustering space

88
Cluster Analysis Challenges

Noise in the data itself
Large data sets
most of the techniques currently used were not
developed for multidimensional data
What about networks?
limitation of cluster analysis similarity in
expression pattern suggests co-regulation but
doesnt reveal cause-effect relationships

89
Using information networks as an interpretive
layer between phenotypes and the underlying
genes, proteins and metabolites
Highly connected genes are often critical in the
onset of cancer and metabolic diseases. However,
drug treatment targeting less connected genes
will have fewer side effects.
J. Blanchard-CAAGED Workshop 2002
90
Other Analytic Approaches are Being Explored to
Reverse Engineer Networks

Bayesian Networks
represent the dependence structure between
multiple interacting quantities (e.g. expression
levels of genes)
gene interactions models of causal influence
good for noisy data

91
Analytic Challenges

Advanced methods may require significant
computational resources
Numerically complex calculations (large
correlation matrices)
Combinatorically large search spaces (Bayesian
nets)
Many training cycles (neural nets)
Global optimization (genetic algorithms)
Archiving, indexing, and correlation of large
datasets (data extraction and data mining
visualization)

S. Atlas CAAGED Workshop
92
Take-home Messages

With current technology global gene expression
studies are best used for hypothesis building
Other experimental methods and smaller data sets
needed to address the reproducibility problem
New analytical approaches are needed to deal with
the multidimensional data
Need for high performance computing
Need for Bio/CS/IT/Stats people to work together!

Write a Comment

User Comments (0)