Title: Analysis of Gene Expression Anne R. Haake Rhys Price Jones
1Analysis of Gene ExpressionAnne R.
HaakeRhys Price Jones
2Gene Expression Analysis
- Connecting structural genomics to functional
genomics
3How do we relate gene identity to cell
physiology, disease drug discovery?
- Functional Genomics
- development and application of global
(genome-wide or system-wide) experimental
approaches to assess gene function by making use
of the information and reagents provided by
structural genomics -
4High Throughput Systems for Studying Global Gene
Expression are Complex
- Need to consider
- the biology behind the experiments the
interpretation of the experiments - advancements in biotechnology
- the computing issues
5The Flow of Information
- A gene is expressed in 2 steps
- DNA is transcribed into RNA (mRNA)
- RNA is translated into protein
6Genotype to Phenotype
- Individual cells in an organism have the same
genes (DNA) - the genotype
- but.not all genes are active (expressed) in
each cell - It is the expression of thousands of genes and
their products (RNA, proteins), functioning in a
complicated and orchestrated way, that make a
specific cell what it is. - the phenotype
7The Flow of Information
8Gene Expression Depends on Context
- The subsets of genes that are expressed
(RNA/protein) will differ among cells, tissues,
organs, conditions - the subset expressed confers unique properties to
the cell
neuron
liver
muscle
muscle
9Differential Gene Expression
- The level of expression of genes also differs
with the cellular context - i.e. the amount of a given RNA will vary
- We can think of gene expression in eukaryotes as
having both an on/off switch and volume
control
10 Specific Patterns of Gene Expression
- Tissue/Cell type-specific
- -e.g. skin cell vs. brain cell
- -e.g. keratinocyte vs. melanocyte
- Developmental stage
- -e.g. embryonic skin cell vs. adult skin cell
- Disease state
- -e.g. normal skin cell vs. skin tumor cell
- Environment-specific (drugs, toxins)
- -e.g. skin cell untreated vs. treated
-
11Analyze Gene Expression
- We measure gene expression by analyzing the
genetic molecule, messenger RNA (mRNA) - We also often are interested in measuring
proteins
12We Can Analyze RNA Content
- First, isolate mRNA from cells or tissues
13- Next, identify RNAs in the sample
- One to few RNAs at a time
- Multiple RNAs (high throughput techniques)
- Sometimes called global expression analysis
- Identify by hybridization (base-pairing)
- radioactive or enzyme-linked probe to the
immobilized RNA - Probe is complementary to RNA of interest
- Called cDNA
14RNA Expression Analysis
- One type of analysis is called a Dot Blot
samples are spotted onto filter and then
hybridized with labeled probe
So, the sequence is used to generate the data
(via hybridization) but the data itself is image
data. We scan the images to get intensities for
each spot.
15State of the Art High-Throughput Methods
- Multiple genes? entire genome expression analyzed
at once! - RNA (the transcriptome)
- DNA microarrays
- SAGEserial analysis of gene expression
- MPSS multiple parallel signature sequencing
- Proteins (the proteome)
- protein arrays
- mass spectrophotometry
16Gene Expression Analysis
- Thousands of different mRNAs are present in a
given cell together they make up the
transcriptional profile - It is important to remember that when a gene
expression profile is analyzed in a given sample,
it is just a snapshot in time and space.
17Regulation of Gene ExpressionHow Do Different
Transcriptional Profiles Arise?
- Prokaryotes (e.g. bacteria)
- simple organisms gene expression responsive
primarily to environment - sets of genes are generally on or off
- functionally related genes are organized into
units called Operons - Eukaryotes
- Not just on/off and volume control but even more
complicated! - complex multi-level control
18The Expression Snapshot
All of these mechanisms together determine which
RNAs/proteins are present in the snapshot.
19Gene Regulatory Networks
- Genes act in concert
- Interrelationships are complex!
- Scientists rarely get funded to study one gene
anymore!! - We need ways to understand what the snapshot
means
20Gene Networks
- "The approach to biology for the past 30 years
has been to study individual proteins and genes
in isolation. The future will be the study of
the genes and proteins of organisms in the
context of their informational pathways or
networks." - Leroy Hood, Director of the Institute for Systems
Biology, Nature, Oct. 19, 2000.
21Gene Networks Some Examples
- Genes and their products are related through
their roles in - metabolic pathways
- cell signalling networks
22Metabolic Pathway
23Cell Signalling Networks
www.mpi-dortmund.mpg.de/departments/dep1/signaltra
nsduktion/image3.gif
24What can we learn by studying global patterns of
gene expression?
- Individual gene expression patterns
- Classifications for diagnosis, prediction
- Groups of Genes
- Molecular taxonomy of disease
- Gene Networks/Pathways
- Reconstruction of metabolic regulatory pathways
25Gene Expression Analysis
- Biotechnology
- High-Throughput techniques of Biochemistry/
Molecular Biology - RNA or protein
-
- Informatics
- Management of large, complex data sets
- Data mining to gain useful information
26High Throughput Techniques
- We will only consider microarrays in detail today
27 Gene Expression Analysis Using Microarrays
Experimental Design Array Production Sample
Preparation Scanning Image Analysis Data
Processing Data Analysis Information Integration
Biology Wet Lab
Computer Workstation
28GeneChip vs Spotted Arrays
- GeneChip Arrays use oligonucleotides
- Oligos arrays are built on a solid support
- Spotted arrays utilize nucleic acids made in
solution - Solutions are then spotted onto a solid support
29What are cDNA (Spotted) Microarrays?
- A miniaturized simultaneous version of the
traditional cDNA - dot blot
- Enables massive gene expression profiling
- 10,000s genes at once
- cDNA probes are amplified directly from
culture - by PCR and purified
- Purified probes are printed on to coated glass
slides
30cDNA Microarray Expression Analysis
Duggan et al. (1999) Nature Genetics 21 11
31 ScanAnalyze Image Analysis
- Spot Finding
- Background subtraction
- Intensity Calculation
32Image Analysis Software
- Freeware or Shareware
- ScanAlyze
- MAGIC (MicroArray Genome Imaging and Clustering
Tool) - Spot
- Automated spot finding
33What do the spots represent?
- Fluorescence intensity is a measure of the
relative abundance of individual mRNAs - Experimental relative to control
- expressed as a ratio from spotted arrays
(2-color) - http//www.ncbi.nlm.nih.gov/geo/query/acc.cgi?ac
cGSM1801 - Have to be careful when comparing between arrays
from experiment to experiment.
34GeneChip Oligonucleotide Array
35GeneChip Expression AnalysisHybridization and
Staining
Array
Hybridized Array
cRNA Target
Streptavidin-phycoerythrin conjugate
Courtesy of M. Hessner, CAAGED Workshop
36Affymetrix Chips
300,000 Probes Perfect Match and
Mismatch Average Difference Values Courtesy of J.
Glasner CAAGED Workshop
37Current Problems Facing Expression Analysis on
the Biotech side
- Standardization quality control in the
experiments (data quality at many levels) - Cost
38Problem in reproducibility of experimental data
- Lots of variation in arrays
- more than 100 experimental steps
- Sources of variation
- biological variability in each RNA extract
- each labeling reaction is different
- each slide is a separate hybridization
- spots on the slide are variable across slides
(and within slides when double spotted) - each color is scanned separately
- Need Replicates and Statistics!
39The Value of Replicates
- What is a replicate?
- Doing the same experiment more than once
- An experimental design issue
- How many?
- Most people dont do true replicates
- Why not?
- cost is primary
- limiting sample size
- ignorance of statistical considerations
- competition
40Outcome
- Noisy data
- Data preprocessing is necessary
- normalization
- scaling
- Heavy reliance on statistics today
41Pre-processing
- Gene filtering
- control genes
- uninformative genes
- Normalization and scaling
- allows comparisons across arrays
- scaling to control dynamic range
- Transformation
- logarithmic transformation for improved
statistical properties
42Normalization
Cy5 signal (log2)
Cy3 signal (log2)
43Outcomes of Microarray Analysis
- Large, complex data sets
- example of a routine study
- 50,000 genes from 20 samples -? approx. 1-2
X 106 pieces of data - challenges for Bioinformatics
- annotation, storage, retrieval, sharing of data
- information from the data
44Current State of Microarray Data Availability
- Wide availability of technology has given rise to
a large number of distributed databases -
- data scattered among many independent sites
(accessible via Internet) or not publicly
available at all - Need for standardization!
45Public Repositories Efforts Towards
Standardization
- GeneX at US National Center for Genome Resources
- ArrayExpress at European Bioinformatics Institute
- Gene Expression Omnibus at US National Center for
Biotechnology Information - Stanford University Database
46Standardization of the biological databases is a
big issue
- A prime example of databases in need of
standardization the gene expression databases - Why?
- wide availability of technologies such as the
microarray has given rise to a large number of
heterogeneous, distributed databases - differ in annotation, database structure,
availability - standardization is necessary to enable scientists
to share and compare data -
47MGED Group and Standardization Issues
- Microarray Gene Expression Database (MGED) Group
- www.mged.org
- MGED is taking on the challenge of
standardization - Four major projects
48MGED Projects
- MIAME - The formulation of the minimum
information about a microarray experiment
required to interpret and verify the results. -
- MAGE - The establishment of a data exchange
format (MAGE-ML) and object model (MAGE-OM) for
microarray experiments.
49MGED Projects
- Ontologies - The development of ontologies for
microarray experiment description and biological
material (biomaterial) annotation in particular. - Normalization - The development of
recommendations regarding experimental controls
and data normalization methods.
50Some Basic Statistics
- dot product
- mean
- standard deviation
- log base 2
- etc.
- util.ss
51Gene chips
- Spots representing thousands of genes
- Two populations of cDNA
- different conditions to be compared
- One colored with Cy5 (red)
- One colored with Cy3 (green)
- Mixed, incubated with the chip
- Figures from Campbell-Heyer Chapter 4
52Red/Green Intensity measurements
- (define redgreens '((2345 2467) (3589 2158)
(4109 1469) (1500 3589) (1246 1258) (1937 2104)
(2561 1562) (2962 3012) (3585 1209) (2796 1005)
(2170 4245) (1896 2996) (1023 3354) (1698 2896))) - Shows (red green) intensities for 14 (out of
6200!) genes
53Should we normalize?
- Average of reds is 2386.9
- Average of greens is 2380.3
- What does John Quackenbush say? (page 420)
- Calculate standard deviations.
- Return to this issue
- For now, no normalization
54Ratios of red values to green
- (define redgreenratios (map (lambda (x)
(round2 (/ (car x) (cadr x)))) redgreens)) - Produces (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98
2.97 2.78 0.51 0.63 0.31 0.59) - Which genes are expressed more in red than green?
- Should these values be normalized?
55Yet another Color scheme
- (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98 2.97 2.78
0.51 0.63 0.31 0.59) - Highly expressed Neutral Less expressed
- gt2.0 gt1.3 close to 1.0 gt0.5 lt0.5
- Seems arbitrary?
- Log scale??
- Why oh why did they re-use red and green?
- Clustering? Meaning?
56Larger experiment
- 12 Genes
- Expression values at 0, 2, 4, 6, 8 and 10 hours
57Table 4.2 of Campbell/Heyer
- Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
hrsC 1 8 12 16 12
8 D 1 3 4 4 3
2 E 1 4 8 8 8
8 F 1 1 1 .25
.25 .1 G 1 2 3 4
3 2 H 1 .5 .33 .25
.33 .5 I 1 4 8 4
1 .5 J 1 2 1 2
1 2 K 1 1 1
1 3 3 L 1 2 3
4 3 2 M 1 .33
.25 .25 .33 .5 N 1 .125
.0833 .0625 .0833 .125 - Normalized how?
58Take logs
- C 0 3.0 3.58 4.0 3.58 3.0
- D 0 1.58 2.0 2.0 1.58 1.0
- E 0 2.0 3.0 3.0 3.0 3.0
- F 0 0 0 -2.0 -2.0 -3.32
- G 0 1.0 1.58 2.0 1.58 1.0
- H 0 -1.0 -1.6 -2.0 -1.6 -1.0
- I 0 2.0 3.0 2.0 0 -1.0
- J 0 1.0 0 1.0 0 1.0
- K 0 0 0 0 1.58 1.58
- L 0 1.0 1.58 2.0 1.58 1.0
- M 0 -1.6 -2.0 -2.0 -1.6 -1.0
- N 0 -3.0 -3.59 -4.0 -3.59 -3.0
- Compare
59How Similar are two Rows?
- How similar are the expressions of two genes?
- First well normalize each row
- (define normalize substract mean and divide
by sd - (lambda (l)
- (let ((m (mean l))
- (s (standarddeviation l)))
- (map (lambda (x) (/ (- x m) s)) l))))
- What are the new mean and standard deviation?
60How Similar are two Rows?
- Calculate the Pearson Correlation between pairs
of rows - (define pc pearson correlation
- (lambda (xs ys)
- (/ (dotproduct (normalize xs) (normalize ys))
(length xs)))) - gt (pc '( 1 2 3 4 3 2 ) row G
- '( 1 2 3 4 3 2 )) row L
- 1.0
- gt (pc '( 1 2 3 4 3 2 ) row G
- '( 1 3 4 4 3 2 )) row D
- 0.8971499589146109
61Some other pairs
- Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
hrsC 1 8 12 16 12
8 D 1 3 4 4 3
2 E 1 4 8 8 8
8 F 1 1 1 .25
.25 .1 G 1 2 3 4
3 2 H 1 .5 .33 .25
.33 .5 I 1 4 8 4
1 .5 J 1 2 1 2
1 2 K 1 1 1
1 3 3 L 1 2 3
4 3 2 M 1 .33
.25 .25 .33 .5 N 1 .125
.0833 .0625 .0833 .125 - gt (pc '( 1 3 4 4 3 2) row D
- '( 1 .33 .25 .25 .33 .5)) row M
- -0.9260278787295065
- gt (pc '( 1 2 3 4 3 2) row G
- '( 1 .5 .33 .25 .33 .5)) row H
- -0.9090853650855358
62Correlation is sensitive to relative magnitudes
- pc(G,L) 1 -- identically expressed genes
- pc(G,D) .897 -- similarly expressed genes
- pc(D,M) -.926 -- reciprocally expressed
- pc(G,H) -.909 -- also reciprocally expressed
- What happens if, instead of using the expression
data we use the log transforms? - pc(G,L) 1.0
- pc(G,D) 0.939
- pc(D,M) -1.0
- pc(G,H) -1.0
63Hierarchical Clustering
- Repeat
- Replace the two closest objects by their
combination - Until only one object remains
64What are the objects?
- (define objects
- (map (lambda (x)
- (cons
- (symbol-gtstring (car x))
- (cdr x)))
- logtable42))
- Initially, the objects are the genes with the log
transformed expression levels - Typical object
- ("E" 0 2.0 3.0 3.0 3.0 3.0)
65Combining objects
- (define combine
- (lambda (xs ys)
- (cons (string-append (car xs) (car ys))
- (map (lambda (x y) (/ ( x y) 2.0))
- (cdr xs) (cdr ys)))))
- combine names
- average the entries
- Typical combined pair
- ("EG" 0 1.5 2.29 2.5 2.29 2.0))
66Manual Hierarchical Clustering
67K-means Clustering -- Lloyds Algorithm
- Partition data into k clusters
- REPEAT
- FOR each datapoint
- Calculate its distance to the centroid of each
cluster - IF this is minimal for its own cluster
- Leave the datapoint in its current cluster
- ELSE
- Place it in its closest cluster
-
- UNTIL no datapoint is moved
- Goal minimize sum of distances from datapoints
to centroids
68Analysis of k-means clustering
- There are always exactly k clusters
- No cluster is empty (why?)
- The clusters are not hierarchical
- The clusters do not overlap
- Run time with n datapoints
- Partitioning O(n)
- FOR loop is O(nk)
- REPEAT loop is ???
- kanungo et al
Partition data into k clusters REPEAT FOR each
datapoint Calculate its distance to the
centroid of each cluster IF this is minimal for
its own cluster Leave the datapoint in its
current cluster ELSE Place it in its closest
cluster UNTIL no datapoint is moved
69Pro and Con
- Pro
- With small k, may be faster than hierarchical
- Clusters may be tighter
- Con
- Sensitive to initial choice of k
- Sensitive to initial partition
- May converge to local, rather than global minimum
- Not clear how good resulting clusters are
70Other Methods for Clustering
- Self Organizing Maps
- SWARM technology
- SOM/SWARM hybrid
71Mining of Expression Data
- A gene expression pattern derived from a single
microarray is simply a snapshot (one
experimental sample vs reference) - Usually want to understand a process or changes
in expression over a collection of samples - ? gene expression profile
72Three levels of microarray gene expression data
processing
Brazma et al., Nature Genetics, 29365-371, 2001
73Goal of Analysis of Expression Matrix
- Some statistical methods applied to
- Group similar genes together gt groups of
functionally similar genes. - Group similar cell samples together.
- Extract representative genes in each group.
74Typical approach
- Look for patterns
- compare rows to find evidence for co-regulation
of genes - compare columns to find evidence for relatedness
among samples - 1) Choose a measure of similarity (distance)
among the objects being compared-each row or
column is considered a vector in space -
- 2) Then, group together objects (genes or
samples) with similar properties-is a
multidimensional analysis -
75Pattern Recognition
- Clustering
- Feature extraction/selection
- Classification-discrimination analysis
76Analytic Approaches
- Clustering Identification of associations
between data points organization of data into
groupsexploratory analysis - Clustering Algorithms
- Hierarchical
- K-means
- Self-organizing maps
- Others
77samples
g e n e s
Gene Expression Matrix Hierarchical Clustering
Eisen et al. http//www.pnas.org/cgi/content/full
/95/25/14863
78Feature Selection Classification
- First, identify features (genes) that
discriminate between classes - Then use features for classification
- machine learning approach
- supervised analysis
- assignment of a new samplepatternto a
previously specified class, based on sample
features and a trained classifier
79Classic Example Classification of AML vs. ALL
- Biological/Clinical Problems
- previously, no single reliable test to
distinguish them - differ greatly in clinical course response to
treatments -
- Comparing 2 acute leukemias
- acute myeloid leukemia (AML)
- acute lymphoid leukemia (ALL)
Golub et al., Science Oct 15 1999 531-537
80 Study Design
Golub et al., Science Oct 15 1999 531-537
81Results of the study
- 1) Clustering of microarray data using tumors of
known type - ? found 1100 of 6817 genes correlated with class
distinction - 2) Formation of a class predictor 50 most
informative genes used as a training set - ? classification of unknown tumors
Golub et al., Science Oct 15 1999 531-537
82The prediction of a new sample is based on
'weighted votes' of a set of informative genes
83(No Transcript)
84Free Software for Microarray Analysis
- Cluster TreeView
- Michael Eisen
- http//rana.lbl.gov/EisenSoftware.htm
85More Free Software
- Expression Profiler (European Bioinformatics
Inst) - http//ep.ebi.ac.uk/
- GeneX (National Center for Genome Research)
- http//www.ncgr.org/genex/index.html
- ArrayViewer and MEV (TIGR)
- http//www.tigr.org/software/
- Many More!!
86Microarray Analysis Software
- Popular commercial packages
- Spotfire DecisionSite (Spotfire, Inc)
- GeneSpring (Silicon Genetics)
- Affymetrix Microarray Suite
- Affymetrix Data Mining Tool (Affymetrix, Inc.)
- Rosetta Inpharmatics' Resolver
87Why cluster analysis may not be the answer
- Clustering methods typically require user inputs
- Example distance measure
- Clustering methods differ in the way that the
number of clusters are specified. - Clustering methods are often sensitive to the
initialization condition (starting guess) - Local vs. global sampling of clustering space
88Cluster Analysis Challenges
- Noise in the data itself
- Large data sets
- most of the techniques currently used were not
developed for multidimensional data - What about networks?
- limitation of cluster analysis similarity in
expression pattern suggests co-regulation but
doesnt reveal cause-effect relationships
89Using information networks as an interpretive
layer between phenotypes and the underlying
genes, proteins and metabolites
Highly connected genes are often critical in the
onset of cancer and metabolic diseases. However,
drug treatment targeting less connected genes
will have fewer side effects.
J. Blanchard-CAAGED Workshop 2002
90Other Analytic Approaches are Being Explored to
Reverse Engineer Networks
-
- Bayesian Networks
- represent the dependence structure between
multiple interacting quantities (e.g. expression
levels of genes) - gene interactions models of causal influence
- good for noisy data
91Analytic Challenges
- Advanced methods may require significant
computational resources - Numerically complex calculations (large
correlation matrices) - Combinatorically large search spaces (Bayesian
nets) - Many training cycles (neural nets)
- Global optimization (genetic algorithms)
- Archiving, indexing, and correlation of large
datasets (data extraction and data mining
visualization)
S. Atlas CAAGED Workshop
92Take-home Messages
- With current technology global gene expression
studies are best used for hypothesis building - Other experimental methods and smaller data sets
needed to address the reproducibility problem - New analytical approaches are needed to deal with
the multidimensional data - Need for high performance computing
- Need for Bio/CS/IT/Stats people to work together!