Analysis of Gene Expression Anne R. Haake Rhys Price Jones - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of Gene Expression Anne R. Haake Rhys Price Jones

Description:

Analysis of Gene Expression Anne R. Haake Rhys Price Jones Gene Expression Analysis Connecting structural genomics to functional genomics How do we relate gene ... – PowerPoint PPT presentation

Number of Views:497
Avg rating:3.0/5.0
Slides: 93
Provided by: AnneH72
Learn more at: https://www.cs.rit.edu
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Gene Expression Anne R. Haake Rhys Price Jones


1
Analysis of Gene ExpressionAnne R.
HaakeRhys Price Jones
2
Gene Expression Analysis
  • Connecting structural genomics to functional
    genomics

3
How do we relate gene identity to cell
physiology, disease drug discovery?
  • Functional Genomics
  • development and application of global
    (genome-wide or system-wide) experimental
    approaches to assess gene function by making use
    of the information and reagents provided by
    structural genomics

4
High Throughput Systems for Studying Global Gene
Expression are Complex
  • Need to consider
  • the biology behind the experiments the
    interpretation of the experiments
  • advancements in biotechnology
  • the computing issues

5
The Flow of Information
  • A gene is expressed in 2 steps
  • DNA is transcribed into RNA (mRNA)
  • RNA is translated into protein

6
Genotype to Phenotype
  • Individual cells in an organism have the same
    genes (DNA)
  • the genotype
  • but.not all genes are active (expressed) in
    each cell
  • It is the expression of thousands of genes and
    their products (RNA, proteins), functioning in a
    complicated and orchestrated way, that make a
    specific cell what it is.
  • the phenotype

7
The Flow of Information
8
Gene Expression Depends on Context
  • The subsets of genes that are expressed
    (RNA/protein) will differ among cells, tissues,
    organs, conditions
  • the subset expressed confers unique properties to
    the cell

neuron
liver
muscle
muscle
9
Differential Gene Expression
  • The level of expression of genes also differs
    with the cellular context
  • i.e. the amount of a given RNA will vary
  • We can think of gene expression in eukaryotes as
    having both an on/off switch and volume
    control

10
Specific Patterns of Gene Expression
  • Tissue/Cell type-specific
  • -e.g. skin cell vs. brain cell
  • -e.g. keratinocyte vs. melanocyte
  • Developmental stage
  • -e.g. embryonic skin cell vs. adult skin cell
  • Disease state
  • -e.g. normal skin cell vs. skin tumor cell
  • Environment-specific (drugs, toxins)
  • -e.g. skin cell untreated vs. treated

11
Analyze Gene Expression
  • We measure gene expression by analyzing the
    genetic molecule, messenger RNA (mRNA)
  • We also often are interested in measuring
    proteins

12
We Can Analyze RNA Content
  • First, isolate mRNA from cells or tissues

13
  • Next, identify RNAs in the sample
  • One to few RNAs at a time
  • Multiple RNAs (high throughput techniques)
  • Sometimes called global expression analysis
  • Identify by hybridization (base-pairing)
  • radioactive or enzyme-linked probe to the
    immobilized RNA
  • Probe is complementary to RNA of interest
  • Called cDNA

14
RNA Expression Analysis
  • One type of analysis is called a Dot Blot

samples are spotted onto filter and then
hybridized with labeled probe
So, the sequence is used to generate the data
(via hybridization) but the data itself is image
data. We scan the images to get intensities for
each spot.
15
State of the Art High-Throughput Methods
  • Multiple genes? entire genome expression analyzed
    at once!
  • RNA (the transcriptome)
  • DNA microarrays
  • SAGEserial analysis of gene expression
  • MPSS multiple parallel signature sequencing
  • Proteins (the proteome)
  • protein arrays
  • mass spectrophotometry

16
Gene Expression Analysis
  • Thousands of different mRNAs are present in a
    given cell together they make up the
    transcriptional profile
  • It is important to remember that when a gene
    expression profile is analyzed in a given sample,
    it is just a snapshot in time and space.

17
Regulation of Gene ExpressionHow Do Different
Transcriptional Profiles Arise?
  • Prokaryotes (e.g. bacteria)
  • simple organisms gene expression responsive
    primarily to environment
  • sets of genes are generally on or off
  • functionally related genes are organized into
    units called Operons
  • Eukaryotes
  • Not just on/off and volume control but even more
    complicated!
  • complex multi-level control

18
The Expression Snapshot
All of these mechanisms together determine which
RNAs/proteins are present in the snapshot.
19
Gene Regulatory Networks
  • Genes act in concert
  • Interrelationships are complex!
  • Scientists rarely get funded to study one gene
    anymore!!
  • We need ways to understand what the snapshot
    means

20
Gene Networks
  • "The approach to biology for the past 30 years
    has been to study individual proteins and genes
    in isolation. The future will be the study of
    the genes and proteins of organisms in the
    context of their informational pathways or
    networks."
  • Leroy Hood, Director of the Institute for Systems
    Biology, Nature, Oct. 19, 2000.

21
Gene Networks Some Examples
  • Genes and their products are related through
    their roles in
  • metabolic pathways
  • cell signalling networks

22
Metabolic Pathway
23
Cell Signalling Networks
www.mpi-dortmund.mpg.de/departments/dep1/signaltra
nsduktion/image3.gif
24
What can we learn by studying global patterns of
gene expression?
  • Individual gene expression patterns
  • Classifications for diagnosis, prediction
  • Groups of Genes
  • Molecular taxonomy of disease
  • Gene Networks/Pathways
  • Reconstruction of metabolic regulatory pathways

25
Gene Expression Analysis
  • Biotechnology
  • High-Throughput techniques of Biochemistry/
    Molecular Biology
  • RNA or protein
  • Informatics
  • Management of large, complex data sets
  • Data mining to gain useful information

26
High Throughput Techniques
  • We will only consider microarrays in detail today

27
Gene Expression Analysis Using Microarrays
Experimental Design Array Production Sample
Preparation Scanning Image Analysis Data
Processing Data Analysis Information Integration
Biology Wet Lab
Computer Workstation
28
GeneChip vs Spotted Arrays
  • GeneChip Arrays use oligonucleotides
  • Oligos arrays are built on a solid support
  • Spotted arrays utilize nucleic acids made in
    solution
  • Solutions are then spotted onto a solid support

29
What are cDNA (Spotted) Microarrays?
  • A miniaturized simultaneous version of the
    traditional cDNA
  • dot blot
  • Enables massive gene expression profiling
  • 10,000s genes at once
  • cDNA probes are amplified directly from
    culture
  • by PCR and purified
  • Purified probes are printed on to coated glass
    slides

30
cDNA Microarray Expression Analysis
Duggan et al. (1999) Nature Genetics 21 11
31
ScanAnalyze Image Analysis
  1. Spot Finding
  2. Background subtraction
  3. Intensity Calculation

32
Image Analysis Software
  • Freeware or Shareware
  • ScanAlyze
  • MAGIC (MicroArray Genome Imaging and Clustering
    Tool)
  • Spot
  • Automated spot finding

33
What do the spots represent?
  • Fluorescence intensity is a measure of the
    relative abundance of individual mRNAs
  • Experimental relative to control
  • expressed as a ratio from spotted arrays
    (2-color)
  • http//www.ncbi.nlm.nih.gov/geo/query/acc.cgi?ac
    cGSM1801
  • Have to be careful when comparing between arrays
    from experiment to experiment.

34
GeneChip Oligonucleotide Array
35
GeneChip Expression AnalysisHybridization and
Staining
Array
Hybridized Array
cRNA Target
Streptavidin-phycoerythrin conjugate
Courtesy of M. Hessner, CAAGED Workshop
36
Affymetrix Chips
300,000 Probes Perfect Match and
Mismatch Average Difference Values Courtesy of J.
Glasner CAAGED Workshop
37
Current Problems Facing Expression Analysis on
the Biotech side
  • Standardization quality control in the
    experiments (data quality at many levels)
  • Cost

38
Problem in reproducibility of experimental data
  • Lots of variation in arrays
  • more than 100 experimental steps
  • Sources of variation
  • biological variability in each RNA extract
  • each labeling reaction is different
  • each slide is a separate hybridization
  • spots on the slide are variable across slides
    (and within slides when double spotted)
  • each color is scanned separately
  • Need Replicates and Statistics!

39
The Value of Replicates
  • What is a replicate?
  • Doing the same experiment more than once
  • An experimental design issue
  • How many?
  • Most people dont do true replicates
  • Why not?
  • cost is primary
  • limiting sample size
  • ignorance of statistical considerations
  • competition

40
Outcome
  • Noisy data
  • Data preprocessing is necessary
  • normalization
  • scaling
  • Heavy reliance on statistics today

41
Pre-processing
  • Gene filtering
  • control genes
  • uninformative genes
  • Normalization and scaling
  • allows comparisons across arrays
  • scaling to control dynamic range
  • Transformation
  • logarithmic transformation for improved
    statistical properties

42
Normalization
Cy5 signal (log2)
Cy3 signal (log2)
43
Outcomes of Microarray Analysis
  • Large, complex data sets
  • example of a routine study
  • 50,000 genes from 20 samples -? approx. 1-2
    X 106 pieces of data
  • challenges for Bioinformatics
  • annotation, storage, retrieval, sharing of data
  • information from the data

44
Current State of Microarray Data Availability
  • Wide availability of technology has given rise to
    a large number of distributed databases
  • data scattered among many independent sites
    (accessible via Internet) or not publicly
    available at all
  • Need for standardization!

45
Public Repositories Efforts Towards
Standardization
  • GeneX at US National Center for Genome Resources
  • ArrayExpress at European Bioinformatics Institute
  • Gene Expression Omnibus at US National Center for
    Biotechnology Information
  • Stanford University Database

46
Standardization of the biological databases is a
big issue
  • A prime example of databases in need of
    standardization the gene expression databases
  • Why?
  • wide availability of technologies such as the
    microarray has given rise to a large number of
    heterogeneous, distributed databases
  • differ in annotation, database structure,
    availability
  • standardization is necessary to enable scientists
    to share and compare data

47
MGED Group and Standardization Issues
  • Microarray Gene Expression Database (MGED) Group
  • www.mged.org
  • MGED is taking on the challenge of
    standardization
  • Four major projects

48
MGED Projects
  • MIAME - The formulation of the minimum
    information about a microarray experiment
    required to interpret and verify the results.
  • MAGE - The establishment of a data exchange
    format (MAGE-ML) and object model (MAGE-OM) for
    microarray experiments.

49
MGED Projects
  • Ontologies - The development of ontologies for
    microarray experiment description and biological
    material (biomaterial) annotation in particular.
  • Normalization - The development of
    recommendations regarding experimental controls
    and data normalization methods.

50
Some Basic Statistics
  • dot product
  • mean
  • standard deviation
  • log base 2
  • etc.
  • util.ss

51
Gene chips
  • Spots representing thousands of genes
  • Two populations of cDNA
  • different conditions to be compared
  • One colored with Cy5 (red)
  • One colored with Cy3 (green)
  • Mixed, incubated with the chip
  • Figures from Campbell-Heyer Chapter 4

52
Red/Green Intensity measurements
  • (define redgreens '((2345 2467) (3589 2158)
    (4109 1469) (1500 3589) (1246 1258) (1937 2104)
    (2561 1562) (2962 3012) (3585 1209) (2796 1005)
    (2170 4245) (1896 2996) (1023 3354) (1698 2896)))
  • Shows (red green) intensities for 14 (out of
    6200!) genes

53
Should we normalize?
  • Average of reds is 2386.9
  • Average of greens is 2380.3
  • What does John Quackenbush say? (page 420)
  • Calculate standard deviations.
  • Return to this issue
  • For now, no normalization

54
Ratios of red values to green
  • (define redgreenratios (map (lambda (x)
    (round2 (/ (car x) (cadr x)))) redgreens))
  • Produces (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98
    2.97 2.78 0.51 0.63 0.31 0.59)
  • Which genes are expressed more in red than green?
  • Should these values be normalized?

55
Yet another Color scheme
  • (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98 2.97 2.78
    0.51 0.63 0.31 0.59)
  • Highly expressed Neutral Less expressed
  • gt2.0 gt1.3 close to 1.0 gt0.5 lt0.5
  • Seems arbitrary?
  • Log scale??
  • Why oh why did they re-use red and green?
  • Clustering? Meaning?

56
Larger experiment
  • 12 Genes
  • Expression values at 0, 2, 4, 6, 8 and 10 hours

57
Table 4.2 of Campbell/Heyer
  • Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
    hrsC 1 8 12 16 12
    8 D 1 3 4 4 3
    2 E 1 4 8 8 8
    8 F 1 1 1 .25
    .25 .1 G 1 2 3 4
    3 2 H 1 .5 .33 .25
    .33 .5 I 1 4 8 4
    1 .5 J 1 2 1 2
    1 2 K 1 1 1
    1 3 3 L 1 2 3
    4 3 2 M 1 .33
    .25 .25 .33 .5 N 1 .125
    .0833 .0625 .0833 .125
  • Normalized how?

58
Take logs
  • C 0 3.0 3.58 4.0 3.58 3.0
  • D 0 1.58 2.0 2.0 1.58 1.0
  • E 0 2.0 3.0 3.0 3.0 3.0
  • F 0 0 0 -2.0 -2.0 -3.32
  • G 0 1.0 1.58 2.0 1.58 1.0
  • H 0 -1.0 -1.6 -2.0 -1.6 -1.0
  • I 0 2.0 3.0 2.0 0 -1.0
  • J 0 1.0 0 1.0 0 1.0
  • K 0 0 0 0 1.58 1.58
  • L 0 1.0 1.58 2.0 1.58 1.0
  • M 0 -1.6 -2.0 -2.0 -1.6 -1.0
  • N 0 -3.0 -3.59 -4.0 -3.59 -3.0
  • Compare

59
How Similar are two Rows?
  • How similar are the expressions of two genes?
  • First well normalize each row
  • (define normalize substract mean and divide
    by sd
  • (lambda (l)
  • (let ((m (mean l))
  • (s (standarddeviation l)))
  • (map (lambda (x) (/ (- x m) s)) l))))
  • What are the new mean and standard deviation?

60
How Similar are two Rows?
  • Calculate the Pearson Correlation between pairs
    of rows
  • (define pc pearson correlation
  • (lambda (xs ys)
  • (/ (dotproduct (normalize xs) (normalize ys))
    (length xs))))
  • gt (pc '( 1 2 3 4 3 2 ) row G
  • '( 1 2 3 4 3 2 )) row L
  • 1.0
  • gt (pc '( 1 2 3 4 3 2 ) row G
  • '( 1 3 4 4 3 2 )) row D
  • 0.8971499589146109

61
Some other pairs
  • Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10
    hrsC 1 8 12 16 12
    8 D 1 3 4 4 3
    2 E 1 4 8 8 8
    8 F 1 1 1 .25
    .25 .1 G 1 2 3 4
    3 2 H 1 .5 .33 .25
    .33 .5 I 1 4 8 4
    1 .5 J 1 2 1 2
    1 2 K 1 1 1
    1 3 3 L 1 2 3
    4 3 2 M 1 .33
    .25 .25 .33 .5 N 1 .125
    .0833 .0625 .0833 .125
  • gt (pc '( 1 3 4 4 3 2) row D
  • '( 1 .33 .25 .25 .33 .5)) row M
  • -0.9260278787295065
  • gt (pc '( 1 2 3 4 3 2) row G
  • '( 1 .5 .33 .25 .33 .5)) row H
  • -0.9090853650855358

62
Correlation is sensitive to relative magnitudes
  • pc(G,L) 1 -- identically expressed genes
  • pc(G,D) .897 -- similarly expressed genes
  • pc(D,M) -.926 -- reciprocally expressed
  • pc(G,H) -.909 -- also reciprocally expressed
  • What happens if, instead of using the expression
    data we use the log transforms?
  • pc(G,L) 1.0
  • pc(G,D) 0.939
  • pc(D,M) -1.0
  • pc(G,H) -1.0

63
Hierarchical Clustering
  • Repeat
  • Replace the two closest objects by their
    combination
  • Until only one object remains

64
What are the objects?
  • (define objects
  • (map (lambda (x)
  • (cons
  • (symbol-gtstring (car x))
  • (cdr x)))
  • logtable42))
  • Initially, the objects are the genes with the log
    transformed expression levels
  • Typical object
  • ("E" 0 2.0 3.0 3.0 3.0 3.0)

65
Combining objects
  • (define combine
  • (lambda (xs ys)
  • (cons (string-append (car xs) (car ys))
  • (map (lambda (x y) (/ ( x y) 2.0))
  • (cdr xs) (cdr ys)))))
  • combine names
  • average the entries
  • Typical combined pair
  • ("EG" 0 1.5 2.29 2.5 2.29 2.0))

66
Manual Hierarchical Clustering
  • Lets go to emacs

67
K-means Clustering -- Lloyds Algorithm
  • Partition data into k clusters
  • REPEAT
  • FOR each datapoint
  • Calculate its distance to the centroid of each
    cluster
  • IF this is minimal for its own cluster
  • Leave the datapoint in its current cluster
  • ELSE
  • Place it in its closest cluster
  • UNTIL no datapoint is moved
  • Goal minimize sum of distances from datapoints
    to centroids

68
Analysis of k-means clustering
  • There are always exactly k clusters
  • No cluster is empty (why?)
  • The clusters are not hierarchical
  • The clusters do not overlap
  • Run time with n datapoints
  • Partitioning O(n)
  • FOR loop is O(nk)
  • REPEAT loop is ???
  • kanungo et al

Partition data into k clusters REPEAT FOR each
datapoint Calculate its distance to the
centroid of each cluster IF this is minimal for
its own cluster Leave the datapoint in its
current cluster ELSE Place it in its closest
cluster UNTIL no datapoint is moved
69
Pro and Con
  • Pro
  • With small k, may be faster than hierarchical
  • Clusters may be tighter
  • Con
  • Sensitive to initial choice of k
  • Sensitive to initial partition
  • May converge to local, rather than global minimum
  • Not clear how good resulting clusters are

70
Other Methods for Clustering
  • Self Organizing Maps
  • SWARM technology
  • SOM/SWARM hybrid

71
Mining of Expression Data
  • A gene expression pattern derived from a single
    microarray is simply a snapshot (one
    experimental sample vs reference)
  • Usually want to understand a process or changes
    in expression over a collection of samples
  • ? gene expression profile

72
Three levels of microarray gene expression data
processing
Brazma et al., Nature Genetics, 29365-371, 2001
73
Goal of Analysis of Expression Matrix
  • Some statistical methods applied to
  • Group similar genes together gt groups of
    functionally similar genes.
  • Group similar cell samples together.
  • Extract representative genes in each group.

74
Typical approach
  • Look for patterns
  • compare rows to find evidence for co-regulation
    of genes
  • compare columns to find evidence for relatedness
    among samples
  • 1) Choose a measure of similarity (distance)
    among the objects being compared-each row or
    column is considered a vector in space
  • 2) Then, group together objects (genes or
    samples) with similar properties-is a
    multidimensional analysis

75
Pattern Recognition
  • Clustering
  • Feature extraction/selection
  • Classification-discrimination analysis

76
Analytic Approaches
  • Clustering Identification of associations
    between data points organization of data into
    groupsexploratory analysis
  • Clustering Algorithms
  • Hierarchical
  • K-means
  • Self-organizing maps
  • Others

77
samples
g e n e s
Gene Expression Matrix Hierarchical Clustering
Eisen et al. http//www.pnas.org/cgi/content/full
/95/25/14863
78
Feature Selection Classification
  • First, identify features (genes) that
    discriminate between classes
  • Then use features for classification
  • machine learning approach
  • supervised analysis
  • assignment of a new samplepatternto a
    previously specified class, based on sample
    features and a trained classifier

79
Classic Example Classification of AML vs. ALL
  • Biological/Clinical Problems
  • previously, no single reliable test to
    distinguish them
  • differ greatly in clinical course response to
    treatments
  • Comparing 2 acute leukemias
  • acute myeloid leukemia (AML)
  • acute lymphoid leukemia (ALL)

Golub et al., Science Oct 15 1999 531-537
80
Study Design
Golub et al., Science Oct 15 1999 531-537
81
Results of the study
  • 1) Clustering of microarray data using tumors of
    known type
  • ? found 1100 of 6817 genes correlated with class
    distinction
  • 2) Formation of a class predictor 50 most
    informative genes used as a training set
  • ? classification of unknown tumors

Golub et al., Science Oct 15 1999 531-537
82

The prediction of a new sample is based on
'weighted votes' of a set of informative genes
83
(No Transcript)
84
Free Software for Microarray Analysis
  • Cluster TreeView
  • Michael Eisen
  • http//rana.lbl.gov/EisenSoftware.htm

85
More Free Software
  • Expression Profiler (European Bioinformatics
    Inst)
  • http//ep.ebi.ac.uk/
  • GeneX (National Center for Genome Research)
  • http//www.ncgr.org/genex/index.html
  • ArrayViewer and MEV (TIGR)
  • http//www.tigr.org/software/
  • Many More!!

86
Microarray Analysis Software
  • Popular commercial packages
  • Spotfire DecisionSite (Spotfire, Inc)
  • GeneSpring (Silicon Genetics)
  • Affymetrix Microarray Suite
  • Affymetrix Data Mining Tool (Affymetrix, Inc.)
  • Rosetta Inpharmatics' Resolver

87
Why cluster analysis may not be the answer
  • Clustering methods typically require user inputs
  • Example distance measure
  • Clustering methods differ in the way that the
    number of clusters are specified.
  • Clustering methods are often sensitive to the
    initialization condition (starting guess)
  • Local vs. global sampling of clustering space

88
Cluster Analysis Challenges
  • Noise in the data itself
  • Large data sets
  • most of the techniques currently used were not
    developed for multidimensional data
  • What about networks?
  • limitation of cluster analysis similarity in
    expression pattern suggests co-regulation but
    doesnt reveal cause-effect relationships

89
Using information networks as an interpretive
layer between phenotypes and the underlying
genes, proteins and metabolites
Highly connected genes are often critical in the
onset of cancer and metabolic diseases. However,
drug treatment targeting less connected genes
will have fewer side effects.
J. Blanchard-CAAGED Workshop 2002
90
Other Analytic Approaches are Being Explored to
Reverse Engineer Networks
  • Bayesian Networks
  • represent the dependence structure between
    multiple interacting quantities (e.g. expression
    levels of genes)
  • gene interactions models of causal influence
  • good for noisy data

91
Analytic Challenges
  • Advanced methods may require significant
    computational resources
  • Numerically complex calculations (large
    correlation matrices)
  • Combinatorically large search spaces (Bayesian
    nets)
  • Many training cycles (neural nets)
  • Global optimization (genetic algorithms)
  • Archiving, indexing, and correlation of large
    datasets (data extraction and data mining
    visualization)

S. Atlas CAAGED Workshop
92
Take-home Messages
  • With current technology global gene expression
    studies are best used for hypothesis building
  • Other experimental methods and smaller data sets
    needed to address the reproducibility problem
  • New analytical approaches are needed to deal with
    the multidimensional data
  • Need for high performance computing
  • Need for Bio/CS/IT/Stats people to work together!
Write a Comment
User Comments (0)
About PowerShow.com