Introduction to Microarray and Data Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Microarray and Data Analysis

1
Introduction to Microarray andData Analysis

By
Han-Yu Chuang

2
Biological background Molecular Biology
3
The Central Dogmaof Molecular Biology
4
Basic principles in physics, chemistry and
biology
Principles Known?
Physics Matter
Chemistry Compound
Biology Organism
Elementary Particles Yes
Genes No
Elements Yes
Every biological rule has exceptions!
5
Measuring Gene Expression
Idea measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein would be more direct, but is
currently harder.
6
Microarrays provide a means to measure gene
expression
7
How to measure gene expression?
8
A simple idea Northern Blot
9
Technology Advanced
10
What is Microarray?

Put a large number (100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or
other substrate) in known locations on a grid.
Label an RNA sample and hybridize
Measure amounts of RNA bound to each square in
the grid

11
Imagination on Microarray
12
(No Transcript)
13
Basic principles

Main novelty is one of scale
hundreds or thousands of probes rather than tens
Probes are attached to solid supports
Robotics are used extensively
Informatics is a central component at all stages

14
Major technologies

cDNA probes (gt 200 nt), usually produced by PCR,
attached to either nylon or glass supports
Oligonucleotides (25-80 nt) attached to glass
support
Oligonucleotides (25-30 nt) synthesized in situ
on silica wafers (Affymetrix)
Probes attached to tagged beads

15
Areas Being Studied with Microarrays

Differential gene expression between two (or
more) sample types
Similar gene expression across treatments
Tumor sub-class identification using gene
expression profiles
Classification of malignancies into known classes
Identification of marker genes that
characterize different tumor classes
Identification of genes associated with clinical
outcomes (e.g. survival)

16
Applications

Pathway Inference
(Gene regulatory network prediction)

Disease detection
(probe detection classification)

17
Principal uses of chips

Genome-scale gene expression analysis
Differentiation
Responses to environmental factors
Disease processes
Effects of drugs
Detection of sequence variation
Genetic typing
Detection of somatic mutations (e.g. in
oncogenes)
Direct sequencing

18
cDNA chips

Probes are cDNA fragments, usually amplified by
PCR
Probes are deposited on a solid support, either
positively charged nylon or glass slide
Samples (normally poly(A) RNA) are labelled
using fluorescent dyes
At least two samples are hybridized to chip
Fluorescence at different wavelengths measured by
a scanner

19
Standard protocol for comparative hybridization
20
cDNA microarray experiments

mRNA levels compared in many different contexts
Different tissues, same organism (brain v.
liver)
Same tissue, same organism (ttt v. ctl, tumor v.
non-tumor)
Same tissue, different organisms (wt v. ko, tg,
or mutant)
Time course experiments (effect of ttt,
development)
Other special designs (e.g. to detect spatial
patterns).

21
Web animation of a cDNA microarray experiment
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html DNA Microarray Technique
22
Yeast genome on a chip
23
Brief outline of steps for producing a microarray

cDNA probes attached or synthesized to solid
support
Hybridize targets
Scan array

Using Microarray with cDNA or oligonucleotide

Building the Chip
MASSIVE PCR
PCR PURIFICATION and PREPARATION
PREPARING SLIDES
PRINTING
Preparing RNA
Hybing the Chip
POST PROCESSING
CELL CULTURE AND HARVEST
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
25
cDNA microarrays
cDNA clones
26
cDNA microarrays

Compare the genetic expression in two samples of
cells

PRINT cDNA from one gene on each spot
SAMPLES cDNA labelled red/green
e.g. treatment / control normal / tumor
tissue
27
HYBRIDIZE Add equal amounts of labelled cDNA
samples to microarray.
SCAN
Laser
Detector
28
Quantification of expression

For each spot on the slide we calculate
Red intensity Rfg - Rbg
(fg foreground, bg background) and
Green intensity Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)

29
Gene Expression Data

On p genes for n slides p is O(10,000), n is
O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
30
cDNA chip design

Probe selection
Non-redundant set of probes
Includes genes of interest to project
Corresponds to physically available clones
Chip layout
Grouping of probes by function
Correspondence between wells in microtitre plates
and spots on the chip

31
Glass chip manufacturing

Choice of coupling method
Physical (charge), non-specific chemical,
specific chemical (modified PCR primer)
Choice of printing method
Mechanical pins flat tip, split tip, pin ring
Piezoelectric deposition (ink-jet)
Robot design
Precision of movement in 3 axes
Speed and throughput
Number of pins, numbers of spots per pin load

32
Labeling and hybridization

Targets are normally prepared by oligo(dT) primed
cDNA synthesis
Probes should contain 3 end of mRNA
Need CoT1 DNA as competitor
Specific activity will limit sensitivity of assay
Alternative protocol is to make ds cDNA
containing bacterial promoter, then cRNA
Can work with smaller amount of RNA
Less quantitative
Hybridization usually under coverslips

33
Scanning the arrays

Laser scanners
Excellent spatial resolution
Good sensitivity, but can bleach fluorochromes
Still rather slow
CCD scanners
Spatial resolution can be a problem
Sensitivity easily adjustable (exposure time)
Faster and cheaper than lasers
In all cases, raw data are images showing
fluorescence on surface of chip

34
Microarray data on the Web

Many groups have made their raw data available,
but in many formats
Some groups have created searchable databases
There are several initiatives to create unified
databases
EBI ArrayExpress
NCBI Gene Expression Omnibus
Companies are beginning to sell microarray
expression data (e.g. Incyte)

35
Bioinformatics of microarrays

Experimental(Array) design choice of sequences
to be used as probes
Analysis of scanned images
Spot detection, normalization, quantitation
Primary analysis of hybridization data
Basic statistics, reproducibility, data
scattering, etc.
Comparison of multiple samples
Clustering, SOMs, classification
Sample tracking and databasing of results

36
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
37
Microarray Image Analysis

Quantitation of fluorescence signals

38
Scanner
PMT
Pinhole
Detector lens
Laser
Beam-splitter
Objective Lens
Dye
Glass Slide
39
Images from scanner

Resolution
standard 10?m currently, max 5?m
100?m spot on chip 10 pixels in diameter
Image format
TIFF (tagged image file format) 16 bit (65536
levels of grey)
1cm x 1cm image at 16 bit 2Mb (uncompressed)
other formats exist e.g.. SCN (used at Stanford
University)
Separate image for each fluorescent sample
channel 1, channel 2, etc.

40
Images examples
41
Practical Problems 1

Comet Tails
Likely caused by insufficiently rapid immersion
of the slides in the succinic anhydride blocking
solution.

42
Practical Problems 2
43
Practical Problems 3

High Background
2 likely causes
Insufficient blocking.
Precipitation of the labeled probe.
Weak Signals

44
Practical Problems 4
Spot overlap Likely cause too much
rehydration during post - processing.
45
Practical Problems 5
Dust
46
Processing of images

Addressing or gridding
Assigning coordinates to each of the spots
Segmentation
Classification of pixels either as foreground or
as background
Intensity determination for each spot
Foreground fluorescence intensity pairs (R, G)
Background intensities
Quality measures

47
Addressing

The measurement process depends on the addressing
procedure
Addressing efficiency can be enhanced by allowing
user intervention (slow!)
Most software systems now provide for both manual
and automatic gridding procedures

Registration
48
Problems in automatic addressing

Misregistration of the red and green channels
Rotation of the array in the image
Skew in the array

Rotation
49
Segmentation

Segmentation methods
Fixed circle segmentation
Adaptive circle segmentation
Adaptive shape segmentation
Histogram segmentation

50
Information Extraction

Spot Intensities
mean (pixel intensities).
median (pixel intensities).
Background values
Local
Morphological opening
Constant (global)
None
Quality Information
Area
Circularity
Signal to Noise ratio

Take the average
51
Quantification of expression

For each spot on the slide we calculate
Red intensity Rfg - Rbg
fg foreground, bg background, and
Green intensity Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)

52
Microarray Data Normalization

Why?
To correct for systematic differences between
samples on the same slide, or between slides,
which do not represent true biological variation
between samples.
How do we know it is necessary?
By examining self-self hybridizations, where no
true differential expression is occurring.
We find dye biases which vary with overall spot
intensity, location on the array, plate origin,
pins, scanning parameters,.
Goals
- Reduces systematic (not random) effects
- Makes it possible to compare several arrays

53
(No Transcript)
54

Intensity-dependent normalization
Here, run a line through the middle of the MA
plot, shifting the M value of the pair (A,M) by
cc(A), i.e.
log2 R/G -gt log2 R/G - c (A)
One estimate of c(A) is made using the LOWESS
function of Cleveland (1979) LOcally WEighted
Scatterplot Smoothing.

A(log Rlog G)/2 M(log R-log G)/2
55
Normalization by controlsMicroarray Sample Pool
titration series
Pool the whole library
Control set to aid intensity- dependent
normalization Different concentrations in
titration series Spotted evenly spread across the
slide in each pin-group
56
Differential Expression Which genes have changed?

Goal
Identify genes associated with covariate or
response of interest
Examples
Qualitative covariates or factors treatment,
cell type, tumor class
Quantitative covariate dose, time
Responses survival, cholesterol level
Any combination of these!

57
cDNA gene expression data

Data on G genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j

(normalized) Log( Red intensity / Green intensity)
58
An expression profile like this for each gene
mRNA Cy5/Cy3 r
_
5
down-regulation repression
up-regulation induction
_
1
0
time / h
Start of experiment
59
Co-Regulation -- Inference of function Genes
belonging to the same pathway are often showing
the same regulatory patterns (profiles) for a
variety of biological situations (or in a time
series). Hence, as a hypothesis, genes of unknown
function showing similar regulatory behaviour as
some genes of known function may have a similar
function.

Which genes are differentially expressed ?
Which genes are expressed in a similar way
when comparing to expression profiles of genes
with known function?
(co-regulation)
patterns of expression (diagnostic
Fingerprinting)
Reverse Engineering of genetic networks

Differential expression Comparing the
Transcriptomes for two different biological
samples (e.g. control, heat-shock) you are
interested in the subset of genes which are
expressed on different levels (up-/
down-regulated).
Expression-Fingerprinting Often in medical
applications it is of interest to characterize
the biological status of cells, e.g. the
severeness of tumor cells, to be able to respond
with the right therapy.
Reverse Engineering Using expression data to
infer regulatory interactions between a number of
genes responsible for a certain adaptation
process or developmental process.
60
Common methods

t-Test
Fisher
Golub

TNOM
Wilcoxon
WEPO

62
Weighted Punishment on Overlap (WEPO)

Combine heuristics from para- and non-parametric
methods.
If a gene is differentially expressed, the
expression value of different groups should come
from quite different distributions.

63
The better the genes, the less the overlap

Score each gene via estimating the overlapped
regions of these classes.
To prevent information loss and maintain
robustness.

64
Formula of Weighted Punishment
Where
65
An Example
66
Another Example
67
Other Higher AdvancedMicroarray data analysis

Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated
genes
Meta-studies using data from multiple experiments

68
(No Transcript)
69
Cluster analysis

Used to find groups of objects when not already
known
Unsupervised learning
Associated with each object is a set of
measurements (the feature vector)
Aim is to identify groups of similar objects on
the basis of the observed measurements

70
(No Transcript)
71
Clustering Gene Expression Data

Can cluster genes (rows), e.g. to (attempt to)
identify groups of co-regulated genes
Can cluster samples (columns), e.g. to identify
tumors based on profiles
Can cluster both rows and columns at the same time

72
Clustering Gene Expression Data

Leads to readily interpretable figures
Can be helpful for identifying patterns in time
or space
Useful (essential?) when seeking new subclasses
of samples
Can be used for exploratory purposes

73
Types of Clustering

Herarchical
Link similar genes, build up to a tree of all
Kmeans
- Partition genes into a prespecified number
of groups K
Self Organizing Maps (SOM)
Split all genes into similar sub-groups
Finds its own groups (machine learning)
Principle Component
every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data

74
Hierarchical Clustering
3 clusters?
2 clusters?
75
K-means Clustering
The intended clusters are found.
76
Hierarchical clustering
77
Hierarchical clustering (continued)
To transform the genesexp matrix into
genesgenes matrix, use a gene similarity
metric. (Eisen et al. 1998 PNAS 9514863-14868)
Exactly same as Pearsons correlation except the
underline
Where Gi equal the (log-transformed) primary data
for gene G in condition i. For any two genes X
and Y observed over a series of N conditions.
Goffset is set to 0, corresponding to
fluorescence ratio of 1.0
78
Hierarchical clustering (continued)
Pearsons correlation example
What if genome expression is clustered based on
negative correlation?
79
Hierarchical clustering (continued)
80
K-means clustering
This method differs from the hierarchical
clustering in many ways. In particular, - There
is no hierarchy, the data are partitioned. You
will be presented only with the final cluster
membership for each case. - There is no role for
the dendrogram in k-means clustering. - You must
supply the number of clusters (k) into which the
data are to be grouped.
81
K-means clustering(continued)
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
Step 2 Cluster genes based on a k-means
clustering algorithm
82
K-means clustering(continued)
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
(Tavazoie et al. Nature Genetics. 1999
Jul22(3)281-5)
Euclidean distance
Where any two genes X and Y observed over a
series of M conditions.
83
K-means clustering(continued)
84
K-means clustering algorithm
Step 1 Suppose distance of genes expression
patterns are positioned on a two dimensional
space based a distance matrix
Step 2 The first cluster center(red) is chosen
randomly and then subsequent centers are
by finding the data point farthest from the
centers already chosen. In this example, k3.
85
K-means clustering algorithm(continued)
Step 3 Each point is assigned to the
cluster associated with the closest
representative center
Step 4 Minimizes the within-cluster sum of
squared distances from the cluster mean by
moving the centroid (star points), that is
computing a new cluster representative
86
K-means clustering algorithm(continued)
Step 5 Repeat step 3 and 4 with a new
representative
Run step 3, 4 and 5 until no further changes
occur.
87
Web links

Leming Shis Gene-Chips.com page very rich
source of basic information and commercial and
academic links
DNA chips for dummies animation
A step by step description of a microarray
experiment by Jeremy Buhler
The Big Leagues Pat Brown and NHGRI microarray
projects

88
Mini-Review How to make a cDNA microarray
89
Glass Slide Array of bound cDNA probes 4x4
blocks 16 print-tip groups
90
Microarray Experiment
91
HybridizationBinding cDNA samples (targets) to
cDNA probes on slide
cover slip
Hybridise for 5-12 hours
92
(No Transcript)
93
Quantification of expression

For each spot on the slide we calculate
Red intensity Rfg - Rbg
fg foreground, bg background, and
Green intensity Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)

94
Some Considerations for cDNA Microarray
Experiments (I)

Scientific (Aims of the experiment)
Specific questions and priorities
How will the experiments answer the questions
Practical (Logistic)
Types of mRNA samples reference, control,
treatment, mutant, etc
Source and Amount of material (tissues, cell
lines)
Number of slides available

95
Some Considerations for cDNA Microarray
Experiments (II)

Other Information
Experimental process prior to hybridization
sample isolation, mRNA extraction, amplification,
labelling,
Controls planned positive, negative, ratio,
etc.
Verification method Northern, RT-PCR, in situ
hybridization, etc.

96
Experimental Design

Ensure questions of interest can be answered
accurately, under some constraints
Cost, number of slides
Biological material, availability of mRNA

97
Combining data across slides

Data on m genes for n hybridizations

98
The design issue here

Determine which mRNAs are to be labeled with
which fluor, and which are to be hybridized
together on the same slide.
i.e, How the samples are paired onto arrays.

99
Graphical Representation

Multi-digraph
Vertices mRNA samples
Edges hybridization
Direction dye assignment

Cy3 sample
Cy5 Sample
100
Comparing K treatments

Common reference design
Extensibility
All-pairs design
Better in precision
Comparison within slides

101
On Graphical Representation

2 mRNA samples can be compared if there is a path
The precision depends on the number of paths
Direct comparisons within slides more precise
than indirect ones

102
Treatment vs Control

Two samples
e.g. KO vs. WT or mutant vs. WT

Indirect
Direct
T
Ref
T
C
C
Ref
average (log (T/C))
log (T / Ref) log (C / Ref )
?2 /2
2?2
103
Common reference

A
B
C
Ref
All pairs
104
(No Transcript)
105
(No Transcript)
106
The problem

We suppose comparison between all pairs of
varieties are of equal interest.
For the number of arrays budgeted for an
experiment, which design should we use to gain
more precision?

107
V5 , S8
2
?
1
?
?
?
5
3
?
4
V6 , S8
?
?
?
?
?
?
?
?
?
?
?
?
V7 , S9
?
?
?
?
?
?
?
From puppy (2002)
108
Traditional ways

Generate the full sets of non-isomorphic
connected designs of given the number of arrays
and samples.
Then calculate each average variance.

109
Difficulties

11,716,571 non-isomorphic connected graphs on 10
nodes. (almost 58 hrs)
1,006,700,565 on 11
164,059,830,476 on 12
Its too time-consuming by using such strategy.

110
Our strategy

Using GA to be a smart search method, we dont
need to explore all designs but get the optimal
one.
For v 5, a 8 -gt 1.6 secs
v 6, a 8 -gt 2.1 secs
v 7, a 9 -gt 2.7 secs
v 10, a 20 -gt 129.416 secs
v 12 , a 14 -gt 136.481 secs
v 13 , a 15 -gt 175.388 secs

111
1
?
2
12
?
?
11
3
?
?
4
10
?
?
?
?
5
9
?
?
6
8
?
7
V12 , S24
112
Statistical model
For any particular gene
B
A
Sample
Expression level
Intensity
G
R
Intensity ? Expression level
113
(No Transcript)
114
(No Transcript)
115
Example Time course
T1
T2
T3
T4
t1
t2
t3
t4
t1, t2, t3, t4 true expression levels
116
T1 VS. T2
117
T1 VS. T2
118
T3 VS. T4
119
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Microarray and Data Analysis PowerPoint PPT Presentation