Title: Microarray
1Microarray
- Yuki Juan
- NTUST
- May 26, 2003
2Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
- Discussion
3The Biology Background of Microarray
- The central dogma of life forms
- DNA
- RNA
- Monitoring the expression of genes
4Central Dogma
- DNA Replication
- --ACGCGA--
- --TGCGCT--
- RNA Transcription
- --UGCGCU--
- Protein Translation
- --CYSALA--
5DNA
replication
transcription
translation
DNA
RNA
Protein
6DNA
- The double helix
- stable
- Nucleotide
- A, T, G, C
- Base pair
- A T
- G C
- Oligonucleotide
- short DNA (tens of nucleotides, or bps)
(http//www.nhgri.nih.gov/)
7DNA Strand
- DNA has canonical orientation
- read from 5 to 3
- antiparallel one strand has direction opposite
to its complements - 5 TACTGAA 3
- 3 ATGACTT 5
8Hydrogen Bond Makes DNA Binding Specifically
Hydrogen bond
5
3
5
3
9Hydrogen Bond Makes DNA Binding Specifically
- The force between base pair is hydrogen bond,
This force let - A-T(U), C-G can specifically match together.
10RNA
replication
transcription
translation
DNA
RNA
Protein
11RNA
- Types
- messenger RNA
- ribosomal RNA (rRNA)
- transfer RNA (tRNA)
-
- Gene is expressed by transcribing DNA
- into single-stranded mRNA
12RNA (Detailed)
(http//www.nhgri.nih.gov/)
13Reverse Transcription
replication
transcription
translation
DNA
RNA
Protein
Reverse Transcription
By reverse transcriptase, we can convert RNA into
cDNA.
14The Southern Blot
- Basic DNA detection technique that has been used
for over 30 years, known as Southern blots - A known strand of DNA is deposited on a solid
support (i.e. nitocellulose paper) - An unknown mixed bag of DNA is labelled
(radioactive or flourescent) - Unknown DNA solution allowed to mix with known
DNA (attached to nitro paper), then excess
solution washed off - If a copy of known DNA occurs in unknown
sample, it will stick (hybridize), and labeled
DNA will be detected on photographic film
15mRNA Represent Gene Function
- When measure the level of a mRNA, we are
monitoring the activity of a gene. - Thus, if we can understand all the level of
mRNAs, we can study the expression of whole
genome. - Microarray takes the advantage of getting over
10000 of blotting data in a single experiment,
which makes monitoring the genome activity
possible.
16Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
- Discussion
17Design of Microarray
- Microarray in different context
- The idea of microarray
- Main type of array chips
18mRNA Levels Compared in Many Different Contexts
- Different tissues, same organism (brain v.
liver) - Same tissue, same organism (tumor v. non-tumor)
- Same tissue, different organisms (wt v. mutant)
- Time course experiments (development)
- Other special designs (e.g. to detect spatial
patterns).
19Idea of Microarray
Cell A
Cell B
Labeled cDNA from geneX
Hybridizaton to chip
Spot of geneX with complementary sequence of
colored cDNA
This spot shows red color after scanning.
20Over 10,000 Hybridization Could Be Down at One
Time
21Several Types of Arrays
- Spotted DNA arrays
- Developed by Pat Browns lab at Stanford
- PCR products of full-length genes (gt100nt)
- Affymetrix gene chips
- Photolithography technology from computer
industry allows building many 25-mers - Ink-jet microarrays from Agilent
- 25-60-mers printed directly on glass slides
- Flexible, rapid, but expensive
22Array Fabrication Spotting
- Use PCR to amplify DNA
- Robotic "pen" deposits DNA at defined coordinates
- approximately 1-10 ng per spot
- Experimentation with oligos (40, 70 bp)
23This machine can make 48 microarrays
simultaneously.
24Array Fabrication Photolithography
- Light activated synthesis
- synthesize oligonucleotides on glass slides
- 107copies per oligo in 24 x 24 um square
- Use 20 pairs of different 25-mers per gene
- Perfect match and mismatch
25Array Fabrication Photolithography
26Affymetrix Microarrays
Raw image
1.28cm
107 oligonucleotides, half perfectly match mRNA
(PM), half have one mismatch (MM) Raw gene
expression is intensity difference PM - MM
27Agilent cDNA microarray and oligonucelotides
microarray
- Agilent delivering printed 60-mer microarrays in
addition to 25-mer formats. - The inkjet process uses standard phosphoramidite
chemistry to deliver extremely small volumes
(picoliters) of the chemicals to be spotted.
28Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
29The Workflow of Microarray
sample
Plate
Plate Preparation
RNA extraction
Array Fabrication
cDNA synthesis and labeled
Array
Hybridization
Labeled cDNA
Hybridized Array
Scanning
30cDNA Synthesis And Directly Labeling
31Cy3 and Cy5 cDNA Hybridization On To The Chip
e.g. treatment / control normal / tumor
tissue
Sample loading
1.Loading from the corner of the cover slip It is
time consuming and easily producing bubbles.
1
2. Loading sample at the center of array then put
the slip smoothly Faster, and have lower chance
of bubble producing then the last one.
2
Sample loading
3. Loading sample at the side of the array then
put the slip on. Solution would attach to the
slip right after the slip contact with it, and
would diffuse with the movement of slip when we
slowly move down.
3
Sample loading
32Scan
Green down regulate Red up regulate Yellow
equal level
33Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
- Discussion
34Image analysis
- To find a spot
- Convert feature into numeric data
- Image normalization
35The Algorithms
- 1. Find spots Finds the location of each spot on
the microarray. - 2. Cookie cutter algorithm
- (1).Suppose the distribution of pixels vs
intensity is Gaussian curve - (2).Using SD or IQR to identify the feature and
background of each spot - (3).Calculates statistics for the pixel
population
36Interquartile Range(IQR)
D
KIQR/2
1.42 IQR
50
75
25
Boundary for rejection
Boundary for rejection
IQR
37Feature or cookie
D
Local background
Exclusion zone
38Data Quality
- Irregular size or shape
- Irregular placement
- Low intensity
- Saturation
- Spot variance
- Background variance
artifact
miss alignment
bad print
indistinguishable
saturated
39Convert Feature Into Numeric Value
Green background
Green b.g.-corrected
Red b.g.-corrected
(R. b.g.-c)/(G. b.g.-c)
Red intensity
Green intensity
Systematic name
Red b.g.
Gene function
40Data Normalization
- Normalize data to correct for variances
- Dye bias
- Location bias
- Intensity bias
- Pin bias
- Slide bias
- Control vs. non-control spots
41Data Normalization
Calibrated, red and green equally detected
Uncalibrated, red light under detected
42Data Normalization
- Assumptions
- Overall mean average ratio should be 1
- Most genes are not differentially expressed
- Total intensity of dyes are equivalent
43 Intensity Dependent Normalization
44After Normalization
45Additional Normalization
- Pin dependent
- Similar to intensity dependent fit.
- Compute individual lowess fits for each pin group
- Within slide normalization
- After pin dependent normalization, log ratios for
each pin are centered around 0 - Scale variance for each pin
- Uses MAD (median absolute deviation)
46Additional Normalization
- Dye swap
- Combine relative expression levels without
explicit normalization - Compute lowess fit for
- log2(RR/GG)/2 vs. log2(A A)/2
- Normalized ratio is
- log2(R/G) - c(A)
- where c(A) is the lowess prediction
47Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
- Discussion
48Data analysis
- Data filtering
- Fold change analysis
- Classification
- Clustering
- Future direction
49Microarray Data Classification
Microarray chips
Images scanned by laser
Gene Value D26528_at
193 D26561_cds1_at -70 D26561_cds2_at
144 D26561_cds3_at 33 D26579_at
318 D26598_at 1764 D26599_at
1537 D26600_at 1204 D28114_at
707
Datasets
New sample
Data Mining and analysis
Prediction
50The Threshold of Spots
- Filtering - remove genes with insufficient
variation - Remove insufficient spot
- saturated, None uniform, too high background
- Remove extreme signal
- e.g. MaxVal - MinVal lt 500 and MaxVal/MinVal lt
5 - Statistical filtering (e.g. p-valuelt0.01)
- biological reasons
- feature reduction for algorithmic
51Microarray Data Analysis Types
- Different gene expression
- Fold change analysis
- Classification (Supervised)
- identify disease
- predict outcome / select best treatment
- Clustering (Unsupervised)
- find new biological classes / refine existing
ones - exploration
-
52Differential Gene Expression
- n-fold change
- n typically gt 2
- May hold no biological relevance
- Often too restrictive
- 2? expression
- Calculate standard deviation ?
- Genes with expression more than 2? away are
differentially expressed
53Fold Changes-Scatter Plot
21
54Fold Changes Table
23
55Classification Multi-Class
- Similar Approach
- select top genes most correlated to each class
- select best subset using cross-validation
- build a single model separating all classes
- Advanced
- build separate model for each class vs. rest
- choose model making the strongest prediction
56Popular Classification Methods
- Decision Trees/Rules
- find smallest gene sets, but also false positives
- Neural Nets -
- work well if number of genes is reduced
- SVM
- good accuracy, does its own gene selection, hard
to understand - K-nearest neighbor - robust for small number
genes - Bayesian nets - simple, robust
57Multi-class Data Example
- Brain data, Pomeroy et al 2002, Nature (415), Jan
2002 - 42 examples, about 7,000 genes, 5 classes
- Selected top 100 genes most correlated to each
class - Selected best subset by testing 1,2, , 20 genes
subsets, leave-one-out x-validation for each
58Classification Other Applications
- Combining clinical and genetic data
- Outcome / Treatment prediction
- Age, Sex, stage of disease, are useful
- e.g. if Data from Male, not Ovarian cancer
59Clustering
- Goals
- Find natural classes in the data
- Identify new classes / gene correlations
- Refine existing taxonomies
- Support biological analysis / discovery
- Different Methods
- Hierarchical clustering, SOM's, etc
60SOM clustering
- SOM - self organizing maps
- Preprocessing
- filter away genes with insufficient biological
variation - normalize gene expression (across samples) to
mean 0, st. dev 1, for each gene separately. - Run SOM for many iterations
- Plot the results
61SOM K Mean By GeneSpring
27
62Hierarchical Clustering
- The most popular hierarchical clustering method
used in microarray data analysis is the so called
agglomerative method - works with the data in a bottom-up manner.
- Initially, each data point forms a cluster and
the algorithm works through the cluster sets by
repeatedly merging the two which are the most
similar or have the shortest distance. - algorithm involves the computation of the
distance or similarity matrix - O(N2) complexity and thus is not very efficient.
63Hierarchical clustering
64Future directions
- Algorithms optimized for small samples (the no.
of samples will remain small for many tasks) - Integration with other data
- biological networks
- medical text
- protein data
- cost-sensitive classification algorithms
- error cost depends on outcome (dont want to miss
treatable cancer), treatment side effects, etc.
65Integrate biological knowledge when analyzing
microarray data (from Cheng Li, Harvard SPH)
Right picture Gene Ontology tool for the
unification of biology, Nature Genetics, 25, p25
66Content
- Biology background of microarray
- Design of microarray
- The workflow of microarray
- Image analysis of microarray
- Data analysis of microarray
- Discussion
67Microarray Potential Applications
- Biological discovery
- new and better molecular diagnostics
- new molecular targets for therapy
- finding and refining biological pathways
- Mutation and polymorphism detection
- Recent examples
- molecular diagnosis of leukemia, breast cancer,
... - appropriate treatment for genetic signature
- potential new drug targets
68Microarray Limitations
- Cross-hybridization of sequences with high
identity - Chip to chip variation
- True measure of abundance?
- Does mRNA levels reflect protein levels?
- Generally, do not prove new biology - simply
suggest genes involved in a process, a hypothesis
that will require traditional experimental
verification. - What fold change has biological relevance?
- Need cloned EST or some sequence knowledge --
rare messages may be undetected - Expensive!! Not every lab can afford experiment
repeat. - The real limitation is Bioinformatics
69Additional Information
- Review papers on microarray
- Genomics, gene expression and DNA arrays (Nature,
June 2000) - Microarray - technology review (Natural Cell
Biology, Aug. 2001) - Magic of Microarray (Scientific American, Feb.
2002) - Molecular biology tutorial
- http//www.lsic.ucla.edu/ls3/tutorials/
70Biological data retrieval systems Entrez
http//www.ncbi.nlm.nih.gov/Database/index.html
- A retrieval system for searching a number of
inter-connected databases at the NCBI. It
provides access to - PubMed The biomedical literature (Medline)
- Genbank Nucleotide sequence database
- Protein sequence database
- Structure three-dimensional macromolecular
structures - Genome complete genome assemblies
- PopSet population study data sets
- OMIM Online Mendelian Inheritance in Man
- Taxonomy organisms in GenBank
- Books online books
- ProbeSet gene expression and microarray datasets
- 3D Domains domains from Entrez Structure
- UniSTS markers and mapping data
- SNP single nucleotide polymorphisms
- CDD conserved domains
- 2. Entrez allows users to perform various
searches.