Analysis of Affymetrix GeneChip Data - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of Affymetrix GeneChip Data

Description:

mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on ... labeled with a dye that will fluoresce and generate a signal that is monotonic ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 30
Provided by: davidm133
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Affymetrix GeneChip Data


1
Analysis of Affymetrix GeneChip Data
  • EPP 245
  • Statistical Analysis of
  • Laboratory Data

2
Basic Design of Expression Arrays
  • For each gene that is a target for the array, we
    have a known DNA sequence.
  • mRNA is reverse transcribed to DNA, and if a
    complementary sequence is on the on a chip, the
    DNA will be more likely to stick
  • The DNA is labeled with a dye that will fluoresce
    and generate a signal that is monotonic in the
    amount in the sample

3
Intron
Exon
TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGA
TACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG
ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGC
TATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATG
C
Probe Sequence
  • cDNA arrays use variable length probes derived
    from expressed sequence tags
  • Spotted and almost always used with two color
    methods
  • Can be used in species with an unsequenced genome
  • Long oligoarrays use 60-70mers
  • Agilent two-color arrays
  • Spotted arrays from UC Davis or elsewhere
  • Usually use computationally derived probes but
    can use probes from sequenced ESTs

4
  • Affymetrix GeneChips use multiple 25-mers
  • For each gene, one or more sets of 8-20 distinct
    probes
  • May overlap
  • May cover more than one exon
  • Affymetrix chips also use mismatch (MM) probes
    that have the same sequence as perfect match
    probes except for the middle base which is
    changed to inhibitbinding.
  • This is supposed to act as a control, but often
    instead binds to another mRNA species, so many
    analysts do not use them

5
Probe Design
  • A good probe sequence should match the chosen
    gene or exon from a gene and should not match any
    other gene in the genome.
  • Melting temperature depends on the GC content and
    should be similar on all probes on an array since
    the hybridization must be conducted at a single
    temperature.

6
  • The affinity of a given piece of DNA for the
    probe sequence can depend on many things,
    including secondary and tertiary structure as
    well as GC content.
  • This means that the relationship between the
    concentration of the RNA species in the original
    sample and the brightness of the spot on the
    array can be very different for different probes
    for the same gene.
  • Thus only comparisons of intensity within the
    same probe across arrays makes sense.

7
Affymetrix GeneChips
  • For each probe set, there are 8-20 perfect match
    (PM) probes which may overlap or not and which
    target the same gene
  • There are also mismatch (MM) probes which are
    supposed to serve as a control, but do so rather
    badly
  • Most of us ignore the MM probes

8
Expression Indices
  • A key issue with Affymetrix chips is how to
    summarize the multiple data values on a chip for
    each probe set (aka gene).
  • There have been a large number of suggested
    methods.
  • Generally, the worst ones are those from Affy, by
    a long way worse means less able to detect real
    differences

9
Usable Methods
  • Li and Wongs dCHIP and follow on work is
    demonstrably better than MAS 4.0 and MAS 5.0, but
    not as good as RMA and GLA
  • The RMA method of Irizarry et al. is available in
    Bioconductor.
  • The GLA method (Durbin, Rocke, Zhou) is also
    available in Bioconductor

10
Bioconductor Documentation
  • gt library(affy)
  • Loading required package Biobase
  • Loading required package tools
  • Welcome to Bioconductor
  • Vignettes contain introductory material. To
    view, type
  • 'openVignette()'. To cite Bioconductor, see
  • 'citation("Biobase")' and for packages
    'citation(pkgname)'.
  • Loading required package affyio
  • Loading required package preprocessCore

11
Bioconductor Documentation
  • gt openVignette()
  • Please select a vignette
  • 1 affy - 1. Primer
  • 2 affy - 2. Built-in Processing Methods
  • 3 affy - 3. Custom Processing Methods
  • 4 affy - 4. Import Methods
  • 5 affy - 5. Automatic downloading of CDF
    packages
  • 6 Biobase - An introduction to Biobase and
    ExpressionSets
  • 7 Biobase - Bioconductor Overview
  • 8 Biobase - esApply Introduction
  • 9 Biobase - Notes for eSet developers
  • 10 Biobase - Notes for writing introductory 'how
    to' documents
  • 11 Biobase - quick views of eSet instances
  • Selection

12
Reading Affy Data into R
  • The CEL files contain the data from an array. We
    will look at data from an older type of array,
    the U95A which contains 12,625 probe sets and
    409,600 probes.
  • The CDF file contains information relating probe
    pair sets to locations on the array. These are
    built into the affy package for standard types.

13
Example Data Set
  • Data from Robert Rices lab on twelve
    keratinocyte cell lines, at six different stages.
  • Affymetrix HG U95A GeneChips.
  • For each gene, we will run a one-way ANOVA with
    two observations per cell.
  • For this illustration, we will use RMA.

14
Files for the Analysis
  • .CDF file has U95A chip definition (which probe
    is where on the chip). Built in.
  • .CEL files contain the raw data after pixel level
    analysis, one number for each spot. Files are
    called LN0A.CEL, LN0B.CELLN5B.CEL and are on the
    web site.
  • 409,600 probe values in 12,625 probe sets.

15
The ReadAffy function
  • ReadAffy() function reads all of the CEL files in
    the current working directory into an object of
    class AffyBatch, which is itself an object of
    class ExpressionSet
  • ReadAffy(widgetT) does so in a GUI that allows
    entry of other characteristics of the dataset
  • You can also specify filenames, phenotype or
    experimental data, and MIAME information

16
rrdata lt- ReadAffy() gt class(rrdata) 1
"AffyBatch" attr(,"package") 1 "affy gt
dim(exprs(rrdata)) 1 409600 12 gt
colnames(exprs(rrdata)) 1 "LN0A.CEL"
"LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL"
"LN2B.CEL" 7 "LN3A.CEL" "LN3B.CEL" "LN4A.CEL"
"LN4B.CEL" "LN5A.CEL" "LN5B.CEL" gt
length(probeNames(rrdata)) 1 201800 gt
length(unique(probeNames(rrdata))) 1 12625 gt
length((featureNames(rrdata))) 1 12625 gt
featureNames(rrdata)15 1 "100_g_at"
"1000_at" "1001_at" "1002_f_at" "1003_s_at"
17
The ExpressionSet class
  • An object of class ExpressionSet has several
    slots the most important of which is an assayData
    object, containing one or more matrices. The best
    way to extract parts of this is using appropriate
    methods.
  • exprs() extracts an expression matrix
  • featureNames() extracts the names of the probe
    sets.

18
Expression Indices
  • The 409,600 rows of the expression matrix in the
    AffyBatch object Data each correspond to a probe
    (25-mer)
  • Ordinarily to use this we need to combine the
    probe level data for each probe set into a single
    expression number
  • This has conceptually several steps

19
Steps in Expression Index Construction
  • Background correction is the process of adjusting
    the signals so that the zero point is similar on
    all parts of all arrays.
  • We like to manage this so that zero signal after
    background correction corresponds approximately
    to zero amount of the mRNA species that is the
    target of the probe set.

20
  • Data transformation is the process of changing
    the scale of the data so that it is more
    comparable from high to low.
  • Common transformations are the logarithm and
    generalized logarithm
  • Normalization is the process of adjusting for
    systematic differences from one array to another.
  • Normalization may be done before or after
    transformation, and before or after probe set
    summarization.

21
  • One may use only the perfect match (PM) probes,
    or may subtract or otherwise use the mismatch
    (MM) probes
  • There are many ways to summarize 20 PM probes and
    20 MM probes on 10 arrays (total of 200 numbers)
    into 10 expression index numbers

22
Probe intensities for LASP1 in a
radiation dose-response experiment
0 1 10 100 Mean
200618_at1 360 216 158 198 233.0
200618_at2 313 402 106 103 231.0
200618_at3 130 182 79 91 120.5
200618_at4 351 370 195 136 263.0
200618_at5 164 130 98 107 124.8
200618_at6 223 219 164 196 200.5
200618_at7 437 529 195 158 329.8
200618_at8 509 554 274 128 366.3
200618_at9 522 720 285 198 431.3
200618_at10 668 715 247 260 472.5
200618_at11 306 286 144 159 223.8
Expression Index 362.1 393.0 176.8 157.6
23
Log probe intensities for LASP1 in a
radiation dose-response experiment
0 1 10 100 Mean
200618_at1 2.56 2.33 2.20 2.30 2.35
200618_at2 2.50 2.60 2.03 2.01 2.28
200618_at3 2.11 2.26 1.90 1.96 2.06
200618_at4 2.55 2.57 2.29 2.13 2.38
200618_at5 2.21 2.11 1.99 2.03 2.09
200618_at6 2.35 2.34 2.21 2.29 2.30
200618_at7 2.64 2.72 2.29 2.20 2.46
200618_at8 2.71 2.74 2.44 2.11 2.50
200618_at9 2.72 2.86 2.45 2.30 2.58
200618_at10 2.82 2.85 2.39 2.41 2.62
200618_at11 2.49 2.46 2.16 2.20 2.33
Expression Index 2.51 2.53 2.21 2.18
24
The RMA Method
  • Background correction that does not make 0 signal
    correspond to 0 amount
  • Quantile normalization
  • Log2 transform
  • Median polish summary of PM probes

25
gt eset lt- rma(rrdata) trying URL
'http//bioconductor.org/packages/2.1/ Content
type 'application/zip' length 1352776 bytes (1.3
Mb) opened URL downloaded 1.3 Mb package
'hgu95av2cdf' successfully unpacked and MD5 sums
checked The downloaded packages are in
C\Documents and Settings\dmrocke\Local
Settings updating HTML package
descriptions Background correcting Normalizing Cal
culating Expression gt class(eset) 1
"ExpressionSet" attr(,"package") 1 "Biobase" gt
dim(exprs(eset)) 1 12625 12 gt
featureNames(eset)15 1 "100_g_at" "1000_at"
"1001_at" "1002_f_at" "1003_s_at"
26
gt exprs(eset)15, LN0A.CEL LN0B.CEL
LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL
LN3A.CEL 100_g_at 9.195937 9.388350 9.443115
9.012228 9.311773 9.386037 9.386089 1000_at
8.229724 7.790238 7.733320 7.864438 7.620704
7.930373 7.502759 1001_at 5.066185 5.057729
4.940588 4.839563 4.808808 5.195664
4.952883 1002_f_at 5.409422 5.472210 5.419907
5.343012 5.266068 5.442173 5.190440 1003_s_at
7.262739 7.323087 7.355976 7.221642 7.023408
7.165052 7.011527 LN3B.CEL LN4A.CEL
LN4B.CEL LN5A.CEL LN5B.CEL 100_g_at 9.394606
9.602404 9.711533 9.826789 9.645565 1000_at
7.463158 7.644588 7.497006 7.618449
7.710110 1001_at 4.871329 4.875907 4.853802
4.752610 4.834317 1002_f_at 5.200380 5.436028
5.310046 5.300938 5.427841 1003_s_at 7.185894
7.235551 7.292139 7.218818 7.253799
27
gt summary(exprs(eset)) LN0A.CEL
LN0B.CEL LN1A.CEL LN1B.CEL
Min. 2.713 Min. 2.585 Min. 2.611
Min. 2.636 1st Qu. 4.478 1st Qu.
4.449 1st Qu. 4.458 1st Qu. 4.477 Median
6.080 Median 6.072 Median 6.070
Median 6.078 Mean 6.120 Mean 6.124
Mean 6.120 Mean 6.128 3rd Qu.
7.443 3rd Qu. 7.473 3rd Qu. 7.467 3rd
Qu. 7.467 Max. 12.042 Max. 12.146
Max. 12.122 Max. 11.889 LN2A.CEL
LN2B.CEL LN3A.CEL LN3B.CEL
Min. 2.598 Min. 2.717 Min.
2.633 Min. 2.622 1st Qu. 4.444 1st
Qu. 4.469 1st Qu. 4.425 1st Qu. 4.428
Median 6.008 Median 6.058 Median 6.017
Median 6.028 Mean 6.109 Mean
6.125 Mean 6.116 Mean 6.117 3rd
Qu. 7.426 3rd Qu. 7.422 3rd Qu. 7.444
3rd Qu. 7.459 Max. 13.135 Max. 13.110
Max. 13.106 Max. 13.138 LN4A.CEL
LN4B.CEL LN5A.CEL LN5B.CEL
Min. 2.742 Min. 2.634 Min.
2.615 Min. 2.590 1st Qu. 4.468 1st
Qu. 4.433 1st Qu. 4.448 1st Qu. 4.487
Median 6.074 Median 6.050 Median 6.053
Median 6.068 Mean 6.122 Mean
6.120 Mean 6.121 Mean 6.123 3rd
Qu. 7.460 3rd Qu. 7.478 3rd Qu. 7.477
3rd Qu. 7.457 Max. 12.033 Max. 12.162
Max. 11.925 Max. 11.952
28
Probe Sets not Genes
  • It is unavoidable to refer to a probe set as
    measuring a gene, but nevertheless it can be
    deceptive
  • The annotation of a probe set may be based on
    homology with a gene of possibly known function
    in a different organism
  • Only a relatively few probe sets correspond to
    genes with known function and known structure in
    the organism being studied

29
Exercise
  • Download the ten arrays from the web site
  • Load the arrays into R using Read.Affy and
    construct the RMA expression indices
Write a Comment
User Comments (0)
About PowerShow.com