Title: Multiple testing, correlation and regression, and clustering in R
1Multiple testing, correlation and regression, and
clustering in R
- Multtest package
- Anscombe dataset and stats package
- Cluster package
2Multtest package
- The multtest package contains a collection of
functions for multiple hypothesis testing. - These functions can be used to identify
differentially expressed genes in microarray
experiments, i.e., genes whose expression levels
are associated with a response or covariate of
interest.
3Multtest package
- These procedures are implemented for tests based
on tstatistics, Fstatistics, paired
tstatistics, Wilcoxon statistics. - The multtest package implements multiple testing
procedures for controlling different Type I error
rates. It includes procedures for controlling the
familywise Type I error rate (FWER) Bonferroni,
Hochberg (1988), Holm (1979). - It also includes procedures for controlling the
false discovery rate (FDR) Benjamini and
Hochberg (1995) and Benjamini and Yekutieli
(2001) stepup procedures.
4Data Analysis
- Leukemia Data by Golub et al. (1999)
- gt library(multtest, verbose FALSE)
- gt data(golub)
5Golub Data
- Golub et al. (1999) were interested in
identifying genes that are differentially
expressed in patients with two type of leukemias,
acute lymphoblastic leukemia (ALL, class 0) and
acute myeloid leukemia (AML, class 1). - Gene expression levels were measured using
Affymetrix highdensity oligonucleotide chips
containing p 6, 817 human genes. - The learning set comprises n 38 samples, 27 ALL
cases and 11 AML cases (data available at
http//www.genome.wi.mit.edu/MPR).
6Golub Data
- Three preprocessing steps were applied to the
normalized matrix of intensity values available
on the website - (i) thresholding floor of 100 and ceiling of
16,000 - (ii) filtering exclusion of genes with max / min
lt 5 or (max-min) lt 500, where max and min refer
respectively to the maximum and minimum
intensities for a particular gene across mRNA
samples - (iii) base 10 logarithmic transformation.
- Boxplots of the expression levels for each of the
38 samples revealed the need to standardize the
expression levels within arrays before combining
data across samples. The data were then
summarized by a 3, 05138 matrix X (xji), where
xji denotes the expression level for gene j in
tumor mRNA sample i.
7Golub Dataset
- The dataset golub contains the gene expression
data for the 38 training set tumor mRNA samples
and 3,051 genes retained after preprocessing.
The dataset includes - golub a 3, 051 38 matrix of expression levels
- golub.gnames a 3, 051 3 matrix of gene
identifiers - golub.cl a vector of tumor class labels (0 for
ALL, 1 for AML).
8Golub Dataset
- gt dim(golub)
- gt 1 3051 38
9Golub Dataset
10Golub Dataset
11The mt.teststat and mt.teststat.num.denum
functions
- Used to to compute test statistics for each row
of a data frame, e.g., twosample Welch
tstatistics, Wilcoxon statistics, Fstatistics,
paired tstatistics, block Fstatistics.
12usage
- mt.teststat(X,classlabel,test"t",na.mt.naNUM,non
para"n") -
'test"wilcoxon"'
'test"pairt"'
'test"f"'
mt.teststat.num.denum(X,classlabel,test"t",na.mt
.naNUM,nonpara"n")
13twosample tstatistics
- comparing, for each gene, expression in the ALL
cases to expression in the AML cases - gt teststat lt- mt.teststat(golub, golub.cl)
14QQ plot
- First create an empty .pdf file to plot the QQ
plot. - gt pdf("mtQQ.pdf")
- Plot the qqplot
- gt qqnorm(teststat)
- Plot a diagonal line
- gt qqline(teststat)
- Shuts down the graphical object (e.g., pdf file)
- gt dev.off()
15What is a QQ plot?
16What is a QQ plot?
- The quantile-quantile (q-q) plot is a graphical
technique for determining if two data sets come
from populations with a common distribution (e.g.
normal distribution). - A q-q plot is a plot of the quantiles of the
first data set (expected) against the quantiles
of the second data set (observed).
17What is a QQ plot?
- By a quantile, we mean the fraction (or percent)
of points below the given value. That is, the 0.3
(or 30) quantile is the point at which 30
percent of the data fall below and 70 fall above
that value. - A 45-degree reference line is also plotted.
- If the two sets come from a population with the
same distribution, the points should fall
approximately along this reference line.
18Store the numerator and denominator of the test
stat.
- Create a variable tmp in which you store the
numerators and denominators of the test statistic - tmp lt- mt.teststat.num.denum(golub, golub.cl,
test "t") - Name the numerator as num (teststat.num atribute
of the tmp object) - gt num lt- tmpteststat.num
- Name the denominator as denum (teststat.denum
atribute of the tmp object). - gt denum lt- tmpteststat.denum
19Plot the num to denum
- Create a pdf devise
- gt pdf("mtNumDen.pdf")
- Plot
- gt plot(sqrt(denum), num)
- Shut off the graphics
- gt dev.off()
20Plot the num to denum
21mt.rawp2adjp function
- This function computes adjusted pvalues for
simple multiple testing procedures from a vector
of raw (unadjusted) pvalues. - The procedures include the Bonferroni, Holm
(1979), Hochberg (1988), and Sidak procedures for
strong control of the familywise Type I error
rate (FWER), and the Benjamini and Hochberg
(1995) and Benjamini and Yekutieli (2001)
procedures for (strong) control of the false
discovery rate (FDR).
22Raw p-values (unadjusted)
- As a first approximation, compute raw nominal
twosided pvalues for the 3051 test statistics
using the standard Gaussian distribution. Create
a variable known as rawp0 that contain the
p-values of the 3051 test statistics. - gt rawp0 lt- 2 (1 - pnorm(abs(teststat)))
- gt rawp015
- 1 0.07854436 0.36289759 0.92191171 0.73463771
0.17063542
23The order function
- gt aalt-c(3,5,2,1,9)
- gt ialt-order(aa) index order
- gt ia
- 1 4 3 1 2 5
- gt aaia order aa according to ia
- 1 1 2 3 5 9
24Adjusted p-values
- Create a vector of character strings containing
the names of the multiple testing procedures for
which adjusted p-values are to be computed. This
vector should include any of the following
'"Bonferroni"', '"Holm"', '"Hochberg"',
'"SidakSS"', '"SidakSD"', '"BH"', '"BY"'. - gt procs lt- c("Bonferroni", "Holm",
- "Hochberg", "SidakSS", "SidakSD",
- "BH", "BY")
25Adjusted p-values in the order of the gene list
- Adjusted pvalues can be stored in the original
gene order in adjp using order(resindex) - gt res lt- mt.rawp2adjp(rawp0, procs)
- gt adjp lt- resadjporder(resindex),
26Adjusted p-values in the order of significance
- Adjusted pvalues can be stored in the order of
significance - gt adjp lt- resadjp
- gt adjp15,
27mt.reject function to list number of genes
rejected
28mt.reject function to list number of genes
rejected
29Find the genes most significant
30Get the data of significantly modulated genes
- gt which lt- mt.reject(cbind(rawp0, adjp),
0.000001)which, 2 - gt sum(which)
- 1 143
- gt gsignificantlt-golubwhich,
- gt dim(gsignificant)
- 1 143 38
- gt gsignificant15,15
- ,1 ,2 ,3 ,4 ,5
- 1, -1.45769 -0.32639 -1.46227 -0.18983 -0.12402
- 2, 0.86019 -0.14271 0.67037 0.70706 0.87697
- 3, -1.00702 -0.89365 -1.21154 -1.40715 -1.42668
- 4, -1.27427 -0.66834 -0.58252 -1.40715 -0.03531
- 5, -0.45670 0.48916 -0.48474 0.02261 0.00704
31mt.plot function
- The mt.plot function produces a number of
graphical summaries for the results of multiple
testing procedures and their corresponding
adjusted pvalues.
32mt.plot function
- To produce plots of adjusted pvalues for the
Bonferroni, maxT, Benjamini and Hochberg 7
(1995), and Benjamini and Yekutieli (2001)
procedures use
33Plot adjp values over number of rejected
hypotheses
- gt res lt- mt.rawp2adjp(rawp, c("Bonferroni", "BH",
"BY")) - gt adjp lt- resadjporder(resindex),
- gt allp lt- cbind(adjp, maxT)
- gt dimnames(allp)2 lt- c(dimnames(adjp)2,
"maxT") - gt procs lt- dimnames(allp)2
- gt procs lt- procsc(1, 2, 5, 3, 4)
- gt cols lt- c(1, 2, 3, 5, 6)
- gt ltypes lt- c(1, 2, 2, 3, 3)
- gt mt.plot(adjp,teststat,plottype "pvsr")
34Correlation and Regression in R
- Anscombe dataset
- gt data(anscombe)
- gt ? anscombe
35Anscombe dataset
36Get summaries
37cor function
38Find correlation coefficient of all pairwise
combinations
39Find correlation coefficient of specific pairs
gt cor(anscombe,1,anscombe,2) 1 1 gt
cor(anscombe,1,anscombe,4) 1 -0.5
40cor.test function
41cor.test
- gt cor.test(anscombe,1,anscombe,4)
- Pearson's product-moment correlation
- data anscombe, 1 and anscombe, 4
- t -1.7321, df 9, p-value 0.1173
- alternative hypothesis true correlation is not
equal to 0 - 95 percent confidence interval
- -0.8460984 0.1426659
- sample estimates
- cor
- -0.5
42Plot scatterplots
- gt plot(anscombe,1,anscombe,4)
43Application to Golub dataset
44Calculate the linear regression coefficients and
get a summary
- gt lllt-lm(anscombe,4 anscombe,1)
- gt summary(ll)
45Get the anova table
46Plot the regression line
- gt plot(anscombe,1,anscombe,4)
- gt abline(lm(anscombe,4 anscombe,1))
47llfitted.values
48llcoefficients, llresiduals
49hclust function
50hclust for golub data (correlation distance with
average linkage)
51hclust for golub data (correlation distance with
complete linkage)
52hclust for golub data (euclidean distance with
single linkage)
53hclust for significant golub data (euclidean
distance with single linkage)
54heatmap function
55heatmap for significant genes
- gt attributes(clust.euclid.single)
- names
- 1 "merge" "height" "order"
"labels" "method" - 6 "call" "dist.method"
- class
- 1 "hclust"
- gt clust.euclid.singleorder
- 1 129 10 124 125 54 35 28 59 44 85 110
90 106 15 137 138 139 37 - 19 47 126 63 88 27 134 79 80 76 121 131
81 3 82 11 1 104 78 - 37 140 31 34 98 89 8 26 123 92 113 61
127 16 95 29 96 39 142 - 55 7 57 128 14 32 6 67 9 71 133 52
19 74 72 23 73 132 99 - 73 55 135 56 136 5 97 24 18 25 4 107
84 94 109 114 51 68 69 - 91 45 65 46 41 75 49 102 122 50 100 83
12 86 141 17 116 143 2 - 109 77 105 112 22 36 101 111 13 62 21 60
87 38 58 118 115 119 93 - 127 30 33 108 43 120 91 70 130 53 40 66
64 20 42 103 48 117 - gt gsignificantorderedlt-gsignificantclust.euclid.s
ingleorder, - gt heatmap(gsignificantordered)
56Heatmap for significant genes
57kmeans function
58kmeans for samples based on significant genes
59pam function
60Pam on samples based on significant genes
61Silioutte plot
gt plot(clust.pam.2)
62Microarray Data Analysis Software
- http//rana.lbl.gov/EisenSoftware.htm
- http//classify.stanford.edu/
- http//www.broad.mit.edu/cancer/software/software.
html - http//homes.esat.kuleuven.be/dna/Biol/Software.h
tml - http//vortex.cs.wayne.edu/Projects.html
- http//visitor.ics.uci.edu/cgi-bin/genex/rcluster/
index.cgi - http//www-stat.stanford.edu/7Etibs/SAM/index.htm
l - http//maexplorer.sourceforge.net/
- http//ihome.cuhk.edu.hk/b400559/arraysoft.html
- http//genome.ws.utk.edu/resources/restech.shtml