Multiple testing, correlation and regression, and clustering in R

About This Presentation

Title:

Multiple testing, correlation and regression, and clustering in R

Description:

mt.rawp2adjp function ... mt.reject function to list number of genes rejected ... mt.plot function ... – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 63

Provided by: fenBilk

Category:

more less

Transcript and Presenter's Notes

Title: Multiple testing, correlation and regression, and clustering in R

1
Multiple testing, correlation and regression, and
clustering in R

Multtest package
Anscombe dataset and stats package
Cluster package

2
Multtest package

The multtest package contains a collection of
functions for multiple hypothesis testing.
These functions can be used to identify
differentially expressed genes in microarray
experiments, i.e., genes whose expression levels
are associated with a response or covariate of
interest.

3
Multtest package

These procedures are implemented for tests based
on tstatistics, Fstatistics, paired
tstatistics, Wilcoxon statistics.
The multtest package implements multiple testing
procedures for controlling different Type I error
rates. It includes procedures for controlling the
familywise Type I error rate (FWER) Bonferroni,
Hochberg (1988), Holm (1979).
It also includes procedures for controlling the
false discovery rate (FDR) Benjamini and
Hochberg (1995) and Benjamini and Yekutieli
(2001) stepup procedures.

4
Data Analysis

Leukemia Data by Golub et al. (1999)
gt library(multtest, verbose FALSE)
gt data(golub)

5
Golub Data

Golub et al. (1999) were interested in
identifying genes that are differentially
expressed in patients with two type of leukemias,
acute lymphoblastic leukemia (ALL, class 0) and
acute myeloid leukemia (AML, class 1).
Gene expression levels were measured using
Affymetrix highdensity oligonucleotide chips
containing p 6, 817 human genes.
The learning set comprises n 38 samples, 27 ALL
cases and 11 AML cases (data available at
http//www.genome.wi.mit.edu/MPR).

6
Golub Data

Three preprocessing steps were applied to the
normalized matrix of intensity values available
on the website
(i) thresholding floor of 100 and ceiling of
16,000
(ii) filtering exclusion of genes with max / min
lt 5 or (max-min) lt 500, where max and min refer
respectively to the maximum and minimum
intensities for a particular gene across mRNA
samples
(iii) base 10 logarithmic transformation.
Boxplots of the expression levels for each of the
38 samples revealed the need to standardize the
expression levels within arrays before combining
data across samples. The data were then
summarized by a 3, 05138 matrix X (xji), where
xji denotes the expression level for gene j in
tumor mRNA sample i.

7
Golub Dataset

The dataset golub contains the gene expression
data for the 38 training set tumor mRNA samples
and 3,051 genes retained after preprocessing.
The dataset includes
golub a 3, 051 38 matrix of expression levels
golub.gnames a 3, 051 3 matrix of gene
identifiers
golub.cl a vector of tumor class labels (0 for
ALL, 1 for AML).

8
Golub Dataset

gt dim(golub)
gt 1 3051 38

9
Golub Dataset
10
Golub Dataset
11
The mt.teststat and mt.teststat.num.denum
functions

Used to to compute test statistics for each row
of a data frame, e.g., twosample Welch
tstatistics, Wilcoxon statistics, Fstatistics,
paired tstatistics, block Fstatistics.

12
usage

mt.teststat(X,classlabel,test"t",na.mt.naNUM,non
para"n")

'test"wilcoxon"'
'test"pairt"'
'test"f"'
mt.teststat.num.denum(X,classlabel,test"t",na.mt
.naNUM,nonpara"n")
13
twosample tstatistics

comparing, for each gene, expression in the ALL
cases to expression in the AML cases
gt teststat lt- mt.teststat(golub, golub.cl)

14
QQ plot

First create an empty .pdf file to plot the QQ
plot.
gt pdf("mtQQ.pdf")
Plot the qqplot
gt qqnorm(teststat)
Plot a diagonal line
gt qqline(teststat)
Shuts down the graphical object (e.g., pdf file)
gt dev.off()

15
What is a QQ plot?
16
What is a QQ plot?

The quantile-quantile (q-q) plot is a graphical
technique for determining if two data sets come
from populations with a common distribution (e.g.
normal distribution).
A q-q plot is a plot of the quantiles of the
first data set (expected) against the quantiles
of the second data set (observed).

17
What is a QQ plot?

By a quantile, we mean the fraction (or percent)
of points below the given value. That is, the 0.3
(or 30) quantile is the point at which 30
percent of the data fall below and 70 fall above
that value.
A 45-degree reference line is also plotted.
If the two sets come from a population with the
same distribution, the points should fall
approximately along this reference line.

18
Store the numerator and denominator of the test
stat.

Create a variable tmp in which you store the
numerators and denominators of the test statistic
tmp lt- mt.teststat.num.denum(golub, golub.cl,
test "t")
Name the numerator as num (teststat.num atribute
of the tmp object)
gt num lt- tmpteststat.num
Name the denominator as denum (teststat.denum
atribute of the tmp object).
gt denum lt- tmpteststat.denum

19
Plot the num to denum

Create a pdf devise
gt pdf("mtNumDen.pdf")
Plot
gt plot(sqrt(denum), num)
Shut off the graphics
gt dev.off()

20
Plot the num to denum
21
mt.rawp2adjp function

This function computes adjusted pvalues for
simple multiple testing procedures from a vector
of raw (unadjusted) pvalues.
The procedures include the Bonferroni, Holm
(1979), Hochberg (1988), and Sidak procedures for
strong control of the familywise Type I error
rate (FWER), and the Benjamini and Hochberg
(1995) and Benjamini and Yekutieli (2001)
procedures for (strong) control of the false
discovery rate (FDR).

22
Raw p-values (unadjusted)

As a first approximation, compute raw nominal
twosided pvalues for the 3051 test statistics
using the standard Gaussian distribution. Create
a variable known as rawp0 that contain the
p-values of the 3051 test statistics.
gt rawp0 lt- 2 (1 - pnorm(abs(teststat)))
gt rawp015
1 0.07854436 0.36289759 0.92191171 0.73463771
0.17063542

23
The order function

gt aalt-c(3,5,2,1,9)
gt ialt-order(aa) index order
gt ia
1 4 3 1 2 5
gt aaia order aa according to ia
1 1 2 3 5 9

24
Adjusted p-values

Create a vector of character strings containing
the names of the multiple testing procedures for
which adjusted p-values are to be computed. This
vector should include any of the following
'"Bonferroni"', '"Holm"', '"Hochberg"',
'"SidakSS"', '"SidakSD"', '"BH"', '"BY"'.
gt procs lt- c("Bonferroni", "Holm",
"Hochberg", "SidakSS", "SidakSD",
"BH", "BY")

25
Adjusted p-values in the order of the gene list

Adjusted pvalues can be stored in the original
gene order in adjp using order(resindex)
gt res lt- mt.rawp2adjp(rawp0, procs)
gt adjp lt- resadjporder(resindex),

26
Adjusted p-values in the order of significance

Adjusted pvalues can be stored in the order of
significance
gt adjp lt- resadjp
gt adjp15,

27
mt.reject function to list number of genes
rejected
28
mt.reject function to list number of genes
rejected
29
Find the genes most significant
30
Get the data of significantly modulated genes

gt which lt- mt.reject(cbind(rawp0, adjp),
0.000001)which, 2
gt sum(which)
1 143
gt gsignificantlt-golubwhich,
gt dim(gsignificant)
1 143 38
gt gsignificant15,15
,1 ,2 ,3 ,4 ,5
1, -1.45769 -0.32639 -1.46227 -0.18983 -0.12402
2, 0.86019 -0.14271 0.67037 0.70706 0.87697
3, -1.00702 -0.89365 -1.21154 -1.40715 -1.42668
4, -1.27427 -0.66834 -0.58252 -1.40715 -0.03531
5, -0.45670 0.48916 -0.48474 0.02261 0.00704

31
mt.plot function

The mt.plot function produces a number of
graphical summaries for the results of multiple
testing procedures and their corresponding
adjusted pvalues.

32
mt.plot function

To produce plots of adjusted pvalues for the
Bonferroni, maxT, Benjamini and Hochberg 7
(1995), and Benjamini and Yekutieli (2001)
procedures use

33
Plot adjp values over number of rejected
hypotheses

gt res lt- mt.rawp2adjp(rawp, c("Bonferroni", "BH",
"BY"))
gt adjp lt- resadjporder(resindex),
gt allp lt- cbind(adjp, maxT)
gt dimnames(allp)2 lt- c(dimnames(adjp)2,
"maxT")
gt procs lt- dimnames(allp)2
gt procs lt- procsc(1, 2, 5, 3, 4)
gt cols lt- c(1, 2, 3, 5, 6)
gt ltypes lt- c(1, 2, 2, 3, 3)
gt mt.plot(adjp,teststat,plottype "pvsr")

34
Correlation and Regression in R

Anscombe dataset
gt data(anscombe)
gt ? anscombe

35
Anscombe dataset
36
Get summaries
37
cor function
38
Find correlation coefficient of all pairwise
combinations
39
Find correlation coefficient of specific pairs
gt cor(anscombe,1,anscombe,2) 1 1 gt
cor(anscombe,1,anscombe,4) 1 -0.5
40
cor.test function
41
cor.test

gt cor.test(anscombe,1,anscombe,4)
Pearson's product-moment correlation
data anscombe, 1 and anscombe, 4
t -1.7321, df 9, p-value 0.1173
alternative hypothesis true correlation is not
equal to 0
95 percent confidence interval
-0.8460984 0.1426659
sample estimates
cor
-0.5

42
Plot scatterplots

gt plot(anscombe,1,anscombe,4)

43
Application to Golub dataset
44
Calculate the linear regression coefficients and
get a summary

gt lllt-lm(anscombe,4 anscombe,1)
gt summary(ll)

45
Get the anova table
46
Plot the regression line

gt plot(anscombe,1,anscombe,4)
gt abline(lm(anscombe,4 anscombe,1))

47
llfitted.values
48
llcoefficients, llresiduals
49
hclust function
50
hclust for golub data (correlation distance with
average linkage)
51
hclust for golub data (correlation distance with
complete linkage)
52
hclust for golub data (euclidean distance with
single linkage)
53
hclust for significant golub data (euclidean
distance with single linkage)
54
heatmap function
55
heatmap for significant genes

gt attributes(clust.euclid.single)
names
1 "merge" "height" "order"
"labels" "method"
6 "call" "dist.method"
class
1 "hclust"
gt clust.euclid.singleorder
1 129 10 124 125 54 35 28 59 44 85 110
90 106 15 137 138 139 37
19 47 126 63 88 27 134 79 80 76 121 131
81 3 82 11 1 104 78
37 140 31 34 98 89 8 26 123 92 113 61
127 16 95 29 96 39 142
55 7 57 128 14 32 6 67 9 71 133 52
19 74 72 23 73 132 99
73 55 135 56 136 5 97 24 18 25 4 107
84 94 109 114 51 68 69
91 45 65 46 41 75 49 102 122 50 100 83
12 86 141 17 116 143 2
109 77 105 112 22 36 101 111 13 62 21 60
87 38 58 118 115 119 93
127 30 33 108 43 120 91 70 130 53 40 66
64 20 42 103 48 117
gt gsignificantorderedlt-gsignificantclust.euclid.s
ingleorder,
gt heatmap(gsignificantordered)

56
Heatmap for significant genes
57
kmeans function
58
kmeans for samples based on significant genes
59
pam function
60
Pam on samples based on significant genes
61
Silioutte plot
gt plot(clust.pam.2)
62
Microarray Data Analysis Software

http//rana.lbl.gov/EisenSoftware.htm
http//classify.stanford.edu/
http//www.broad.mit.edu/cancer/software/software.
html
http//homes.esat.kuleuven.be/dna/Biol/Software.h
tml
http//vortex.cs.wayne.edu/Projects.html
http//visitor.ics.uci.edu/cgi-bin/genex/rcluster/
index.cgi
http//www-stat.stanford.edu/7Etibs/SAM/index.htm
l
http//maexplorer.sourceforge.net/
http//ihome.cuhk.edu.hk/b400559/arraysoft.html
http//genome.ws.utk.edu/resources/restech.shtml

Write a Comment

User Comments (0)

About PowerShow.com

Multiple testing, correlation and regression, and clustering in R - PowerPoint PPT Presentation

Multiple testing, correlation and regression, and clustering in R

mt.rawp2adjp function ... mt.reject function to list number of genes rejected ... mt.plot function ... – PowerPoint PPT presentation