Multiple testing, correlation and regression, and clustering in R - PowerPoint PPT Presentation

About This Presentation
Title:

Multiple testing, correlation and regression, and clustering in R

Description:

mt.rawp2adjp function ... mt.reject function to list number of genes rejected ... mt.plot function ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 63
Provided by: fenBilk
Category:

less

Transcript and Presenter's Notes

Title: Multiple testing, correlation and regression, and clustering in R


1
Multiple testing, correlation and regression, and
clustering in R
  • Multtest package
  • Anscombe dataset and stats package
  • Cluster package

2
Multtest package
  • The multtest package contains a collection of
    functions for multiple hypothesis testing.
  • These functions can be used to identify
    differentially expressed genes in microarray
    experiments, i.e., genes whose expression levels
    are associated with a response or covariate of
    interest.

3
Multtest package
  • These procedures are implemented for tests based
    on tstatistics, Fstatistics, paired
    tstatistics, Wilcoxon statistics.
  • The multtest package implements multiple testing
    procedures for controlling different Type I error
    rates. It includes procedures for controlling the
    familywise Type I error rate (FWER) Bonferroni,
    Hochberg (1988), Holm (1979).
  • It also includes procedures for controlling the
    false discovery rate (FDR) Benjamini and
    Hochberg (1995) and Benjamini and Yekutieli
    (2001) stepup procedures.

4
Data Analysis
  • Leukemia Data by Golub et al. (1999)
  • gt library(multtest, verbose FALSE)
  • gt data(golub)

5
Golub Data
  • Golub et al. (1999) were interested in
    identifying genes that are differentially
    expressed in patients with two type of leukemias,
    acute lymphoblastic leukemia (ALL, class 0) and
    acute myeloid leukemia (AML, class 1).
  • Gene expression levels were measured using
    Affymetrix highdensity oligonucleotide chips
    containing p 6, 817 human genes.
  • The learning set comprises n 38 samples, 27 ALL
    cases and 11 AML cases (data available at
    http//www.genome.wi.mit.edu/MPR).

6
Golub Data
  • Three preprocessing steps were applied to the
    normalized matrix of intensity values available
    on the website
  • (i) thresholding floor of 100 and ceiling of
    16,000
  • (ii) filtering exclusion of genes with max / min
    lt 5 or (max-min) lt 500, where max and min refer
    respectively to the maximum and minimum
    intensities for a particular gene across mRNA
    samples
  • (iii) base 10 logarithmic transformation.
  • Boxplots of the expression levels for each of the
    38 samples revealed the need to standardize the
    expression levels within arrays before combining
    data across samples. The data were then
    summarized by a 3, 05138 matrix X (xji), where
    xji denotes the expression level for gene j in
    tumor mRNA sample i.

7
Golub Dataset
  • The dataset golub contains the gene expression
    data for the 38 training set tumor mRNA samples
    and 3,051 genes retained after preprocessing.
    The dataset includes
  • golub a 3, 051 38 matrix of expression levels
  • golub.gnames a 3, 051 3 matrix of gene
    identifiers
  • golub.cl a vector of tumor class labels (0 for
    ALL, 1 for AML).

8
Golub Dataset
  • gt dim(golub)
  • gt 1 3051 38

9
Golub Dataset
10
Golub Dataset
11
The mt.teststat and mt.teststat.num.denum
functions
  • Used to to compute test statistics for each row
    of a data frame, e.g., twosample Welch
    tstatistics, Wilcoxon statistics, Fstatistics,
    paired tstatistics, block Fstatistics.

12
usage
  • mt.teststat(X,classlabel,test"t",na.mt.naNUM,non
    para"n")

'test"wilcoxon"'
'test"pairt"'
'test"f"'
mt.teststat.num.denum(X,classlabel,test"t",na.mt
.naNUM,nonpara"n")
13
twosample tstatistics
  • comparing, for each gene, expression in the ALL
    cases to expression in the AML cases
  • gt teststat lt- mt.teststat(golub, golub.cl)

14
QQ plot
  • First create an empty .pdf file to plot the QQ
    plot.
  • gt pdf("mtQQ.pdf")
  • Plot the qqplot
  • gt qqnorm(teststat)
  • Plot a diagonal line
  • gt qqline(teststat)
  • Shuts down the graphical object (e.g., pdf file)
  • gt dev.off()

15
What is a QQ plot?
16
What is a QQ plot?
  • The quantile-quantile (q-q) plot is a graphical
    technique for determining if two data sets come
    from populations with a common distribution (e.g.
    normal distribution).
  • A q-q plot is a plot of the quantiles of the
    first data set (expected) against the quantiles
    of the second data set (observed).

17
What is a QQ plot?
  • By a quantile, we mean the fraction (or percent)
    of points below the given value. That is, the 0.3
    (or 30) quantile is the point at which 30
    percent of the data fall below and 70 fall above
    that value.
  • A 45-degree reference line is also plotted.
  • If the two sets come from a population with the
    same distribution, the points should fall
    approximately along this reference line.

18
Store the numerator and denominator of the test
stat.
  • Create a variable tmp in which you store the
    numerators and denominators of the test statistic
  • tmp lt- mt.teststat.num.denum(golub, golub.cl,
    test "t")
  • Name the numerator as num (teststat.num atribute
    of the tmp object)
  • gt num lt- tmpteststat.num
  • Name the denominator as denum (teststat.denum
    atribute of the tmp object).
  • gt denum lt- tmpteststat.denum

19
Plot the num to denum
  • Create a pdf devise
  • gt pdf("mtNumDen.pdf")
  • Plot
  • gt plot(sqrt(denum), num)
  • Shut off the graphics
  • gt dev.off()

20
Plot the num to denum
21
mt.rawp2adjp function
  • This function computes adjusted pvalues for
    simple multiple testing procedures from a vector
    of raw (unadjusted) pvalues.
  • The procedures include the Bonferroni, Holm
    (1979), Hochberg (1988), and Sidak procedures for
    strong control of the familywise Type I error
    rate (FWER), and the Benjamini and Hochberg
    (1995) and Benjamini and Yekutieli (2001)
    procedures for (strong) control of the false
    discovery rate (FDR).

22
Raw p-values (unadjusted)
  • As a first approximation, compute raw nominal
    twosided pvalues for the 3051 test statistics
    using the standard Gaussian distribution. Create
    a variable known as rawp0 that contain the
    p-values of the 3051 test statistics.
  • gt rawp0 lt- 2 (1 - pnorm(abs(teststat)))
  • gt rawp015
  • 1 0.07854436 0.36289759 0.92191171 0.73463771
    0.17063542

23
The order function
  • gt aalt-c(3,5,2,1,9)
  • gt ialt-order(aa) index order
  • gt ia
  • 1 4 3 1 2 5
  • gt aaia order aa according to ia
  • 1 1 2 3 5 9

24
Adjusted p-values
  • Create a vector of character strings containing
    the names of the multiple testing procedures for
    which adjusted p-values are to be computed. This
    vector should include any of the following
    '"Bonferroni"', '"Holm"', '"Hochberg"',
    '"SidakSS"', '"SidakSD"', '"BH"', '"BY"'.
  • gt procs lt- c("Bonferroni", "Holm",
  • "Hochberg", "SidakSS", "SidakSD",
  • "BH", "BY")

25
Adjusted p-values in the order of the gene list
  • Adjusted pvalues can be stored in the original
    gene order in adjp using order(resindex)
  • gt res lt- mt.rawp2adjp(rawp0, procs)
  • gt adjp lt- resadjporder(resindex),

26
Adjusted p-values in the order of significance
  • Adjusted pvalues can be stored in the order of
    significance
  • gt adjp lt- resadjp
  • gt adjp15,

27
mt.reject function to list number of genes
rejected
28
mt.reject function to list number of genes
rejected
29
Find the genes most significant
30
Get the data of significantly modulated genes
  • gt which lt- mt.reject(cbind(rawp0, adjp),
    0.000001)which, 2
  • gt sum(which)
  • 1 143
  • gt gsignificantlt-golubwhich,
  • gt dim(gsignificant)
  • 1 143 38
  • gt gsignificant15,15
  • ,1 ,2 ,3 ,4 ,5
  • 1, -1.45769 -0.32639 -1.46227 -0.18983 -0.12402
  • 2, 0.86019 -0.14271 0.67037 0.70706 0.87697
  • 3, -1.00702 -0.89365 -1.21154 -1.40715 -1.42668
  • 4, -1.27427 -0.66834 -0.58252 -1.40715 -0.03531
  • 5, -0.45670 0.48916 -0.48474 0.02261 0.00704

31
mt.plot function
  • The mt.plot function produces a number of
    graphical summaries for the results of multiple
    testing procedures and their corresponding
    adjusted pvalues.

32
mt.plot function
  • To produce plots of adjusted pvalues for the
    Bonferroni, maxT, Benjamini and Hochberg 7
    (1995), and Benjamini and Yekutieli (2001)
    procedures use

33
Plot adjp values over number of rejected
hypotheses
  • gt res lt- mt.rawp2adjp(rawp, c("Bonferroni", "BH",
    "BY"))
  • gt adjp lt- resadjporder(resindex),
  • gt allp lt- cbind(adjp, maxT)
  • gt dimnames(allp)2 lt- c(dimnames(adjp)2,
    "maxT")
  • gt procs lt- dimnames(allp)2
  • gt procs lt- procsc(1, 2, 5, 3, 4)
  • gt cols lt- c(1, 2, 3, 5, 6)
  • gt ltypes lt- c(1, 2, 2, 3, 3)
  • gt mt.plot(adjp,teststat,plottype "pvsr")

34
Correlation and Regression in R
  • Anscombe dataset
  • gt data(anscombe)
  • gt ? anscombe

35
Anscombe dataset
36
Get summaries
37
cor function
38
Find correlation coefficient of all pairwise
combinations
39
Find correlation coefficient of specific pairs
gt cor(anscombe,1,anscombe,2) 1 1 gt
cor(anscombe,1,anscombe,4) 1 -0.5
40
cor.test function
41
cor.test
  • gt cor.test(anscombe,1,anscombe,4)
  • Pearson's product-moment correlation
  • data anscombe, 1 and anscombe, 4
  • t -1.7321, df 9, p-value 0.1173
  • alternative hypothesis true correlation is not
    equal to 0
  • 95 percent confidence interval
  • -0.8460984 0.1426659
  • sample estimates
  • cor
  • -0.5

42
Plot scatterplots
  • gt plot(anscombe,1,anscombe,4)

43
Application to Golub dataset
44
Calculate the linear regression coefficients and
get a summary
  • gt lllt-lm(anscombe,4 anscombe,1)
  • gt summary(ll)

45
Get the anova table
46
Plot the regression line
  • gt plot(anscombe,1,anscombe,4)
  • gt abline(lm(anscombe,4 anscombe,1))

47
llfitted.values
48
llcoefficients, llresiduals
49
hclust function
50
hclust for golub data (correlation distance with
average linkage)
51
hclust for golub data (correlation distance with
complete linkage)
52
hclust for golub data (euclidean distance with
single linkage)
53
hclust for significant golub data (euclidean
distance with single linkage)
54
heatmap function
55
heatmap for significant genes
  • gt attributes(clust.euclid.single)
  • names
  • 1 "merge" "height" "order"
    "labels" "method"
  • 6 "call" "dist.method"
  • class
  • 1 "hclust"
  • gt clust.euclid.singleorder
  • 1 129 10 124 125 54 35 28 59 44 85 110
    90 106 15 137 138 139 37
  • 19 47 126 63 88 27 134 79 80 76 121 131
    81 3 82 11 1 104 78
  • 37 140 31 34 98 89 8 26 123 92 113 61
    127 16 95 29 96 39 142
  • 55 7 57 128 14 32 6 67 9 71 133 52
    19 74 72 23 73 132 99
  • 73 55 135 56 136 5 97 24 18 25 4 107
    84 94 109 114 51 68 69
  • 91 45 65 46 41 75 49 102 122 50 100 83
    12 86 141 17 116 143 2
  • 109 77 105 112 22 36 101 111 13 62 21 60
    87 38 58 118 115 119 93
  • 127 30 33 108 43 120 91 70 130 53 40 66
    64 20 42 103 48 117
  • gt gsignificantorderedlt-gsignificantclust.euclid.s
    ingleorder,
  • gt heatmap(gsignificantordered)

56
Heatmap for significant genes
57
kmeans function
58
kmeans for samples based on significant genes
59
pam function
60
Pam on samples based on significant genes
61
Silioutte plot
gt plot(clust.pam.2)
62
Microarray Data Analysis Software
  • http//rana.lbl.gov/EisenSoftware.htm
  • http//classify.stanford.edu/
  • http//www.broad.mit.edu/cancer/software/software.
    html
  • http//homes.esat.kuleuven.be/dna/Biol/Software.h
    tml
  • http//vortex.cs.wayne.edu/Projects.html
  • http//visitor.ics.uci.edu/cgi-bin/genex/rcluster/
    index.cgi
  • http//www-stat.stanford.edu/7Etibs/SAM/index.htm
    l
  • http//maexplorer.sourceforge.net/
  • http//ihome.cuhk.edu.hk/b400559/arraysoft.html
  • http//genome.ws.utk.edu/resources/restech.shtml
Write a Comment
User Comments (0)
About PowerShow.com