Quick R Tips - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Quick R Tips

Description:

Title: PowerPoint Presentation Author: nbhagaba Last modified by: Carol Bult Created Date: 7/8/2002 2:59:22 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 50
Provided by: nbh9
Category:

less

Transcript and Presenter's Notes

Title: Quick R Tips


1
Quick R Tips
  • How to find out what packages are available
  • library()
  • How to find out what packages are actually
    installed locally
  • (.packages())

2
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
Image analysis
Normalization
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3
Microarray Data Flow
Microarray experiment
Unsupervised Analysis clustering
Image Analysis
Database
Supervised Analysis
Data Selection Missing value estimation
Normalization Centering
Networks Data Integration
Data Matrix
Decomposition techniques
4
A note about Affymetrix (1-color) pre-processing
cross-chip
sequence specific background correction
within-chip
within-probe set aggregation of intensity values
  • Two standard methods
  • MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares
    PM and MM probes)
  • RMA by Speed group (UC Berkeley) (ignores MM
    probes)

5
Why normalize?
  • Microarray data have significant systematic
    variation both within arrays and between arrays
    that is not true biological variation
  • Accurate comparison of genes relative
    expression within and across conditions requires
    normalization of effects
  • Sources of variation
  • Spatial location on the array
  • Dye biases which vary with spot intensity
  • Plate origin
  • Printing/spotting quality
  • Experimenter

6
Normalization Thoughts
  • There are many different ways to normalize data
  • Global median, LOWESS, LOESS, RMA etc
  • By print tip, spatial, etc
  • BUT dont expect it to fix bad data!
  • Wont make up for lack of replicates
  • Wont make up for horrible slides

7
Create a boxplot of the normalized
data boxplot(mydata-1, main "Normalized
Intensities", xlab"Array", ylab"Intensities",
col"blue") To save the boxplot as a jpeg
file jpeg("normal_boxplot.jpg") boxplot(mydata-1
, main "Normalized Intensities", xlab"Array",
ylab"Intensities", col"blue") dev.off()
8
Microarray Data Analysis(slides used with
permission of Dr. John Quackenbush, Dana Farber
creator of MeV software )http//www.tm4.org/mev/
9
  • Classification
  • Hierarchical clustering
  • K-means clustering
  • Coherence
  • PCA
  • Relevance Network
  • Differential gene expression
  • T-test
  • Analysis of Variance
  • Significance of Microarray (SAM)

10
Hierarchical Clustering
  • A type of cluster analysis
  • There is both divisive and agglomerative
    HCagglomerative is most commonly used
  • Group objects that are close to one another
    based on some distance/similarity metric
  • Clusters are created and linked based on a metric
    that evaluates the cluster-to-cluster distance
  • Results are displayed as a dendrogram

11
Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
(HCL-2)
12
Hierarchical Clustering
g5 is most like g7
g5,g7 is most like g1, g4, g8
(HCL-3)
13
Hierarchical Tree
(HCL-4)
14
Hierarchical Clustering
During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. The distance or similarity between
clusters must be calculated. The rules that
govern this calculation are linkage methods.
(HCL-5)
15
Agglomerative Linkage Methods
  • Linkage methods are rules or metrics that return
    a value that can be used to determine which
    elements (clusters) should be linked.
  • Three linkage methods that are commonly used are
  • Single Linkage
  • Average Linkage
  • Complete Linkage

(HCL-6)
16
Comparison of Linkage Methods
(HCL-10)
17
Bootstrapping (ST)
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
18
Jackknifing (ST)
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
19
Analysis of Bootstrapped and Jackknifed Support
Trees
  • Bootstrapped or jackknifed expression matrices
    are created many times by randomly resampling the
    original expression matrix, using either the
    bootstrap or jackknife procedure.
  • Each time, hierarchical trees are created from
    the resampled matrices.
  • The trees are compared to the tree obtained from
    the original data set.
  • The more frequently a given cluster from the
    original tree is found in the resampled trees,
    the stronger the support for the cluster.
  • As each resampled matrix lacks some of the
    original data, high support for a cluster means
    that the clustering is not biased by a small
    subset of the data.

20
Hierarchical Clustering in R
21
Step 1 Data matrix
  • First you need a numeric matrix
  • Typical array data set will have samples as
    columns and genes as rows
  • We want to be sure our data are in the form of an
    expression matrix
  • Use Biobase library/package
  • See http//www.bioconductor.org/packages/2.2/bioc/
    vignettes/Biobase/inst/doc/ExpressionSetIntroducti
    on.pdf
  • gt exprslt-as.matrix(data, headerTRUE, sep"\t",
    row.names1, as.isTRUE)

22
Step 2 Calculate Distance Matrix
  • Default dist() method in R uses rows as the
    vectors..but we want the distance between
    samples.i.e., the columns of our matrix.
  • There is a handy package to help us at MD
    Anderson called oompaBase
  • source("http//bioinformatics.mdanderson.org/OOMPA
    /oompaLite.R")
  • oompaLite()
  • oompainstall(groupName"all")
  • Once installed, be sure to locally activate the
    libraries
  • library(oompaBase)
  • library(ClassDiscovery)
  • library(ClassComparison)
  • oompaBase also requires the mclust and cobs
    packagesdownload these from CRAN

23
  • Use the function distanceMatrix() to create a
    distance matrix of your samples.
  • Uses the expression set created in Step 1 as
    input
  • Remember that there are many different types of
    distance metrics to choose from!
  • See help(distanceMatrix)
  • xlt- distanceMatrix(exprs,'pearson')

24
Step 3 Cluster
  • Use the hclust() function to create a
    hierarchical cluster based on your distance
    matrix, x, created in Step 2.
  • gt ylt-hclust(x,method"complete")
  • gt plot(y)

25
Testing for Differential Gene Expression with the
T-test
26
  • Get the multtest package from CRAN
  • Package contains data from the Golub leukemia
    microarray data set (ALL v AML)
  • 38 arrays
  • 27 from lymphoblastic
  • 11 from myeloid

http//people.cryst.bbk.ac.uk/wernisch/macourse/
27
  • library(multtest)
  • data(golub)
  • golub.cl
  • Generate the T statistic
  • teststat lt-mt.teststat(golub, golub.cl)
  • Convert into P-values
  • rawp0 lt-2pt(abs(teststat),lower.tailF, df38-2)
  • Correct for multiple testing and show the ten
    most significant genes
  • procs lt-c(Bonferroni, BH)
  • reslt-mt.rawp2adjp((rawp0), procs)
  • resadjp110,

http//people.cryst.bbk.ac.uk/wernisch/macourse/
28
K-Means / K-Medians Clustering (KMC) 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
29
K-Means Clustering 2
3. Calculate mean / median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean / median
expression profile (calculated in step 3) is the
closest to that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
K-Means / K-Medians is most useful when the user
has an a-priori hypothesis about the number of
clusters the genes should group into.
30
Principal Components (PCAG and PCAE) 1
  • PCA simplifies the views of the data.
  • Suppose we have measurements for each gene on
    multiple
  • experiments.
  • Suppose some of the experiments are correlated.
  • PCA will ignore the redundant experiments, and
    will take a
  • weighted average of some of the experiments, thus
    possibly making
  • the trends in the data more interpretable.
  • 5. The components can be thought of as axes in
    n-dimensional
  • space, where n is the number of components. Each
    axis represents a
  • different trend in the data.

31
PCAG and PCAE - 2
In this example, x-axis could mean a continuum
from over-to under-expression (blue and
green genes over-expressed, yellow genes
under-expressed) y-axis could mean that gray
genes are over-expressed in first five expts and
under expressed in The remaining expts, while
brown genes are under-expressed in the first
five expts, and over-expressed in the remaining
expts. z-axis might represent different cyclic
patterns, e.g., red genes might be
over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
32
Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes
Genes with low entropy (least variable across
experiments) are excluded from analysis.
33
Relevance Networks
Tmin 0.50
The expression pattern of each gene compared to
that of every other gene.
The remaining relationships between genes define
the subnets
Tmax 0.90
Correlation coefficients outside the boundaries
defined by the minimum and maximum thresholds are
eliminated.
The ability of each gene to predict the
expression of each other gene is assigned a
correlation coefficient
34
T-Tests (TTEST) Between subjects (or unpaired)
- 1
  • Assign experiments to two groups, e.g., in the
    expression matrix
  • below, assign Experiments 1, 2 and 5 to group A,
    and
  • experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
35
TTEST Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic
for each gene either from A. Theoretical
t-distribution OR B. Permutation tests.
36
TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Original grouping
Randomized grouping
37
TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi
) Then, the p-value associated with the gene 1
(x/n)
38
TTEST - Between subjects - 5
  • 5. Determine whether a genes expression levels
    are significantly
  • different between the two groups by one of three
    methods
  • Just alpha If the calculated p-value for a gene
    is less than
  • or equal to the user-input alpha (critical
    p-value), the gene is
  • considered significant.
  • OR
  • Use Bonferroni corrections to reduce the
    probability of
  • erroneously classifying non-significant genes as
    significant.
  • B) Standard Bonferroni correction The user-input
    alpha is divided
  • by the total number of genes to give a critical
    p-value that is used
  • as above.

39
TTEST - Between subjects 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (alpha / N), where N is
the total number of genes for the gene with the
second-highest t-value, the critical p-value will
be (alpha/ N-1), and so on.
40
  • The problem of multiple testing
  • (adapted from presentation by Anja von
    Heydebreck, MaxPlanckInstitute for Molecular
    Genetics,
  • Dept. Computational Molecular Biology, Berlin,
    Germany
  • http//www.bioconductor.org/workshops/Heidelberg02
    /mult.pdf)
  • Lets imagine there are 10,000 genes on a chip,
    AND
  • None of them is differentially expressed.
  • Suppose we use a statistical test for
    differential
  • expression, where we consider a gene to be
    differentially expressed if it meets the
    criterion at a
  • p-value of p lt 0.05.

41
  • The problem of multiple testing 2
  • Lets say that applying this test to gene G1
    yields a p-value of p 0.01
  • Remember that a p-value of 0.01 means that there
    is a 1 chance that the gene is not
    differentially expressed, i.e.,
  • Even though we conclude that the gene is
    differentially expressed (because p lt 0.05),
    there is a 1 chance that our conclusion is
    wrong.
  • We might be willing to live with such a low
    probability
  • of being wrong
  • BUT .....

42
  • The problem of multiple testing 3
  • We are testing 10,000 genes, not just one!!!
  • Even though none of the genes is differentially
    expressed, about 5 of the genes (i.e., 500
    genes) will be erroneously concluded to be
    differentially expressed, because we have decided
    to live with a p-value of 0.05
  • If only one gene were being studied, a 5 margin
    of error might not be a big deal, but 500 false
    conclusions in one study? That doesnt sound too
    good.

43
  • The problem of multiple testing - 4
  • There are tricks we can use to reduce the
    severity of
  • this problem.
  • They all involve slashing the p-value for each
    test
  • (i.e., gene), so that while the critical p-value
    for the entire
  • data set might still equal 0.05, each gene will
    be
  • evaluated at a lower p-value.

44
  • Dont get too hung up on p-values.
  • Ultimately, what matters is biological
    relevance.
  • P-values should help you evaluate the strength of
    the
  • evidence, rather than being used as an absolute
    yardstick
  • of significance. Statistical significance is not
    necessarily
  • the same as biological significance.

45
Significance analysis of microarrays (SAM)
  • SAM can be used to pick out significant genes
    based on differential expression between sets of
    samples.

46
SAM -2
  • SAM gives estimates of the False Discovery Rate
    (FDR), which is the proportion of genes likely to
    have been wrongly identified by chance as being
    significant.
  • It is a very interactive algorithm allows users
    to dynamically change thresholds for significance
    (through the tuning parameter delta) after
    looking at the distribution of the test
    statistic.
  • The ability to dynamically alter the input
    parameters based on immediate visual feedback,
    even before completing the analysis, should make
    the data-mining process more sensitive.

47
SAM designs
  • Two-class unpaired to pick out genes whose mean
    expression level is significantly different
    between two groups of samples (analogous to
    between subjects t-test).
  • Two-class paired samples are split into two
    groups, and there is a 1-to-1 correspondence
    between an sample in group A and one in group B
    (analogous to paired t-test).

48
SAM designs - 2
  • Multi-class picks up genes whose mean expression
    is different across gt 2 groups of samples
    (analogous to one-way ANOVA)
  • Censored survival picks up genes whose
    expression levels are correlated with duration of
    survival.
  • One-class picks up genes whose mean expression
    across experiments is different from a
    user-specified mean.

49
SAM Two-Class Unpaired 4
Write a Comment
User Comments (0)
About PowerShow.com