GIS mAdb Intermediate - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

GIS mAdb Intermediate

Description:

Jeff Trent. CIT NCI mAdb. John Powell, Chief, BIMAS. Liming Yang, Ph.D. Jim Tomlin ... Tammy Qiu, SRA. INTERNATIONAL, INC. 58. 59. Averaging Arrays ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 62

Provided by: giscomput

Category:

more less

Transcript and Presenter's Notes

Title: GIS mAdb Intermediate

1
GIS mAdb Intermediate Informatics Training
John Greene, Ph.D.
August 29, 2002
2
Default Definitions

Signal - refers to the (Target Intensity -
Background Intensity). More precisely, it is the
MEAN Target Intensity - MEDIAN Background
Intensity. MEAN-MEDIAN was used based on a
publication by Mike Eisen at Stanford. Can now
also choose MEDIAN MEDIAN.
Normalization By default, we use the overall
ratio (Signal Cy5/Signal Cy3). Normalization is
calculated so that the median(Ratio) is 1.0.
Those outliers with an extremely low signal are
excluded from the calculation.
Spot size for GenePix, Spot Size is the
percentage of feature pixels 1 S.D. above
background.

3
Recenter
Histogram uses spots that have been extracted and
filtered
4
Whenever possible, use ratios converted to log
base 2

Why? Because it makes variation ratios of
intensities more independent of absolute
magnitude.
Easier interpretation negative numbers are
downregulated genes positive numbers are
upregulated genes
Evens out highly skewed distributions
Gives a more realistic sense of variation

5
Simple Group Retrieval Tool ArraySuite data
Applies spot filtering options to selected arrays
and creates a new working dataset.
show
6
Extended Dataset Extraction Tool (GenePix Arrays
Only)
do
7
Spotfilters and Dataset Properties
8

Dataset Properties Checkbox to Activate
Rows ordered by
Dataset Location
Transient (24 Hours after creation)
Temporary (30 Days after last access)
Permanent
Dataset Label highly recommended

9
Data Display - Example
10
Additional Filtering and Analysis Options
11
Additional Data/Array Filtering Options
Applies selected filtering options to the dataset
based on values in the data and creates a new
subset. Can repeat without changing set name for
trial and error filtering
12
Open/Expand datasets
13
Filtering hierarchy /tree structure why
dataset management is a necessity
Original spot filtering
Original Dataset
Additional filtering

Data subsets
14
Refreshing Gene Info Dataset Management
Not yet available on GIS mAdb
15
Dataset Management delete/move datasets
16
Dataset History
A log is maintained for each dataset tracing the
analysis history. When the history is displayed,
links are provided to allow the user to recall
any dataset in the analysis chain.
17
(No Transcript)
18
Boolean Comparison Summary
Clicking on the Logical Subset links creates a
new working dataset reflecting the Boolean
results.
19
Array Analysis Methods

Gene Discovery
Outlier detection simple and group logic
retrieval tools multiple array viewers
Scatter plots
Pattern Prediction
t-tests Wilcoxon tests, ANOVA, Kruskal-Willis
Stanford PAM imminent
Pattern Discovery
Clustering Hierarchical, K-means, SOMs
Multidimensional Scaling, PCA
FutureGene Shaving, Tree Harvesting,

20
Designating groups
21
Two Group Statistical Comparison Options
22
T-test

The t-test assesses whether the means of two
groups are statistically different from each
other.
Once you compute the t-value you have to look it
up in a table of significance to test whether the
ratio is large enough to say that the difference
between the groups is not likely to have been a
chance finding. To test the significance, you
need to set a risk level (called the alpha
level). In most research, the "rule of thumb" is
to set the alpha level at .05. This means that
five times out of a hundred you would find a
statistically significant difference between the
means even if there was none (i.e., by "chance").
More than two groups ANOVA(parametric)
Kruskal-Wallis (non-parametric)

23
Independent T-test variance

Equal (pooled) or unequal (separate) variance
For independent (non-paired) samples, you must
choose an option for the variance of the data
Checking this option bases the calculation for
the variance of a difference between two
proportions or a difference between two means on
the assumption that the variance in the
populations from which the two groups studied are
selected is the same. Note that the default
choice, two populations with different variances,
would be preferred by many researchers. You have
to have some evidence in logic or observed that
the variances are the same before selecting this
option.
The pooled variance in the case of the equal
variance assumption will mostly be larger,
compared with the un-equality variance
assumption. However, the number of degrees of
freedom will also be larger at df (n1n2)-2.
This will result in a slightly more powerful test
of statistical significance.

24
Wilcoxon tests

The t-tests are widely used, but they do depend
on certain assumptions. These assumptions are
1. The data are from a normal distribution
(i.e. parametric)
2. All observations are independent
When these assumptions are acceptable, the
t-tests provide the most sensitive and powerful
approach to the analysis of the data.
However, in many cases, observations arise from
populations which are clearly non-normal. In
these cases, simpler tests are available, based
on signs, or on the rank order of the data. These
are known as non-parametric tests.
Independent samples use Wilcoxon rank-sum test
(Mann-Whitney and Wilcoxon Rank Sum use different
methods of calculation, but are equivalent in
result).
Paired (dependent) samples use Wilcoxon
matched-pairs signed rank test
http//www-jime.open.ac.uk/98/12/demos/stats/stats
.html

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Ad Hoc Query Tool
30
Ad Hoc Query Output
31
Graphics tools Scatter Plot
32
Correlation Summary Report
pairwise scatter plots
33
Multiple Array Viewer
34
Multidimensional Scaling

Mapping of data points from a high-dimensional
space into a lower-dimensional space
Example Represent a tumors 5,000-dimensional
gene profile as a point in 3-dimensional space
Typically uses nonlinear optimization methods
that select lower-dimensional coordinates to
best match pairwise distances in
higher-dimensional space
Depends only on pairwise distances (Euclidean,
1-correlation, . . .) between points
All distances in the lower dimensional space must
be viewed in a relative sense
Allows missing values in input data

35
(No Transcript)
36
PCA
Principal components analysis (PCA) explores the
variability in gene expression patterns and finds
a small number of themes. These themes can be
combined to make all the different gene
expression patterns in a data set. The first
principal component is obtained by finding the
linear combination of expression patterns
explaining the greatest amount of variability in
the data. The second principal component is
obtained by finding another linear
combination of expression patterns
that is at right angles to (i.e. orthogonal and
uncorrelated with) the first principal component.
The second principal component must explain the
greatest amount of the remaining variability in
the data after accounting for the first principal
component. Each succeeding principal component
is similarly obtained. There will never be more
principal components than there are variables
(experimental points) in the data. Any individual
gene expression pattern can be recreated as a
linear combination of the principal component
expression patterns.
37
Principal Components Analysis

Principal Components Analysis (PCA) is an
exploratory multivariate statistical technique
for simplifying complex data sets. Given m
observations on n variables, the goal of PCA is
to reduce the dimensionality of the data matrix
by finding r new variables, where r is less than
n. Termed principal components, these r new
variables together account for as much of the
variance in the original n variables as possible
while remaining mutually uncorrelated and
orthogonal. Each principal component is a linear
combination of the original variables, and so it
is often possible to ascribe meaning to what the
components represent. Principal components
analysis has been used in a wide range of
biomedical problems, including the analysis of
microarray data in search of outlier genes
(Hilsenbeck et al. 1999) as well as the analysis
of other types of expression data (Vohradsky et
al. 1997, Craig et al. 1997).
Use PCA to focus on specific expression patterns
and their changes, identify discriminating genes,
separate contributing profiles and find trends,
e.g. in time series or dose response curves.
For dispersion matrix, use correlation option
when data is scaled to fit within boundaries, or
when variables measured in different units or
have different variances. Most often, covariance
is the correct choice ( when variables measured
in same units and have similar variance)
N.B. PCA does not allow missing values in input
data these are filtered out
http//www.statsoftinc.com/textbook/stfacan.html
http//www.okstate.edu/artsci/botany/ordinate/PCA.
htm

38
PCA Details
(First three components)
39
PCA Details
40
MDS/PCA comparison

PCA
Linear projection
Does not allow (filters out) missing values
Preserves large dissimilarities better
Meaningful variables
information content known
Computationally efficient for large number of
samples
Meaningful orientation
Performed on covariance or correlation
similarities

MDS
Nonlinear projection
Allows missing values
Preserves small dissimilarities better
Meaningless variables
information content not known
Computationally inefficient for large number of
samples
Arbitrary orientation
Performed on any type of (dis)similarities

Adapted from Partek Quick Start for Microarray
Analysis
41
Clustering

Clustering programs make clusters even if the
data are completely random you must examine your
clusters to see if they make biological sense
If clustered by genes, are the genes in certain
clusters biologically related in function? In a
pathway?
If clustered by array, do the clusters group
related samples/tissues/diseases/treatments
together logically?

42
Common clustering methods
Hierarchical Clustering allows you to visualize a
set of samples or genes by organizing them into a
mock-phylogenetic tree, often referred to as a
dendrogram. In these trees, samples or genes
having similar effects on the gene expression
patterns are clustered together.
K-means clustering divides genes into distinct
groups based on their expression patterns. Genes
are initially divided into a number (k) of
user-defined and equally-sized groups. Centroids
are calculated for each group corresponding to
the average of the expression profiles.
Individual genes are then reassigned to the group
in which the centroid is the most similar to the
gene. The process is iterated until the group
compositions converge.
Self-Organizing Maps (SOMs) are similar to
k-means clustering, but adds an additional
feature where the resulting groups of genes can
be displayed in a rectangular pattern, with
adjacent groups being more similar than groups
further away. Self-Organizing Maps were invented
by Tuevo Kohonen and are used to analyze many
kinds of data.
43
Example of Hierarchical Clustering (Alizadeh et
al., Nature, Feb. 2000)
44
Dendrogram Construction for Hierarchical
Agglomerative Clustering

Merge two closest (least distant) objects (genes
or arrays)
Subsequent merges require specification of
linkage to define distance between clusters
Average linkage
Complete linkage
Single linkage

45
Euclidean distance
Generally, the distance between two points is
taken as a common metric to assess the similarity
among the components of a population. The most
commonly used distance measure is the Euclidean
metric which defines the distance between two
points p ( p1, p2, ...) and q ( q1, q2, ...)
as
46
Linkage Methods

Average Linkage
Merge clusters whose average distance between all
pairs of items (one item from each cluster) is
minimized
Particularly sensitive to distance metric
Complete Linkage
Merge clusters to minimize the maximum distance
within any resulting cluster
Tends to produce compact clusters
Single Linkage
Merge clusters at minimum distance from one
another
Prone to chaining and sensitive to noise

47
(Data from Bittner et al., Nature, 2000)
48
Common Distance Metrics forHierarchical
Clustering

Euclidean distance
Measures absolute distance (square root of sum of
squared differences)
1-Correlation
Values reflect amount of linear association
(pattern dissimilarity) smaller the value, more
similar the gene expression pattern

8 9 10 11 12 13
14
49
Server-side Hierarchical Clustering
50
Hierarchical Clustering Output
51
Expanded Heatmap Thumbnail Image
52
Tree View for PostScript output, too large files
http//rana.lbl.gov/EisenSoftware.htm
53
K-means
54
Self-organizing/Kohonen maps
55
Summary Remarks

Data quality assessment and pre-processing are
important.
Different study objectives will require different
statistical analysis approaches.
Different analysis methods may produce different
results. Thoughtful application of multiple
analysis methods may be required.
Chances for spurious findings are enormous, and
validation of any findings on larger independent
collections of specimens will be essential.
Analysis tools are not an adequate substitute for
collaboration with professional statisticians and
data analysts.

56
Acknowledgments The Single ArrayViewer and
Multi-ArrayViewer were derived from NHGRI uAP
Toolset developed in the NHGRI/Cancer Genetics
Branch under Dr. Jeffrey Trent. The Scatterplot
and Multi-dimensional scaling tools were derived
from work done in the NCI/Biometric Research
Branch under Dr. Richard Simon. Server side
Cluster uses a derivative of the Xcluster program
developed at Stanford University by Gavin
Sherlock, Head Microarray Informatics.
57
Acknowledgments

CIT NCI mAdb
John Powell, Chief, BIMAS
Liming Yang, Ph.D.
Jim Tomlin
Carla Bock
Esther Asaki, SRA
Robin Martell, SRA
Kathy Meyer, SRA
Agara Sudhindra, SRA
Tammy Qiu, SRA

Biometric Research Branch/NCI
Richard Simon, Ph.D.
Lisa McShane, Ph.D.
Michael Radmacher, Ph.D.
Joanna Shih, Ph.D
Yingdong Zhao, Ph.D.
MSB Section
NHGRI Java viewers
Mike Bittner
Yidong Chen
Jeff Trent

58
(No Transcript)
59
Averaging Arrays
Names/Descriptions for averaged arrays This tool
creates a new dataset consisting of one array per
group. Each array is the average of all arrays
within a group. Averaging is done on the log base
2 ratio values. The new averaged arrays will not
have an array name or description. You may enter
appropriate Names/Descriptions to be associated
with the new arrays. If you choose not to enter
values, the name will default to the Group
designation, the description defaults to NULL.
60
Gene Ontology/KEGG Pathway Summary Report
61
(No Transcript)

Write a Comment

User Comments (0)