GIS mAdb Intermediate - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

GIS mAdb Intermediate

Description:

Jeff Trent. CIT NCI mAdb. John Powell, Chief, BIMAS. Liming Yang, Ph.D. Jim Tomlin ... Tammy Qiu, SRA. INTERNATIONAL, INC. 58. 59. Averaging Arrays ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 62
Provided by: giscomput
Category:

less

Transcript and Presenter's Notes

Title: GIS mAdb Intermediate


1
GIS mAdb Intermediate Informatics Training
John Greene, Ph.D.
August 29, 2002
2
Default Definitions
  • Signal - refers to the (Target Intensity -
    Background Intensity). More precisely, it is the
    MEAN Target Intensity - MEDIAN Background
    Intensity. MEAN-MEDIAN was used based on a
    publication by Mike Eisen at Stanford. Can now
    also choose MEDIAN MEDIAN.
  • Normalization By default, we use the overall
    ratio (Signal Cy5/Signal Cy3). Normalization is
    calculated so that the median(Ratio) is 1.0.
    Those outliers with an extremely low signal are
    excluded from the calculation.
  • Spot size for GenePix, Spot Size is the
    percentage of feature pixels 1 S.D. above
    background.

3
Recenter
Histogram uses spots that have been extracted and
filtered
4
Whenever possible, use ratios converted to log
base 2
  • Why? Because it makes variation ratios of
    intensities more independent of absolute
    magnitude.
  • Easier interpretation negative numbers are
    downregulated genes positive numbers are
    upregulated genes
  • Evens out highly skewed distributions
  • Gives a more realistic sense of variation

5
Simple Group Retrieval Tool ArraySuite data
Applies spot filtering options to selected arrays
and creates a new working dataset.
show
6
Extended Dataset Extraction Tool (GenePix Arrays
Only)
do
7
Spotfilters and Dataset Properties
8
  • Dataset Properties Checkbox to Activate
  • Rows ordered by
  • Dataset Location
  • Transient (24 Hours after creation)
  • Temporary (30 Days after last access)
  • Permanent
  • Dataset Label highly recommended

9
Data Display - Example
10
Additional Filtering and Analysis Options
11
Additional Data/Array Filtering Options
Applies selected filtering options to the dataset
based on values in the data and creates a new
subset. Can repeat without changing set name for
trial and error filtering
12
Open/Expand datasets
13
Filtering hierarchy /tree structure why
dataset management is a necessity
Original spot filtering
Original Dataset
Additional filtering

Data subsets
14
Refreshing Gene Info Dataset Management
Not yet available on GIS mAdb
15
Dataset Management delete/move datasets
16
Dataset History
A log is maintained for each dataset tracing the
analysis history. When the history is displayed,
links are provided to allow the user to recall
any dataset in the analysis chain.
17
(No Transcript)
18
Boolean Comparison Summary
Clicking on the Logical Subset links creates a
new working dataset reflecting the Boolean
results.
19
Array Analysis Methods
  • Gene Discovery
  • Outlier detection simple and group logic
    retrieval tools multiple array viewers
  • Scatter plots
  • Pattern Prediction
  • t-tests Wilcoxon tests, ANOVA, Kruskal-Willis
  • Stanford PAM imminent
  • Pattern Discovery
  • Clustering Hierarchical, K-means, SOMs
  • Multidimensional Scaling, PCA
  • FutureGene Shaving, Tree Harvesting,

20
Designating groups
21
Two Group Statistical Comparison Options
22
T-test
  • The t-test assesses whether the means of two
    groups are statistically different from each
    other.
  • Once you compute the t-value you have to look it
    up in a table of significance to test whether the
    ratio is large enough to say that the difference
    between the groups is not likely to have been a
    chance finding. To test the significance, you
    need to set a risk level (called the alpha
    level). In most research, the "rule of thumb" is
    to set the alpha level at .05. This means that
    five times out of a hundred you would find a
    statistically significant difference between the
    means even if there was none (i.e., by "chance").
  • More than two groups ANOVA(parametric)
    Kruskal-Wallis (non-parametric)

23
Independent T-test variance
  • Equal (pooled) or unequal (separate) variance
  • For independent (non-paired) samples, you must
    choose an option for the variance of the data
  • Checking this option bases the calculation for
    the variance of a difference between two
    proportions or a difference between two means on
    the assumption that the variance in the
    populations from which the two groups studied are
    selected is the same. Note that the default
    choice, two populations with different variances,
    would be preferred by many researchers. You have
    to have some evidence in logic or observed that
    the variances are the same before selecting this
    option.
  • The pooled variance in the case of the equal
    variance assumption will mostly be larger,
    compared with the un-equality variance
    assumption. However, the number of degrees of
    freedom will also be larger at df (n1n2)-2.
    This will result in a slightly more powerful test
    of statistical significance.

24
Wilcoxon tests
  • The t-tests are widely used, but they do depend
    on certain assumptions. These assumptions are
  • 1. The data are from a normal distribution
    (i.e. parametric)
  • 2. All observations are independent
  • When these assumptions are acceptable, the
    t-tests provide the most sensitive and powerful
    approach to the analysis of the data.
  • However, in many cases, observations arise from
    populations which are clearly non-normal. In
    these cases, simpler tests are available, based
    on signs, or on the rank order of the data. These
    are known as non-parametric tests.
  • Independent samples use Wilcoxon rank-sum test
    (Mann-Whitney and Wilcoxon Rank Sum use different
    methods of calculation, but are equivalent in
    result).
  • Paired (dependent) samples use Wilcoxon
    matched-pairs signed rank test
  • http//www-jime.open.ac.uk/98/12/demos/stats/stats
    .html

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Ad Hoc Query Tool
30
Ad Hoc Query Output
31
Graphics tools Scatter Plot
32
Correlation Summary Report
pairwise scatter plots
33
Multiple Array Viewer
34
Multidimensional Scaling
  • Mapping of data points from a high-dimensional
    space into a lower-dimensional space
  • Example Represent a tumors 5,000-dimensional
    gene profile as a point in 3-dimensional space
  • Typically uses nonlinear optimization methods
    that select lower-dimensional coordinates to
    best match pairwise distances in
    higher-dimensional space
  • Depends only on pairwise distances (Euclidean,
    1-correlation, . . .) between points
  • All distances in the lower dimensional space must
    be viewed in a relative sense
  • Allows missing values in input data

35
(No Transcript)
36
PCA
Principal components analysis (PCA) explores the
variability in gene expression patterns and finds
a small number of themes. These themes can be
combined to make all the different gene
expression patterns in a data set. The first
principal component is obtained by finding the
linear combination of expression patterns
explaining the greatest amount of variability in
the data. The second principal component is
obtained by finding another linear
combination of expression patterns
that is at right angles to (i.e. orthogonal and
uncorrelated with) the first principal component.
The second principal component must explain the
greatest amount of the remaining variability in
the data after accounting for the first principal
component. Each succeeding principal component
is similarly obtained. There will never be more
principal components than there are variables
(experimental points) in the data. Any individual
gene expression pattern can be recreated as a
linear combination of the principal component
expression patterns.
37
Principal Components Analysis
  • Principal Components Analysis (PCA) is an
    exploratory multivariate statistical technique
    for simplifying complex data sets. Given m
    observations on n variables, the goal of PCA is
    to reduce the dimensionality of the data matrix
    by finding r new variables, where r is less than
    n. Termed principal components, these r new
    variables together account for as much of the
    variance in the original n variables as possible
    while remaining mutually uncorrelated and
    orthogonal. Each principal component is a linear
    combination of the original variables, and so it
    is often possible to ascribe meaning to what the
    components represent. Principal components
    analysis has been used in a wide range of
    biomedical problems, including the analysis of
    microarray data in search of outlier genes
    (Hilsenbeck et al. 1999) as well as the analysis
    of other types of expression data (Vohradsky et
    al. 1997, Craig et al. 1997).
  • Use PCA to focus on specific expression patterns
    and their changes, identify discriminating genes,
    separate contributing profiles and find trends,
    e.g. in time series or dose response curves.
  • For dispersion matrix, use correlation option
    when data is scaled to fit within boundaries, or
    when variables measured in different units or
    have different variances. Most often, covariance
    is the correct choice ( when variables measured
    in same units and have similar variance)
  • N.B. PCA does not allow missing values in input
    data these are filtered out
  • http//www.statsoftinc.com/textbook/stfacan.html
  • http//www.okstate.edu/artsci/botany/ordinate/PCA.
    htm

38
PCA Details
(First three components)
39
PCA Details
40
MDS/PCA comparison
  • PCA
  • Linear projection
  • Does not allow (filters out) missing values
  • Preserves large dissimilarities better
  • Meaningful variables
  • information content known
  • Computationally efficient for large number of
    samples
  • Meaningful orientation
  • Performed on covariance or correlation
    similarities
  • MDS
  • Nonlinear projection
  • Allows missing values
  • Preserves small dissimilarities better
  • Meaningless variables
  • information content not known
  • Computationally inefficient for large number of
    samples
  • Arbitrary orientation
  • Performed on any type of (dis)similarities

Adapted from Partek Quick Start for Microarray
Analysis
41
Clustering
  • Clustering programs make clusters even if the
    data are completely random you must examine your
    clusters to see if they make biological sense
  • If clustered by genes, are the genes in certain
    clusters biologically related in function? In a
    pathway?
  • If clustered by array, do the clusters group
    related samples/tissues/diseases/treatments
    together logically?

42
Common clustering methods
Hierarchical Clustering allows you to visualize a
set of samples or genes by organizing them into a
mock-phylogenetic tree, often referred to as a
dendrogram. In these trees, samples or genes
having similar effects on the gene expression
patterns are clustered together.
K-means clustering divides genes into distinct
groups based on their expression patterns. Genes
are initially divided into a number (k) of
user-defined and equally-sized groups. Centroids
are calculated for each group corresponding to
the average of the expression profiles.
Individual genes are then reassigned to the group
in which the centroid is the most similar to the
gene. The process is iterated until the group
compositions converge.
Self-Organizing Maps (SOMs) are similar to
k-means clustering, but adds an additional
feature where the resulting groups of genes can
be displayed in a rectangular pattern, with
adjacent groups being more similar than groups
further away. Self-Organizing Maps were invented
by Tuevo Kohonen and are used to analyze many
kinds of data.
43
Example of Hierarchical Clustering (Alizadeh et
al., Nature, Feb. 2000)
44
Dendrogram Construction for Hierarchical
Agglomerative Clustering
  • Merge two closest (least distant) objects (genes
    or arrays)
  • Subsequent merges require specification of
    linkage to define distance between clusters
  • Average linkage
  • Complete linkage
  • Single linkage

45
Euclidean distance
Generally, the distance between two points is
taken as a common metric to assess the similarity
among the components of a population. The most
commonly used distance measure is the Euclidean
metric which defines the distance between two
points p ( p1, p2, ...) and q ( q1, q2, ...)
as
46
Linkage Methods
  • Average Linkage
  • Merge clusters whose average distance between all
    pairs of items (one item from each cluster) is
    minimized
  • Particularly sensitive to distance metric
  • Complete Linkage
  • Merge clusters to minimize the maximum distance
    within any resulting cluster
  • Tends to produce compact clusters
  • Single Linkage
  • Merge clusters at minimum distance from one
    another
  • Prone to chaining and sensitive to noise

47
(Data from Bittner et al., Nature, 2000)
48
Common Distance Metrics forHierarchical
Clustering
  • Euclidean distance
  • Measures absolute distance (square root of sum of
    squared differences)
  • 1-Correlation
  • Values reflect amount of linear association
    (pattern dissimilarity) smaller the value, more
    similar the gene expression pattern

8 9 10 11 12 13
14
49
Server-side Hierarchical Clustering
50
Hierarchical Clustering Output
51
Expanded Heatmap Thumbnail Image
52
Tree View for PostScript output, too large files
http//rana.lbl.gov/EisenSoftware.htm
53
K-means
54
Self-organizing/Kohonen maps
55
Summary Remarks
  • Data quality assessment and pre-processing are
    important.
  • Different study objectives will require different
    statistical analysis approaches.
  • Different analysis methods may produce different
    results. Thoughtful application of multiple
    analysis methods may be required.
  • Chances for spurious findings are enormous, and
    validation of any findings on larger independent
    collections of specimens will be essential.
  • Analysis tools are not an adequate substitute for
    collaboration with professional statisticians and
    data analysts.

56
Acknowledgments The Single ArrayViewer and
Multi-ArrayViewer were derived from NHGRI uAP
Toolset developed in the NHGRI/Cancer Genetics
Branch under Dr. Jeffrey Trent. The Scatterplot
and Multi-dimensional scaling tools were derived
from work done in the NCI/Biometric Research
Branch under Dr. Richard Simon. Server side
Cluster uses a derivative of the Xcluster program
developed at Stanford University by Gavin
Sherlock, Head Microarray Informatics.
57
Acknowledgments
  • CIT NCI mAdb
  • John Powell, Chief, BIMAS
  • Liming Yang, Ph.D.
  • Jim Tomlin
  • Carla Bock
  • Esther Asaki, SRA
  • Robin Martell, SRA
  • Kathy Meyer, SRA
  • Agara Sudhindra, SRA
  • Tammy Qiu, SRA
  • Biometric Research Branch/NCI
  • Richard Simon, Ph.D.
  • Lisa McShane, Ph.D.
  • Michael Radmacher, Ph.D.
  • Joanna Shih, Ph.D
  • Yingdong Zhao, Ph.D.
  • MSB Section
  • NHGRI Java viewers
  • Mike Bittner
  • Yidong Chen
  • Jeff Trent

58
(No Transcript)
59
Averaging Arrays
Names/Descriptions for averaged arrays This tool
creates a new dataset consisting of one array per
group. Each array is the average of all arrays
within a group. Averaging is done on the log base
2 ratio values. The new averaged arrays will not
have an array name or description. You may enter
appropriate Names/Descriptions to be associated
with the new arrays. If you choose not to enter
values, the name will default to the Group
designation, the description defaults to NULL.
60
Gene Ontology/KEGG Pathway Summary Report
61
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com