SMD Data Quality Assessment and Repository Tools Tutorial - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

SMD Data Quality Assessment and Repository Tools Tutorial

Description:

After averaging: - a new row for the synthetic gene data ... during averaging) - Can have comment lines (start with #) Synthetic Genes ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 80

Provided by: cath82

Category:

more less

Transcript and Presenter's Notes

Title: SMD Data Quality Assessment and Repository Tools Tutorial

1
SMD Data Quality Assessment and Repository Tools
Tutorial

November 10, 2007
Catherine Ball
Janos Demeter

2
SMD Getting Help

Click on the Help menu
Tool-specific links will be listed at the top.
Use the SMD help index to look for specific
subjects
Send e-mail to
array_at_genome.stanford.edu

3
Quality Assessment and Repository Tools Tutorial

Quality Assessment Tools
Ratios on Array
HEEBO/MEEBO plots
Graphing tool
Q-score
Repository
Repository
SVD
Synthetic Gene Tool
kNNimpute

4
SMD Data Repository Help

How to use the tool
Limitations of file sizes
Sharing data
Options
Links to help for analysis methods, data file
formats, data retrieval and clustering

5
SMD Help File Formats
6
File Formats Pre-clustering (PCL) File
Names and orders of arrays (if arrays are not
clustered)
7
File Formats Clustering Design Tree (CDT) File
8
SMD Data Repository

What is the SMD Data Repository?
What is the repository?
Using the repository to save or upload data
Using the repository to share data
Using the repository to analyze data
Options for PCL files via the repository
View
Data
Delete
Edit
Cluster
Filter
SVD
Synthetic Genes
KNN Impute
Options for CDT files via the repository
GeneXplorer
TreeView
View Clusters, spots

9
What is the SMD Repository?

A method to save data sets to prevent repeatedly
performing the same data retrieval
A method to share processed data with others
A way SMD can provide you with access to new
and/or computationally-intensive tools

10
Accessing the SMD Data Repository
Here!
11
SMD Data Repository
12
Uploading files to Repository

If uploading clustered data, enter CDT files
If uploading pre-clustering data, enter PCL
files
Choose an organism
Give a unique name to your data set
Provide a useful description to your data set

13
Using Your Repository CDT Deposits

View cluster using GeneXplorer or TreeView
View cluster images
View retrieval and clustering report
Download files
Assign access

14
Using Your Repository PCL Deposits
Apply Synthetic Genes to data
Edit the entry
Filter data
Estimate missing data with KNN impute
Download data
15
Using the Repository CDT File Options
CDT files have a few other options
GeneXplorer
Clustering with Proxy and Spot images
TreeView
Clustering with Spotimages
Clustering with Proxy images
16
Viewing Repository Entries

Name
Organism
Number of genes
Number of arrays
Size of file
Date uploaded
Description
Data retrieval summary

17
Downloading Repository Entries

Downloading puts file(s) into a folder labeled
with your SMD user name onto your computers
desktop

18
Deleting Repository Entries

Details about your repository entry
Asks you to confirm before deleting!

19
Editing Entries -- How to Share!

Change repository entry name
Change description
Add access to repository entry to a GROUP
Add access to a repository entry to a SMD USER

20
Filtering Data in Repository Entries

If your repository entry is a PCL file, you can
re-enter the SMD filtering pipeline

21
SVD Singular Value Decomposition

The goal of SVD is to find a set of patterns that
describe the greatest amount of variance in a
dataset
SVD determines unique orthogonal (or
uncorrelated) gene and corresponding array
expression patterns (i.e. "eigengenes" and
"eigenarrays," respectively) in the data
Patterns might be correlated with biological
processes OR might be correlated with technical
artifacts

22
SVD The Concept (easy version)

Lets imagine we have a three-dimensional cigar,
as shown in A
We can represent this in one dimension, by
looking at its lengthwise shadow (B)
Looking at its cross-wise shadow (C), we get an
orthogonal view of the cigar that tells us more
about the three-dimensional object than B alone.

23
SVD Missing Data Estimation

Some algorithms (such as SVD) cannot operate with
missing data
You can use this simple method or you can use
KNNImpute to estimate missing data

24
SVD Display in SMD
25
SVD Raster Display

Each row represents an eigengene -- an
orthogonal representation of the genes in the
dataset
The topmost eigengene contributes the most to the
data set

26
SVD View Projection

Clicking on a row in the Raster Display brings
you the Projection View
You can select genes that have high and low
contributions from an eigengene and download them
in a PCL file
In this way, you might use SVD to help classify
subtypes

27
SVD Eigenexpression

Each bar show the probability of expression of
each eigengene
You can compare the probabilities to see which
eigengenes contribute more to the overall view
of the data

28
SVD Plot selected eigengenes

You can plot as many or as few eigengenes as you
like
This plot gives you an easy-to-understand view of
the behavior of each eigengene

29
Synthetic Genes

Purpose
average data based on arbitrary groupings of
genes
- for biological reasons
- for technical reasons
Can average data using
- common genelists
- your own genelists
After averaging
- a new row for the synthetic gene data
- Original data can be removed/included

30
Synthetic Genes

Common lists available (only mouse and human
data)
Unigene (all clones/oligos that report on a given
Unigene id will be averaged and shown as the
Unigene id)
LocusLink (same as above, but for LocusLink id)
These lists are useful to collapse data by gene,
rather than suid/luid.
They allow comparison of experiments between
different platforms - oligo print to cDNA print
or spotted arrays to Agilent arrays where the
arrays dont share common suids. Also can be used
to compare cDNA prints with h/meebo arrays
These synthetic gene lists are updated on a
regular basis.

31
Synthetic Genes

Other common synthetic gene lists
chromosome arms
cytobands
5 Mb tiles based on GoldenPath mappings
Tissue types
tumor types
processes
Additional lists see
http//smd.stanford.edu/help/synthGenes.shtml

32
Synthetic Genes

You can use your own genelists
1 genelist for each synthetic gene
Name of the genelist is the synthetic genes name

- tab-delimited text file
File must have header (NAME, WEIGHT)
NAME contains cloneid
WEIGHT can be -1 to 1 (weight of clone
during averaging)
- Can have comment lines (start with )

33
Synthetic Genes

Tool only works on pcl files in repository
During data retrieval the include UIDs option
should not be used
After collapsing, file can be downloaded, added
to your repository, and/or clustered
Currently works only for human and mouse data

34
Synthetic Genes/Merge PCL Files

Related tool Merge PCL Files
On main page (lists menu -gt all programs) under
tools section
Can be used to combine 2 pcl files from different
sources into a single pcl file.
Cloneids that belong to the same gene can be
combined into single row (based on a translation
file provided).

35
Synthetic Genes/Merge PCL Files
36
Synthetic Genes/Merge PCL Files

Same experiments in the pcl files can be averaged
Averaging method can be mean/median
Translation file
Tab-delimited text file
First column desired final identifier
Second column desired final annotation
Third and subsequent columns identifiers (first
column of a pcl file) in the pcl files that
should be collapsed to the identifier in the
first column.
Data for identifiers not included in the
translation file will not be collapsed

37
KNNImpute The Missing Values Problem

Microarrays can have systematic or random missing
values
Some algorithms arent robust to missing values
Large literature on parameter estimation exists
Whats best to do for microarrays?

38
Why Estimate Missing Values?

39
KNNimpute Algorithm

Idea use genes with similar expression profiles
to estimate missing values

40
Clustering Cluster Image

Scale is indicated on the color bar
Gene names are at the right
Tree generated by hierarchical clustering is at
the left

41
Clustering Display Clustered Spot Images

Spot images can also be viewed in a clustered
image
This can give you a visual impression of the data
that are the basis of your analysis

42
Clustering Display Adjacent Cluster and
Clustered Spot Images
43
GENEXPLORER
44
TREEVIEW
45
SMD Getting Help

Click on the Help menu
Tool-specific links will be listed at the top.
Use the SMD help index to look for specific
subjects
Send e-mail to
array_at_genome.stanford.edu

46
Quality Assessment and Repository Tutorial

Quality assessment tools
Ratios on Array
H/Meebo plots
Graphing tool
Q-score
Repository
Repository
SVD
Synthetic Gene Tool
kNNimpute

47
Ratios on Array Tool

Accessible from the display data -gt view data
pages
Ratios on array

48
Ratios on Array Tool

Quick visualization of log-ratio distribution on
the slide
Color assignments are based on log-ratio values
and also intensity
Can visualize normalized or non-normalized
log-ratios
PLUS ANOVA analysis to detect spatial bias
(print-tip or plate)

49
Ratios on Array Tool

Not normalized vs. normalized (loess intensity,
print-tip)

50
Ratios on Array Tool

One way ANOVA to test dependence of log-ratios on
print-tip and printing plate
F-statistic is given for the hypothesis no bias
in data
In the example, normalization significantly
improved print-tip bias

51
HEEBO/MEEBO plots
Single experiment
Batch access

HEEBO/MEEBO quality assessment graphs from
BioConductor package
If you used doping controls on the slide, the
graphs are automatically generated during
experiment loading
Accessible from
For single experiment display data -gt view data
pages
For batch from main page, under tools
You can create new graphs or look at existing
ones
Help page http//smd.stanford.edu/help/arrayQuali
ty.shtml

52
HEEBO/MEEBO plots

Can be used for a gpr file uploaded from desktop
- print has to be present in SMD and oligo_ids in
the id/name column
In batch for a result set list on
loader.stanford.edu
If called for a specific experiment, the values
are already filled in.
Normalization options available from limma. Note
this normalization will NOT change data stored in
SMD, only used for generating graphs
Background subtraction methods - same story as
normalization
Job is placed in the job-queue - email is sent
with link

53
HEEBO/MEEBO plots

Can be used for a gpr file uploaded from desktop
- print has to be present in SMD and oligo_ids in
the id/name column
In batch for a result set list on
loader.stanford.edu
If called for a specific experiment, the values
are already filled in.
Normalization options available from limma. Note
this normalization will NOT change data stored in
SMD, only used for generating graphs
Background subtraction methods - same story as
normalization
Job is placed in the job-queue - email is sent
with link

54
HEEBO/MEEBO plots

Can be used for a gpr file uploaded from desktop
- print has to be present in SMD and oligo_ids in
the id/name column
In batch for a result set list on
loader.stanford.edu
If called for a specific experiment, the values
are already filled in.
Normalization options available from limma. Note
this normalization will NOT change data stored in
SMD, only used for generating graphs
Background subtraction methods - same story as
normalization
Job is placed in the job-queue - email is sent
with link

55
HEEBO/MEEBO plots
56
HEEBO/MEEBO diagnostics

MA-plots before and after normalization
A 1/2(log2(Cy5) log2(Cy3))
M log2(Cy5 / Cy3)
Loess lines are shown for sectors if print-tip
normalization was selected
Distribution should be centered around M0, with
no intensity dependence

57
HEEBO/MEEBO diagnostics

Distribution of ranked log-ratios (M-values) on
slide, before and after normalization
Spatial distribution of non-normalized A-values

58
HEEBO/MEEBO diagnostics

Histograms of signal-to-noise ratios for Cy5
(upper) and Cy3 (lower) channels
Histogram for all probes (probe) and curves for
subgroups (doping, negative, positive controls
and actual probes)

59
HEEBO/MEEBO diagnostics

Box-plots for groups of reporters (colors same as
on previous)
A-values without background subtraction
Normalized M-values for positive/negative
controls (should be around 0 for type 1
experiment)

60
HEEBO/MEEBO doping controls

Amount of doping control (DC) vs. observed
fluorescence intensity
Expected sigmoid curve
Additional graphs for individual DCs

61
HEEBO/MEEBO doping controls

non-normalized Cy5 vs. Cy3 signal intensity (log2
scale) (background corrected if selected)
DCs with same ratio should fit line parallel to
diagonal
Log-ratio increases from top left to bottom right

62
HEEBO/MEEBO doping controls

Observed vs. expected log-ratios (normalized and
bg corrected) for each doping control group
Ratios should be aligned on the diagonal
Graphs for individual doping controls as well

63
HEEBO/MEEBO mismatch and tiling controls

Mismatch and tiling probes are used for 2 tests
Assess integrity of sample (degradation) - tiling
probes
Degree of cross-hybridization - mismatch probes
Mutations are anchored (at the extremities) or
distributed (along transcript)
Calculated binding energies vs. normalized (i.e.
divided by median of corresponding wild type
probes) raw intensities

64
HEEBO/MEEBO mismatch and tiling controls

Percent mismatch vs. log2 intensity for anchored
(blue) and distributed (green) probes
Wild-type probe on left (red box-plot) and
negative controls on the right (red box-plot)
Right axis fraction of all A-values

65
HEEBO/MEEBO mismatch and tiling controls

Tiling probes were designed along the transcript
Non-normalized signal intensities (Cy5 and Cy3)
vs. probes distance from 3-end
Quick drop in signal indicates problem in sample
(degradation/ivt)

66
Graphing Tool
or
histograms

Can be accessed directly from display data page
or from view data page.
It allows you to create graphs of any two data
columns in linear or log space
Can be applied for individual experiment or in
batch for experiment set
Interactive tool

67
Graphing Tool

In batch mode (for experiment set) it can be
configured to work on a subset of the experiments
in the set.

68
Graphing Tool

Can create scatter plots or histograms
Can transform data to log space
Wide selection of data columns to choose from
Combine with data filter to look at distribution
of subset of the data

69
Graphing Tool

Can create scatter plots or histograms
Can transform data to log space
Wide selection of data columns to choose from
Combine with data filter to look at distribution
of subset of the data

70
Graphing Tool

Can create scatter plots or histograms
Can transform data to log space
Wide selection of data columns to choose from
Combine with data filter to look at distribution
of subset of the data

71
Graphing Tool Filter selection

Data filters should be customized for the data
retrieved.
Graphing tool helps in filter selection and
finding a cut-off value

72
Graphing Tool Filter selection

Plot filter field (here regression correlation)
against test field (log ratio).
Log ratios should center around 0.
Here, the log ratios appear to diverge below a
regression correlation of about 0.4 - 0.6.

73
Graphing Tool Filter selection

Foreground / Background (log scale) plotted
against log-ratio
Data should center around a log ratio of zero
Impose cutoff at 2.0 (linear, 0.3 log10) to
eliminate flare at low relative intensity.

74
Graphing Tool Filter selection

As intensity decreases, the log(ratio) tends to
scatter
Spots with low intensities might seem falsely
significant
A cut-off value of 250 (28) is suggested for
Ch2 normalized net

75
Q-score Tool

Tool to use for filter and cut-off value
selection
Currently usable for cDNA slides (uses UNIGENE
clusterid), for human and mouse (will be extended
to HEEBO/MEEBO arrays)
Still in experimental stage
Simple idea pool reporters that belong to same
gene, calculate their spread and combine values
for each gene into score for whole array gt
Q-score
Filtering that removes bad quality spots should
decrease spread of measurements for genes, hence
improve (decrease) Q-score

76
Q-score Tool

Works in batch for a group of slides (from same
print) in a result set list
Requires a genelist that specifies which
reporters to use. Common genelists for human and
mouse clusterids
Filters and their ranges need to be defined.
Log-ratio mean/median is used to calculate
Q-score
Run from the job-queue, email is sent to user
with the link

77
Q-score Tool

Output is a set of graphs showing the fraction of
reporters not filtered out and the corresponding
Q-scores at increasingly stringent filter values
Cut-off values (if found) saved in a new result
set list for data retrieval

78
SMD Office Hours

Grant S201
Mondays 3 - 5 pm
Wednesdays 2 - 4 pm

79
SMD Staff
Gavin Sherlock Co-Investigator
Catherine Ball Director
Patrick Brown Co-Investigator
Farrell Wymore Lead Programmer
Michael Nitzberg Database Administrator
Catherine Beauheim Scientific Programmer
Zac Zachariah Systems Administrator
Janos Demeter Computational Biologist
Heng Jin Scientific Programmer
Takashi Kido Visiting Scholar
Don Maier Senior Software Engineer

Write a Comment

User Comments (0)