SMD Data Quality Assessment and Repository Tools Tutorial - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

SMD Data Quality Assessment and Repository Tools Tutorial

Description:

After averaging: - a new row for the synthetic gene data ... during averaging) - Can have comment lines (start with #) Synthetic Genes ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 80
Provided by: cath82
Category:

less

Transcript and Presenter's Notes

Title: SMD Data Quality Assessment and Repository Tools Tutorial


1
SMD Data Quality Assessment and Repository Tools
Tutorial
  • November 10, 2007
  • Catherine Ball
  • Janos Demeter

2
SMD Getting Help
  • Click on the Help menu
  • Tool-specific links will be listed at the top.
  • Use the SMD help index to look for specific
    subjects
  • Send e-mail to
  • array_at_genome.stanford.edu

3
Quality Assessment and Repository Tools Tutorial
  • Quality Assessment Tools
  • Ratios on Array
  • HEEBO/MEEBO plots
  • Graphing tool
  • Q-score
  • Repository
  • Repository
  • SVD
  • Synthetic Gene Tool
  • kNNimpute

4
SMD Data Repository Help
  • How to use the tool
  • Limitations of file sizes
  • Sharing data
  • Options
  • Links to help for analysis methods, data file
    formats, data retrieval and clustering

5
SMD Help File Formats
6
File Formats Pre-clustering (PCL) File
Names and orders of arrays (if arrays are not
clustered)
7
File Formats Clustering Design Tree (CDT) File
8
SMD Data Repository
  • What is the SMD Data Repository?
  • What is the repository?
  • Using the repository to save or upload data
  • Using the repository to share data
  • Using the repository to analyze data
  • Options for PCL files via the repository
  • View
  • Data
  • Delete
  • Edit
  • Cluster
  • Filter
  • SVD
  • Synthetic Genes
  • KNN Impute
  • Options for CDT files via the repository
  • GeneXplorer
  • TreeView
  • View Clusters, spots

9
What is the SMD Repository?
  • A method to save data sets to prevent repeatedly
    performing the same data retrieval
  • A method to share processed data with others
  • A way SMD can provide you with access to new
    and/or computationally-intensive tools

10
Accessing the SMD Data Repository
Here!
11
SMD Data Repository
12
Uploading files to Repository
  • If uploading clustered data, enter CDT files
  • If uploading pre-clustering data, enter PCL
    files
  • Choose an organism
  • Give a unique name to your data set
  • Provide a useful description to your data set

13
Using Your Repository CDT Deposits
  • View cluster using GeneXplorer or TreeView
  • View cluster images
  • View retrieval and clustering report
  • Download files
  • Assign access

14
Using Your Repository PCL Deposits
Apply Synthetic Genes to data
Edit the entry
Filter data
Estimate missing data with KNN impute
Download data
15
Using the Repository CDT File Options
CDT files have a few other options
GeneXplorer
Clustering with Proxy and Spot images
TreeView
Clustering with Spotimages
Clustering with Proxy images
16
Viewing Repository Entries
  • Name
  • Organism
  • Number of genes
  • Number of arrays
  • Size of file
  • Date uploaded
  • Description
  • Data retrieval summary

17
Downloading Repository Entries
  • Downloading puts file(s) into a folder labeled
    with your SMD user name onto your computers
    desktop

18
Deleting Repository Entries
  • Details about your repository entry
  • Asks you to confirm before deleting!

19
Editing Entries -- How to Share!
  • Change repository entry name
  • Change description
  • Add access to repository entry to a GROUP
  • Add access to a repository entry to a SMD USER

20
Filtering Data in Repository Entries
  • If your repository entry is a PCL file, you can
    re-enter the SMD filtering pipeline

21
SVD Singular Value Decomposition
  • The goal of SVD is to find a set of patterns that
    describe the greatest amount of variance in a
    dataset
  • SVD determines unique orthogonal (or
    uncorrelated) gene and corresponding array
    expression patterns (i.e. "eigengenes" and
    "eigenarrays," respectively) in the data
  • Patterns might be correlated with biological
    processes OR might be correlated with technical
    artifacts

22
SVD The Concept (easy version)
  • Lets imagine we have a three-dimensional cigar,
    as shown in A
  • We can represent this in one dimension, by
    looking at its lengthwise shadow (B)
  • Looking at its cross-wise shadow (C), we get an
    orthogonal view of the cigar that tells us more
    about the three-dimensional object than B alone.

23
SVD Missing Data Estimation
  • Some algorithms (such as SVD) cannot operate with
    missing data
  • You can use this simple method or you can use
    KNNImpute to estimate missing data

24
SVD Display in SMD
25
SVD Raster Display
  • Each row represents an eigengene -- an
    orthogonal representation of the genes in the
    dataset
  • The topmost eigengene contributes the most to the
    data set

26
SVD View Projection
  • Clicking on a row in the Raster Display brings
    you the Projection View
  • You can select genes that have high and low
    contributions from an eigengene and download them
    in a PCL file
  • In this way, you might use SVD to help classify
    subtypes

27
SVD Eigenexpression
  • Each bar show the probability of expression of
    each eigengene
  • You can compare the probabilities to see which
    eigengenes contribute more to the overall view
    of the data

28
SVD Plot selected eigengenes
  • You can plot as many or as few eigengenes as you
    like
  • This plot gives you an easy-to-understand view of
    the behavior of each eigengene

29
Synthetic Genes
  • Purpose
  • average data based on arbitrary groupings of
    genes
  • - for biological reasons
  • - for technical reasons
  • Can average data using
  • - common genelists
  • - your own genelists
  • After averaging
  • - a new row for the synthetic gene data
  • - Original data can be removed/included

30
Synthetic Genes
  • Common lists available (only mouse and human
    data)
  • Unigene (all clones/oligos that report on a given
    Unigene id will be averaged and shown as the
    Unigene id)
  • LocusLink (same as above, but for LocusLink id)
  • These lists are useful to collapse data by gene,
    rather than suid/luid.
  • They allow comparison of experiments between
    different platforms - oligo print to cDNA print
    or spotted arrays to Agilent arrays where the
    arrays dont share common suids. Also can be used
    to compare cDNA prints with h/meebo arrays
  • These synthetic gene lists are updated on a
    regular basis.

31
Synthetic Genes
  • Other common synthetic gene lists
  • chromosome arms
  • cytobands
  • 5 Mb tiles based on GoldenPath mappings
  • Tissue types
  • tumor types
  • processes
  • Additional lists see
  • http//smd.stanford.edu/help/synthGenes.shtml

32
Synthetic Genes
  • You can use your own genelists
  • 1 genelist for each synthetic gene
  • Name of the genelist is the synthetic genes name
  • - tab-delimited text file
  • File must have header (NAME, WEIGHT)
  • NAME contains cloneid
  • WEIGHT can be -1 to 1 (weight of clone
  • during averaging)
  • - Can have comment lines (start with )

33
Synthetic Genes
  • Tool only works on pcl files in repository
  • During data retrieval the include UIDs option
    should not be used
  • After collapsing, file can be downloaded, added
    to your repository, and/or clustered
  • Currently works only for human and mouse data

34
Synthetic Genes/Merge PCL Files
  • Related tool Merge PCL Files
  • On main page (lists menu -gt all programs) under
    tools section
  • Can be used to combine 2 pcl files from different
    sources into a single pcl file.
  • Cloneids that belong to the same gene can be
    combined into single row (based on a translation
    file provided).

35
Synthetic Genes/Merge PCL Files
36
Synthetic Genes/Merge PCL Files
  • Same experiments in the pcl files can be averaged
  • Averaging method can be mean/median
  • Translation file
  • Tab-delimited text file
  • First column desired final identifier
  • Second column desired final annotation
  • Third and subsequent columns identifiers (first
    column of a pcl file) in the pcl files that
    should be collapsed to the identifier in the
    first column.
  • Data for identifiers not included in the
    translation file will not be collapsed

37
KNNImpute The Missing Values Problem
  • Microarrays can have systematic or random missing
    values
  • Some algorithms arent robust to missing values
  • Large literature on parameter estimation exists
  • Whats best to do for microarrays?

38
Why Estimate Missing Values?

 
39
KNNimpute Algorithm
  • Idea use genes with similar expression profiles
    to estimate missing values

40
Clustering Cluster Image
  • Scale is indicated on the color bar
  • Gene names are at the right
  • Tree generated by hierarchical clustering is at
    the left

41
Clustering Display Clustered Spot Images
  • Spot images can also be viewed in a clustered
    image
  • This can give you a visual impression of the data
    that are the basis of your analysis

42
Clustering Display Adjacent Cluster and
Clustered Spot Images
43
GENEXPLORER
44
TREEVIEW
45
SMD Getting Help
  • Click on the Help menu
  • Tool-specific links will be listed at the top.
  • Use the SMD help index to look for specific
    subjects
  • Send e-mail to
  • array_at_genome.stanford.edu

46
Quality Assessment and Repository Tutorial
  • Quality assessment tools
  • Ratios on Array
  • H/Meebo plots
  • Graphing tool
  • Q-score
  • Repository
  • Repository
  • SVD
  • Synthetic Gene Tool
  • kNNimpute

47
Ratios on Array Tool
  • Accessible from the display data -gt view data
    pages
  • Ratios on array

48
Ratios on Array Tool
  • Quick visualization of log-ratio distribution on
    the slide
  • Color assignments are based on log-ratio values
    and also intensity
  • Can visualize normalized or non-normalized
    log-ratios
  • PLUS ANOVA analysis to detect spatial bias
    (print-tip or plate)

49
Ratios on Array Tool
  • Not normalized vs. normalized (loess intensity,
    print-tip)

50
Ratios on Array Tool
  • One way ANOVA to test dependence of log-ratios on
    print-tip and printing plate
  • F-statistic is given for the hypothesis no bias
    in data
  • In the example, normalization significantly
    improved print-tip bias

51
HEEBO/MEEBO plots
Single experiment
Batch access
  • HEEBO/MEEBO quality assessment graphs from
    BioConductor package
  • If you used doping controls on the slide, the
    graphs are automatically generated during
    experiment loading
  • Accessible from
  • For single experiment display data -gt view data
    pages
  • For batch from main page, under tools
  • You can create new graphs or look at existing
    ones
  • Help page http//smd.stanford.edu/help/arrayQuali
    ty.shtml

52
HEEBO/MEEBO plots
  • Can be used for a gpr file uploaded from desktop
    - print has to be present in SMD and oligo_ids in
    the id/name column
  • In batch for a result set list on
    loader.stanford.edu
  • If called for a specific experiment, the values
    are already filled in.
  • Normalization options available from limma. Note
    this normalization will NOT change data stored in
    SMD, only used for generating graphs
  • Background subtraction methods - same story as
    normalization
  • Job is placed in the job-queue - email is sent
    with link

53
HEEBO/MEEBO plots
  • Can be used for a gpr file uploaded from desktop
    - print has to be present in SMD and oligo_ids in
    the id/name column
  • In batch for a result set list on
    loader.stanford.edu
  • If called for a specific experiment, the values
    are already filled in.
  • Normalization options available from limma. Note
    this normalization will NOT change data stored in
    SMD, only used for generating graphs
  • Background subtraction methods - same story as
    normalization
  • Job is placed in the job-queue - email is sent
    with link

54
HEEBO/MEEBO plots
  • Can be used for a gpr file uploaded from desktop
    - print has to be present in SMD and oligo_ids in
    the id/name column
  • In batch for a result set list on
    loader.stanford.edu
  • If called for a specific experiment, the values
    are already filled in.
  • Normalization options available from limma. Note
    this normalization will NOT change data stored in
    SMD, only used for generating graphs
  • Background subtraction methods - same story as
    normalization
  • Job is placed in the job-queue - email is sent
    with link

55
HEEBO/MEEBO plots
56
HEEBO/MEEBO diagnostics
  • MA-plots before and after normalization
  • A 1/2(log2(Cy5) log2(Cy3))
  • M log2(Cy5 / Cy3)
  • Loess lines are shown for sectors if print-tip
    normalization was selected
  • Distribution should be centered around M0, with
    no intensity dependence

57
HEEBO/MEEBO diagnostics
  • Distribution of ranked log-ratios (M-values) on
    slide, before and after normalization
  • Spatial distribution of non-normalized A-values

58
HEEBO/MEEBO diagnostics
  • Histograms of signal-to-noise ratios for Cy5
    (upper) and Cy3 (lower) channels
  • Histogram for all probes (probe) and curves for
    subgroups (doping, negative, positive controls
    and actual probes)

59
HEEBO/MEEBO diagnostics
  • Box-plots for groups of reporters (colors same as
    on previous)
  • A-values without background subtraction
  • Normalized M-values for positive/negative
    controls (should be around 0 for type 1
    experiment)

60
HEEBO/MEEBO doping controls
  • Amount of doping control (DC) vs. observed
    fluorescence intensity
  • Expected sigmoid curve
  • Additional graphs for individual DCs

61
HEEBO/MEEBO doping controls
  • non-normalized Cy5 vs. Cy3 signal intensity (log2
    scale) (background corrected if selected)
  • DCs with same ratio should fit line parallel to
    diagonal
  • Log-ratio increases from top left to bottom right

62
HEEBO/MEEBO doping controls
  • Observed vs. expected log-ratios (normalized and
    bg corrected) for each doping control group
  • Ratios should be aligned on the diagonal
  • Graphs for individual doping controls as well

63
HEEBO/MEEBO mismatch and tiling controls
  • Mismatch and tiling probes are used for 2 tests
  • Assess integrity of sample (degradation) - tiling
    probes
  • Degree of cross-hybridization - mismatch probes
  • Mutations are anchored (at the extremities) or
    distributed (along transcript)
  • Calculated binding energies vs. normalized (i.e.
    divided by median of corresponding wild type
    probes) raw intensities

64
HEEBO/MEEBO mismatch and tiling controls
  • Percent mismatch vs. log2 intensity for anchored
    (blue) and distributed (green) probes
  • Wild-type probe on left (red box-plot) and
    negative controls on the right (red box-plot)
  • Right axis fraction of all A-values

65
HEEBO/MEEBO mismatch and tiling controls
  • Tiling probes were designed along the transcript
  • Non-normalized signal intensities (Cy5 and Cy3)
    vs. probes distance from 3-end
  • Quick drop in signal indicates problem in sample
    (degradation/ivt)

66
Graphing Tool
or
histograms
  • Can be accessed directly from display data page
    or from view data page.
  • It allows you to create graphs of any two data
    columns in linear or log space
  • Can be applied for individual experiment or in
    batch for experiment set
  • Interactive tool

67
Graphing Tool
  • In batch mode (for experiment set) it can be
    configured to work on a subset of the experiments
    in the set.

68
Graphing Tool
  • Can create scatter plots or histograms
  • Can transform data to log space
  • Wide selection of data columns to choose from
  • Combine with data filter to look at distribution
    of subset of the data

69
Graphing Tool
  • Can create scatter plots or histograms
  • Can transform data to log space
  • Wide selection of data columns to choose from
  • Combine with data filter to look at distribution
    of subset of the data

70
Graphing Tool
  • Can create scatter plots or histograms
  • Can transform data to log space
  • Wide selection of data columns to choose from
  • Combine with data filter to look at distribution
    of subset of the data

71
Graphing Tool Filter selection
  • Data filters should be customized for the data
    retrieved.
  • Graphing tool helps in filter selection and
    finding a cut-off value

72
Graphing Tool Filter selection
  • Plot filter field (here regression correlation)
    against test field (log ratio).
  • Log ratios should center around 0.
  • Here, the log ratios appear to diverge below a
    regression correlation of about 0.4 - 0.6.

73
Graphing Tool Filter selection
  • Foreground / Background (log scale) plotted
    against log-ratio
  • Data should center around a log ratio of zero
  • Impose cutoff at 2.0 (linear, 0.3 log10) to
    eliminate flare at low relative intensity.

74
Graphing Tool Filter selection
  • As intensity decreases, the log(ratio) tends to
    scatter
  • Spots with low intensities might seem falsely
    significant
  • A cut-off value of 250 (28) is suggested for
    Ch2 normalized net

75
Q-score Tool
  • Tool to use for filter and cut-off value
    selection
  • Currently usable for cDNA slides (uses UNIGENE
    clusterid), for human and mouse (will be extended
    to HEEBO/MEEBO arrays)
  • Still in experimental stage
  • Simple idea pool reporters that belong to same
    gene, calculate their spread and combine values
    for each gene into score for whole array gt
    Q-score
  • Filtering that removes bad quality spots should
    decrease spread of measurements for genes, hence
    improve (decrease) Q-score

76
Q-score Tool
  • Works in batch for a group of slides (from same
    print) in a result set list
  • Requires a genelist that specifies which
    reporters to use. Common genelists for human and
    mouse clusterids
  • Filters and their ranges need to be defined.
  • Log-ratio mean/median is used to calculate
    Q-score
  • Run from the job-queue, email is sent to user
    with the link

77
Q-score Tool
  • Output is a set of graphs showing the fraction of
    reporters not filtered out and the corresponding
    Q-scores at increasingly stringent filter values
  • Cut-off values (if found) saved in a new result
    set list for data retrieval

78
SMD Office Hours
  • Grant S201
  • Mondays 3 - 5 pm
  • Wednesdays 2 - 4 pm

79
SMD Staff
Gavin Sherlock Co-Investigator
Catherine Ball Director
Patrick Brown Co-Investigator
Farrell Wymore Lead Programmer
Michael Nitzberg Database Administrator
Catherine Beauheim Scientific Programmer
Zac Zachariah Systems Administrator
Janos Demeter Computational Biologist
Heng Jin Scientific Programmer
Takashi Kido Visiting Scholar
Don Maier Senior Software Engineer
Write a Comment
User Comments (0)
About PowerShow.com