Statistics Tools in GeneSpring - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Statistics Tools in GeneSpring

Description:

Title: PowerPoint Presentation Author: train Last modified by: jjin Created Date: 11/15/2001 2:46:32 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 26
Provided by: tra760
Category:

less

Transcript and Presenter's Notes

Title: Statistics Tools in GeneSpring


1
Statistics Tools in GeneSpring
  • The Center for Bioinformatics
  • UNC at Chapel Hill
  • Jianping Jin Ph.D.
  • Bioinformatics Scientist
  • Phone (919)843-6015
  • E-mail jjin_at_email.unc.edu
  • Fax (919)966-6821

2
What GeneSpring Can do?
  • Works with both Affymetrix and two-color data.
  • Views data graphically (classification, graph,
    tree, scatter plot, Vann Diagram )
  • Performs statistical analyses.
  • Annotates genes (updating from GenBank,
    LocusLink, Unigene biochemical pathways).

3
What statistical analyses does GS do?
  • Clustering
  • k-means (non-hierarchical)
  • Self-organizing map
  • Gene trees (hierarchical dendrograms).
  • principal component analysis
  • T-Test analyses ( p-values)
  • Like a known gene or average of genes
  • Like a pattern drawn with the mouse
  • Genes with high confidence
  • Genes with relative expression in certain ranges
  • Pathway analysis finding genes that fit in a
    certain place in a pathway.
  • Sequence analysis to automatically find
    regulatory sequences.
  • Automatic functional annotation of sub-trees in
    dendrograms.

4
Tree Clustering
  • Standard correlation
  • Smooth correlation
  • Change correlation
  • Upregulated correlation
  • Pearson correlation
  • Spearman correlation
  • Spearman confidence
  • Two-sided Spearman confidence
  • Distance

5
Notations to the Formulas
  •  Result the result of the calculation for genes
    A and B.
  •  n the numbers of samples being correlated
    over.
  • a the vector (a 1 , a 2 , a 3 ... a n) of
    expression values for gene A.
  •  b the vector (b 1 , b 2 , b 3 ... b n) of
    expression values for gene B.
  •  a.b a 1 b 1 a 2 b 2 ...a n b n.
  •  asquare root(a.a)

6
Standard Correlation
  • Equation a.b/(ab)
  • also called Pearson correlation around zero.
  • Measure the angular separation of expression
    vectors for genes A B.
  • Answer the question do the peaks match up?

7
Pearson Correlation
  • Equation A.B / ( A B )
  • Very similar to the Std correlation, except it
    measures the angle of expression vector for genes
    A B around the mean of the expression vectors.
  • A the mean of all element in vector a - the
    value from each element in a.
  • Do the same for b to make a vector B

8
Spearman Confidence
  • r the value of the Spearman correlation,
  • SC 1-(probability you would get a value of r
    or higher by chance)
  • A measure of similarity, not a correlation
  • High SC value if a high Spearman corr, a low
    p-value.
  • Takes account of the number of sub-experiment in
    your experiment set.

9
Two-sided Spearman Confidence
  • A measure of similarity, very similar to the
    Spearman conf.
  • Two-sided test of whether the Spearman corr. is
    either significantly gt/lt zero.
  • what genes behave similarly/opposite to a
    specific gene?
  • Probably not good for k-means/tree clustering.
  • 1-(probability you would get a Spearman
    correlation of r or higher, or -r or lower,
    by chance).

10
Distance
  • A measurement of dissimilarity, not a correlation
    at all.
  • Euclidian dist. b/w expression Profiles ( values
    for each point in N-dimensional space) of genes A
    B.
  • Distance a-b/square root of N (expt. points)

11
Special Case Correlations
  • Smooth correlation, Change correlation and
    Upregulated correlation.
  • All three modified version of the Std.
    correlation.
  • Only make sense when data in a sequence, such as
    before/after, a time series, or a drug series.

12
Smooth Correlation
  • Make a new vector A from a by interpolating the
    avg. of each consecutive pair of elements of a.
  • Insert this new value b/w the old values
  • Do this for each pair of elements that would
    connected by a line in the graph screen
  • Do the same to make a vector B from b.

13
Change Correlation
  • The opposite of what the Smooth corr. looks for.
    Only the chg. in expression level of adjacent
    points.
  • Similar to the Std corr., but use an arc tangent
    transformation of ratio b/w adjacent pairs of
    points to create the expr. vector. Less sensitive
    to outliers than using the ratio directly.
  • The value created b/w two values a i and a i1 is
    atan(a i1 /a i )- ? /4

14
Upregulated Correlation
  • Very similar to the Chg. Corr., but it only
    considers positive changes. All negative values
    for the arc tangent are set to zero.
  • Make a new vector A from a by looking at the
    change b/w each pair of elements of a.
  • The value created b/w two values a i and a i1
    is max(atan(a i1 /a i )- ? /4.0).

15
Algorithm to Build Gene Tree
  • Determine if there is only one gene or subtree
    left. If yes, go to step five.
  • Find the two closest genes/subtrees.
  • Merge these two into one subtree.
  • Return to step one.
  • Merge together branches where the distance
    between sub-branches is less than the separation
    ratio, subject to considering genes with less
    than the minimum distance apart.

16
Algorithm to Build Tree
  • The minimum distance how far down the tree
    discrete branches are depicted. Higher number,
    more genes in a group, less specific.
  • The separate ratio the correlation diff. b/w
    groups of clustered genes. B/w 0 and 1.
    Increasing separation increases the branchiness
    of the tree.

17
Principal Components Analysis
  • Not a clustering method.
  • PCA, the most abundant building blocks, a set of
    expression patterns.
  • 1st PC is obtained by finding the linear
    combination of expr. Patterns for the most of
    variability in the data. And so on.

18
k-Means Clustering
  • Divides genes into a user-defined (k) of
    equal-sized groups, based on their expression
    patterns.
  • Creates centroids at the avg. location of each
    group of genes
  • With each iteration, genes are reassigned to the
    group with closest centroid
  • After all of the genes have been reassigned, the
    location of the centroids is recalculated.

19
Self-Organizing Maps
  • Similar to k-means clustering.
  • Relationship b/w groups in a 2-D map.
  • Best represents the variability of the data,
    while still maintaining similarity b/w adjacent
    nodes, e.g. point 1,2 is one unit away from 1,3.

20
What does t-test mean in GS
  • Replicates one-sample Students t-test
  • Comparisons for 2 groups Students two-sample
    t-test.
  • Comparisons for multiple groups one-way analysis
    of variance (ANOVA).
  • Filtering genes based on a one-sample t-test of
    the mean expression level across replicates vs. a
    reference value (Expression Percentage
    Restriction)

21
Filter Genes Analysis Tools
  • Global Error Model filters out genes with large
    std deviations or error values.
  • Raw data filtering gets rid of genes too close
    to the background.
  • Sample to sample comparison fold cmp. Among
    different samples.
  • Statistical Group cmp. filters out genes not
    vary significantly across different groups.
  • Data File Restriction based on other field ( P/S
    call, /- pairs).

22
Statistical Group Comparison
  • Genes statistically significant difference in the
    mean expression levels across all group.
  • For two groups Studentss two-sample t-test.
  • For multiple groups ANOVA
  • Non-parametric cmp. for each gene, the rank
    order is used for analysis. Wilcoxon two-sample
    test (Mann-Whitney U test), the Kruskal-Wallis
    test for multiple groups.

23
Data Normalization
  • In two-color experiments, normalizing vs. the
    control channel (green) for each gene.
  • Normalize each sample to itself or to a positive
    control. Make diff. samples comparable to one
    another.
  • Normalizing each gene to itself remove the
    differing intensity scales from multiple expt
    readings (highly recommended if not using a
    two-color experiment.

24
NCI-60 cell lines
25
DrugActivity_AT
Write a Comment
User Comments (0)
About PowerShow.com