Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 93
Provided by: DavidW327
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Module 5
  • David Wishart
  • Informatics and Statistics for Metabolomics
  • June 16-17, 2011

4
Distributions Significance
5
Univariate Statistics
6
Univariate Statistics
  • Univariate means a single variable
  • If you measure a population using some single
    measure such as height, weight, test score, IQ,
    you are measuring a single variable
  • If you plot that single variable over the whole
    population, measuring the frequency that a given
    value is reached you will get the following

7
A Bell Curve
of each
Height
Also called a Gaussian or Normal Distribution
8
Features of a Normal Distribution
m mean
  • Symmetric Distribution
  • Has an average or mean value (m) at the centre
  • Has a characteristic width called the standard
    deviation (s)
  • Most common type of distribution known

9
Normal Distribution
  • Almost any set of biological or physical
    measurements will display some some variation and
    these will almost always follow a Normal
    distribution
  • The larger the set of measurements, the more
    normal the curve
  • Minimum set of measurements to get a normal
    distribution is 30-40

10
Gaussian Distribution
11
Some Equations
Mean m Sxi
N
s2 S(xi - m)2
Variance
N
s S(xi - m)2
Standard Deviation
N
12
Standard Deviations (Z-values)
13
Significance
  • Based on the Normal Distribution, the probability
    that something is gt1 SD away (larger or smaller)
    from the mean is 32
  • Based on the Normal Distribution, the probability
    that something is gt2 SD away (larger or smaller)
    from the mean is 5
  • Based on the Normal Distribution, the probability
    that something is gt3 SD away (larger or smaller)
    from the mean is 0.3

14
Significance
  • In a test with a class of 400 students, if you
    score the average you typically receive a C
  • In a test with a class of 400 students, if you
    score 1 SD above the average you typically
    receive a B
  • In a test with a class of 400 students if you
    score 2 SD above the average you typically
    receive an A,

15
The P-value
  • The p-value is the probability of obtaining a
    test statistic (a score, a set of events, a
    height) at least as extreme as the one that was
    actually observed
  • One "rejects the null hypothesis" when the
    p-value is less than the significance level a
    which is often 0.05 or 0.01
  • When the null hypothesis is rejected, the result
    is said to be statistically significant

16
P-value
  • If the average height of an adult (MF) human is
    5 7 and the standard deviation is 5, what is
    the probability of finding someone who is more
    than 6 10?
  • If you choose an a of 0.05 is a 6 11
    individual a member of the human species?
  • If you choose an a of 0.01 is a 6 11 individual
    a member of the human species?

17
P-value
  • If you flip a coin 20 times and the coin turns up
    heads 14/20 times the probability that this would
    occur is 60,000/1,048,000 0.058
  • If you choose an a of 0.05 is this coin a fair
    coin?
  • If you choose an a of 0.10 is this coin a fair
    coin?

18
Mean, Median Mode
Mode
Median
Mean
19
Mean, Median, Mode
  • In a Normal Distribution the mean, mode and
    median are all equal
  • In skewed distributions they are unequal
  • Mean - average value, affected by extreme values
    in the distribution
  • Median - the middlemost value, usually half way
    between the mode and the mean
  • Mode - most common value

20
Different Distributions
Unimodal Bimodal
21
Other Distributions
  • Binomial Distribution
  • Poisson Distribution
  • Extreme Value Distribution
  • Skewed or Exponential Distribution

22
Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
23
Poisson Distribution
24
Extreme Value Distribution
  • Arises from sampling the extreme end of a normal
    distribution
  • A distribution which is skewed due to its
    selective sampling
  • Skew can be either right or left

Gaussian Distribution
25
Skewed Distribution
  • Resembles an exponential or Poisson-like
    distribution
  • Lots of extreme values far from mean or mode
  • Hard to do useful statistical tests with this
    type of distribution

Outliers
26
Fixing a Skewed Distribution
  • A skewed distribution or exponentially decaying
    distribution can be transformed into a normal
    or Gaussian distribution by applying a log
    transformation
  • This brings the outliers a little closer to the
    mean because it rescales the x-variable, it also
    makes the distribution much more Gaussian

27
Log Transformation
Skewed distribution
Normal distribution
28
Log Transformation on Real Data
29
Distinguishing 2 Populations
Normals
Leprechauns
30
The Result
of each
Height
Are they different?
31
What about these 2 Populations?
32
The Result
of each
Height
Are they different?
33
Students t-Test
  • Also called the t-Test
  • Used to determine if 2 populations are different
  • Formally allows you to calculate the probability
    that 2 sample means are the same
  • If the t-Test statistic gives you a p0.4, and
    the a is 0.05, then the 2 populations are the
    same
  • If the t-Test statistic gives you a p0.04, and
    the a is 0.05, then the 2 populations are
    different
  • Paired and unpaired t-Tests are available, paired
    if used for before after expts. while
    unpaired is for 2 randomly chosen samples

34
Students t-Test
  • A t-Test can also be used to determine whether 2
    clusters are different if the clusters follow a
    normal distribution

Variable 1
Variable 2
35
Distinguishing 3 Populations
Normals
Leprechauns
Elves
36
The Result
of each
Height
Are they different?
37
Distinguishing 3 Populations
38
The Result
of each
Height
Are they different?
39
ANOVA
  • Also called Analysis of Variance
  • Used to determine if 3 or more populations are
    different, it is a generalization of the t-Test
  • Formally ANOVA provides a statistical test (by
    looking at group variance) of whether or not the
    means of several groups are all equal
  • Uses an F-measure to test for significance
  • 1-way, 2-way, 3-way and n-way ANOVAs, most common
    is 1-way which just is concerned about whether
    any of the 3 populations are different, not
    which pair is different

40
ANOVA
  • ANOVA can also be used to determine whether 3
    clusters are different if the clusters follow a
    normal distribution

Variable 1
Variable 2
41
Normalization
42
Normalization
  • What if we measured the top population using a
    ruler that was miscalibrated or biased (inches
    were short by 10)? We would get the following
    result

of each
Height
43
Normalization
  • Normalization adjusts for systematic bias in the
    measurement tool
  • After normalization we would get

of each
Height
44
Data Comparisons Dependencies
45
Data Comparisons
  • In many kinds of experiments we want to know what
    happened to a population before and after
    some treatment or intervention
  • In other situations we want to measure the
    dependency of one variable against another
  • In still others we want to assess how the
    observed property matches the predicted property
  • In all cases we will measure multiple samples or
    work with a population of subjects
  • The best way to view this kind of data is through
    a scatter plot

46
A Scatter Plot
47
Scatter Plots
  • If there is some dependency between the two
    variables or if there is a relationship between
    the predicted and observer variable or if the
    before and after treatments led to some
    effect, then it is possible to see some clear
    patterns to the scatter plot
  • This pattern or relationship is called correlation

48
Correlation
correlation Uncorrelated -
correlation
49
Correlation
High correlation
Low correlation
Perfect correlation
50
Correlation Coefficient
r 0.85
r 0.4
r 1.0
51
Correlation Coefficient
  • Sometimes called coefficient of linear
    correlation or Pearson product-moment correlation
    coefficient
  • A quantitative way of determining what model (or
    equation or type of line) best fits a set of data
  • Commonly used to assess most kinds of
    predictions, simulations, comparisons or
    dependencies

52
Students t-Test (Again)
  • The t-Test can also be used to assess the
    statistical significance of a correlation
  • It specifically determines whether the slope of
    the regression line is statistically different
    than 0

53
Correlation and Outliers
Experimental error or something important?
A single bad point can destroy a good
correlation
54
Outliers
  • Can be both good and bad
  • When modeling data -- you dont like to see
    outliers (suggests the model is bad)
  • Often a good indicator of experimental or
    measurement errors -- only you can know!
  • When plotting metabolite concentration data you
    do like to see outliers
  • A good indicator of something significant

55
Detecting Clusters
Height
Weight
56
Is it Right to Calculate a Correlation
Coefficient?
Height
r 0.73
Weight
57
Or is There More to This?
male
Height
female
Weight
58
Clustering Applications in Bioinformatics
  • Metabolomics and Cheminformatics
  • Microarray or GeneChip Analysis
  • 2D Gel or ProteinChip Analysis
  • Protein Interaction Analysis
  • Phylogenetic and Evolutionary Analysis
  • Structural Classification of Proteins
  • Protein Sequence Families

59
Clustering
  • Definition - a process by which objects that are
    logically similar in characteristics are grouped
    together.
  • Clustering is different than Classification
  • In classification the objects are assigned to
    pre-defined classes, in clustering the classes
    are yet to be defined
  • Clustering helps in classification

60
Clustering Requires...
  • A method to measure similarity (a similarity
    matrix) or dissimilarity (a dissimilarity
    coefficient) between objects
  • A threshold value with which to decide whether an
    object belongs with a cluster
  • A way of measuring the distance between two
    clusters
  • A cluster seed (an object to begin the clustering
    process)

61
Clustering Algorithms
  • K-means or Partitioning Methods - divides a set
    of N objects into M clusters -- with or without
    overlap
  • Hierarchical Methods - produces a set of nested
    clusters in which each pair of objects is
    progressively nested into a larger cluster until
    only one cluster remains
  • Self-Organizing Feature Maps - produces a cluster
    set through iterative training

62
K-means or Partitioning Methods
  • Make the first object the centroid for the first
    cluster
  • For the next object calculate the similarity to
    each existing centroid
  • If the similarity is greater than a threshold add
    the object to the existing cluster and
    redetermine the centroid, else use the object to
    start new cluster
  • Return to step 2 and repeat until done

63
K-means or Partitioning Methods
Initial cluster choose 1 choose 2
test join
centroid centroid
64
Hierarchical Clustering
  • Find the two closest objects and merge them into
    a cluster
  • Find and merge the next two closest objects (or
    an object and a cluster, or two clusters) using
    some similarity measure and a predefined
    threshold
  • If more than one cluster remains return to step 2
    until finished

65
Hierarchical Clustering
Initial cluster pairwise
select select
compare closest
next closest
66
Hierarchical Clustering
A
A
A
B
B
C
D
C
B
E
F
Find 2 most similar metabolite expression
levels or curves
Find the next closest pair of levels or curves
Iterate
Heat map
67
Multivariate Statistics
68
Multivariate Statistics
  • Multivariate means multiple variables
  • If you measure a population using multiple
    measures at the same time such as height, weight,
    hair colour, clothing colour, eye colour, etc.
    you are performing multivariate statistics
  • Multivariate statistics requires more complex,
    multidimensional analyses or dimensional
    reduction methods

69
A Typical Metabolomics Experiment
70
A Metabolomics Experiment
  • Metabolomics experiments typically measure many
    metabolites at once, in other words the
    instruments are measuring multiple variables and
    so metabolomic data are inherently multivariate
    data
  • Metabolomics requires multivariate statistics

71
Multivariate Statistics The Trick
  • The key trick in multivariate statistics is to
    find a way that effectively reduces the
    multivariate data into univariate data
  • Once done, then you can apply the same univariate
    concepts such as p-values, t-Tests and ANOVA
    tests to the data
  • The trick is dimensional reduction

72
Dimension Reduction PCA
  • PCA Principal Componenent Analysis
  • Process that transforms a number of possibly
    correlated variables into a smaller number of
    uncorrelated variables called principal
    components
  • Reduces 1000s of variables to 2-3 key features

Scores plot
73
Principal Component Analysis
Hundreds of peaks 2 components
Scores plot
PCA captures what should be visually detectable
If you cant see it, PCA probably wont help
74
Visualizing PCA
  • PCA of a bagel
  • One projection produces a weiner
  • Another projection produces an O
  • The O projection captures most of the variation
    and has the largest eigenvector (PC1)
  • The weiner projection is PC2 and gives depth info

75
PCA - The Details
  • PCA involves the calculation of the eigenvalue
    (singular value) decomposition of a data
    covariance matrix
  • PCA is an orthogonal linear transformation
  • PCA transforms data to a new coordinate system so
    that the greatest variance of the data comes to
    lie on the first coordinate (1st PC), the second
    greatest variance on the 2nd PC etc.

t1 t2 .. tm
x1 x2 x3, variables . xn
s1 s2 s3 samples. sk
p1 p2 pk
Scores t (eigen vectors uncorrelated orthogonal)
..
Loadings p
scores loadings x data t1 p1x1 p2x2 p3x3
pnxn
76
Visualizing PCA
  • Airport data from USA
  • 5000 samples
  • X1 - latitude
  • X2 - longitude
  • X3 - altitude
  • What should you expect?

Data from Roy Goodacre (U of Manchester)
77
Visualizing PCA
PCA is equivalent to K-means clustering
78
K-means Clustering
Initial cluster choose 1 choose 2
test join
centroid centroid
79
PCA Clusters
  • Once dimensional reduction has been achieved you
    obtain clusters of data that are mostly normally
    distributed with means and variances (in PCA
    space)
  • It is possible to use t-Tests and ANOVA tests to
    determine if these clusters or their means are
    significantly different or not

80
PCA and ANOVA
  • ANOVA can also be used to determine whether 3
    clusters are different if the clusters follow a
    normal distribution

PC 1
PC 2
81
PCA Plot Nomenclature
  • PCA Generate 2 kinds of plots, the scores plot
    and the loadings plot
  • Scores plot (on right) plots the data using the
    main principal components

82
PCA Loadings Plot
  • Loadings plot shows how much each of the
    variables (metabolites) contributed to the
    different principal components
  • Variables at the extreme corners contribute most
    to the scores plot separation

83
PCA Details/Advice
  • In some cases PCA will not succeed in identifying
    any clear clusters or obvious groupings no matter
    how many components are used. If this is the
    case, it is wise to accept the result and assume
    that the presumptive classes or groups cannot be
    distinguished
  • As a general rule, if a PCA analysis fails to
    achieve even a modest separation of classes, then
    it is probably not worthwhile using other
    statistical techniques to try to separate them

84
PCA Q2 and R2
  • The performance of a PCA model can be
    quantitatively evaluated in terms of an R2 and/or
    a Q2 value
  • R2 is the correlation index and refers to the
    goodness of fit or the explained variation (range
    0-1)
  • Q2 refers to the predicted variation or quality
    of prediction (range 0-1)
  • Typically Q2 and R2 track very closely together

85
PCA R2
  • R2 is a quantitative measure (with a maximum
    value of 1) that indicates how well the PCA model
    is able to mathematically reproduce the data in
    the data set
  • A poorly fit model will have an R2 of 0.2 or 0.3,
    while a well-fit model will have an R2 of 0.7 or
    0.8.

86
PCA Q2
  • To guard against over-fitting, the value Q2 is
    commonly determined. Q2 is usually estimated by
    cross validation or permutation testing to assess
    the predictive ability of the model relative to
    the number of principal components used in the
    model
  • Generally a Q2 gt 0.5 if considered good while a
    Q2 of 0.9 is outstanding

87
PCA vs. PLS-DA
  • Partial Least Squares Discriminant Analysis
  • PLS-DA is a supervised classification technique
    while PCA is an unsupervised clustering technique
  • PLS-DA uses labeled data while PCA uses no
    prior knowledge
  • PLS-DA enhances the separation between groups of
    observations by rotating PCA components such that
    a maximum separation among classes is obtained

88
Other Supervised Classification Methods
  • SIMCA Soft Independent Modeling of Class
    Analogy
  • OPLS Orthoganol Project of Least Squares
  • Support Vector Machines
  • Random Forest
  • Naïve Bayes Classifiers
  • Neural Networks

89
Breaching the Data Barrier
Unsupervised Methods PCA K-means
clustering Factor Analysis
Supervised Methods PLS-DA LDA PLS-Regression
Machine Learning Neural Networks Support Vector
Machines Bayesian Belief Net
90
Data Analysis Progression
  • Unsupervised Methods
  • PCA or cluster to see if natural clusters form or
    if data separates well
  • Data is unlabeled (no prior knowledge)
  • Supervised Methods/Machine Learning
  • Data is labeled (prior knowledge)
  • Used to see if data can be classified
  • Helps separate less obvious clusters or features
  • Statistical Significance
  • Supervised methods always generate clusters --
    this can be very misleading
  • Check if clusters are real by label permutation

91
Testing Significance
PCA
Labelled data
PLS-DA/SVM
PLS-DA/SVM
Permuted data
92
Note of Caution
  • Supervised classification methods are powerful
  • Learn from experience
  • Generalize from previous examples
  • Perform pattern recognition
  • Too many people skip the PCA or clustering steps
    and jump straight to supervised methods
  • Some get great separation and think the job is
    done - this is where the errors begin
  • Too many dont assess significance using
    permutation testing or n-fold cross validation
  • If separation isnt partially obvious by
    eye-balling your data, you may be treading on
    thin ice
Write a Comment
User Comments (0)
About PowerShow.com