Canadian Bioinformatics Workshops

About This Presentation

Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:372

Avg rating:3.0/5.0

Slides: 93

Provided by: DavidW327

Category:

more less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
2
Module Title of Module
3
Module 5

David Wishart
Informatics and Statistics for Metabolomics
June 16-17, 2011

4
Distributions Significance
5
Univariate Statistics
6
Univariate Statistics

Univariate means a single variable
If you measure a population using some single
measure such as height, weight, test score, IQ,
you are measuring a single variable
If you plot that single variable over the whole
population, measuring the frequency that a given
value is reached you will get the following

7
A Bell Curve
of each
Height
Also called a Gaussian or Normal Distribution
8
Features of a Normal Distribution
m mean

Symmetric Distribution
Has an average or mean value (m) at the centre
Has a characteristic width called the standard
deviation (s)
Most common type of distribution known

9
Normal Distribution

Almost any set of biological or physical
measurements will display some some variation and
these will almost always follow a Normal
distribution
The larger the set of measurements, the more
normal the curve
Minimum set of measurements to get a normal
distribution is 30-40

10
Gaussian Distribution
11
Some Equations
Mean m Sxi
N
s2 S(xi - m)2
Variance
N
s S(xi - m)2
Standard Deviation
N
12
Standard Deviations (Z-values)
13
Significance

Based on the Normal Distribution, the probability
that something is gt1 SD away (larger or smaller)
from the mean is 32
Based on the Normal Distribution, the probability
that something is gt2 SD away (larger or smaller)
from the mean is 5
Based on the Normal Distribution, the probability
that something is gt3 SD away (larger or smaller)
from the mean is 0.3

14
Significance

In a test with a class of 400 students, if you
score the average you typically receive a C
In a test with a class of 400 students, if you
score 1 SD above the average you typically
receive a B
In a test with a class of 400 students if you
score 2 SD above the average you typically
receive an A,

15
The P-value

The p-value is the probability of obtaining a
test statistic (a score, a set of events, a
height) at least as extreme as the one that was
actually observed
One "rejects the null hypothesis" when the
p-value is less than the significance level a
which is often 0.05 or 0.01
When the null hypothesis is rejected, the result
is said to be statistically significant

16
P-value

If the average height of an adult (MF) human is
5 7 and the standard deviation is 5, what is
the probability of finding someone who is more
than 6 10?
If you choose an a of 0.05 is a 6 11
individual a member of the human species?
If you choose an a of 0.01 is a 6 11 individual
a member of the human species?

17
P-value

If you flip a coin 20 times and the coin turns up
heads 14/20 times the probability that this would
occur is 60,000/1,048,000 0.058
If you choose an a of 0.05 is this coin a fair
coin?
If you choose an a of 0.10 is this coin a fair
coin?

18
Mean, Median Mode
Mode
Median
Mean
19
Mean, Median, Mode

In a Normal Distribution the mean, mode and
median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values
in the distribution
Median - the middlemost value, usually half way
between the mode and the mean
Mode - most common value

20
Different Distributions
Unimodal Bimodal
21
Other Distributions

Binomial Distribution
Poisson Distribution
Extreme Value Distribution
Skewed or Exponential Distribution

22
Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
23
Poisson Distribution
24
Extreme Value Distribution

Arises from sampling the extreme end of a normal
distribution
A distribution which is skewed due to its
selective sampling
Skew can be either right or left

Gaussian Distribution
25
Skewed Distribution

Resembles an exponential or Poisson-like
distribution
Lots of extreme values far from mean or mode
Hard to do useful statistical tests with this
type of distribution

Outliers
26
Fixing a Skewed Distribution

A skewed distribution or exponentially decaying
distribution can be transformed into a normal
or Gaussian distribution by applying a log
transformation
This brings the outliers a little closer to the
mean because it rescales the x-variable, it also
makes the distribution much more Gaussian

27
Log Transformation
Skewed distribution
Normal distribution
28
Log Transformation on Real Data
29
Distinguishing 2 Populations
Normals
Leprechauns
30
The Result
of each
Height
Are they different?
31
What about these 2 Populations?
32
The Result
of each
Height
Are they different?
33
Students t-Test

Also called the t-Test
Used to determine if 2 populations are different
Formally allows you to calculate the probability
that 2 sample means are the same
If the t-Test statistic gives you a p0.4, and
the a is 0.05, then the 2 populations are the
same
If the t-Test statistic gives you a p0.04, and
the a is 0.05, then the 2 populations are
different
Paired and unpaired t-Tests are available, paired
if used for before after expts. while
unpaired is for 2 randomly chosen samples

34
Students t-Test

A t-Test can also be used to determine whether 2
clusters are different if the clusters follow a
normal distribution

Variable 1
Variable 2
35
Distinguishing 3 Populations
Normals
Leprechauns
Elves
36
The Result
of each
Height
Are they different?
37
Distinguishing 3 Populations
38
The Result
of each
Height
Are they different?
39
ANOVA

Also called Analysis of Variance
Used to determine if 3 or more populations are
different, it is a generalization of the t-Test
Formally ANOVA provides a statistical test (by
looking at group variance) of whether or not the
means of several groups are all equal
Uses an F-measure to test for significance
1-way, 2-way, 3-way and n-way ANOVAs, most common
is 1-way which just is concerned about whether
any of the 3 populations are different, not
which pair is different

40
ANOVA

ANOVA can also be used to determine whether 3
clusters are different if the clusters follow a
normal distribution

Variable 1
Variable 2
41
Normalization
42
Normalization

What if we measured the top population using a
ruler that was miscalibrated or biased (inches
were short by 10)? We would get the following
result

of each
Height
43
Normalization

Normalization adjusts for systematic bias in the
measurement tool
After normalization we would get

of each
Height
44
Data Comparisons Dependencies
45
Data Comparisons

In many kinds of experiments we want to know what
happened to a population before and after
some treatment or intervention
In other situations we want to measure the
dependency of one variable against another
In still others we want to assess how the
observed property matches the predicted property
In all cases we will measure multiple samples or
work with a population of subjects
The best way to view this kind of data is through
a scatter plot

46
A Scatter Plot
47
Scatter Plots

If there is some dependency between the two
variables or if there is a relationship between
the predicted and observer variable or if the
before and after treatments led to some
effect, then it is possible to see some clear
patterns to the scatter plot
This pattern or relationship is called correlation

48
Correlation
correlation Uncorrelated -
correlation
49
Correlation
High correlation
Low correlation
Perfect correlation
50
Correlation Coefficient
r 0.85
r 0.4
r 1.0
51
Correlation Coefficient

Sometimes called coefficient of linear
correlation or Pearson product-moment correlation
coefficient
A quantitative way of determining what model (or
equation or type of line) best fits a set of data
Commonly used to assess most kinds of
predictions, simulations, comparisons or
dependencies

52
Students t-Test (Again)

The t-Test can also be used to assess the
statistical significance of a correlation
It specifically determines whether the slope of
the regression line is statistically different
than 0

53
Correlation and Outliers
Experimental error or something important?
A single bad point can destroy a good
correlation
54
Outliers

Can be both good and bad
When modeling data -- you dont like to see
outliers (suggests the model is bad)
Often a good indicator of experimental or
measurement errors -- only you can know!
When plotting metabolite concentration data you
do like to see outliers
A good indicator of something significant

55
Detecting Clusters
Height
Weight
56
Is it Right to Calculate a Correlation
Coefficient?
Height
r 0.73
Weight
57
Or is There More to This?
male
Height
female
Weight
58
Clustering Applications in Bioinformatics

Metabolomics and Cheminformatics
Microarray or GeneChip Analysis
2D Gel or ProteinChip Analysis
Protein Interaction Analysis
Phylogenetic and Evolutionary Analysis
Structural Classification of Proteins
Protein Sequence Families

59
Clustering

Definition - a process by which objects that are
logically similar in characteristics are grouped
together.
Clustering is different than Classification
In classification the objects are assigned to
pre-defined classes, in clustering the classes
are yet to be defined
Clustering helps in classification

60
Clustering Requires...

A method to measure similarity (a similarity
matrix) or dissimilarity (a dissimilarity
coefficient) between objects
A threshold value with which to decide whether an
object belongs with a cluster
A way of measuring the distance between two
clusters
A cluster seed (an object to begin the clustering
process)

61
Clustering Algorithms

K-means or Partitioning Methods - divides a set
of N objects into M clusters -- with or without
overlap
Hierarchical Methods - produces a set of nested
clusters in which each pair of objects is
progressively nested into a larger cluster until
only one cluster remains
Self-Organizing Feature Maps - produces a cluster
set through iterative training

62
K-means or Partitioning Methods

Make the first object the centroid for the first
cluster
For the next object calculate the similarity to
each existing centroid
If the similarity is greater than a threshold add
the object to the existing cluster and
redetermine the centroid, else use the object to
start new cluster
Return to step 2 and repeat until done

63
K-means or Partitioning Methods
Initial cluster choose 1 choose 2
test join
centroid centroid
64
Hierarchical Clustering

Find the two closest objects and merge them into
a cluster
Find and merge the next two closest objects (or
an object and a cluster, or two clusters) using
some similarity measure and a predefined
threshold
If more than one cluster remains return to step 2
until finished

65
Hierarchical Clustering
Initial cluster pairwise
select select
compare closest
next closest
66
Hierarchical Clustering
A
A
A
B
B
C
D
C
B
E
F
Find 2 most similar metabolite expression
levels or curves
Find the next closest pair of levels or curves
Iterate
Heat map
67
Multivariate Statistics
68
Multivariate Statistics

Multivariate means multiple variables
If you measure a population using multiple
measures at the same time such as height, weight,
hair colour, clothing colour, eye colour, etc.
you are performing multivariate statistics
Multivariate statistics requires more complex,
multidimensional analyses or dimensional
reduction methods

69
A Typical Metabolomics Experiment
70
A Metabolomics Experiment

Metabolomics experiments typically measure many
metabolites at once, in other words the
instruments are measuring multiple variables and
so metabolomic data are inherently multivariate
data
Metabolomics requires multivariate statistics

71
Multivariate Statistics The Trick

The key trick in multivariate statistics is to
find a way that effectively reduces the
multivariate data into univariate data
Once done, then you can apply the same univariate
concepts such as p-values, t-Tests and ANOVA
tests to the data
The trick is dimensional reduction

72
Dimension Reduction PCA

PCA Principal Componenent Analysis
Process that transforms a number of possibly
correlated variables into a smaller number of
uncorrelated variables called principal
components
Reduces 1000s of variables to 2-3 key features

Scores plot
73
Principal Component Analysis
Hundreds of peaks 2 components
Scores plot
PCA captures what should be visually detectable
If you cant see it, PCA probably wont help
74
Visualizing PCA

PCA of a bagel
One projection produces a weiner
Another projection produces an O
The O projection captures most of the variation
and has the largest eigenvector (PC1)
The weiner projection is PC2 and gives depth info

75
PCA - The Details

PCA involves the calculation of the eigenvalue
(singular value) decomposition of a data
covariance matrix
PCA is an orthogonal linear transformation
PCA transforms data to a new coordinate system so
that the greatest variance of the data comes to
lie on the first coordinate (1st PC), the second
greatest variance on the 2nd PC etc.

t1 t2 .. tm
x1 x2 x3, variables . xn
s1 s2 s3 samples. sk
p1 p2 pk
Scores t (eigen vectors uncorrelated orthogonal)
..
Loadings p
scores loadings x data t1 p1x1 p2x2 p3x3
pnxn
76
Visualizing PCA

Airport data from USA
5000 samples
X1 - latitude
X2 - longitude
X3 - altitude
What should you expect?

Data from Roy Goodacre (U of Manchester)
77
Visualizing PCA
PCA is equivalent to K-means clustering
78
K-means Clustering
Initial cluster choose 1 choose 2
test join
centroid centroid
79
PCA Clusters

Once dimensional reduction has been achieved you
obtain clusters of data that are mostly normally
distributed with means and variances (in PCA
space)
It is possible to use t-Tests and ANOVA tests to
determine if these clusters or their means are
significantly different or not

80
PCA and ANOVA

ANOVA can also be used to determine whether 3
clusters are different if the clusters follow a
normal distribution

PC 1
PC 2
81
PCA Plot Nomenclature

PCA Generate 2 kinds of plots, the scores plot
and the loadings plot
Scores plot (on right) plots the data using the
main principal components

82
PCA Loadings Plot

Loadings plot shows how much each of the
variables (metabolites) contributed to the
different principal components
Variables at the extreme corners contribute most
to the scores plot separation

83
PCA Details/Advice

In some cases PCA will not succeed in identifying
any clear clusters or obvious groupings no matter
how many components are used. If this is the
case, it is wise to accept the result and assume
that the presumptive classes or groups cannot be
distinguished
As a general rule, if a PCA analysis fails to
achieve even a modest separation of classes, then
it is probably not worthwhile using other
statistical techniques to try to separate them

84
PCA Q2 and R2

The performance of a PCA model can be
quantitatively evaluated in terms of an R2 and/or
a Q2 value
R2 is the correlation index and refers to the
goodness of fit or the explained variation (range
0-1)
Q2 refers to the predicted variation or quality
of prediction (range 0-1)
Typically Q2 and R2 track very closely together

85
PCA R2

R2 is a quantitative measure (with a maximum
value of 1) that indicates how well the PCA model
is able to mathematically reproduce the data in
the data set
A poorly fit model will have an R2 of 0.2 or 0.3,
while a well-fit model will have an R2 of 0.7 or
0.8.

86
PCA Q2

To guard against over-fitting, the value Q2 is
commonly determined. Q2 is usually estimated by
cross validation or permutation testing to assess
the predictive ability of the model relative to
the number of principal components used in the
model
Generally a Q2 gt 0.5 if considered good while a
Q2 of 0.9 is outstanding

87
PCA vs. PLS-DA

Partial Least Squares Discriminant Analysis
PLS-DA is a supervised classification technique
while PCA is an unsupervised clustering technique
PLS-DA uses labeled data while PCA uses no
prior knowledge
PLS-DA enhances the separation between groups of
observations by rotating PCA components such that
a maximum separation among classes is obtained

88
Other Supervised Classification Methods

SIMCA Soft Independent Modeling of Class
Analogy
OPLS Orthoganol Project of Least Squares
Support Vector Machines
Random Forest
Naïve Bayes Classifiers
Neural Networks

89
Breaching the Data Barrier
Unsupervised Methods PCA K-means
clustering Factor Analysis
Supervised Methods PLS-DA LDA PLS-Regression
Machine Learning Neural Networks Support Vector
Machines Bayesian Belief Net
90
Data Analysis Progression

Unsupervised Methods
PCA or cluster to see if natural clusters form or
if data separates well
Data is unlabeled (no prior knowledge)
Supervised Methods/Machine Learning
Data is labeled (prior knowledge)
Used to see if data can be classified
Helps separate less obvious clusters or features
Statistical Significance
Supervised methods always generate clusters --
this can be very misleading
Check if clusters are real by label permutation

91
Testing Significance
PCA
Labelled data
PLS-DA/SVM
PLS-DA/SVM
Permuted data
92
Note of Caution

Supervised classification methods are powerful
Learn from experience
Generalize from previous examples
Perform pattern recognition
Too many people skip the PCA or clustering steps
and jump straight to supervised methods
Some get great separation and think the job is
done - this is where the errors begin
Too many dont assess significance using
permutation testing or n-fold cross validation
If separation isnt partially obvious by
eye-balling your data, you may be treading on
thin ice