Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

Loading...

PPT – Canadian Bioinformatics Workshops PowerPoint presentation | free to view - id: 773c82-NjA1M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 48
Provided by: Michael3563
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 3Univariate Analyses Discrete Data
MBP1010 Dr. Paul C. Boutros Winter 2014

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Sequence Analysis
  • Lecture 7 Microarray Analysis I Pre-Processing
  • Lecture 8 Microarray Analysis II
    Multiple-Testing
  • Lecture 9 Machine-Learning
  • Final Exam (written)

5
How Will You Be Graded?
  • 9 Participation 1 per week
  • 56 Assignments 8 x 7 each
  • 35 Final Examination in-class
  • Each individual will get their own, unique
    assignment
  • Assignments will all be in R, and will be graded
    according to computational correctness only (i.e.
    does your R script yield the correct result when
    run)
  • Final Exam will include multiple-choice and
    written answers

6
Course Information Updates
  • Website will have up to date information, lecture
    notes, sample source-code from class, etc.
  • http//medbio.utoronto.ca/students/courses/mbp1010
    /mbp_1010.html
  • Tutorials are Thursdays 1300-1500 in 4-204 TMDT
  • New TA (focusing on bioinformatics component)
    will be Irakli (Erik) Dzneladze
  • Assignment 1 is released today, due on January
    30
  • Assignment 2 will be released on January 31, due
    Feb 7
  • Updated course-schedule on website

7
House Rules
  • Cell phones to silent
  • No side conversations
  • Hands up for questions

8
Review From Lecture 1
  • Population vs. Sample

All MBP Students Population MBP Students in
1010 Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why dont we use Excel/spreadsheets?
Input errors, reproducibility, wrong results
9
Review From Lecture 2
  • Define discrete data

No gaps on the number-line
What is the central limit theorem?
A random variable that is the sum of many small
random variables is normally distributed
Theoretical vs. empirical quantiles
Probability vs. percentage of values less than p
Components of a boxplot?
25 - 1.5 IQR, 25, 50, 75, 75 1.5 IQR
10
Boxplot
Descriptive statistics can be intuitively
summarized in a Boxplot.
1.5 x IQR
75 quantile Median 25 quantile
IQR
gt boxplot(x)
1.5 x IQR
Everything above and below 1.5 x IQR is
considered an "outlier".
IQR Inter Quantile Range 75 quantile 25
quantile
11
Review From Lecture 2
  • How can you interpret a QQ plot?

Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis test null hypothesis
What is a p-value?
Evidence against null probability of FP,
probability of seeing as extreme a value by
chance alone
12
Review From Lecture 2
  • Parametric vs. non-parametric tests

Parametric tests have distributional assumptions
What is the t-statistics?
SignalNoise ratio
Assumptions of the t-test?
Data sampled from normal distribution
independence of replicates independence of
groups homoscedasticity
13
Flow-Chart For Two-Sample Tests
Is Data Sampled From a Normally-Distributed
Population?
Yes
No
Sufficient n for CLT (gt30)?
Equal Variance (F-Test)?
Yes
Yes
No
No
Heteroscedastic T-Test
Homoscedastic T-Test
Wilcoxon U-Test
14
Topics For This Week
  • Correlations
  • ceRNAs
  • Attendance
  • Common discrete univariate analyses

15
Power, error rates and decision
Power calculation in R
gt power.t.test(n 5, delta 1, sd2,
alternative"two.sided", type"one.sample")
One-sample t test power calculation
n 5 delta 1 sd
2 sig.level 0.05 power
0.1384528 alternative two.sided
Other tests are available see ??power.
16
Power, error rates and decision
PR(False Negative) PR(Type II error)
Lets Try Some Power Analyses in R
µ0
µ1
PR(False Positive) PR(Type I error)
17
Problem
  • When we measure more one than one variable for
    each member of a population, a scatter plot may
    show us that the values are not completely
    independent there is e.g. a trend for one
    variable to increase as the other increases.
  • Regression analyses assess the dependence.
  • Examples
  • Height vs. weight
  • Gene dosage vs.expression level
  • Survival analysisprobability of death vs. age

18
Correlation
When one variable depends on the other, the
variables are to some degree correlated. (Note
correlation need not imply causality.) In R, the
function cov() measures covariance and cor()
measures the Pearson coefficient of correlation
(a normalized measure of covariance). Pearson's
coeffecient of correlation values rangefrom -1
to 1, with 0 indicating no correlation.
19
Pearson's Coefficient of Correlation
How to interpret the correlation coefficient
Explore varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.99 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9999666
20
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.8 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.9661111
21
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.4 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.6652423
22
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
gt xlt-rnorm(50) gt r lt- 0.01 gt y lt- (r x)
((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1
0.01232522
23
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt periodic ... gt
y lt- (r cos(xpi)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.3438495
24
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt polynomial
... gt y lt- (r xx) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 -0.5024503
25
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt exponential gt
y lt- (r exp(5x)) ((1-r) rnorm(50)) gt
plot(x,y) cor(x,y) 1 0.6334732
26
Pearson's Coefficient of Correlation
Non-linear relationships ...
gt xlt-runif(50,-1,1) gt r lt- 0.9 gt circular ... gt
a lt- (r cos(xpi)) ((1-r) rnorm(50)) gt b lt-
(r sin(xpi)) ((1-r) rnorm(50)) gt
plot(a,b) cor(a,b) 1 0.04531711
27
Correlation coefficient
28
Other Correlations
  • There are many other types of correlations
  • Spearmans correlation
  • rho
  • Kendalls correlation
  • Tau
  • Spearman is a Pearson on ranked values
  • Spearman rho 1 means a monotonic relationship
  • Pearson R 1 means a linear relationship

29
When Do We Use Statistics?
  • Ubiquitous in modern biology
  • Every class I will show a use of statistics in a
    (very, very) recent Nature paper.

January 9, 2014
30
Non-Small Cell Lung Cancer 101
15 5-year survival
Lung Cancer
80 of lung cancer
Non-Small Cell
Small Cell
Large Cell (and others)
Squamous Cell Carcinomas
Adenocarcinomas
31
Non-Small Cell Lung Cancer 102
Stage I
Local Tumour Only
Stage II
Local Lymph Nodes
Distal Lymph Nodes
Stage III
Metastasis
Stage IV
IA small tumour IB large tumour
32
General Idea HMGA2 is a ceRNA
What are ceRNAs?
Salmena et al. Cell 2011
33
Test Multiple Constructs for Activity
34
What Statistical Analysis Did They Do?
  • No information given in main text!
  • Figure legend says
  • Values are technical triplicates, have been
    performed independently three times, and
    represent mean /- standard deviation (s.d.) with
    propagated error.
  • In supplementary they say
  • Unless otherwise specified, statistical
    significance was assessed by the Students
    t-test
  • So, what would you do differently?

35
Attendance Break
36
Lets Go Back to Discrete vs. Continuous
  • Definition?
  • Lets take a few examples of discrete univariate
    statistical analyses in biology and write them
    down here
  • Cell counts
  • Embryo pigmentation yes/no with morpholino
  • SNP calling
  • Immunohistochemistry
  • Colony formations

37
Four Main Discrete Univariate Tests
  • Hypergeometric test
  • Is a sample randomly selected from a fixed
    population?
  • Proportion test
  • Are two proportions equivalent?
  • Fishers Exact test
  • Are two binary classifications associated?
  • (Pearsons) Chi-Squared Test
  • Are paired observations on two variables
    independent?

38
Hypergeometric Test
  • Is a sample randomly selected from a fixed
    population?
  • Closer to discrete mathematics than statistics
  • Technically sampling without replacement
  • In R ?phyper
  • Classic example marbles
  • Less classic poker

5/24 are yellow
1/6 sampled are yellow
39
Hypergeometric Test Biological Example
  • Class example in genomics pathway analysis
  • I do a screen and identify n genes associated
    with something
  • Are those n genes biased towards a pathway?
  • Well a pathway contains m genes
  • So is n a random selection of m? Hypergeometric
    test!
  • Similar example drug screening
  • I test 1000 drugs to see which ones kill a
    cell-line
  • 100 of these are kinase inhibitors
  • 100 drugs kill my cell-line
  • 30 of these are kinase inhibitors
  • Did I find more kinase inhibitors than expected
    by chance?
  • Lets do the calculation

40
Hypergeometric Venn Diagram Overlap
Lets pretend X and Y are sets of genes (or
drugs, etc.) found in two separate
experiments. We want to know, is there more
overlap than expected by chance? To do this
Total Balls total number of genes considered
(but a gene must be analyzed in both experiments
exclude those studied in only one)
Black Balls all genes found in experiment X
White Balls all genes not found in experiment X
Sample all genes found in experiment Y
41
Proportion Test
  • Are two proportions equivalent?
  • Example is the fraction of people who play
    hockey in MBP different from the fraction who
    play hockey in Mathematics?
  • Mathematics 12/85
  • MBP 24/135
  • In R prop.test
  • Only useful for two-group studies

42
Proportion Test Biological Example
  • Does the frequency of TP53 mutations differ
    between prostate cancer patients who will suffer
    a recurrence and those who will not?
  • 12/150 patients whose tumours recur have mutated
    TP53
  • 50/921 patients whose tumours do not recur have
    mutated TP53
  • P-value guesses?

43
Fishers Exact Test
  • Are two binary categorizations associated?
  • Based on a contingency table
  • What are these? Have we seen any before?
  • In R ?fisher.test
  • Classic example drinking tea

Dr. Muriel Bristow claimed to be able to taste if
whether tea or milk was added first to a cup. Dr.
Ronald Fisher didnt believe her.
0
4
0
4
44
Fishers Exact Test Biological Example
  • You can use this any time you form a contingency
    table
  • Any time you make predictions (biomarkers)
  • Any time you compare two binary phenomena
  • Examples?

45
(Pearsons) Chi-Squared Test
  • Are two variables independent?
  • There are a lot of different chi-squared tests.
    Why?
  • Pearson
  • Yates
  • McNemar
  • Portmanteau test
  • In R ?chisq.test
  • You can think of it as a multiple-category
    Fishers test
  • The assumptions break down if lt5 values in a cell

46
Chi-Squared Test Biological Example
  • Comparing sex across different tumour subtypes

Female
Male
Adenocarcinoma
192
250
Squamous Cell Carcinoma
202
261
9
15
Small Cell Carcinoma
Neuroendocrine
12
10
47
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Sequence Analysis
  • Lecture 7 Microarray Analysis I Pre-Processing
  • Lecture 8 Microarray Analysis II
    Multiple-Testing
  • Lecture 9 Machine-Learning
  • Final Exam (written)
About PowerShow.com