Loading...

PPT – Canadian Bioinformatics Workshops PowerPoint presentation | free to view - id: 773c82-NjA1M

The Adobe Flash plugin is needed to view this content

Canadian Bioinformatics Workshops

- www.bioinformatics.ca

2

Module Title of Module

Lecture 3Univariate Analyses Discrete Data

MBP1010 Dr. Paul C. Boutros Winter 2014

Aegeus, King of Athens, consulting the Delphic

Oracle. High Classical (430 BCE)

DEPARTMENT OF MEDICAL BIOPHYSICS

This workshop includes material originally

developed by Drs. Raphael Gottardo, Sohrab

Shah, Boris Steipe and others

Course Overview

- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized

models - Lecture 5 Multivariate Analyses II general

models - Lecture 6 Sequence Analysis
- Lecture 7 Microarray Analysis I Pre-Processing
- Lecture 8 Microarray Analysis II

Multiple-Testing - Lecture 9 Machine-Learning
- Final Exam (written)

How Will You Be Graded?

- 9 Participation 1 per week
- 56 Assignments 8 x 7 each
- 35 Final Examination in-class
- Each individual will get their own, unique

assignment - Assignments will all be in R, and will be graded

according to computational correctness only (i.e.

does your R script yield the correct result when

run) - Final Exam will include multiple-choice and

written answers

Course Information Updates

- Website will have up to date information, lecture

notes, sample source-code from class, etc. - http//medbio.utoronto.ca/students/courses/mbp1010

/mbp_1010.html - Tutorials are Thursdays 1300-1500 in 4-204 TMDT
- New TA (focusing on bioinformatics component)

will be Irakli (Erik) Dzneladze - Assignment 1 is released today, due on January

30 - Assignment 2 will be released on January 31, due

Feb 7 - Updated course-schedule on website

House Rules

- Cell phones to silent
- No side conversations
- Hands up for questions

Review From Lecture 1

- Population vs. Sample

All MBP Students Population MBP Students in

1010 Sample

How do you report statistical information?

P-value, variance, effect-size, sample-size, test

Why dont we use Excel/spreadsheets?

Input errors, reproducibility, wrong results

Review From Lecture 2

- Define discrete data

No gaps on the number-line

What is the central limit theorem?

A random variable that is the sum of many small

random variables is normally distributed

Theoretical vs. empirical quantiles

Probability vs. percentage of values less than p

Components of a boxplot?

25 - 1.5 IQR, 25, 50, 75, 75 1.5 IQR

Boxplot

Descriptive statistics can be intuitively

summarized in a Boxplot.

1.5 x IQR

75 quantile Median 25 quantile

IQR

gt boxplot(x)

1.5 x IQR

Everything above and below 1.5 x IQR is

considered an "outlier".

IQR Inter Quantile Range 75 quantile 25

quantile

Review From Lecture 2

- How can you interpret a QQ plot?

Compares two samples or a sample and a

distribution. Straight line indicates identity.

What is hypothesis testing?

Confirmatory data-analysis test null hypothesis

What is a p-value?

Evidence against null probability of FP,

probability of seeing as extreme a value by

chance alone

Review From Lecture 2

- Parametric vs. non-parametric tests

Parametric tests have distributional assumptions

What is the t-statistics?

SignalNoise ratio

Assumptions of the t-test?

Data sampled from normal distribution

independence of replicates independence of

groups homoscedasticity

Flow-Chart For Two-Sample Tests

Is Data Sampled From a Normally-Distributed

Population?

Yes

No

Sufficient n for CLT (gt30)?

Equal Variance (F-Test)?

Yes

Yes

No

No

Heteroscedastic T-Test

Homoscedastic T-Test

Wilcoxon U-Test

Topics For This Week

- Correlations
- ceRNAs
- Attendance
- Common discrete univariate analyses

Power, error rates and decision

Power calculation in R

gt power.t.test(n 5, delta 1, sd2,

alternative"two.sided", type"one.sample")

One-sample t test power calculation

n 5 delta 1 sd

2 sig.level 0.05 power

0.1384528 alternative two.sided

Other tests are available see ??power.

Power, error rates and decision

PR(False Negative) PR(Type II error)

Lets Try Some Power Analyses in R

µ0

µ1

PR(False Positive) PR(Type I error)

Problem

- When we measure more one than one variable for

each member of a population, a scatter plot may

show us that the values are not completely

independent there is e.g. a trend for one

variable to increase as the other increases. - Regression analyses assess the dependence.
- Examples
- Height vs. weight
- Gene dosage vs.expression level
- Survival analysisprobability of death vs. age

Correlation

When one variable depends on the other, the

variables are to some degree correlated. (Note

correlation need not imply causality.) In R, the

function cov() measures covariance and cor()

measures the Pearson coefficient of correlation

(a normalized measure of covariance). Pearson's

coeffecient of correlation values rangefrom -1

to 1, with 0 indicating no correlation.

Pearson's Coefficient of Correlation

How to interpret the correlation coefficient

Explore varying degrees of randomness ...

gt xlt-rnorm(50) gt r lt- 0.99 gt y lt- (r x)

((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1

0.9999666

Pearson's Coefficient of Correlation

Varying degrees of randomness ...

gt xlt-rnorm(50) gt r lt- 0.8 gt y lt- (r x)

((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1

0.9661111

Pearson's Coefficient of Correlation

Varying degrees of randomness ...

gt xlt-rnorm(50) gt r lt- 0.4 gt y lt- (r x)

((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1

0.6652423

Pearson's Coefficient of Correlation

Varying degrees of randomness ...

gt xlt-rnorm(50) gt r lt- 0.01 gt y lt- (r x)

((1-r) rnorm(50)) gt plot(x,y) cor(x,y) 1

0.01232522

Pearson's Coefficient of Correlation

Non-linear relationships ...

gt xlt-runif(50,-1,1) gt r lt- 0.9 gt periodic ... gt

y lt- (r cos(xpi)) ((1-r) rnorm(50)) gt

plot(x,y) cor(x,y) 1 0.3438495

Pearson's Coefficient of Correlation

Non-linear relationships ...

gt xlt-runif(50,-1,1) gt r lt- 0.9 gt polynomial

... gt y lt- (r xx) ((1-r) rnorm(50)) gt

plot(x,y) cor(x,y) 1 -0.5024503

Pearson's Coefficient of Correlation

Non-linear relationships ...

gt xlt-runif(50,-1,1) gt r lt- 0.9 gt exponential gt

y lt- (r exp(5x)) ((1-r) rnorm(50)) gt

plot(x,y) cor(x,y) 1 0.6334732

Pearson's Coefficient of Correlation

Non-linear relationships ...

gt xlt-runif(50,-1,1) gt r lt- 0.9 gt circular ... gt

a lt- (r cos(xpi)) ((1-r) rnorm(50)) gt b lt-

(r sin(xpi)) ((1-r) rnorm(50)) gt

plot(a,b) cor(a,b) 1 0.04531711

Correlation coefficient

Other Correlations

- There are many other types of correlations
- Spearmans correlation
- rho
- Kendalls correlation
- Tau
- Spearman is a Pearson on ranked values
- Spearman rho 1 means a monotonic relationship
- Pearson R 1 means a linear relationship

When Do We Use Statistics?

- Ubiquitous in modern biology
- Every class I will show a use of statistics in a

(very, very) recent Nature paper.

January 9, 2014

Non-Small Cell Lung Cancer 101

15 5-year survival

Lung Cancer

80 of lung cancer

Non-Small Cell

Small Cell

Large Cell (and others)

Squamous Cell Carcinomas

Adenocarcinomas

Non-Small Cell Lung Cancer 102

Stage I

Local Tumour Only

Stage II

Local Lymph Nodes

Distal Lymph Nodes

Stage III

Metastasis

Stage IV

IA small tumour IB large tumour

General Idea HMGA2 is a ceRNA

What are ceRNAs?

Salmena et al. Cell 2011

Test Multiple Constructs for Activity

What Statistical Analysis Did They Do?

- No information given in main text!
- Figure legend says
- Values are technical triplicates, have been

performed independently three times, and

represent mean /- standard deviation (s.d.) with

propagated error. - In supplementary they say
- Unless otherwise specified, statistical

significance was assessed by the Students

t-test - So, what would you do differently?

Attendance Break

Lets Go Back to Discrete vs. Continuous

- Definition?
- Lets take a few examples of discrete univariate

statistical analyses in biology and write them

down here - Cell counts
- Embryo pigmentation yes/no with morpholino
- SNP calling
- Immunohistochemistry
- Colony formations

Four Main Discrete Univariate Tests

- Hypergeometric test
- Is a sample randomly selected from a fixed

population? - Proportion test
- Are two proportions equivalent?
- Fishers Exact test
- Are two binary classifications associated?
- (Pearsons) Chi-Squared Test
- Are paired observations on two variables

independent?

Hypergeometric Test

- Is a sample randomly selected from a fixed

population? - Closer to discrete mathematics than statistics
- Technically sampling without replacement
- In R ?phyper
- Classic example marbles
- Less classic poker

5/24 are yellow

1/6 sampled are yellow

Hypergeometric Test Biological Example

- Class example in genomics pathway analysis
- I do a screen and identify n genes associated

with something - Are those n genes biased towards a pathway?
- Well a pathway contains m genes
- So is n a random selection of m? Hypergeometric

test! - Similar example drug screening
- I test 1000 drugs to see which ones kill a

cell-line - 100 of these are kinase inhibitors
- 100 drugs kill my cell-line
- 30 of these are kinase inhibitors
- Did I find more kinase inhibitors than expected

by chance? - Lets do the calculation

Hypergeometric Venn Diagram Overlap

Lets pretend X and Y are sets of genes (or

drugs, etc.) found in two separate

experiments. We want to know, is there more

overlap than expected by chance? To do this

Total Balls total number of genes considered

(but a gene must be analyzed in both experiments

exclude those studied in only one)

Black Balls all genes found in experiment X

White Balls all genes not found in experiment X

Sample all genes found in experiment Y

Proportion Test

- Are two proportions equivalent?
- Example is the fraction of people who play

hockey in MBP different from the fraction who

play hockey in Mathematics? - Mathematics 12/85
- MBP 24/135
- In R prop.test
- Only useful for two-group studies

Proportion Test Biological Example

- Does the frequency of TP53 mutations differ

between prostate cancer patients who will suffer

a recurrence and those who will not? - 12/150 patients whose tumours recur have mutated

TP53 - 50/921 patients whose tumours do not recur have

mutated TP53 - P-value guesses?

Fishers Exact Test

- Are two binary categorizations associated?
- Based on a contingency table
- What are these? Have we seen any before?
- In R ?fisher.test
- Classic example drinking tea

Dr. Muriel Bristow claimed to be able to taste if

whether tea or milk was added first to a cup. Dr.

Ronald Fisher didnt believe her.

0

4

0

4

Fishers Exact Test Biological Example

- You can use this any time you form a contingency

table - Any time you make predictions (biomarkers)
- Any time you compare two binary phenomena
- Examples?

(Pearsons) Chi-Squared Test

- Are two variables independent?
- There are a lot of different chi-squared tests.

Why? - Pearson
- Yates
- McNemar
- Portmanteau test
- In R ?chisq.test
- You can think of it as a multiple-category

Fishers test - The assumptions break down if lt5 values in a cell

Chi-Squared Test Biological Example

- Comparing sex across different tumour subtypes

Female

Male

Adenocarcinoma

192

250

Squamous Cell Carcinoma

202

261

9

15

Small Cell Carcinoma

Neuroendocrine

12

10

Course Overview

- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized

models - Lecture 5 Multivariate Analyses II general

models - Lecture 6 Sequence Analysis
- Lecture 7 Microarray Analysis I Pre-Processing
- Lecture 8 Microarray Analysis II

Multiple-Testing - Lecture 9 Machine-Learning
- Final Exam (written)