Introduction to Biostatistical Analysis Using R Statistics course for PhD students in Veterinary Sciences - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Introduction to Biostatistical Analysis Using R Statistics course for PhD students in Veterinary Sciences

Description:

Introduction to Biostatistical Analysis Using R Statistics course for PhD students in Veterinary Sciences Session 2 Lecture: Introduction to statistical hypothesis ... – PowerPoint PPT presentation

Number of Views:586
Avg rating:3.0/5.0
Slides: 46
Provided by: biodiversi9
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Biostatistical Analysis Using R Statistics course for PhD students in Veterinary Sciences


1
Introduction to Biostatistical AnalysisUsing R
Statistics course for PhD students in Veterinary
Sciences
Session 2 Lecture Introduction to statistical
hypothesis testing Null and alternate
hypothesis. Types of error. Two-sample
hypotheses. Correlation. Analysis of frequency
data. Model simplification
Lecturer Lorenzo Marini, PhD Department of
Environmental Agronomy and Crop
Production, University of Padova, Viale
dellUniversità 16, 35020 Legnaro,
Padova. E-mail lorenzo.marini_at_unipd.it Tel. 39
0498272807 http//www.biodiversity-lorenzomarini.e
u/
2
Inference
A statistical hypothesis test is a method of
making statistical decisions from and about
experimental data. Null-hypothesis testing just
answers the question of "how well do the findings
fit the possibility that chance factors alone
might be responsible?.
sampling
Sample
Estimation (Uncertainty!!!)
Population
testing
Statistical Model
3
Key concepts Session 1
  • Statistical testing in five steps
  • 1. Construct a null hypothesis (H0)
  • 2. Choose a statistical analysis (assumptions!!!)
  • 3. Collect the data (sampling)
  • 4. Calculate P-value and test statistic
  • 5. Reject/accept (H0) if P is small/large

Remember the order!!!
Concept of replication vs. pseudoreplication 1.
Spatial dependence (e.g. spatial
autocorrelation) 2. Temporal dependence (e.g.
repeated measures) 3. Biological dependence (e.g.
siblings)

n6
yi
Key quantities
residual
y
mean
x
4
Hypothesis testing
  • 1 Hypothesis formulation (Null hypothesis H0
    vs. alternative hypothesis H1)
  • 2 Compute the probability P that a particular
    result we observed (e.g. difference in means)
    could have occurred by chance, if the null
    hypothesis were true
  • 3 In short P is a measure of the credibility of
    the null hypothesis
  • 3 If this probability P is lower than a defined
    threshold (e.g. lt0.05) we can reject the null
    hypothesis

5
Wrong conclusions Type 1 and 2 errors
Hypothesis testing Types of error
Actual situation Reject H0 Accept H0
Actual situation Effect Correct Effect detected Type 2 error (?) Effect not detected
Actual situation No effect Type 1 error (?) Effect detected, none exists Correct, No effect detected, None exists
As power increases, the chances of a Type II
error decreases
Statistical power depends on the statistical
significance criterion used in the test the size
of the difference or the strength of the
similarity (effect size) in the population the
sensitivity of the data.
6
Statistical analyses
Mean comparisons for 2 populations Test the
difference between the means drawn by two samples

Correlation In probability theory and statistics,
correlation, (often measured as a correlation
coefficient), indicates the strength and
direction of a linear relationship between two
random variables. In general statistical usage,
correlation refers to the departure of two
variables from independence.
Analysis of count or proportion data Whole number
or integer numbers (not continuous, different
distributional properties) or proportion
7
Mean comparisons for 2 samples
The t test
H0 means do not differ H1 means differ
  • Assumptions
  • Independence of cases (work with true
    replications!!!) - this is a requirement of the
    design.
  • Normality - the distributions in each of the
    groups are normal
  • Homogeneity of variances - the variance of data
    in groups should be the same (use Fisher test or
    Fligner's test for homogeneity of variances).
  • These together form the common assumption that
    the errors are independently, identically, and
    normally distributed

8
Normality
Before we can carry out a test assuming normality
of the data we need to test our distribution (not
always before!!!)
Graphics analysis
In many cases we must check this assumption after
having fitted the model (e.g. regression or
multifactorial ANOVA)
hist(y) lines(density(y))
library(car) qq.plot(y) or qqnorm(y)
RESIDUALS MUST BE NORMAL
Test for normality
Shapiro-Wilk Normality Test shapiro.test()
Skew kurtosis (t test)
9
Normality Histogram and Q-Q Plot
10
Normality Quantile-Quantile Plot
Quantiles are points taken at regular intervals
from the cumulative distribution function (CDF)
of a random variable. The quantiles are the data
values marking the boundaries between consecutive
subsets
11
Normality Histogram
Normal distribution must be symmetrical around
the mean
library(animation) ani.options(nmax 2000 15
-2, interval 0.003) freq quincunx(balls
2000, col.balls rainbow(1)) frequency
table barplot(freq, space 0)
12
Normality
In case of non-normality 2 possible approaches
1. Change the distribution (use GLMs)
Advanced statistics
E.g. Poisson (count data) E.g. Binomial
(proportion)
2. Data transformation
Logaritmic (skewed data)
Square-root
Arcsin (percentage)
Probit (proportion)
Box-Cox transformation
13
Homogeneity of variance two samples
Before we can carry out a test to compare two
sample means, we need to test whether the sample
variances are significantly different. The test
could not be simpler. It is called Fishers F
To compare two variances, all you do is divide
the larger variance by the smaller variance.
E.g. Students on the left vs. Students on the
right
F calculated
Flt-var(A)/var(B)
qf(0.975,nA-1,nB-1)
F critical
if the calculated F is larger than the critical
value, we reject the null hypothesis
Test can be carried out with the var.test()
14
Homogeneity of variance gt two samples
It is important to know whether variance differs
significantly from sample to sample. Constancy of
variance (homoscedasticity) is the most important
assumption underlying regression and analysis of
variance. For multiple samples you can choose
between the Bartlett test and the FlignerKilleen
test.
Bartlett.test(response,factor)
Fligner.test(response,factor)
There are differences between the tests Fisher
and Bartlett are very sensitive to outliers,
whereas FlignerKilleen is not
15
Mean comparison
In many cases, a researcher is interesting in
gathering information about two populations in
order to compare them. As in statistical
inference for one population parameter,
confidence intervals and tests of significance
are useful statistical tools for the difference
between two population parameters.
Ho the two means are the same H1 the two means
differ
  • - All Assumptions met? Parametric t.test()
  • - t test with independent or paired sample

-Some assumptions not met? Non-parametric
Wilcox.test() - The Wilcoxon signed-rank test is
a non-parametric alternative to the Student's
t-test for the case of two samples.
16
Mean comparison 2 independent samples
Two independent samples
Students on the left
Students on the right
The two samples are statistically independent
Test can be carried out with the t.test() function
17
Mean comparison t test for paired samples
Paired sampling in time or in space
E.g. Test your performance before or after the
course. I measure twice on the same student
Time 1 a 1, 2, 3, 2, 3, 2 ,2
Time 2 b 1, 2, 1, 1, 5, 1, 2
If we have information about dependence, we have
to use this!!!
Test can be carried out with the t.test() function
We can deal with dependence
18
Mean comparison Wilcoxon
A B
3 5
4 5
4 6
3 7
2 4
3 4
1 3
3 5
5 6
2 5
Rank procedure
n1 and n2 are the number of observations R1 is
the sum of the ranks in the sample 1
Test can be carried out with the wilcox.test()
function
-NB Tied ranks correction
19
Correlation
Correlation, (often measured as a correlation
coefficient), indicates the strength and
direction of a linear relationship between two
random variables
Plant species richness
Bird species richness
1 2 3 4 458
x1 x2 x3 x4 x458
l1 l2 l3 l4 l458
Sampling unit
Three alternative approaches 1. Parametric -
cor() 2. Nonparametric - cor() 3. Bootstrapping -
replicate(), boot()
20
Correlation causal relationship?
Which is the response variable in a correlation
analysis?
NONE
Plant species richness
Bird species richness
1 2 3 4 458
x1 x2 x3 x4 x458
l1 l2 l3 l4 l458
Sampling unit
21
Correlation
Plot the two variables in a Cartesian space
A correlation of 1 means that there is a perfect
positive LINEAR relationship between variables.
A correlation of -1 means that there is a
perfect negative LINEAR relationship between
variables. A correlation of 0 means there is no
LINEAR relationship between the two variables.
22
Correlation
Same correlation coefficient!
r 0.816
23
Parametric correlation when is significant?
Pearson product-moment correlation
coefficient Correlation coefficient
Hypothesis testing using the t
distribution Ho Is cor 0 H1 Is cor ? 0
t critic value for d.f. n-2
  • Assumptions
  • Two random variables from a random populations
  • - cor() detects ONLY linear relationships

24
Nonparametric correlation
Rank procedures
Distribution-free but less power
Spearman correlation index
The Kendall tau rank correlation coefficient
P is the number of concordant pairs n is the
total number of pairs
25
Scale-dependent correlation
NB Dont use grouped data to compute overall
correlation!!!
7 sites
26
Issues related to correlation
1. Temporal autocorrelation Values in close years
are more similar Dependence of the data
2. Spatial autocorrelation Values in close sites
are more similar Dependence of the data
Moran's I 0
Moran's I 1
Moran's I or Gearys C Measures of global spatial
autocorrelation
27
Three issues related to correlation
2. Temporal autocorrelation Values in close years
are more similar Dependence of the data
Working with time series is likely to have
temporal pattern in the data
E.g. Population dynamics
Autoregressive models (not covered!)
28
Three issues related to correlation
3. Spatial autocorrelation Values in close sites
are more similar Dependence of the data
ISSUE can we explain the spatial autocorrelation
with our models?
Moran's I or Gearys C (univariate response)
Measures of global spatial autocorrelation
Raw response
Residuals after model fitting
Hint If you find spatial autocorrelation in your
residuals, you should start worrying
29
Frequency data
  • Properties of frequency data
  • Count data
  • Proportion data

Count data where we count how many times
something happened, but we have no way of knowing
how often it did not happen (e.g. number of
students coming at the first lesson)
Proportion data where we know the number doing a
particular thing, but also the number not doing
that thing (e.g. mortality of the students who
attend the first lesson, but not the second)
30
Count data
Straightforward general linear methods (assuming
constant variance, normal errors) are not
appropriate for count data for four main
reasons The linear model might lead to the
prediction of negative counts. The variance of
the response variable is likely to increase with
the mean. The errors will not be normally
distributed. Many zeros are difficult to handle
in transformations.
- Classical test with contingency tables -
Generalized linear models with Poisson
distribution and log-link function (extremely
powerful and flexible!!!)
31
Count data contingency tables
We can assess the significance of the differences
between observed and expected frequencies in a
variety of ways
- Pearsons chi-squared (?2) - G test - Fishers
exact test
Group 1 Group 2 Row total (R)
Trait 1 a b ab
Trait 2 c d cd
Column total (C) ac bd abcd (G)
H0 frequencies found in rows are independent
from frequencies in columns
32
Count data contingency tables
Pearsons chi-squared An example
Following a staff training in insemination
techniques in cattle, an insemination centre
compared three different methods
Method I Method II Method III Row total (Ri)
Pregnant 275 192 261 728
Non pregnant 78 64 123 265
Column total (Ci) 353 256 386 993 (G)
X
What about the sampling? Where are the potential
biases?
H0 the pregnant rate is independent from the
method. Method I (78) method II (73) method
III (68)?
The test uses expected and observed
frequencies The table above presents the observed
(Oi) frequencies We need a model to get the
expected (Ei) frequencies
33
Count data contingency tables
- Pearsons chi-squared (?2)
We need a model to define the expected
frequencies (E) in case of perfect independence
Critic value
O1275
E1(728353)/993
O2192
E2(728256)/993
Method I Method II Method III Row total (Ri)
Pregnant 275 192 261 728
Non pregnant 78 64 123 265
Column total (Ci) 353 256 386 993 (G)
34
Count data contingency tables
- G test
1. We need a model to define the expected
frequencies (E) (many possibilities) E.g.
perfect independence
?2 distribution
- Fishers exact test fisher.test()
If expected values are less than 4 o 5
35
Proportion data
Proportion data have three important properties
that affect the way the data should be
analyzed the data are strictly bounded
(0-1) the variance is non-constant (it depends
on the mean) errors are non-normal.
- Classical test with probit or arcsin
transformation - Generalized linear models with
binomial distribution and logit-link function
(extremely powerful and flexible!!!)
36
Proportion data traditional approach
Transform the data!
Arcsine transformation The arcsine transformation
takes care of the error distribution
p are percentages (0-100)
Probit transformation The probit transformation
takes care of the non-linearity
p are proportions (0-1)
37
Proportion data modern analysis
An important class of problems involves data on
proportions such as studies on percentage
mortality (LD50), infection rates of
diseases, proportion responding to clinical
treatment (bioassay), sex ratios, or in
general data on proportional response to an
experimental treatment
2 approaches
1. It is often needed to transform both response
and explanatory variables or 2. To use
Generalized Linear Models (GLM) using different
error distributions
38
MODELGenerally speaking, a statistical model is
a function of your explanatory variables to
explain the variation in your response variable
(y)
Statistical modelling
E.g. Yabx1cx2 dx3 Y response variable
(performance of the students) xi explanatory
variables (ability of the teacher, background,
age)
The object is to determine the values of the
parameters (a, b, c and d) in a specific model
that lead to the best fit of the model to the data
The best model is the model that produces the
least unexplained variation (the minimal residual
deviance), subject to the constraint that all the
parameters in the model should be statistically
significant (many ways to reach this!)
39
Getting started with complex statistical modeling
Statistical modelling
It is essential, that you can answer the
following questions Which of your variables is
the response variable? Which are the
explanatory variables? Are the explanatory
variables continuous or categorical, or a mixture
of both? What kind of response variable do you
have is it a continuous measurement, a count, a
proportion, a time at death, or a category?
40
Statistical modelling multicollinearity
1. Multicollinearity Correlation between
predictors in a non-orthogonal multiple linear
models Confounding effects difficult to
separate Variables are not independent This
makes an important difference to our statistical
modelling because, in orthogonal designs, the
variation that is attributed to a given factor is
constant, and does not depend upon the order in
which factors are removed from the model. In
contrast, with non-orthogonal data, we find that
the variation attributable to a given factor does
depend upon the order in which factors are
removed from the model
The order of variable selection makes a huge
difference (please wait for session 4!!!)
41
Statistical modelling multicollinearity
1. Multicollinearity
E.g. Yabx1cx2 dx3 Y response variable
(performance of the students) xi explanatory
variables (ability of the teacher, background,
age)
Do you see potential collinearity here?
Collinearity is a major problem in situation
where we cannot control all the factors. Typical
of multiple regression
42
Getting started with complex statistical modeling
Statistical modelling
The explanatory variables (a) All explanatory
variables continuous - Regression (b) All
explanatory variables categorical - Analysis of
variance (ANOVA) (c) Explanatory variables both
continuous and categorical - Analysis of
covariance (ANCOVA) The response variable (a)
Continuous - Normal regression, ANOVA or
ANCOVA (b) Proportion - Logistic regression, GLM
logit-linear models (c) Count - GLM Log-linear
models (d) Binary - GLM binary logistic
analysis (e) Time at death - Survival analysis
43
Statistical modelling
Each analysis estimate a MODEL
You want the model to be minimal (parsimony), and
adequate (must describe a significant fraction of
the variation in the data) It is very important
to understand that there is not just one model.
given the data, and given our choice of
model, what values of the parameters of that
model make the observed data most likely?
Model building estimate of parameters (slopes
and level of factors)
Occams Razor
44
Statistical modelling
Occams Razor
Models should have as few parameters as
possible linear models should be preferred to
non-linear models experiments relying on few
assumptions should be preferred to those relying
on many models should be pared down until they
are minimal adequate simple explanations
should be preferred to complex explanations.
MODEL SIMPLIFICATION
The process of model simplification is an
integral part of hypothesis testing in R. In
general, a variable is retained in the model only
if it causes a significant increase in deviance
when it is removed from the current model.
45
Statistical modelling model simplification
Parsimony requires that the model should be as
simple as possible. This means that the model
should not contain any redundant parameters or
factor levels.
Model simplification
remove non-significant interaction terms
remove non-significant quadratic or other
non-linear terms remove non-significant
explanatory variables group together factor
levels that do not differ from one another in
ANCOVA, set non-significant slopes of continuous
explanatory variables to zero.
Write a Comment
User Comments (0)
About PowerShow.com