Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Description:

Title: PowerPoint Presentation Last modified by: Lorenzo Marini Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 44
Provided by: biodiversi9
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students


1
Introduction to Biostatistical AnalysisUsing R
Statistics course for first-year PhD students
Session 4 Lecture Regression Analysis Practical
multiple regression
Lecturer Lorenzo Marini DAFNAE, University of
Padova, Viale dellUniversità 16, 35020 Legnaro,
Padova. E-mail lorenzo.marini_at_unipd.it Tel. 39
049 8272807 Skype lorenzo.marini http//www.biod
iversity-lorenzomarini.eu/
2
Statistical modelling more than one parameter
Nature of the response variable
Generalized Linear Models
NORMAL
POISSON,
BINOMIAL

GLM
Session 3
General Linear Models
Nature of the explanatory variables
Categorical
Continuous
Categorical continuous
ANOVA
Regression
ANCOVA
Session 3
Session 4
3
REGRESSION
REGRESSION
Linear methods
Non-linear methods
Non-linear -One X -Complex relation
Simple linear -One X -Linear relation
Multiple linear -2 or gt Xi - Linear relation
Polynomial -One X but more slopes - Non-linear
relation
4
LINEAR REGRESSION lm()
Regression analysis is a technique used for the
modeling and analysis of numerical data
consisting of values of a dependent variable
(response variable) and of one or more
independent continuous variables (explanatory
variables)
Assumptions Independence The Y-values and the
error terms must be independent of each
other. Linearity between Y and X. Normality The
populations of Y-values and the error terms are
normally distributed for each level of the
predictor variable x Homogeneity of variance The
populations of Y-values and the error terms have
the same variance at each level of the predictor
variable x. (dont test for normality or
heteroscedasticity, check the residuals instead!)
5
LINEAR REGRESSION lm()
AIMS 1.To describe the linear relationships
between Y and Xi (EXPLANATORY APPROACH) and to
quantify how much of the total variation in Y can
be explained by the linear relationship with
Xi. 2. To predict new values of Y from new
values of Xi (PREDICTIVE APPROACH)
Xi predictors
Yi a ßxi ei
Y response
We estimate one INTERCEPT and one or more SLOPES
6
SIMPLE LINEAR REGRESSION
Simple LINEAR regression step by step
I step -Check linearity visualization with
plot()
II step -Estimate the parameters (one slope and
one intercept)
III step -Check residuals (check the assumptions
looking at the residuals normality and
homogeneity of variance)
7
SIMPLE LINEAR REGRESSION
Normality
Do not test the normality over the whole y
8
SIMPLE LINEAR REGRESSION
MODEL
The model gives the fitted values
yi a ßxi
SLOPE
ß S (xi-xmean)(yi-ymean) S (xi-xmean)2
INTERCEPT
a ymean- ßxmean
The model does not explained everything
RESIDUALS
Residuals observed yi- fitted yi
Observed value
9
SIMPLE LINEAR REGRESSION
Least square regression explanation library(anim
ation)
Slope changing save the animation in HTML
pages ani.options(ani.height 450, ani.width
600, outdir getwd(), title "Demonstration
of Least Squares", description "We want to
find an estimate for the slope in 50
candidate slopes, so we just compute the RSS one
by one. ") ani.start() par(mar c(4, 4, 0.5,
0.1), mgp c(2, 0.5, 0), tcl
-0.3) least.squares() ani.stop()
Intercept
changing save the animation in HTML
pages ani.options(ani.height 450, ani.width
600, outdir getwd(), title "Demonstration
of Least Squares", description "We want to
find an estimate for the slope in 50
candidate slopes, so we just compute the RSS one
by one. ") ani.start() par(mar c(4, 4, 0.5,
0.1), mgp c(2, 0.5, 0), tcl
-0.3) least.squares(ani.type "i") ani.stop()
10
SIMPLE LINEAR REGRESSION
Parameter t testing
Hypothesis testing Ho ß 0 (There is no
relation between X and Y) H1 ß ? 0
Parameter t testing (test the single parameter!)
We must measure the unreliability associated with
each of the estimated parameters (i.e. we need
the standard errors) SE(ß) (residual
SS/(n-2))/S(xi - xmean)2 t (ß 0) / SE(ß)
11
SIMPLE LINEAR REGRESSION
If the model is significant (ß ? 0) How much
variation is explained?
Measure of goodness-of-fit
Total SS S(yobserved i- ymean)2 Model SS
S(yfitted i - ymean)2 Residual SS Total SS -
Model SS
R2 Model SS /Total SS
It does not provide information about the
significance
Explained variation
12
SIMPLE LINEAR REGRESSION example 1
If the model is significant, then model checking
1. Linearity between X and Y?
ok
No patterns in the residuals vs. predictor plot
13
SIMPLE LINEAR REGRESSION example 1
2. Normality of the residuals Q-Q plot
Shapiro-Wilk test on the residuals
ok
gt shapiro.test(residuals) Shapiro-Wilk normality
test data residuals W 0.9669, p-value
0.2461
ok
14
SIMPLE LINEAR REGRESSION example 1
3. Homoscedasticity
Call lm(formula abs(residuals)
yfitted) Coefficients Estimate
SE t P (Intercept)
2.17676 2.04315 1.065 0.293 yfitted
0.11428 0.07636 1.497 0.142
ok
15
SIMPLE LINEAR REGRESSION example 2
1. Linearity between X and Y?
no
no
yes
NO LINEARITY between X and Y
16
SIMPLE LINEAR REGRESSION example 2
2. Normality of the residuals Q-Q plot
Shapiro-Wilk test on the residuals
no
gt shapiro.test(residuals) Shapiro-Wilk normality
test data residuals W 0.8994, p-value
0.001199
no
17
SIMPLE LINEAR REGRESSION example 2
3. Homoscedasticity
NO
YES
18
SIMPLE LINEAR REGRESSION example 2
How to deal with non-linearity and non-normality
situations?
Transformation of the data -Box-cox
transformation (power transformation of the
response) -Square-root transformation -Log
transformation -Arcsin transformation
Polynomial regression Regression
with multiple terms (linear, quadratic, and
cubic) Y a b1X b2X2 b3X3 error
X is one variable!!!
19
POLYNOMIAL REGRESSION one X, n parameters
Hierarchy in the testing (always test the
highest)!!!!
n.s.
n.s.
n.s.
No relation
X
X X2 X3
X X2
Plt0.01
Plt0.01
Plt0.01
Stop
Stop
Stop
NB Do not delete lower terms even if
non-significant
20
MULTIPLE LINEAR REGRESSION more than one x
Multiple regression Regression with two or more
variables Y a b1X1 b2X2 biXi
Assumptions Same assumptions as in the simple
linear regression!!!
The Multiple Regression Model There are important
issues involved in carrying out a multiple
regression which explanatory variables to
include (VARIABLE SELECTION) NON-LINEARITY in
the response to the explanatory variables
INTERACTIONS between explanatory variables
correlation between explanatory variables
(COLLINEARITY) RELATIVE IMPORTANCE of variables
21
MULTIPLE LINEAR REGRESSION more than one x
Multiple regression MODEL Regression with two or
more variables Y a b1X1 b2X2 biXi
Each slope (bi) is a partial regression
coefficient
bi are the most important parameters of the
multiple regression model. They measure the
expected change in the dependent variable
associated with a one unit change in an
independent variable holding the other
independent variables constant. This
interpretation of partial regression coefficients
is very important because independent variables
are often correlated with one another.
22
MULTIPLE LINEAR REGRESSION more than one x
Multiple regression MODEL EXPANDED We can add
polynomial terms and interactions Y a linear
terms quadratic cubic terms interactions
QUADRATIC AND CUBIC TERMS account for
NON-LINEARITY INTERACTIONS account for
non-independent effects of the factors
23
MULTIPLE LINEAR REGRESSION
Multiple regression step by step
I step -Check collinearity (visualization with
pairs() and correlation) -Check linearity
II step -Variable selection and model building
(different procedures to select the significant
variables)
III step -Check residuals (check the assumptions
looking at the residuals normality and
homogeneity of variance)
24
MULTIPLE LINEAR REGRESSION I STEP
I STEP -Check collinearity -Check linearity
Lets begin with an example from air pollution
studies. How is ozone concentration related to
wind speed, air temperature and the intensity of
solar radiation?
25
MULTIPLE LINEAR REGRESSION II STEP
II STEP MODEL BUILDING
Start with a complex model with interactions and
quadratic and cubic terms
Model simplification
Minimum Adequate Model
How to carry out a model simplification in
multiple regression
1. Remove non-significant interaction terms. 2.
Remove non-significant quadratic or other
non-linear terms. 3. Remove non-significant
explanatory variables. 4. Amalgamate explanatory
variables that have similar parameter values.
26
MULTIPLE LINEAR REGRESSION II STEP
Start with the most complicate model (it is one
approach)
model1ltlm( ozone tempwindradI(rad2)I(temp2I
(wind2))
Estimate Std.Error t Pr(gtt) Pr(gtt)
(Intercept) 5.7E02 2.1E02 2.74 0.01
temp -1.1E01 4.3E00 -2.50 0.01
wind -3.2E01 1.2E01 -2.76 0.01
rad -3.1E-01 5.6E-01 -0.56 0.58
I(rad2) -3.6E-04 2.6E-04 -1.41 0.16
I(temp2) 5.8E-02 2.4E-02 2.44 0.02
I(wind2) 6.1E-01 1.5E-01 4.16 0.00
tempwind 2.4E-01 1.4E-01 1.74 0.09
temprad 8.4E-03 7.5E-03 1.12 0.27
windrad 2.1E-02 4.9E-02 0.42 0.68
tempwindrad -4.3E-04 6.6E-04 -0.66 0.51
!!!!!! We cannot delete these terms !!!!!!!
Delete only the highest interaction tempwindrad
27
MULTIPLE LINEAR REGRESSION II STEP
Manual model simplification (It is one of the
many philosophies) Deletion the non-significant
terms one by one Hierarchy in the deletion 1.
Highest interactions 2. Cubic terms 3. Quadratic
terms 4. Linear terms
COMPLEX
Deletion
At each deletion test Is the fit of a simpler
model worse?
SIMPLE
IMPORTANT!!!
If you have quadratic and cubic terms significant
you cannot delete the linear or the quadratic
term even if they are not significant
If you have an interaction significant you
cannot delete the linear terms even if they are
not significant
28
MULTIPLE LINEAR REGRESSION III STEP
III STEP we must check the assumptions
NO
NO
Variance tends to increase with y
Non-normal errors
We can transform the data (e.g.
Log-transformation of y)
modelltlm( log(ozone) temp wind rad
I(wind2))
29
MULTIPLE LINEAR REGRESSION more than one x
The log-transformation has improved our model but
maybe there is an outlier
30
PARTIAL REGRESSION
With partial regression we can remove the effect
of one or more variables (covariates) and test a
further factor which becomes independent from the
covariates
WHEN?
  • Would like to hold third variables constant, but
    cannot manipulate.
  • Can use statistical control.
  • HOW?
  • Statistical control is based on residuals. If we
    regress Y on X1 and take residuals of Y, this
    part of Y will be uncorrelated with X1, so
    anything Y residuals correlate with will not be
    explained by X1.

31
PARTIAL REGRESSION VARIATION PARTITIONING
Relative importance of groups of explanatory
variables
R2 76 (TOTAL EXPLAINED VARIATION) What is
space and what is environment?
Total variation
Latitude (km)
Space n Envir.
Unexpl.
Environment
Space
Longitude (km)
Explained variation
Full.modelltlm(species environment i space i)
Site
Response variable orthopteran species richness
Explanatory variable SPACE (latitude
longitude) ENVIRONMENT (temperature
land-cover heterogeneity)
32
VARIATION PARTITIONING varpart(vegan)
Full.modelltlm(SPECIES temp het lat long)
Unexpl.
Environment
Space
TVE76
Env.modelltlm(SPECIES temp het)
env.residuals
Unexpl.
Environment
Space
Pure.Space.modelltlm(ENV.RESIDUALS lat long)
VE15
Unexpl.
Environment
Space
Space.modelltlm(SPECIES lat long)
space.residuals
Unexpl.
Environment
Space
Pure.env.modelltlm(SPACE.RESIDUALS tem het)
VE40
Unexpl.
Environment
Space
33
NON-LINEAR REGRESSION nls()
Sometimes we have a mechanistic model for the
relationship between y and x, and we want to
estimate the parameters and standard errors of
the parameters of a specific non-linear equation
from data.
We must SPECIFY the exact nature of the function
as part of the model formula when we use
non-linear modelling
In place of lm() we write nls() (this stands for
non-linear least squares). Then, instead of
yxI(x2)I(x3) (polynomial), we write the
yfunction to spell out the precise nonlinear
model we want R to fit to the data.
34
NON-LINEAR REGRESSION step by step
1. Plot y against x
Alternative approach
2. Get an idea of the family of functions that
you can fit
Multimodel inference (minimum deviance minimum
number of parameters)
3. Start fitting the different models
AIC scaled deviance 2k
4. Specify initial guesses for the values of the
parameters
k parameter number 1
Compare GROUPS of model at a time
5. Get the MAM for each by model simplification
6. Check the residuals
Model weights and model average see Burnham
Anderson, 2002
7. Compare PAIRS of models and choose the best
35
nls() examples of function families
Asymptotic functions
S-shaped functions
Humped functions
Exponential functions
36
nls() Look at the data
Asymptotic functions
?
S-shaped functions
Humped functions
Exponential functions
Asymptotic exponential
Understand the role of the parameters a, b, and c
Using the data plot work out sensible starting
values. It always helps in cases like this to
work out the equations at the limits i.e. find
the values of y when x0 and when x
37
nls() Look at the data
Fit the model
1. Estimate of a, b, and c (iterative)
2. Extract the fitted values (yi)
3. Check graphically the curve fitting
Can we try another function from the same family?
Model choice is always an important issue in
curve fitting (particularly for prediction)
Different behavior at the limits! Think about
your biological system not just residual deviance!
38
nls() Look at the data
MichaelisMenten
Fit a second model
1. Extract the fitted values (yi)
2. Check graphically the curve fitting
You can see that the asymptotic exponential
(solid line) tends to get to its asymptote first,
and that the MichaelisMenten (dotted line)
continues to increase. Model choice, therefore
would be enormously important if you intended to
use the model for prediction to ages much greater
than 50 months.
39
Application of regression prediction
Regression models for prediction
A model can be used to predict values of y in
space or in time knowing new xi values
Spatial extent data range
NO
YES
Before using a model for prediction it has to be
VALIDATED!!!
2 APPROACHES
40
VALIDATION
1. In data-rich situation, set aside validation
(use one part of data set to fit model, second
part for assessing prediction error of final
selected model).
ResidualPrediction error
Real y
Model fit
Predicted
2. If data scarce, must resort to artificially
produced validation sets
Cross-validation
Bootstrap
41
K-FOLD CROSS-VALIDATION
Split randomly the data in K groups with roughly
the same size Take turns using one group as test
set and the other k-I as training set for fitting
the model
1
2
5
3
4
1. Prediction error1
Train
Train
Train
Train
Test
2. Prediction error2
Train
Train
Train
Train
Test
3. Prediction error3
Train
Train
Train
Train
Test
4. Prediction error4
Train
Train
Train
Train
Test
5. Prediction error5
Train
Train
Train
Train
Test
42
BOOTSTRAP
1. Generate a large number (n 10 000) of
bootstrap samples
n10000

2. For each bootstrap sample, compute the
prediction error
Error3
Errorn

Error1
Error2
3. The mean of these estimates is the bootstrap
estimate of prediction error
43
Application of regression prediction
1. Do not use your model for prediction without
carrying out a validation
If you can, use an independent data set for
validating the model
If you cannot, use at least bootstrap or
cross-validation
2. Never extrapolate
Write a Comment
User Comments (0)
About PowerShow.com