Linear correlation and linear regression summary of tests - PowerPoint PPT Presentation

About This Presentation

Title:

Linear correlation and linear regression summary of tests

Description:

cov(X,Y) 0 X and Y are positively correlated. cov(X,Y) 0 X and Y are ... ( remember max and mins from calculus)... Derivative[ (Yi-(mx b))2]=0. Prediction ... – PowerPoint PPT presentation

Number of Views:384

Avg rating:3.0/5.0

Slides: 70

Provided by: Kris147

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Linear correlation and linear regression summary of tests

1
Linear correlation and linear regression
summary of tests
2
Recall Covariance
3
Interpreting Covariance

cov(X,Y) gt 0 X and Y are positively
correlated
cov(X,Y) lt 0 X and Y are inversely
correlated
cov(X,Y) 0 X and Y are independent

4
Correlation coefficient

Pearsons Correlation Coefficient is standardized
covariance (unitless)

5
Recall dice problem

Var(x) 2.916666
Var(y) 5.83333
Cov(xy) 2.91666

R2Coefficient of Determination
SSexplained/TSS
? Interpretation of R2 50 of the total
variation in the sum of the two dice is explained
by the roll on the first die. Makes perfect
intuitive sense!
6
Correlation

Measures the relative strength of the linear
relationship between two variables
Unit-less
Ranges between 1 and 1
The closer to 1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship

7
Scatter Plots of Data with Various Correlation
Coefficients
Y
Y
Y
X
X
X
r -1
r -.6
r 0
Y
Y
Y
X
X
X
r 1
r .3
r 0

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

8
Linear Correlation
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
X

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

9
Linear Correlation
Strong relationships
Weak relationships
Y
Y
X
X
Y
Y
X
X

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

10
Linear Correlation
No relationship
Y
X
Y
X

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

11
Some calculation formulas
12
Sampling distribution of correlation coefficient
The sample correlation coefficient follows a
T-distribution with n-2 degrees of freedom (since
you have to estimate the standard error).

note, like a proportion, the variance of the
correlation coefficient depends on the
correlation coefficient itself?substitute in
estimated r

13
Sample size requirements for r
14
Correlation in SAS

/To get correlations between variables 1 and 2,
1 and 3, and 2 and 3/
PROC CORR datayourdata
var variable1 variable2 variable3
run
/To get correlations between variables 3 and 1
and 3 and 2/
PROC CORR datayourdata
var variable1 variable2
with variable3
run

15
Linear regression
http//www.math.csusb.edu/faculty/stanton/m262/reg
ress/regress.html
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
16
What is Linear?

Remember this
YmXB?

17
Whats Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
18
Simple linear regression
The linear regression model Hours of homework
14.2 4.4ounces of caffeinated coffee
19
Prediction

If you know something about X, this knowledge
helps you predict something about Y. (Sound
familiar?sound like conditional probabilities?)

20
EXAMPLE

The distribution of baby weights at Stanford
N(3400, 360000)
Your Best guess at a random babys weight,
given no information about the baby, is what?
3400 grams
But, what if you have relevant information? Can
you make a better guess?

21
Predictor variable

Xgestation time
Assume that babies that gestate for longer are
born heavier, all other things being equal.
Pretend (at least for the purposes of this
example) that this relationship is linear.
Example suppose a one-week increase in
gestation, on average, leads to a 100-gram
increase in birth-weight

22
Y depends on X
Ybirth- weight (g)
Xgestation time (weeks)
23
Prediction

A new baby is born that had gestated for just 30
weeks. Whats your best guess at the
birth-weight?
Are you still best off guessing 3400? NO!

24
At 30 weeks
Ybirth- weight (g)
3000
Xgestation time (weeks)
30
25
At 30 weeks
Ybirth weight (g)
3000
Xgestation time (weeks)
30
26
At 30 weeks

The babies that gestate for 30 weeks appear to
center around a weight of 3000 grams.
In Math-Speak
E(Y/X30 weeks)3000 grams

27
But

Note that not every Y-value (Yi) sits on the
line. Theres variability.
Yi3000 random errori
In fact, babies that gestate for 30 weeks have
birth-weights that center at 3000 grams, but vary
around 3000 with some variance ?2
Approximately what distribution do birth-weights
follow? Normal. Y/X30 weeks N(3000, ?2)

28
And, if X20, 30, or 40
Ybirth- weight (g)
Xgestation time (weeks)
20
30
40
29
If X20, 30, or 40
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
30
Mean values fall on the line

E(Y/X40 weeks)4000
E(Y/X30 weeks)3000
E(Y/X20 weeks)2000
E(Y/X) ? Y/X 100 grams/weekX weeks

31
Linear Regression Model

Ys are modeled
Yi 100X random errori

32
Assumptions (or the fine print)

Linear regression assumes that
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)
Why? The math requires itthe mathematical
process is called least squares because it fits
the regression line by minimizing the squared
errors from the line (mathematically easy, but
not generalrelies on above assumptions).

33
Non-homogenous variance
Ybirth-weight (100g)
Xgestation time (weeks)
34
Least squares estimation
Least Squares Estimation A little
calculus. What are we trying to estimate? ß,
the slope, from Whats the constraint? We are
trying to minimize the squared distance (hence
the least squares) between the observations
themselves and the predicted values , or (also
called the residuals, or left-over unexplained
variability) Differencei yi (ßx a)
Differencei2 (yi (ßx a)) 2 Find the ß
that gives the minimum sum of the squared
differences. How do you maximize a function?
Take the derivative set it equal to zero and
solve. Typical max/min problem from
calculus. From here takes a little math
trickery to solve for ß
35
The Regression Picture
Least squares estimation gave us the line (ß)
that minimized C2
R2SSreg/SStotal
36
Results of least squares
Slope (beta coefficient)
Intercept
Regression line always goes through the point
37
Relationship with correlation
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
38
Expected value of y
Expected value of y at level of x xi
39
Residual
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
40
Residual
Residual observed value predicted value
41
Standard error of y/x
42
The standard error of Y given X is the average
variability around the regression line at any
given value of X. It is assumed to be equal at
all values of X.
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
43
Standard error of beta
44
Comparing Standard Errors of the Slope
is a measure of the variation in the slope
of regression lines from different possible
samples
Y
Y
X
X
45
Sampling distribution of beta

Slope
Sampling distribution of slope Tn-2(ß,s.e.(
))

H0 ß1 0 (no linear relationship) H1 ß1 ?
0 (linear relationship does exist)
46
(Standard error of intercept)
47
Residual Analysis check assumptions

The residual for observation i, ei, is the
difference between its observed and predicted
value
Check the assumptions of regression by examining
the residuals
Examine for linearity assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
Graphical Analysis of Residuals
Can plot residuals vs. X

48
Residual Analysis for Linearity
Y
Y
x
x
x
x
residuals
residuals
?
Not Linear
Linear

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

49
Residual Analysis for Homoscedasticity
Y
Y
x
x
x
x
residuals
residuals
?
Constant variance
Non-constant variance

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

50
Residual Analysis for Independence
Not Independent
?
Independent
X
residuals
X
residuals
X
residuals

Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall

51
A ttest is linear regression!

In our class the average alcohol consumed weekly
was 3.5 oz/week (sd 1.7) in Red Sox fans (n4)
and 1.7 oz/week (sd 2.1) in non-Red Sox fans
(n21).
We can evaluate these data with a ttest or a
linear regression

52
As a linear regression

Parameter Standard Variable
Label DF Estimate Error
t Value Pr gt t Intercept
Intercept 1 1.69048 0.45328
3.73 0.0011 SoxFan SoxFan
1 1.80952 1.13320 1.60
0.1240
53
Multiple Linear Regression

More than one predictor
? ? ?1X ?2 W ?3 Z
Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the predictor, if
all other variables in the model were held
constant.

54
ANOVA is linear regression!

A categorical variable with more than two groups
E.g. groups 1, 2, and 3 (mutually exclusive)
? ? (value for group 1) ?1(1 if in group 2)
?2 (1 if in group 3)
This is called dummy codingwhere multiple
binary variables are created to represent being
in each category (or not) of a categorical
variable

55
Functions of multivariate analysis

Control for confounders
Test for interactions between predictors (effect
modification)
Improve predictions

56
Table 3. Relationship of Combinations of
Macronutrients to BP (SBP and DBP) for 11 342
Men, Years 1 Through 6 of MRFIT Multiple Linear
Regression Analyses
Models controlled for baseline age, race (black,
nonblack), education, smoking, serum cholesterol.
Circulation. 1996 Nov 1594(10)2417-23.
57
In math terms SBP ? -.0346( protein) ?age
(Age) .
Variable
SBP
DBP
Total protein, kcal
-0.0346 (-1.10)
-0.0568 (-3.17)
Translation controlled for other variables in
the model (as well as baseline age, race, etc.),
every 1 increase in the percent of calories
coming from protein correlates with .0346 mmHg
decrease in systolic BP. (NS)
Also (from a separate model), every 1 increase
in the percent of calories coming from protein
correlates with a .0568 mmHg decrease in
diastolic BP. (significant)
DBP ? - 05568( protein) ?age (Age) .
58
Multivariate regression pitfalls

Multi-collinearity
Residual confounding
Overfitting

59
Multicollinearity

Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model they will, in effect, cancel
each other out and generally destroy your model.
Model building and diagnostics are tricky
business!

60
Residual confounding

You cannot completely wipe out confounding simply
by adjusting for variables in multiple regression
unless variables are measured with zero error
(which is usually impossible).
Residual confounding can lead to significant
adjusted odds ratios (ORs) as high as 1.5 to 2.0
if measurement error is high.
Hypothetical Example In a case-control study of
lung cancer, researchers identified a link
between alcohol drinking and cancer in smokers
only. The OR was 1.3 for 1-2 drinks per day
(compared with none) and 1.5 for 3 drinks per
day. Though the authors adjusted for number of
cigarettes smoked per day in multivariate
regression, we cannot rule out residual
confounding by level of smoking (which may be
tightly linked to alcohol drinking).

61
Overfitting

In multivariate modeling, you can get highly
significant but meaningless results if you put
too many predictors in the model.
The model is fit perfectly to the quirks of your
particular sample, but has no predictive ability
in a new sample.
Example (hypothetical) In a randomized trial of
an intervention to speed bone healing after
fracture, researchers built a multivariate
regression model to predict time to recovery in a
subset of women (n12). An automatic selection
procedure came up with a model containing age,
weight, use of oral contraceptives, and treatment
status the predictors were all highly
significant and the model had a nearly perfect
R-square of 99.5.
This is likely an example of overfitting. The
researchers have fit a model to exactly their
particular sample of data, but it will likely
have no predictive ability in a new sample.
Rule of thumb You need at least 10 subjects for
each additional predictor variable in the
multivariate regression model.

62
Overfitting
Pure noise variables still produce good R2 values
if the model is overfitted. The distribution of
R2 values from a series of simulated regression
models containing only noise variables. (Figure
1 from Babyak, MA. What You See May Not Be What
You Get A Brief, Nontechnical Introduction to
Overfitting in Regression-Type Models.
Psychosomatic Medicine 66411-421 (2004).)
63
Other types of multivariate regression

Multiple linear regression is for normally
distributed outcomes
Logistic regression is for binary outcomes
Cox proportional hazards regression is used when
time-to-event is the outcome

64
Overview of statistical tests

The following table gives the appropriate choice
of a statistical test or measure of association
for various types of data (outcome variables and
predictor variables) by study design.

e.g., blood pressure pounds age treatment
(1/0)
65
(No Transcript)
66
Alternative summary statistics for various types
of outcome data
67
Continuous outcome (means)
68
Binary or categorical outcomes (proportions)
69
Time-to-event outcome (survival data)

Write a Comment

User Comments (0)