Title: Linear correlation and linear regression summary of tests
1Linear correlation and linear regression
summary of tests
2Recall Covariance
3Interpreting Covariance
- cov(X,Y) gt 0 X and Y are positively
correlated - cov(X,Y) lt 0 X and Y are inversely
correlated - cov(X,Y) 0 X and Y are independent
4Correlation coefficient
- Pearsons Correlation Coefficient is standardized
covariance (unitless)
5Recall dice problem
- Var(x) 2.916666
- Var(y) 5.83333
- Cov(xy) 2.91666
R2Coefficient of Determination
SSexplained/TSS
? Interpretation of R2 50 of the total
variation in the sum of the two dice is explained
by the roll on the first die. Makes perfect
intuitive sense!
6Correlation
- Measures the relative strength of the linear
relationship between two variables - Unit-less
- Ranges between 1 and 1
- The closer to 1, the stronger the negative
linear relationship - The closer to 1, the stronger the positive linear
relationship - The closer to 0, the weaker any positive linear
relationship
7Scatter Plots of Data with Various Correlation
Coefficients
Y
Y
Y
X
X
X
r -1
r -.6
r 0
Y
Y
Y
X
X
X
r 1
r .3
r 0
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
8Linear Correlation
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
X
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
9Linear Correlation
Strong relationships
Weak relationships
Y
Y
X
X
Y
Y
X
X
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
10Linear Correlation
No relationship
Y
X
Y
X
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
11Some calculation formulas
12Sampling distribution of correlation coefficient
The sample correlation coefficient follows a
T-distribution with n-2 degrees of freedom (since
you have to estimate the standard error).
- note, like a proportion, the variance of the
correlation coefficient depends on the
correlation coefficient itself?substitute in
estimated r
13Sample size requirements for r
14Correlation in SAS
- /To get correlations between variables 1 and 2,
1 and 3, and 2 and 3/ - PROC CORR datayourdata
- var variable1 variable2 variable3
- run
- /To get correlations between variables 3 and 1
and 3 and 2/ - PROC CORR datayourdata
- var variable1 variable2
- with variable3
- run
15Linear regression
http//www.math.csusb.edu/faculty/stanton/m262/reg
ress/regress.html
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
16What is Linear?
17Whats Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
18Simple linear regression
The linear regression model Hours of homework
14.2 4.4ounces of caffeinated coffee
19Prediction
- If you know something about X, this knowledge
helps you predict something about Y. (Sound
familiar?sound like conditional probabilities?)
20EXAMPLE
- The distribution of baby weights at Stanford
- N(3400, 360000)
- Your Best guess at a random babys weight,
given no information about the baby, is what? - 3400 grams
- But, what if you have relevant information? Can
you make a better guess?
21Predictor variable
- Xgestation time
- Assume that babies that gestate for longer are
born heavier, all other things being equal. - Pretend (at least for the purposes of this
example) that this relationship is linear. - Example suppose a one-week increase in
gestation, on average, leads to a 100-gram
increase in birth-weight
22Y depends on X
Ybirth- weight (g)
Xgestation time (weeks)
23Prediction
- A new baby is born that had gestated for just 30
weeks. Whats your best guess at the
birth-weight? - Are you still best off guessing 3400? NO!
24At 30 weeks
Ybirth- weight (g)
3000
Xgestation time (weeks)
30
25At 30 weeks
Ybirth weight (g)
3000
Xgestation time (weeks)
30
26At 30 weeks
- The babies that gestate for 30 weeks appear to
center around a weight of 3000 grams. - In Math-Speak
- E(Y/X30 weeks)3000 grams
27But
- Note that not every Y-value (Yi) sits on the
line. Theres variability. - Yi3000 random errori
- In fact, babies that gestate for 30 weeks have
birth-weights that center at 3000 grams, but vary
around 3000 with some variance ?2 - Approximately what distribution do birth-weights
follow? Normal. Y/X30 weeks N(3000, ?2)
28 And, if X20, 30, or 40
Ybirth- weight (g)
Xgestation time (weeks)
20
30
40
29 If X20, 30, or 40
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
30Mean values fall on the line
- E(Y/X40 weeks)4000
- E(Y/X30 weeks)3000
- E(Y/X20 weeks)2000
- E(Y/X) ? Y/X 100 grams/weekX weeks
31Linear Regression Model
- Ys are modeled
- Yi 100X random errori
32Assumptions (or the fine print)
- Linear regression assumes that
- 1. The relationship between X and Y is linear
- 2. Y is distributed normally at each value of X
- 3. The variance of Y at every value of X is the
same (homogeneity of variances) - Why? The math requires itthe mathematical
process is called least squares because it fits
the regression line by minimizing the squared
errors from the line (mathematically easy, but
not generalrelies on above assumptions).
33Non-homogenous variance
Ybirth-weight (100g)
Xgestation time (weeks)
34Least squares estimation
Least Squares Estimation A little
calculus. What are we trying to estimate? ß,
the slope, from Whats the constraint? We are
trying to minimize the squared distance (hence
the least squares) between the observations
themselves and the predicted values , or (also
called the residuals, or left-over unexplained
variability) Differencei yi (ßx a)
Differencei2 (yi (ßx a)) 2 Find the ß
that gives the minimum sum of the squared
differences. How do you maximize a function?
Take the derivative set it equal to zero and
solve. Typical max/min problem from
calculus. From here takes a little math
trickery to solve for ß
35The Regression Picture
Least squares estimation gave us the line (ß)
that minimized C2
R2SSreg/SStotal
36Results of least squares
Slope (beta coefficient)
Intercept
Regression line always goes through the point
37Relationship with correlation
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
38Expected value of y
Expected value of y at level of x xi
39Residual
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
40Residual
Residual observed value predicted value
41Standard error of y/x
42The standard error of Y given X is the average
variability around the regression line at any
given value of X. It is assumed to be equal at
all values of X.
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
43Standard error of beta
44Comparing Standard Errors of the Slope
is a measure of the variation in the slope
of regression lines from different possible
samples
Y
Y
X
X
45Sampling distribution of beta
- Slope
- Sampling distribution of slope Tn-2(ß,s.e.(
)) -
-
H0 ß1 0 (no linear relationship) H1 ß1 ?
0 (linear relationship does exist)
46(Standard error of intercept)
47Residual Analysis check assumptions
- The residual for observation i, ei, is the
difference between its observed and predicted
value - Check the assumptions of regression by examining
the residuals - Examine for linearity assumption
- Examine for constant variance for all levels of X
(homoscedasticity) - Evaluate normal distribution assumption
- Evaluate independence assumption
- Graphical Analysis of Residuals
- Can plot residuals vs. X
48Residual Analysis for Linearity
Y
Y
x
x
x
x
residuals
residuals
?
Not Linear
Linear
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
49Residual Analysis for Homoscedasticity
Y
Y
x
x
x
x
residuals
residuals
?
Constant variance
Non-constant variance
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
50Residual Analysis for Independence
Not Independent
?
Independent
X
residuals
X
residuals
X
residuals
- Slide from Statistics for Managers Using
Microsoft Excel 4th Edition, 2004 Prentice-Hall
51A ttest is linear regression!
- In our class the average alcohol consumed weekly
was 3.5 oz/week (sd 1.7) in Red Sox fans (n4)
and 1.7 oz/week (sd 2.1) in non-Red Sox fans
(n21). - We can evaluate these data with a ttest or a
linear regression
52As a linear regression
Parameter Standard Variable
Label DF Estimate Error
t Value Pr gt t Intercept
Intercept 1 1.69048 0.45328
3.73 0.0011 SoxFan SoxFan
1 1.80952 1.13320 1.60
0.1240
53Multiple Linear Regression
- More than one predictor
- ? ? ?1X ?2 W ?3 Z
- Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the predictor, if
all other variables in the model were held
constant. -
54ANOVA is linear regression!
- A categorical variable with more than two groups
- E.g. groups 1, 2, and 3 (mutually exclusive)
- ? ? (value for group 1) ?1(1 if in group 2)
?2 (1 if in group 3) - This is called dummy codingwhere multiple
binary variables are created to represent being
in each category (or not) of a categorical
variable
55Functions of multivariate analysis
- Control for confounders
- Test for interactions between predictors (effect
modification) - Improve predictions
56Table 3. Relationship of Combinations of
Macronutrients to BP (SBP and DBP) for 11 342
Men, Years 1 Through 6 of MRFIT Multiple Linear
Regression Analyses
Models controlled for baseline age, race (black,
nonblack), education, smoking, serum cholesterol.
Circulation. 1996 Nov 1594(10)2417-23.
57In math terms SBP ? -.0346( protein) ?age
(Age) .
Variable
SBP
DBP
Total protein, kcal
-0.0346 (-1.10)
-0.0568 (-3.17)
Translation controlled for other variables in
the model (as well as baseline age, race, etc.),
every 1 increase in the percent of calories
coming from protein correlates with .0346 mmHg
decrease in systolic BP. (NS)
Also (from a separate model), every 1 increase
in the percent of calories coming from protein
correlates with a .0568 mmHg decrease in
diastolic BP. (significant)
DBP ? - 05568( protein) ?age (Age) .
58Multivariate regression pitfalls
- Multi-collinearity
- Residual confounding
- Overfitting
59Multicollinearity
- Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model they will, in effect, cancel
each other out and generally destroy your model.
- Model building and diagnostics are tricky
business!
60Residual confounding
- You cannot completely wipe out confounding simply
by adjusting for variables in multiple regression
unless variables are measured with zero error
(which is usually impossible). - Residual confounding can lead to significant
adjusted odds ratios (ORs) as high as 1.5 to 2.0
if measurement error is high. - Hypothetical Example In a case-control study of
lung cancer, researchers identified a link
between alcohol drinking and cancer in smokers
only. The OR was 1.3 for 1-2 drinks per day
(compared with none) and 1.5 for 3 drinks per
day. Though the authors adjusted for number of
cigarettes smoked per day in multivariate
regression, we cannot rule out residual
confounding by level of smoking (which may be
tightly linked to alcohol drinking).
61Overfitting
- In multivariate modeling, you can get highly
significant but meaningless results if you put
too many predictors in the model. - The model is fit perfectly to the quirks of your
particular sample, but has no predictive ability
in a new sample. - Example (hypothetical) In a randomized trial of
an intervention to speed bone healing after
fracture, researchers built a multivariate
regression model to predict time to recovery in a
subset of women (n12). An automatic selection
procedure came up with a model containing age,
weight, use of oral contraceptives, and treatment
status the predictors were all highly
significant and the model had a nearly perfect
R-square of 99.5. - This is likely an example of overfitting. The
researchers have fit a model to exactly their
particular sample of data, but it will likely
have no predictive ability in a new sample. - Rule of thumb You need at least 10 subjects for
each additional predictor variable in the
multivariate regression model.
62Overfitting
Pure noise variables still produce good R2 values
if the model is overfitted. The distribution of
R2 values from a series of simulated regression
models containing only noise variables. (Figure
1 from Babyak, MA. What You See May Not Be What
You Get A Brief, Nontechnical Introduction to
Overfitting in Regression-Type Models.
Psychosomatic Medicine 66411-421 (2004).)
63Other types of multivariate regression
- Multiple linear regression is for normally
distributed outcomes - Logistic regression is for binary outcomes
- Cox proportional hazards regression is used when
time-to-event is the outcome
64Overview of statistical tests
- The following table gives the appropriate choice
of a statistical test or measure of association
for various types of data (outcome variables and
predictor variables) by study design.
e.g., blood pressure pounds age treatment
(1/0)
65(No Transcript)
66Alternative summary statistics for various types
of outcome data
67Continuous outcome (means)
68Binary or categorical outcomes (proportions)
69Time-to-event outcome (survival data)