Linear correlation and linear regression summary of tests - PowerPoint PPT Presentation

About This Presentation
Title:

Linear correlation and linear regression summary of tests

Description:

cov(X,Y) 0 X and Y are positively correlated. cov(X,Y) 0 X and Y are ... ( remember max and mins from calculus)... Derivative[ (Yi-(mx b))2]=0. Prediction ... – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0
Slides: 70
Provided by: Kris147
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Linear correlation and linear regression summary of tests


1
Linear correlation and linear regression
summary of tests
2
Recall Covariance
3
Interpreting Covariance
  • cov(X,Y) gt 0 X and Y are positively
    correlated
  • cov(X,Y) lt 0 X and Y are inversely
    correlated
  • cov(X,Y) 0 X and Y are independent

4
Correlation coefficient
  • Pearsons Correlation Coefficient is standardized
    covariance (unitless)

5
Recall dice problem
  • Var(x) 2.916666
  • Var(y) 5.83333
  • Cov(xy) 2.91666

R2Coefficient of Determination
SSexplained/TSS
? Interpretation of R2 50 of the total
variation in the sum of the two dice is explained
by the roll on the first die. Makes perfect
intuitive sense!
6
Correlation
  • Measures the relative strength of the linear
    relationship between two variables
  • Unit-less
  • Ranges between 1 and 1
  • The closer to 1, the stronger the negative
    linear relationship
  • The closer to 1, the stronger the positive linear
    relationship
  • The closer to 0, the weaker any positive linear
    relationship

7
Scatter Plots of Data with Various Correlation
Coefficients
Y
Y
Y
X
X
X
r -1
r -.6
r 0
Y
Y
Y
X
X
X
r 1
r .3
r 0
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

8
Linear Correlation
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
X
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

9
Linear Correlation
Strong relationships
Weak relationships
Y
Y
X
X
Y
Y
X
X
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

10
Linear Correlation
No relationship
Y
X
Y
X
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

11
Some calculation formulas
12
Sampling distribution of correlation coefficient
The sample correlation coefficient follows a
T-distribution with n-2 degrees of freedom (since
you have to estimate the standard error).
  • note, like a proportion, the variance of the
    correlation coefficient depends on the
    correlation coefficient itself?substitute in
    estimated r

13
Sample size requirements for r
14
Correlation in SAS
  • /To get correlations between variables 1 and 2,
    1 and 3, and 2 and 3/
  • PROC CORR datayourdata
  • var variable1 variable2 variable3
  • run
  • /To get correlations between variables 3 and 1
    and 3 and 2/
  • PROC CORR datayourdata
  • var variable1 variable2
  • with variable3
  • run

15
Linear regression
http//www.math.csusb.edu/faculty/stanton/m262/reg
ress/regress.html
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
16
What is Linear?
  • Remember this
  • YmXB?

17
Whats Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
18
Simple linear regression
The linear regression model Hours of homework
14.2 4.4ounces of caffeinated coffee
19
Prediction
  • If you know something about X, this knowledge
    helps you predict something about Y. (Sound
    familiar?sound like conditional probabilities?)

20
EXAMPLE
  • The distribution of baby weights at Stanford
  • N(3400, 360000)
  • Your Best guess at a random babys weight,
    given no information about the baby, is what?
  • 3400 grams
  • But, what if you have relevant information? Can
    you make a better guess?

21
Predictor variable
  • Xgestation time
  • Assume that babies that gestate for longer are
    born heavier, all other things being equal.
  • Pretend (at least for the purposes of this
    example) that this relationship is linear.
  • Example suppose a one-week increase in
    gestation, on average, leads to a 100-gram
    increase in birth-weight

22
Y depends on X
Ybirth- weight (g)
Xgestation time (weeks)
23
Prediction
  • A new baby is born that had gestated for just 30
    weeks. Whats your best guess at the
    birth-weight?
  • Are you still best off guessing 3400? NO!

24
At 30 weeks
Ybirth- weight (g)
3000
Xgestation time (weeks)
30
25
At 30 weeks
Ybirth weight (g)
3000
Xgestation time (weeks)
30
26
At 30 weeks
  • The babies that gestate for 30 weeks appear to
    center around a weight of 3000 grams.
  • In Math-Speak
  • E(Y/X30 weeks)3000 grams

27
But
  • Note that not every Y-value (Yi) sits on the
    line. Theres variability.
  • Yi3000 random errori
  • In fact, babies that gestate for 30 weeks have
    birth-weights that center at 3000 grams, but vary
    around 3000 with some variance ?2
  • Approximately what distribution do birth-weights
    follow? Normal. Y/X30 weeks N(3000, ?2)

28
And, if X20, 30, or 40
Ybirth- weight (g)
Xgestation time (weeks)
20
30
40
29
If X20, 30, or 40
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
30
Mean values fall on the line
  • E(Y/X40 weeks)4000
  • E(Y/X30 weeks)3000
  • E(Y/X20 weeks)2000
  • E(Y/X) ? Y/X 100 grams/weekX weeks

31
Linear Regression Model
  • Ys are modeled
  • Yi 100X random errori

32
Assumptions (or the fine print)
  • Linear regression assumes that
  • 1. The relationship between X and Y is linear
  • 2. Y is distributed normally at each value of X
  • 3. The variance of Y at every value of X is the
    same (homogeneity of variances)
  • Why? The math requires itthe mathematical
    process is called least squares because it fits
    the regression line by minimizing the squared
    errors from the line (mathematically easy, but
    not generalrelies on above assumptions).

33
Non-homogenous variance
Ybirth-weight (100g)
Xgestation time (weeks)
34
Least squares estimation
Least Squares Estimation A little
calculus. What are we trying to estimate? ß,
the slope, from Whats the constraint? We are
trying to minimize the squared distance (hence
the least squares) between the observations
themselves and the predicted values , or (also
called the residuals, or left-over unexplained
variability) Differencei yi (ßx a)
Differencei2 (yi (ßx a)) 2 Find the ß
that gives the minimum sum of the squared
differences. How do you maximize a function?
Take the derivative set it equal to zero and
solve. Typical max/min problem from
calculus. From here takes a little math
trickery to solve for ß
35
The Regression Picture
Least squares estimation gave us the line (ß)
that minimized C2  
R2SSreg/SStotal
36
Results of least squares
Slope (beta coefficient)
Intercept
Regression line always goes through the point
37
Relationship with correlation
In correlation, the two variables are treated as
equals. In regression, one variable is
considered independent (predictor) variable (X)
and the other the dependent (outcome) variable
Y.
38
Expected value of y
Expected value of y at level of x xi
39
Residual
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
40
Residual
Residual observed value predicted value
41
Standard error of y/x
42
The standard error of Y given X is the average
variability around the regression line at any
given value of X. It is assumed to be equal at
all values of X.
Ybaby weights (g)
Xgestation times (weeks)
20
30
40
43
Standard error of beta
44
Comparing Standard Errors of the Slope
is a measure of the variation in the slope
of regression lines from different possible
samples
Y
Y
X
X
45
Sampling distribution of beta
  • Slope
  • Sampling distribution of slope Tn-2(ß,s.e.(
    ))
  •  
  •  

H0 ß1 0 (no linear relationship) H1 ß1 ?
0 (linear relationship does exist)
46
(Standard error of intercept)
47
Residual Analysis check assumptions
  • The residual for observation i, ei, is the
    difference between its observed and predicted
    value
  • Check the assumptions of regression by examining
    the residuals
  • Examine for linearity assumption
  • Examine for constant variance for all levels of X
    (homoscedasticity)
  • Evaluate normal distribution assumption
  • Evaluate independence assumption
  • Graphical Analysis of Residuals
  • Can plot residuals vs. X

48
Residual Analysis for Linearity
Y
Y
x
x
x
x
residuals
residuals
?
Not Linear
Linear
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

49
Residual Analysis for Homoscedasticity
Y
Y
x
x
x
x
residuals
residuals
?
Constant variance
Non-constant variance
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

50
Residual Analysis for Independence
Not Independent
?
Independent
X
residuals
X
residuals
X
residuals
  • Slide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

51
A ttest is linear regression!
  • In our class the average alcohol consumed weekly
    was 3.5 oz/week (sd 1.7) in Red Sox fans (n4)
    and 1.7 oz/week (sd 2.1) in non-Red Sox fans
    (n21).
  • We can evaluate these data with a ttest or a
    linear regression

52
As a linear regression

Parameter Standard Variable
Label DF Estimate Error
t Value Pr gt t Intercept
Intercept 1 1.69048 0.45328
3.73 0.0011 SoxFan SoxFan
1 1.80952 1.13320 1.60
0.1240
53
Multiple Linear Regression
  • More than one predictor
  • ? ? ?1X ?2 W ?3 Z
  • Each regression coefficient is the amount of
    change in the outcome variable that would be
    expected per one-unit change of the predictor, if
    all other variables in the model were held
    constant.
  •  

54
ANOVA is linear regression!
  • A categorical variable with more than two groups
  • E.g. groups 1, 2, and 3 (mutually exclusive)
  • ? ? (value for group 1) ?1(1 if in group 2)
    ?2 (1 if in group 3)
  • This is called dummy codingwhere multiple
    binary variables are created to represent being
    in each category (or not) of a categorical
    variable

55
Functions of multivariate analysis
  • Control for confounders
  • Test for interactions between predictors (effect
    modification)
  • Improve predictions

56
Table 3. Relationship of Combinations of
Macronutrients to BP (SBP and DBP) for 11 342
Men, Years 1 Through 6 of MRFIT Multiple Linear
Regression Analyses
Models controlled for baseline age, race (black,
nonblack), education, smoking, serum cholesterol.
Circulation. 1996 Nov 1594(10)2417-23.
57
In math terms SBP ? -.0346( protein) ?age
(Age) .
Variable
SBP
DBP
 Total protein, kcal
-0.0346 (-1.10)
-0.0568 (-3.17)
Translation controlled for other variables in
the model (as well as baseline age, race, etc.),
every 1 increase in the percent of calories
coming from protein correlates with .0346 mmHg
decrease in systolic BP. (NS)
Also (from a separate model), every 1 increase
in the percent of calories coming from protein
correlates with a .0568 mmHg decrease in
diastolic BP. (significant)
DBP ? - 05568( protein) ?age (Age) .
58
Multivariate regression pitfalls
  • Multi-collinearity
  • Residual confounding
  • Overfitting

59
Multicollinearity
  • Multicollinearity arises when two variables that
    measure the same thing or similar things (e.g.,
    weight and BMI) are both included in a multiple
    regression model they will, in effect, cancel
    each other out and generally destroy your model.
     
  • Model building and diagnostics are tricky
    business!

60
Residual confounding
  • You cannot completely wipe out confounding simply
    by adjusting for variables in multiple regression
    unless variables are measured with zero error
    (which is usually impossible).
  • Residual confounding can lead to significant
    adjusted odds ratios (ORs) as high as 1.5 to 2.0
    if measurement error is high.
  • Hypothetical Example In a case-control study of
    lung cancer, researchers identified a link
    between alcohol drinking and cancer in smokers
    only. The OR was 1.3 for 1-2 drinks per day
    (compared with none) and 1.5 for 3 drinks per
    day. Though the authors adjusted for number of
    cigarettes smoked per day in multivariate
    regression, we cannot rule out residual
    confounding by level of smoking (which may be
    tightly linked to alcohol drinking).

61
Overfitting
  • In multivariate modeling, you can get highly
    significant but meaningless results if you put
    too many predictors in the model.
  • The model is fit perfectly to the quirks of your
    particular sample, but has no predictive ability
    in a new sample.
  • Example (hypothetical) In a randomized trial of
    an intervention to speed bone healing after
    fracture, researchers built a multivariate
    regression model to predict time to recovery in a
    subset of women (n12). An automatic selection
    procedure came up with a model containing age,
    weight, use of oral contraceptives, and treatment
    status the predictors were all highly
    significant and the model had a nearly perfect
    R-square of 99.5.
  • This is likely an example of overfitting. The
    researchers have fit a model to exactly their
    particular sample of data, but it will likely
    have no predictive ability in a new sample.
  • Rule of thumb You need at least 10 subjects for
    each additional predictor variable in the
    multivariate regression model.

62
Overfitting
Pure noise variables still produce good R2 values
if the model is overfitted. The distribution of
R2 values from a series of simulated regression
models containing only noise variables. (Figure
1 from Babyak, MA. What You See May Not Be What
You Get A Brief, Nontechnical Introduction to
Overfitting in Regression-Type Models.
Psychosomatic Medicine 66411-421 (2004).)
63
Other types of multivariate regression
  • Multiple linear regression is for normally
    distributed outcomes
  • Logistic regression is for binary outcomes
  • Cox proportional hazards regression is used when
    time-to-event is the outcome

64
Overview of statistical tests
  • The following table gives the appropriate choice
    of a statistical test or measure of association
    for various types of data (outcome variables and
    predictor variables) by study design.

e.g., blood pressure pounds age treatment
(1/0)
65
(No Transcript)
66
Alternative summary statistics for various types
of outcome data
67
Continuous outcome (means)
68
Binary or categorical outcomes (proportions)
69
Time-to-event outcome (survival data)
Write a Comment
User Comments (0)
About PowerShow.com