bivariate EDA and regression analysis - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

bivariate EDA and regression analysis

Description:

bivariate EDA and regression. analysis. length. width. distance from quarry. weight of core ... scatterplots provide the most detailed summary of a bivariate ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 62
Provided by: stan7
Category:

less

Transcript and Presenter's Notes

Title: bivariate EDA and regression analysis


1
bivariate EDA and regression analysis
2
(No Transcript)
3
(No Transcript)
4
Y
X
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
scatterplot matrix
9
(No Transcript)
10
(No Transcript)
11
scatterplots
  • scatterplots provide the most detailed summary of
    a bivariate relationship, but they are not
    concise, and there are limits to what else you
    can do with them
  • simpler kinds of summaries may be useful
  • more compact often capture less detail
  • may support more extended mathematical analyses
  • may reveal fundamental relationships

12
(No Transcript)
13
y a bx
14
y a bx
b slope b ?y/?xb (y2-y1)/(x2-x1)
a y intercept
15
y a bx
b slope b ?y/?xb (y2-y1)/(x2-x1)
a y intercept
16
y a bx
  • we can predict values of y from values of x
  • predicted values of y are called y-hat
  • the predicted values (y) are often regarded as
    dependent on the (independent) x values
  • try to assign independent values to x-axis,
    dependent values to the y-axis

17
y a bx
  • becomes a concise summary of a point
    distribution, and a model of a relationship
  • may have important explanatory and predictive
    value

18
(No Transcript)
19
  • how do we come up with these lines?
  • various options
  • by eye
  • calculating a Tukey Line (resistant to
    outliers)
  • locally weighted regression LOWESS
  • least squares regression

20
linear regression
  • linear regression and correlation analysis are
    generally concerned with fitting lines to real
    data
  • least squares regression is one of the main tools
  • attempts to minimize deviation of observed points
    from the regression line
  • maximizes its potential for prediction

21
  • standard approach minimizes the squared variation
    in y
  • Note
  • these are the vertical deviations
  • this is a sum-squared-error approach

22
  • regressing x on y would involve defining the line
  • by minimizing

23
  • calculating a line that minimizes this value is
    called regressing y on x
  • appropriate when we are trying to predict y from
    x
  • this is also called Model I Regression

24
  • start by calculating the slope (b)

? covariance
25
  • once you have the slope, you can calculate the
    y-intercept (a)

26
regression pathologies
  • things to avoid in regression analysis

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Tukey Line
  • resistant to outliers
  • divide cases into thirds, based on x-axis
  • identify the median x and y values in upper and
    lower thirds
  • slope (b) (My3-My1)/(Mx3-Mx1)
  • intercept (a) median of all values yi-bxi

32
(No Transcript)
33
Correlation
  • regression concerns fitting a linear model to
    observed data
  • correlation concerns the degree of fit between
    observed data and the model...
  • if most points lie near the line
  • the fit of the model is good
  • the two variables are strongly correlated
  • values of y can be well predicted from x

34
Pearsons r
  • this is assessed using the product-moment
    correlation coefficient
  • covariance (the numerator), standardized by a
    measure of variation in both x and y

35
(xi,yi)
36
  • unlike the covariance, r is unit-less
  • ranges between 1 and 1
  • 0 no correlation
  • -1 and 1 perfect negative and positive
    correlation (respectively)
  • r is symmetrical
  • correlation between x and y is the same as
    between y and x
  • no question of independence or dependence
  • recall, this symmetry is not true of regression

37
  • regression/correlation
  • one can assess the strength of a relationship by
    seeing how knowledge of one variable improves the
    ability to predict the other
  • this is like what we did with PRE (proportional
    reduction of error) approaches to contingency
    table analysis
  • (recall Goodman and Kruskals Tau)

38
  • regression/correlation
  • one can assess the strength of a relationship by
    seeing how knowledge of one variable improves the
    ability to predict the other

39
  • if you ignore x, the best predictor of y will be
    the mean of all y values (y-bar)
  • if the y measurements are widely scattered,
    prediction errors will be greater than if they
    are close together
  • we can assess the dispersion of y values around
    their mean by

40
(No Transcript)
41
  • coefficient of determination (r2)
  • describes the proportion of variation that is
    explained or accounted for by the regression
    line
  • r2.5
  • ? half of the variation is explained by the
    regression
  • ? half of the variation in y is explained by
    variation in x

42
(No Transcript)
43
correlation and percentages
  • much of what we want to learn about association
    between variables can be learned from counts
  • ex are high counts of bone needles associated
    with high counts of end scrapers?
  • sometimes, similar questions are posed of
    percent-standardized data
  • ex are high proportions of decorated pottery
    associated with high proportions of copper bells?

44
caution
  • these are different questions and have different
    implications for formal regression
  • percents will show at least some level of
    correlation even if the underlying counts do not
  • spurious correlation (negative)
  • closed-sum effect

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
rank order correlation
  • appropriate where continuous measurements imply a
    false level of accuracy
  • site survey data
  • estimations of site-size, population, etc.
  • often rough approximations, but still can be
    expected (at least in theory) to vary as a
    function of other variables
  • if we can rank observations with confidence we
    can gain similar insights to what we get from r

49
rank order correlation
  • also appropriate where you want to reduce the
    effect of outliers
  • may also help if the trend is not strictly
    linear resistant to the exact shape of the trend

50
Spearmans r
  • based on Pearsons r
  • Di is the difference between pairs of ranks
  • range is -1 to 1 0 means no correlation

51
example
  • site-size vs. site-catchment
  • 5 sites
  • ranked from largest to smallest
  • ranked with respect to quantity and quality of
    arable land within a 1 km radius
  • are the bigger sites surrounded by better land?

52
(No Transcript)
53
5
4
site-size
3
2
1
1
2
3
4
5
catchment
54
  • rs 1-(66)/(5(25-1))
  • rs 1-36/120 .7

55
Kendalls Tau
  • calculate a value S
  • feed S into 2nd equation to get Tau
  • rank cases on one variable
  • for each case, note how many cases, further
    along, are correctly ordered and/or misordered on
    the second variable

56
Tau S/(n(n-1)/2) n(n-1)/2 10 Tau 6/10 .6
57
rs vs. Tau
  • not much reason to prefer one over the other
  • numerical value of Tau usually a little less than
    rs
  • Tau is better if number of cases is small
  • neither can be used like r2 (i.e., as a way of
    explaining of observed variation)
  • basically a relative measure

58
regression assumptions
  • we want regression to have predictive potential
  • scatterplot pathologies work against this
  • these are probably the most important things to
    avoid, but there are other underlying assumptions

59
regression assumptions
  • both variables are measured at the interval scale
    or above
  • variation is the same at all points along the
    regression line (variation is homoscedastic)

60
residuals
  • vertical deviations of points around the
    regression
  • for case i, residual yi-y-hati yi-(abxi)
  • residuals in y should not show patterned
    variation either with x or y-hat
  • normally distributed around the regression line
  • residual error should not be autocorrelated
    (errors/residuals in y are independent)

61
standard error of the regression
  • recall standard error of an estimate (SEE) is
    like a standard deviation
  • can calculate an SEE for residuals associated
    with a regression formula

62
  • to the degree that the regression assumptions
    hold, there is a 68 probability that true values
    of y lie within 1 SEE of y-hat
  • 95 within 2 SEE
  • can plot lines showing the SEE
  • y-hat abx /- SEE

63
(No Transcript)
64
data transformations and regression
  • read Shennan, Chapter 9 (esp. pp. 151-173)

65
(No Transcript)
66
(No Transcript)
67
let VAR1T sqr(VAR1)
68
  • distribution and fall-off models
  • ex density of obsidian vs. distance from the
    quarry

69
(No Transcript)
70
(No Transcript)
71
LG_DENS ? log(DENSITY)
72
y 1.70-.05x
remember y is logged density
73
logy 1.70-.05x
fplot y exp(1.70-.05x)
74
begin PLOT DENSITYDISTANCE / FILL1,0,0 fplot
y exp(1.70-.05x) XLABEL'' YLABEL'' XTICK0
XPIP0 YTICK0 YPIP0 XMIN0 XMAX80 YMIN0
YMAX6 end
75
transformation summary
  • correcting left skew
  • x4 stronger
  • x3 strong
  • x2 mild
  • correcting right skew
  • ?x weak
  • log(x) mild
  • -1/x strong
  • -1/x2 stronger
Write a Comment
User Comments (0)
About PowerShow.com