Loading...

PPT – Chapters 8 and 9: Correlations Between Data Sets PowerPoint presentation | free to download - id: 70bdf4-MjM1O

The Adobe Flash plugin is needed to view this content

Chapters 8 and 9 Correlations Between Data Sets

- Math 1680

Overview

- Scatter Plots
- Associations
- The Correlation Coefficient
- Sketching Scatter Plots
- Changes of Scale
- Summary

Scatter Plots

- Often, we are interested in comparing two related

data sets - Heights and weights of students
- SAT scores and freshman GPA
- Age and fuel efficiency of vehicles
- We can draw a scatter plot of the data set
- Plot paired data points on a Cartesian plane

Scatter Plots

- Scatter plot for the heights of 1,078 fathers and

their adult sons - From HANES study

Scatter Plots

- What does the dashed diagonal line represent?
- Find the point representing a 5'3¼" father who

has a 5'6½" son

Scatter Plots

- What does the vertical dashed column represent?
- Consider the families where the father was 72"

tall, to the nearest inch - How tall was the tallest son?
- Shortest?

Scatter Plots

- Was the average height of the fathers around 64,

68 or 72? - Was the SD of the fathers heights around 3", 6"

or 9"?

Scatter Plots

- The points form a swarm that is more or less

football-shaped - This indicates that there is a linear association

between the fathers heights and the sons heights

Scatter Plots

- Short fathers tend to have short sons, and tall

fathers tend to have tall sons - We say there is a positive association between

the heights of fathers and sons - What would it mean for there to be a negative

association between the heights?

Scatter Plots

- Does knowing the fathers height give a precise

prediction of his sons height? - Does knowing the fathers height let you better

predict his sons height?

Scatter Plots

- We will generally assume the scatter plots are

football-shaped - Association is linear in nature
- Each data set is approximately normal

Scatter Plots

- Key features of scatter plots
- Given two data sets X and Y,
- The point of averages is the point (?x, ?y)
- The average of a data set is denoted by µ (Greek

mu, for mean) - The subscript indicates which set is being

referenced - It will be in the center of the cloud
- Due to the normal approximation, the vast

majority (95) of the cloud should fall within 2

SDs less than and greater than average for both

X and Y

Scatter Plots

Associations

- When given a value in one data set, we often want

to make a prediction for the other data set - We call our given value the independent variable
- We call the value we are trying to predict the

dependent variable

Associations

- If there is indeed a relationship between the two

data sets, we can say various things about their

association - Strong Knowing X helps you a lot in predicting

Y, and vice versa - Weak Knowing X doesnt really help you predict

Y, and vice versa - Positive X and Y are directly proportional
- The higher in one you look, the higher in the

other you should be - Negative X and Y are inversely proportional
- The higher in one you look, the lower in the

other you should be

Associations

- Positive associations
- Study time/final grade
- Height/weight
- SAT score/GPA
- Clouds in sky/chance of rain
- Bowling practice/bowling score
- Age of husband/age of wife

- Negative associations
- Age of car/fuel efficiency
- Golfing practice/golf score
- Dental hygiene/cavities formed
- Pollution/air quality
- Speed/mile time

Associations

- What kind of association is this?

Associations

- What kind of association is this?

Associations

- Remember that even a very strong association does

not necessarily imply a causal relationship - There may be a confounding influence at play

The Correlation Coefficient

- While strong/weak and positive/negative give a

sense of the association, we want a way to

quantify the strength and direction of the

association - The correlation coefficient (r) is the statistic

which accomplishes this

The Correlation Coefficient

- The correlation coefficient is always between 1

and 1 - A positive r means that there is a positive

association between the sets - A negative r means that there is a negative

association between the sets - If r is close to 0, then there is only a weak

association between the sets - If r is close to 1 or 1, then there is a strong

association between the sets

The Correlation Coefficient

- The following plots have and

, with 50 points in them - The only difference between them is the

correlation coefficient - Note how the points fall into a line as r

approaches 1 or 1

(No Transcript)

The Correlation Coefficient

- To calculate r
- Find the average and SD of each data set
- Multiply the data sets pairwise and find the

average - The correlation is the average of the product

minus the product of the averages, all divided by

the product of the SDs

The Correlation Coefficient

X Y

1 5

3 9

4 7

5 1

7 13

The Correlation Coefficient

- Compute r for the following data

X Y

1 2

2 1

3 4

4 3

5 7

6 5

7 6

X Y

1 3

3 7

4 9

5 11

7 15

1

0.8214

The Correlation Coefficient

- Estimate the correlation

The Correlation Coefficient

- Estimate the correlation

Sketching Scatter Plots

- The SD line is the line consisting of all the

points where the standard score in X equals the

standard score in Y - zX zY
- To sketch the SD line, draw a line bisecting the

long axis of the football shape - Note that the SD line always goes through the

point of averages

Sketching Scatter Plots

- Given the five-statistic summary (averages, SDs,

and correlation) for a pair of data sets, we can

sketch the scatter plot - Plot the point of averages in the center
- Mark two SDs in both directions, on both axes
- Plot the point 1 SD above average for both data

sets - draw a line connecting this point and the point

of averages - This is the SD line
- Draw an ellipse with the SD line as its long axis

- Ellipse should go just beyond the 2 SD marks in

all directions - The value of r determines how oblong the ellipse

is

Sketching Scatter Plots

- A study of the IQs of husbands and wives obtained

the following results - Husbands average IQ 100, SD 15
- Wives average IQ 100, SD 15
- r 0.6
- Sketch the scatter plot

Changes of Scale

- The correlation coefficient is not affected by

changes of scale - Moving adding the same number to all of the

values of one variable - Stretching multiplying the same positive number

to all the values of one variable - Would r change if we multiplied by a negative

number? - The correlation coefficient is also unaffected by

interchanging the two data sets

Changes of Scale

Changes of Scale

Changes of Scale

- Compute r for each of the following data sets

X Y

0 8

4 9

6 10

8 12

12 6

X Y

0 2

2 3

3 4

4 6

6 0

r -0.15

Summary

- The relationship between two variables, X and Y,

can be graphed in a scatter plot - When the scatter plot is tightly clustered around

a line, there is a strong linear association

between X and Y - A scatter plot can be characterized by its

five-statistic summary - Average and SD of the X values
- Average and SD of the Y values
- Correlation coefficient

Summary

- When the correlation coefficient gets closer to 1

or 1, the points cluster more tightly around a

line - Positive association has a positive r-value
- Negative association has a negative r-value
- Calculating the correlation coefficient
- Take the average of the product
- Subtract the product of the averages
- Divide the difference by the product of the SDs

Summary

- The correlation coefficient is not affected by

changes of scale or transposing the variables - Correlation does not measure causation!