Spearman - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

Spearman

Description:

Number of Views:175

Avg rating:3.0/5.0

Slides: 15

Provided by: DrJames75

Category:

Tags: gams | spearman

Transcript and Presenter's Notes

Title: Spearman

1
Spearmans rank correlation

The standard correlation coefficient that we have
seen so far is called the Pearson product moment
correlation.
In the social sciences especially and in the
biological sciences, it is often more common to
use Spearmans rank correlation coefficient.
Spearman was a rather nasty individual who spent
a lot of time trying to show blacks were inferior
(and therefore only suitable for the infantry
cannon fodder in the US army). Spearman is also
responsible for a fair amount of work on IQ tests
and some of the misconceptions associated with
that idea
Well known evolutionary biologist Stephen J.
Gould has a great book called The Mismeasure of
Man on this subject.

As the name suggests Spearmans correlation works
with that ranks rather than the raw data.
What are the ranks?
Essentially the ranks are almost the labels
associated with the order statistics.
However, there is one key difference.
In the case of a tie, there are a couple of ways
to assign rank.
We could follow the method used in the Olympics
the two people with equal times receive the same
medal but the next person is down one level
lower. E.g. if there are two golds, then the next
medal awarded is bronze.
Numerically if there are k values tied at
position x, the next largest value receives rank
xk

Statistical ranking is slightly different.
The observations receive an average rank
That is, if there is only one observation at
position i then it receives rank i
If there is a tie between k order statistics,
then they receive the average of the labels. This
is probably best demonstrated with an example
Let x 0.2, 0.0, -0.3, 0.3, 0.3, -0.4, 0.4,
-0.4, -0.2, -0.4
We sort this into order
x -0.4, -0.4, -0.4, -0.3, -0.2, 0.0, 0.2,
0.3, 0.3, 0.4
Then the ranks are
2, 2, 2, 4, 5, 6, 7, 8.5, 8.5, 10

So as previously stated Spearmans rank
correlation works with the ranks.
That is we have as set of n observations on two
variables, x and y
Spearmans correlation is then given by
Where di is the difference between the ith rank
for x and the ith rank for y
It turns out that this formula is algebraically
equivalent to using the Pearson product moment
correlation on the ranks
And this in turn provides a method for
calculation of ?

This form of correlation used to be preferred for
its computationally simplicity
However, with todays computational power this is
no longer relevant
For all practical purposes, although ? is
calculated on the ranks, it doesnt really give
very different values from r
To calculate it using R, type
cor(rank(x), rank(y))

6
Advanced topics in regression

We have covered a very small subset of regression
analysis
There are a whole raft of advanced techniques
that are beyond the scope of this course
However, you should at least be aware that there
are special techniques for different types of
situations.
One of the most commonly used regression methods
is known as logistic regression
Logistic regression is used for situations where
the response is a proportion of the form mi/ni
ni is usually treated as fixed and mi is a count
of data
Logistic regression is often used to identify
significant variables in surveys and case-control
studies (SIDS study)

8
Example

This data comes from a medical study.
The aim is to identify the factors/variables that
help predict the probability that a patient will
die in an adult intensive care unit
There are 19 possible independent predictors,
some continuous, some discrete and 200 people in
this study. The response is STA
ID ID number of the patient
STA Vital status (0 Lived, 1 Died)
AGE Patient's age in years
SEX Patient's sex (0 Male, 1 Female)
RACE Patient's race (1 White, 2 Black, 3
Other)
SER Service at ICU admission (0 Medical, 1
Surgical)
CAN Is cancer part of the present problem? (0
No, 1 Yes)
CRN History of chronic renal failure (0 No, 1
Yes)
INF Infection probable at ICU admission (0
No, 1 Yes)
CPR CPR prior to ICU admission (0 No, 1
Yes)

SYS Systolic blood pressure at ICU admission (in
mm Hg)
HRA Heart rate at ICU admission (beats/min)
RE Previous admission to an ICU within 6 months
(0 No, 1 Yes)
TYP Type of admission (0 Elective, 1
Emergency)
FRA Long bone, multiple, neck, single area, or
hip fracture (0 No, 1 Yes)
PO2 PO2 from initial blood gases (0 gt60, 1
²60)
PH pH from initial blood gases (0 ³7.25, 1
lt7.25)
PCO PCO2 from initial blood gases (0 ²45, 1
gt45)
BIC Bicarbonate from initial blood gases (0
³18, 1 lt18)
CRE Creatinine from initial blood gases (0
²2.0, 1 gt2.0)
LOC Level of consciousness at admission (0
no coma or stupor, 1 deep stupor, 2 coma)
Initially we fit a model with all 19 predictors
and use the significance tests to see which
variables are important

10
Generalized linear models

Logistic regression is just one example from a
family of techniques called generalized linear
models (GLM)
GLMs have the same form as the logistic model
But the function f() changes depending on which
GLM youre fitting, and the distribution of the
errors is not always normal (it isnt for the
logistic model)
f() is called the link function, and there is
usually a typical link corresponding to a
distribution

11
Everything else

Outside of GLMs there are a number of other
popular models
Poisson regression and Cox-Proportional Hazard
models are very common in survival analysis and
reliability
Generalized additive models (GAMS) are used in
many different fields
And then of course there is the whole field of
non-linear regression models If you should ever
get into neural nets then you will find
non-linear regression plays a big part

12
Prediction

It is sometimes necessary to use a regression
model for prediction
You should be aware of a few things if youre
going to do this
Never predict outside the range of your data
Interpolate but dont extrapolate why?
You havent observed any data outside what you
have already, therefore you have no idea how it
behaves outside the range
Every predicted value is an estimate
Your predicted values are not carved in stone
Theyre estimates based on data, therefore there
is error in them

13
Confidence intervals

It is possible to construct a confidence interval
around your regression line
This is equivalent to constructing a confidence
interval for each fitted value
Why would we do this?
Dont do it for no reason
There is a difference for confidence interval for
a predicted value and a future (interpolated
value)
How do we do this?

Let
be the ith fitted value
s be the residual standard error
r n k 1 be the degrees of freedom for the
residual standard error
be the ith row of the design matrix i.e. the
x values corresponding to the ith y value, then a
100(1-?) confidence interval for is given
by
If is an interpolated value then this becomes