Spearman - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Spearman

Description:

In the social sciences especially and in the biological sciences, it is often ... Generalized additive models (GAMS) are used in many different fields ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 15
Provided by: DrJames75
Category:
Tags: gams | spearman

less

Transcript and Presenter's Notes

Title: Spearman


1
Spearmans rank correlation
  • The standard correlation coefficient that we have
    seen so far is called the Pearson product moment
    correlation.
  • In the social sciences especially and in the
    biological sciences, it is often more common to
    use Spearmans rank correlation coefficient.
  • Spearman was a rather nasty individual who spent
    a lot of time trying to show blacks were inferior
    (and therefore only suitable for the infantry
    cannon fodder in the US army). Spearman is also
    responsible for a fair amount of work on IQ tests
    and some of the misconceptions associated with
    that idea
  • Well known evolutionary biologist Stephen J.
    Gould has a great book called The Mismeasure of
    Man on this subject.

2
  • As the name suggests Spearmans correlation works
    with that ranks rather than the raw data.
  • What are the ranks?
  • Essentially the ranks are almost the labels
    associated with the order statistics.
  • However, there is one key difference.
  • In the case of a tie, there are a couple of ways
    to assign rank.
  • We could follow the method used in the Olympics
    the two people with equal times receive the same
    medal but the next person is down one level
    lower. E.g. if there are two golds, then the next
    medal awarded is bronze.
  • Numerically if there are k values tied at
    position x, the next largest value receives rank
    xk

3
  • Statistical ranking is slightly different.
  • The observations receive an average rank
  • That is, if there is only one observation at
    position i then it receives rank i
  • If there is a tie between k order statistics,
    then they receive the average of the labels. This
    is probably best demonstrated with an example
  • Let x 0.2, 0.0, -0.3, 0.3, 0.3, -0.4, 0.4,
    -0.4, -0.2, -0.4
  • We sort this into order
  • x -0.4, -0.4, -0.4, -0.3, -0.2, 0.0, 0.2,
    0.3, 0.3, 0.4
  • Then the ranks are
  • 2, 2, 2, 4, 5, 6, 7, 8.5, 8.5, 10

4
  • So as previously stated Spearmans rank
    correlation works with the ranks.
  • That is we have as set of n observations on two
    variables, x and y
  • Spearmans correlation is then given by
  • Where di is the difference between the ith rank
    for x and the ith rank for y
  • It turns out that this formula is algebraically
    equivalent to using the Pearson product moment
    correlation on the ranks
  • And this in turn provides a method for
    calculation of ?

5
  • This form of correlation used to be preferred for
    its computationally simplicity
  • However, with todays computational power this is
    no longer relevant
  • For all practical purposes, although ? is
    calculated on the ranks, it doesnt really give
    very different values from r
  • To calculate it using R, type
  • cor(rank(x), rank(y))

6
Advanced topics in regression
  • We have covered a very small subset of regression
    analysis
  • There are a whole raft of advanced techniques
    that are beyond the scope of this course
  • However, you should at least be aware that there
    are special techniques for different types of
    situations.
  • One of the most commonly used regression methods
    is known as logistic regression
  • Logistic regression is used for situations where
    the response is a proportion of the form mi/ni
  • ni is usually treated as fixed and mi is a count
    of data
  • Logistic regression is often used to identify
    significant variables in surveys and case-control
    studies (SIDS study)

7
  • Logistic regression fits a model of the form
  • where the function f( ) is the logistic function
  • Hence the name of the procedure.

8
Example
  • This data comes from a medical study.
  • The aim is to identify the factors/variables that
    help predict the probability that a patient will
    die in an adult intensive care unit
  • There are 19 possible independent predictors,
    some continuous, some discrete and 200 people in
    this study. The response is STA
  • ID ID number of the patient
  • STA Vital status (0 Lived, 1 Died)
  • AGE Patient's age in years
  • SEX Patient's sex (0 Male, 1 Female)
  • RACE Patient's race (1 White, 2 Black, 3
    Other)
  • SER Service at ICU admission (0 Medical, 1
    Surgical)
  • CAN Is cancer part of the present problem? (0
    No, 1 Yes)
  • CRN History of chronic renal failure (0 No, 1
    Yes)
  • INF Infection probable at ICU admission (0
    No, 1 Yes)
  • CPR CPR prior to ICU admission (0 No, 1
    Yes)

9
  • SYS Systolic blood pressure at ICU admission (in
    mm Hg)
  • HRA Heart rate at ICU admission (beats/min)
  • RE Previous admission to an ICU within 6 months
    (0 No, 1 Yes)
  • TYP Type of admission (0 Elective, 1
    Emergency)
  • FRA Long bone, multiple, neck, single area, or
    hip fracture (0 No, 1 Yes)
  • PO2 PO2 from initial blood gases (0 gt60, 1
    ²60)
  • PH pH from initial blood gases (0 ³7.25, 1
    lt7.25)
  • PCO PCO2 from initial blood gases (0 ²45, 1
    gt45)
  • BIC Bicarbonate from initial blood gases (0
    ³18, 1 lt18)
  • CRE Creatinine from initial blood gases (0
    ²2.0, 1 gt2.0)
  • LOC Level of consciousness at admission (0
    no coma or stupor, 1 deep stupor, 2 coma)
  • Initially we fit a model with all 19 predictors
    and use the significance tests to see which
    variables are important

10
Generalized linear models
  • Logistic regression is just one example from a
    family of techniques called generalized linear
    models (GLM)
  • GLMs have the same form as the logistic model
  • But the function f() changes depending on which
    GLM youre fitting, and the distribution of the
    errors is not always normal (it isnt for the
    logistic model)
  • f() is called the link function, and there is
    usually a typical link corresponding to a
    distribution

11
Everything else
  • Outside of GLMs there are a number of other
    popular models
  • Poisson regression and Cox-Proportional Hazard
    models are very common in survival analysis and
    reliability
  • Generalized additive models (GAMS) are used in
    many different fields
  • And then of course there is the whole field of
    non-linear regression models If you should ever
    get into neural nets then you will find
    non-linear regression plays a big part

12
Prediction
  • It is sometimes necessary to use a regression
    model for prediction
  • You should be aware of a few things if youre
    going to do this
  • Never predict outside the range of your data
  • Interpolate but dont extrapolate why?
  • You havent observed any data outside what you
    have already, therefore you have no idea how it
    behaves outside the range
  • Every predicted value is an estimate
  • Your predicted values are not carved in stone
  • Theyre estimates based on data, therefore there
    is error in them

13
Confidence intervals
  • It is possible to construct a confidence interval
    around your regression line
  • This is equivalent to constructing a confidence
    interval for each fitted value
  • Why would we do this?
  • Dont do it for no reason
  • There is a difference for confidence interval for
    a predicted value and a future (interpolated
    value)
  • How do we do this?

14
  • Let
  • be the ith fitted value
  • s be the residual standard error
  • r n k 1 be the degrees of freedom for the
    residual standard error
  • be the ith row of the design matrix i.e. the
    x values corresponding to the ith y value, then a
    100(1-?) confidence interval for is given
    by
  • If is an interpolated value then this becomes
Write a Comment
User Comments (0)
About PowerShow.com