2. Exploratory Data Analysis - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

2. Exploratory Data Analysis

Description:

No apriori ideas (model) in EDA! For classical analysis, the sequence is ... When is Hallowe'en equal to Christmas? ANSWER: Oct. 31 = Dec. 25. I.e. 8x3 1 = 2x10 5 ... – PowerPoint PPT presentation

Number of Views:983
Avg rating:3.0/5.0
Slides: 58
Provided by: Pet126
Category:

less

Transcript and Presenter's Notes

Title: 2. Exploratory Data Analysis


1
2. Exploratory Data Analysis
  • OR An ABC of EDA
  • Peter Watson
  • http//imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/gra
    phFAQ

2
No apriori ideas (model) in EDA!
  • For classical analysis, the sequence is
  • Problem gt Data gt Model gt Analysis gt
    Conclusions
  • For EDA, the sequence is
  • Problem gt Data gt Analysis gt Model gt
    Conclusions

3
EDA - Exploratory data analysis
  • Informal graphical techniques (Tukey, 1977) which
  • look at underlying structure
  • identify outliers
  • check assumptions in later formal analyses
    (Normality, equality of variance)
  • Most of the EDA techniques are graphical and
    quite simple
  • A picture is worth more than ten thousand words
    Chinese proverb

4
Graphical displays
  • Histograms
  • Boxplots
  • Quantile Plots
  • Error Bar plots (groups)
  • Stem and Leaf Displays
  • Scatterplots (especially for checking linearity
    of correlations, residual plots from regressions)
  • (see also regression talk)
  • Under SPSS EXPLORE

5
Symmetry
  • Clustered around median
  • medianmeanmode
  • no skewness
  • CIs of mean assume symmetry

6
Skew and kurtosis
  • Skew
  • lt0 upper straggle
  • 0 symmetric
  • gt0 downward straggle
  • Rules of Thumb (Hair et al,1998Simon,2002)
  • Negative skew lt-1
  • Positive Skew gt 1
  • Kurtosis
  • lt0 flat (Platikurtic)
  • 0 normal peak
  • gt0 peaked about mean (Leptokurtic)
  • Rules of Thumb (Simon, 2002)
  • Positive Kurtosis gt 3
  • Negative Kurtosis lt -3

7
Peakedness
  • Kurtosis measures peakedness
  • Dont want too peaked or too uniform
    distributions
  • Too peaked -no variation
  • Too flat - no one typical value

8
Types of Kurtosis (Miles Shevlin, 2001)
9
Bimodality
  • This is a mixture of two distributions (clear dip
    around the middle multipeaked)
  • Histograms are usually good at spotting this
  • Suggests modelling the first half and second half
    separately

10
Beck Score
  • Positive skew (gt1.0)
  • Most scores around zero
  • Scores above 13 - clinically depressed
  • One score of 46!

11
Boxplots
12
Boxplots
  • median line in red box
  • Middle half in red box (1.3 sds)
  • Outliers circles and stars
  • Shape of data

13
Outliers in boxplots
  • Inner fence - moderately weird. Over 1.5
    boxlengths from upper/lower quartiles Circles in
    SPSS
  • (2.67 sds from mean in normal data)
  • Outer fence - decidedly weird. Over 3 boxlengths
    from upper/lower quartiles Asterisks in SPSS
  • (4.67 sds from mean in normal data)

14
Hinges
  • Boxplots actually use Hinges to define locations
    of boxes and outliers
  • Upper Hinge similar to Upper quartile
  • Lower Hinge similar to Lower quartile
  • Inter-hinge spread similar to interquartile range

15
Boxplot of Beck score
  • Positive skew
  • Concentration of outliers above median score

16
Effect of an outlier
  • biases mean (green line)
  • inflates variance of mean
  • median more robust

17
Robustness to outliers
  • Number of positive responses (max6)
  • 0,0,0,4,4,5,5,5,6,6,6,6,6,6,6
  • 95 Bootstrap Confidence Intervals
  • Median (4.89,5.11) Observed Median5.00
  • Mean (4.11,4.41) Observed Mean4.33

18
Consistency of median
19
Obtaining 95 CIs for skewed data Bootstrapping
(Efron Tibshirani, 1993)
See also http//www.ruf.rice.edu/lane/stat_sim/s
ampling_dist/index.html
20
Example revisited
  • 1000 random samples of size equal to original
    sample (N15)
  • Results
  • Point estimates Mean4.37 Median5
  • 95 CIs Mean 3.27, 5.27 Median 4, 6
  • Outliers exerting undue influence on the mean
  • See http//imaging.mrc-cbu.cam.ac.uk/statswiki/FA
    Q/boot

21
The sampling distribution of the variance follows
a chisquare which tends to N(n-1,2(n-1)) as n
increases
shoulds.e.(mean) 5/sqrt(25)1
N25 normal(24,48) observed mean is 23.75,
observed variance6.826.8246.5
22
Other approaches to identifying outliers (besides
boxplots)
  • Cases with z-scores exceeding 2.5
  • (z-score subtracts mean and divides by s.d.)
  • Grubbs test (see CBU website for details)

23
Quantile Plots
  • Raw beck score
  • deviates from straight line
  • Substantial skew
  • Limits choice of statistical tests we can use to
    analyse beck score
  • Bump above line positive skew

24
Reverse scored Beck
  • beck is now negative skewed
  • the bump is now under the line

25
S shapes
  • Symptoms of Kurtosis
  • uniformity
  • peakedness

26
Testing normality more formally
  • Kolmogorov-Smirnov test
  • Shapiro-Wilks
  • Overly sensitive for large samples

Non-Normal
27
Symmetric plots
  • Rank Beck distances above and below the beck
    score median
  • Plots I-th lowest distance above the median
    against I-th lowest distance below the median
  • Not many points plotted as so many points below
    the median so doesnt show asymmetry very well
    multiple points with same co-ordinates
  • If symmetric points fall on line xy
  • distances above median gt distances below median

28
Stem and Leaf of Beck Score
  • Stem Leaf
  • 6 . 0 6.0
  • Each leaf4 cases

29
Temperature
  • What is unusual about this
  • distribution?
  • Clue spacing.
  • Each leaftwo temperatures
  • STEM LEAF
  • -6 6 -6.6 Degrees C
  • Frequency Stem Leaf
  • 2.00 -6 . 6
  • 4.00 -5 . 00
  • 10.00 -4 . 44444
  • 6.00 -3 . 338
  • 14.00 -2 . 2227777
  • 14.00 -1 . 1116666
  • 8.00 -0 . 0055
  • 12.00 0 . 005555
  • 6.00 1 . 616
  • 14.00 2 . 7777777
  • 16.00 3 . 33333888
  • 14.00 4 . 4444444
  • 13.00 5 . 555555
  • 11.00 7 . 22227
  • 7.00 8 . 888
  • 6.00 9 . 444

30
Scales (c/o RSS News)
  • Grain diameters recorded to nearest division 1
    inch apart
  • Subsequently told to report in cm 1 inch
    2.5cm approx.
  • Raw data (1,0,2,1,1,0,4,1,3,0,1,1) in inches
  • (2.50, 0, 5.00, 2.50, 2.50, 0, 10.00, 2.50, 7.50,
    0, 2.50, 2.50) in cm
  • The village post office is 1.21 km (2 miles)
    across the valley on the left
  • (1 lb) 454 grammes of cheese, (1 pint) 560ml of
    beer
  • The human mind likes whole numbers

31
Percentage success
  • What is wrong with this graph?
  • No axis labels or title
  • Y axis strangely scaled Cant have percentages lt
    0 or greater than 100
  • green markers smaller
  • Green and red not distinguishable by colour blind
    person yellow partially hidden by background
  • Other caveats
  • Joining points can be misleading
  • make sure tick marks on scales are not too near
    one another to give false effect.

32
Scale invariance or not.
  • When is 40 approximately equal to 25?
  • When is 73 equal to 111? (asked on University
    Challenge in 2005)
  • ANSWERS
  • km mph In base 8. Computers think in
    binary (base 2)!
  • But
  • Februarys temperature was 55F (13C) which is
    three times the average
  • So the average equals 55/3 18.3F (13/3
    4.3C)?
  • Why is this patently untrue?
  • ANSWER
  • 18.3F is below freezing (lt32F) but 4.3C is
    above freezing(gt0C).

33
One more thing...
  • When is Halloween equal to Christmas?
  • ANSWER Oct. 31 Dec. 25. I.e. 8x3 1 2x10 5

34
Error Bar Charts
  • Interactive Bar charts
  • Bar length represents 95 Confidence interval for
    the mean
  • females have higher depression scores than males

35
Bubble Plots (in R) years in education related to
income/prestige
36
Multiple scatter plots (R)
37
Ladder of Powers (Marsh,1988)
  • Powers (double star function in SPSS COMPUTE)
    e.g. 329
  • 2 square
  • 1 untransformed
  • 0.5 square root
  • 0 (natural) log
  • -0.5 inverse square root
  • -1 reciprocal
  • -2 inverse square

38
Choosing a power
  • Trial and Error
  • Box-Cox transformation
  • SPSS Box-Cox macro available at
  • http//stat.tamu.edu/ftp/pub/mspeed/stat653/spss/

39
Box-Cox applied to Beck score
  • Looks for a power that minimises Beck score
    variance
  • Suggests a power of 0.3 (near to log transform
    (power0))
  • Regression improve fit of a covariate to predict
    a test score Box-Cox can flag up a non-linear
    relationship
  • Can be used to help determine z-scores and means
    but can be misleading for very skewed data e.g.
    when floor and ceiling effects are present

40
Predicted test score using a covariate vs actual
test score (raw and square rooted)
More linear relationship taking square root
(right hand side picture)
41
Box Cox on residual variance
  • test score constant Aitem score residual
  • Can use boxcox on residuals of fitting item score
    on test score
  • suggests using square root of y
  • This is the transform of test score which
    minimizes the residual variance

42
Exponential
  • Clicks constant AE-B Age
  • Another type of non-linear relationship.
    Characterised by ever increasing rates of
    changes as you get older

43
Log Beck
  • Skew0.60
  • Kurtosis-0.06
  • Acceptable using rule of thumb

44
Quantile plot - log Beck
  • Fits closer to a straight line
  • Log transform has made the distribution more
    Normal
  • Log transform enables the use of more powerful
    statistical tests

45
Symmetry of midpoints
  • midpoints of percentiles
  • average of thresholds marking blue and green
    areas should be equal in symmetric distributions

46
Midpoints of beck score
  • Beck
  • Median6
  • 0.5(Sum of Midpoints) - Median
  • Quarters 8.3
  • Eighths 16.7
  • Sixteenths 38
  • Log(Beck1)
  • Median1.95
  • 0.5(Sum of Midpoints) - Median
  • Quarters 3.0
  • Eighths 14.4
  • Sixteenths 26.7
  • MORE SYMMETRIC!

47
Rank transform
  • Downweight outliers
  • Useful if power transformations fail
  • Useful summary measures
  • Medians
  • Interquartile ranges (Boxplots)
  • Rank sums (Non-parametric tests)

48
Using ranks - example
  • Compare cost (in ) of two care centres
  • Care Centres O R
  • Any patient cost saving?

49
Centre O stem leaf display
  • STEM WIDTH200
  • 2 EXTREMES
  • POSITIVE SKEW

50
Centre R - Stem and Leaf
  • (stem width100)
  • outliers present
  • positive skew
  • rank test needed

51
RESULTS
  • UNRANKED
  • t(147) 0.91, p.36
  • centre costs the same
  • Uses means
  • RANKED
  • mean Rank
  • Study O 65.06
  • Study R 85.63
  • M-W Z-2.96,p.003
  • Centre R costlier
  • Uses ranks

52
Nonparametric tests
  • PROS
  • Downweight outliers
  • Fewer assumptions
  • Useful for skewed distributions
  • CONS
  • Less powerful
  • Lose information
  • Limited range of tests

53
Equal Group Variances
  • Important for t-tests and ANOVAs
  • No covariate by group interaction in ANCOVA
    Quades (1967) method is a nonparametric
    equivalent
  • May need to transform outcome
  • Tests available to identify problems

54
Levenes test
  • Are group variances equal?
  • Gets slope of spread vs location
  • Compares slope to 0
  • produces F-test

55
Proportions
  • Variance of a proportion depends on value of
    proportion!
  • Arcsine transform resolves this
  • In SPSS use function in COMPUTE to do transform
  • 2 arsin(sqrt(p))

56
Funny you should say that...
  • There is no truth to the allegation that
    statisticians are mean. They are just your
    standard normal deviates.
  • Why don't statisticians like to model new
    clothes?
  • Lack of fit.
  • Did you hear about the statistician who invented
    a device to measure the weight of trees? It's
    referred to as the ? scale
  • ?log
  • Old statisticians never die, they just undergo a
    transformation.
  • Or in summary.Normal lack of fit try a log
    transformation!
  • http//research.microsoft.com/users/lamport/pubs/h
    air.pdf

57
And Finally...
  • A Statistician is someone who can have their
    head in an oven and their feet in an ice box and
    say that on the whole they are feeling perfectly
    normal
  • Check you are using appropriate summary measures
  • Further details including references on EDA at
  • http//www.itl.nist.gov/div898/handbook/eda/eda.ht
    m
  • Thanks to Frank Duckworth RSS News article on
    scales
  • Thanks to Chrissy Fletcher for supplying the
    jokes
  • Allan Reese (CEFAS, graphical comments)
  • Next week (Thursday). 11am
  • Ian Nimmo-Smith
  • The anatomy of statistical methods models,
    hypotheses, significance and power
Write a Comment
User Comments (0)
About PowerShow.com