Title: 2. Exploratory Data Analysis
 12. Exploratory Data Analysis
- OR An ABC of EDA 
- Peter Watson 
- http//imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/gra
 phFAQ
2No apriori ideas (model) in EDA!
- For classical analysis, the sequence is 
- Problem gt Data gt Model gt Analysis gt 
 Conclusions
- For EDA, the sequence is 
- Problem gt Data gt Analysis gt Model gt 
 Conclusions
3EDA - Exploratory data analysis
- Informal graphical techniques (Tukey, 1977) which 
- look at underlying structure 
- identify outliers 
- check assumptions in later formal analyses 
 (Normality, equality of variance)
- Most of the EDA techniques are graphical and 
 quite simple
- A picture is worth more than ten thousand words 
 Chinese proverb
4Graphical displays
- Histograms 
- Boxplots 
- Quantile Plots 
- Error Bar plots (groups) 
- Stem and Leaf Displays 
- Scatterplots (especially for checking linearity 
 of correlations, residual plots from regressions)
-  (see also regression talk) 
- Under SPSS EXPLORE
5Symmetry
- Clustered around median 
-  medianmeanmode 
-  no skewness 
- CIs of mean assume symmetry 
6Skew and kurtosis
- Skew 
- lt0 upper straggle 
- 0 symmetric 
- gt0 downward straggle 
- Rules of Thumb (Hair et al,1998Simon,2002) 
-  Negative skew lt-1 
-  Positive Skew gt 1
- Kurtosis 
- lt0 flat (Platikurtic) 
- 0 normal peak 
- gt0 peaked about mean (Leptokurtic) 
- Rules of Thumb (Simon, 2002) 
- Positive Kurtosis gt 3 
- Negative Kurtosis lt -3 
7Peakedness
- Kurtosis measures peakedness 
- Dont want too peaked or too uniform 
 distributions
- Too peaked -no variation 
- Too flat - no one typical value
8Types of Kurtosis (Miles  Shevlin, 2001) 
 9Bimodality
- This is a mixture of two distributions (clear dip 
 around the middle multipeaked)
- Histograms are usually good at spotting this 
- Suggests modelling the first half and second half 
 separately
10Beck Score
- Positive skew (gt1.0) 
- Most scores around zero 
- Scores above 13 - clinically depressed 
- One score of 46! 
11Boxplots 
 12Boxplots
- median  line in red box 
- Middle half in red box (1.3 sds) 
- Outliers  circles and stars 
- Shape of data 
13Outliers in boxplots
- Inner fence - moderately weird. Over 1.5 
 boxlengths from upper/lower quartiles Circles in
 SPSS
- (2.67 sds from mean in normal data) 
- Outer fence - decidedly weird. Over 3 boxlengths 
 from upper/lower quartiles Asterisks in SPSS
-  (4.67 sds from mean in normal data) 
14Hinges
- Boxplots actually use Hinges to define locations 
 of boxes and outliers
- Upper Hinge similar to Upper quartile 
- Lower Hinge similar to Lower quartile 
- Inter-hinge spread similar to interquartile range 
15Boxplot of Beck score
- Positive skew 
- Concentration of outliers above median score
16Effect of an outlier
- biases mean (green line) 
- inflates variance of mean 
- median more robust
17Robustness to outliers
- Number of positive responses (max6) 
- 0,0,0,4,4,5,5,5,6,6,6,6,6,6,6 
- 95 Bootstrap Confidence Intervals 
- Median (4.89,5.11) Observed Median5.00 
- Mean (4.11,4.41) Observed Mean4.33
18Consistency of median 
 19Obtaining 95 CIs for skewed data Bootstrapping 
(Efron  Tibshirani, 1993)
See also http//www.ruf.rice.edu/lane/stat_sim/s
ampling_dist/index.html 
 20Example revisited
- 1000 random samples of size equal to original 
 sample (N15)
- Results 
- Point estimates Mean4.37 Median5 
- 95 CIs Mean 3.27, 5.27 Median 4, 6 
- Outliers exerting undue influence on the mean 
- See http//imaging.mrc-cbu.cam.ac.uk/statswiki/FA
 Q/boot
21The sampling distribution of the variance follows 
a chisquare which tends to N(n-1,2(n-1)) as n 
increases
shoulds.e.(mean) 5/sqrt(25)1
N25 normal(24,48) observed mean is 23.75, 
observed variance6.826.8246.5 
 22Other approaches to identifying outliers (besides 
boxplots)
- Cases with z-scores exceeding 2.5 
-  (z-score subtracts mean and divides by s.d.) 
- Grubbs test (see CBU website for details)
23Quantile Plots
- Raw beck score 
- deviates from straight line 
- Substantial skew 
- Limits choice of statistical tests we can use to 
 analyse beck score
- Bump above line  positive skew 
24Reverse scored Beck
- beck is now negative skewed 
- the bump is now under the line
25S shapes
- Symptoms of Kurtosis 
- uniformity 
- peakedness
26Testing normality more formally
- Kolmogorov-Smirnov test 
- Shapiro-Wilks 
- Overly sensitive for large samples 
Non-Normal 
 27Symmetric plots
- Rank Beck distances above and below the beck 
 score median
- Plots I-th lowest distance above the median 
 against I-th lowest distance below the median
- Not many points plotted as so many points below 
 the median so doesnt show asymmetry very well
 multiple points with same co-ordinates
- If symmetric points fall on line xy 
- distances above median gt distances below median 
28Stem and Leaf of Beck Score
- Stem Leaf 
-  6 . 0 6.0 
- Each leaf4 cases
29Temperature
- What is unusual about this 
-  distribution? 
- Clue spacing. 
- Each leaftwo temperatures 
- STEM LEAF 
-  -6 6  -6.6 Degrees C
- Frequency Stem  Leaf 
-  2.00 -6 . 6 
-  4.00 -5 . 00 
-  10.00 -4 . 44444 
-  6.00 -3 . 338 
-  14.00 -2 . 2227777 
-  14.00 -1 . 1116666 
-  8.00 -0 . 0055 
-  12.00 0 . 005555 
-  6.00 1 . 616 
-  14.00 2 . 7777777 
-  16.00 3 . 33333888 
-  14.00 4 . 4444444 
-  13.00 5 . 555555 
-  11.00 7 . 22227 
-  7.00 8 . 888 
-  6.00 9 . 444 
30Scales (c/o RSS News)
- Grain diameters recorded to nearest division 1 
 inch apart
- Subsequently told to report in cm 1 inch  
 2.5cm approx.
-  Raw data (1,0,2,1,1,0,4,1,3,0,1,1) in inches 
- (2.50, 0, 5.00, 2.50, 2.50, 0, 10.00, 2.50, 7.50, 
 0, 2.50, 2.50) in cm
- The village post office is 1.21 km (2 miles) 
 across the valley on the left
- (1 lb) 454 grammes of cheese, (1 pint) 560ml of 
 beer
- The human mind likes whole numbers 
31Percentage success
- What is wrong with this graph? 
- No axis labels or title 
- Y axis strangely scaled Cant have percentages lt 
 0 or greater than 100
- green markers smaller 
- Green and red not distinguishable by colour blind 
 person yellow partially hidden by background
- Other caveats 
- Joining points can be misleading 
- make sure tick marks on scales are not too near 
 one another to give false effect.
32Scale invariance or not.
- When is 40 approximately equal to 25? 
- When is 73 equal to 111? (asked on University 
 Challenge in 2005)
- ANSWERS 
-  km  mph In base 8. Computers think in 
 binary (base 2)!
-  
-  But 
- Februarys temperature was 55F (13C) which is 
 three times the average
- So the average equals 55/3  18.3F (13/3  
 4.3C)?
-  Why is this patently untrue? 
- ANSWER 
-  18.3F is below freezing (lt32F) but 4.3C is 
 above freezing(gt0C).
33One more thing...
- When is Halloween equal to Christmas? 
- ANSWER Oct. 31  Dec. 25. I.e. 8x3  1  2x10  5
34Error Bar Charts
- Interactive Bar charts 
- Bar length represents 95 Confidence interval for 
 the mean
- females have higher depression scores than males
35Bubble Plots (in R) years in education related to 
income/prestige 
 36Multiple scatter plots (R) 
 37Ladder of Powers (Marsh,1988)
-  
- Powers (double star function in SPSS COMPUTE) 
 e.g. 329
-  2 square 
-  1 untransformed 
-  0.5 square root 
-  0 (natural) log 
- -0.5 inverse square root 
- -1 reciprocal 
- -2 inverse square 
38Choosing a power
- Trial and Error 
- Box-Cox transformation 
- SPSS Box-Cox macro available at 
- http//stat.tamu.edu/ftp/pub/mspeed/stat653/spss/ 
39Box-Cox applied to Beck score
- Looks for a power that minimises Beck score 
 variance
- Suggests a power of 0.3 (near to log transform 
 (power0))
- Regression improve fit of a covariate to predict 
 a test score Box-Cox can flag up a non-linear
 relationship
- Can be used to help determine z-scores and means 
 but can be misleading for very skewed data e.g.
 when floor and ceiling effects are present
40Predicted test score using a covariate vs actual 
test score (raw and square rooted)
More linear relationship taking square root 
(right hand side picture) 
 41Box Cox on residual variance
- test score  constant  Aitem score  residual 
- Can use boxcox on residuals of fitting item score 
 on test score
- suggests using square root of y 
- This is the transform of test score which 
 minimizes the residual variance
-  
42Exponential
- Clicks  constant  AE-B Age 
- Another type of non-linear relationship. 
 Characterised by ever increasing rates of
 changes as you get older
43Log Beck
- Skew0.60 
- Kurtosis-0.06 
- Acceptable using rule of thumb
44Quantile plot - log Beck
- Fits closer to a straight line 
- Log transform has made the distribution more 
 Normal
- Log transform enables the use of more powerful 
 statistical tests
45Symmetry of midpoints
- midpoints of percentiles 
- average of thresholds marking blue and green 
 areas should be equal in symmetric distributions
46Midpoints of beck score
- Beck 
- Median6 
- 0.5(Sum of Midpoints) - Median 
- Quarters 8.3 
- Eighths 16.7 
- Sixteenths 38 
- Log(Beck1) 
- Median1.95 
- 0.5(Sum of Midpoints) - Median 
- Quarters 3.0 
- Eighths  14.4 
- Sixteenths 26.7 
- MORE SYMMETRIC! 
47Rank transform
- Downweight outliers 
- Useful if power transformations fail 
- Useful summary measures 
- Medians 
- Interquartile ranges (Boxplots) 
- Rank sums (Non-parametric tests) 
48Using ranks - example
- Compare cost (in ) of two care centres 
- Care Centres O  R 
- Any patient cost saving? 
49Centre O stem  leaf display
- STEM WIDTH200 
- 2 EXTREMES 
- POSITIVE SKEW 
50Centre R - Stem and Leaf
- (stem width100) 
- outliers present 
- positive skew 
- rank test needed
51RESULTS
- UNRANKED 
- t(147)  0.91, p.36 
- centre costs the same 
- Uses means 
- RANKED 
-  mean Rank 
- Study O 65.06 
- Study R 85.63 
- M-W Z-2.96,p.003 
- Centre R costlier 
- Uses ranks 
52Nonparametric tests
- PROS 
- Downweight outliers 
- Fewer assumptions 
- Useful for skewed distributions 
- CONS 
- Less powerful 
- Lose information 
- Limited range of tests 
53Equal Group Variances
- Important for t-tests and ANOVAs 
- No covariate by group interaction in ANCOVA 
 Quades (1967) method is a nonparametric
 equivalent
- May need to transform outcome 
- Tests available to identify problems 
54Levenes test
- Are group variances equal? 
- Gets slope of spread vs location 
- Compares slope to 0 
- produces F-test 
55Proportions
- Variance of a proportion depends on value of 
 proportion!
- Arcsine transform resolves this 
- In SPSS use function in COMPUTE to do transform  
 
- 2  arsin(sqrt(p)) 
56Funny you should say that...
- There is no truth to the allegation that 
 statisticians are mean. They are just your
 standard normal deviates.
- Why don't statisticians like to model new 
 clothes?
- Lack of fit. 
- Did you hear about the statistician who invented 
 a device to measure the weight of trees? It's
 referred to as the ? scale
- ?log 
- Old statisticians never die, they just undergo a 
 transformation.
- Or in summary.Normal lack of fit try a log 
 transformation!
- http//research.microsoft.com/users/lamport/pubs/h
 air.pdf
57And Finally...
-  A Statistician is someone who can have their 
 head in an oven and their feet in an ice box and
 say that on the whole they are feeling perfectly
 normal
- Check you are using appropriate summary measures 
- Further details including references on EDA at 
- http//www.itl.nist.gov/div898/handbook/eda/eda.ht
 m
- Thanks to Frank Duckworth RSS News article on 
 scales
- Thanks to Chrissy Fletcher for supplying the 
 jokes
- Allan Reese (CEFAS, graphical comments) 
- Next week (Thursday). 11am 
- Ian Nimmo-Smith 
- The anatomy of statistical methods models, 
 hypotheses, significance and power