Title: Last week we used stemplots and histograms to describe the shape, location, and spread of a distribution. This week we use numerical summaries of location and spread.
1Summary Statistics
- Last week we used stemplots and histograms to
describe the shape, location, and spread of a
distribution. This week we use numerical
summaries of location and spread.
2Main Summary Statistics by Type
- Central location
- Mean
- Median
- Mode
- Spread
- Variance and standard deviation
- Quartiles and Inter Quartile Range (IQR)
- Shape
- Statistical measures of spread (e.g., skewness
and kurtosis) are available but are seldom used
in practice (not covered)
3Notation
- n ? sample size
- X ? variable
- xi ? value of individual i
- ? ? sum all values (capital sigma)
- Illustrative example (sample.sav), data
- 21 42 5 11 30 50 28 27 24 52
- n 10
- X age
- x1 21, x2 42, , x10 52
- ?x 21 42 52 290
4Sample Mean
Illustrative example n 10 (data intermediate
calculations on prior slide)
5Population Mean
- Same operation as sample mean, but based on
entire population (N population size) - Not available in practice, but important
conceptually
6Interpretation of xbar
- Sample mean used to predict
- an observation drawn at random from a sample
- an observation drawn at random from the
population - the population mean
- Gravitational center (balance point)
7Median a different kind of average
- Middle value
- Covered last week
- Order data
- Depth of median is (n1) / 2
- When n is odd ? middle value
- When n is even ? average two middle values
- Illustrative example, n 10 ? median has depth
(101) / 2 5.5
05 11 21 24 27 28 30 42 50
52?median average of 27 and 28 27.5
8Median is robust
- Robust ? resistant to skews and outliers
This data set has a mean (xbar) of 1600 1362
1439 1460 1614 1666 1792 1867 This data
set has an outlier and a mean of 2743 1362
1439 1460 1614 1666 1792 9867
The median is 1614 in both instances. The median
was not influenced by the outlier.
9Mode
- Mode ? value with greatest frequency
- e.g., 4, 7, 7, 7, 8, 8, 9 has mode 7
- Used only in very large data sets
10Mean, Median, Mode
- Symmetrical data mean median
- positive skew mean gt median mean gets pulled
by tail - negative skew mean lt median
11Spread Variability
- Variability ? amount values spread above and
below the average - Measures of spread
- Range and inter-quartile range
- Standard deviation and variance (this week)
12Range max min
The range is rarely used in practice b/c it tends
to underestimate population range and is not
robust
13Standard deviation
Most common descriptive measure of spread
Sample variance
14Standard deviation (formula)
Sample standard deviation s is the unbiased
estimator of population standard deviation ?.
Population standard deviation ? is rarely known
in practice.
15New data set (Metabolic Rates)This example is
not in your lecture notes
- Metabolic rates (cal/day), n 7
- 1792 1666 1362 1614 1460 1867 1439
16Metabolic rates showing mean () and deviations
of first two observations
17Standard Deviation Calculationmetabolic.sav
introduced slide 15
Observations Deviations Squared deviations
1792 1792 ?1600 192 (192)2 36,864
1666 1666 ?1600 66 (66)2 4,356
1362 1362 ?1600 -238 (-238)2 56,644
1614 1614 ?1600 14 (14)2 196
1460 1460 ?1600 -140 (-140)2 19,600
1867 1867 ?1600 267 (267)2 71,289
1439 1439 ?1600 -161 (-161)2 25,921
SUMS ? 0 SS 214,870
Sum of deviations will always equal zero
18Standard Deviation Metabolic data (cont.)
Variance (s2)
Standard deviation (s)
19General rule for rounding means and standard
deviations
- Report mean to one additional decimals above that
of the data - To achieve accuracy, intermediate calculations
should carry still an additional decimals - Illustrative example
- Suppose data is recorded with one decimal
accuracy (i.e., xx.x) - Report mean with two decimal accuracy (i.e.,
xx.xx) - Carry all intermediate calculations with at least
three decimal accuracy (i.e., xx.xxx)
Even more important Always use common sense and
judgment.
20TI-30XIIS about 12
In practice, we often use software or a
calculator to check our standard deviation
21Interpretation of Standard Deviation
- Larger standard deviation ? greater variability
- s1 15 and s2 10 ? group 1 has more
variability - 68-95-99.7 rule Normal data only
- 68 of data with 1 SD of mean, 95 within 2 SD
from mean, and 99.7 within 3 SD of mean - e.g., if mean 30 and SD 10, then 95 of
individuals are in the range 30 (2)(10) 30
20 (10 to 50) - Chebychevs rule All data
- at least 75 data within 2 SD of mean
- e.g., mean 30 and SD 10, then at least 75 of
individuals in range 30 (2)(10) (10 to 50)
22Quartiles and IQR
- Quartiles divide the ordered data into four
equally-sized groups - Q0 minimum
- Q1 25th ile
- Q2 50th ile (Median)
- Q3 75th ile
- Q4 maximum
23Rule for quartiles
- Find the median ? Q2
- Middle of lower half of data set ? Q1
- Middle of upper half of the data ? Q3
Bottom half Top half 05 11
21 24 27 28 30 42 50 52
? ? ? Q1
Q2 Q3
IQR Q3 Q1 42 21 21 gives spread of
middle 50 of the data
245-Point Summary (sample.sav)
- Q0 5 (minimum)
- Q1 21 (lower hinge)
- Q2 27.5 (median)
- Q3 42 (upper hinge)
- Q4 52 (maximum)
Best descriptive statistics for skewed data
25Illustrative example (metabolic.sav)
1362 1439 1460 1614 1666 1792 1867
?
median Bottom half 1362
1439 1460 1614 ?
Q1 (1439 1460) / 2 1449.5 Top half 1614
1666 1792 1867 ? Q3
(1666 1792) / 2 1729 5-point summary 1362,
1449.5, 1614, 1729, 1867
26Box-and-whiskers plot (boxplot)
- 5 point summary outside values
- Procedure
- Determine 5-point summary
- Draw box from Q1 to Q3
- Draw line _at_ Q2
- Calculate IQR Q3 Q1
- Calculate fences
- FLower Q1 1.5(IQR)
- FUpper Q3 1.5(IQR)
- Determine if any outside values? If so, plot
separately - Determine inside values and draw whiskers from
box to inside values
27Boxplot example
05 11 21 24 27 28 30 42 50 52
- 5-point 5, 21, 27.5, 42, 52
- IQR 42 21 21
- FU 42 (1.5)(21) 73.5
- No outside above (outside) Upper inside value
52 - FL 21 (1.5)(21) 10.5
- No values below (outside)
- Lower inside value 5
28Boxplot example 2
3 21 22 24 25 26 28 29 31 51
- 5-point 3, 22, 25.5, 29, 51
- IQR 29 22 7
- FU 29 (1.5)(7) 39.5
- One outside (51)
- Inside value 31
- FL 22 (1.5)(7) 11.5
- One outside (3)
- Inside value 21
29Boxplot example 3 (metabolic.sav)
1362 1439 1460 1614 1666 1792 1867
- 5-point 1362, 1449.5, 1614, 1729, 1867 (slide
30) - IQR 1729 1449.5 279.5
- FU 1729 (1.5)(279.5) 2148.25
- None outside
- Upper inside 1867
- FL 1449.5 (1.5)(279.5) 1030.25
- None outside
- Lower inside 1362
30Interpretation of boxplots
- Location
- Position of median
- Position of box
- Spread
- Hinge-spread (box length) IQR
- Whisker-to-whisker spread (range or range minus
the outside values) - Shape
- Symmetry of box
- Size of whiskers
- Outside values (potential outliers)
31Side-by-side boxplots
Boxplots are especially useful for comparing
groups