Title: Descriptive Statistics Part II
1Chapter 3
- Descriptive Statistics Part II
- Describing Central Tendency
- Measures of Variation
2Important Characteristics of data
- Center A value that indicates where the middle
of the data set is located. - Variation A measure of the amount that the data
values vary among themselves - Distribution The nature or shape of the
distribution of data (such as bell-shaped,
uniform, or skewed) - Outliers Sample values that lie very far away
from the vast majority of the other samples
3Describing Central Tendency
- Mean, ?, is the average or expected value
- Median, Md , is the middle point of the ordered
measurements - Mode, Mo, is the most frequent value
- Percentiles and Quartiles
4Basic Symbols
- The sample size, i.e. the number of items in the
sample, is denoted n. - The population size, i.e. the total number of
items in the entire population, is denoted N.
5Mean
- If the data are from a population, the mean is
denoted by ? (mu). - If the data are from a sample, the mean is
denoted by . -
- The sample mean is a point estimate of the
population mean .
6Example 3.1
- Suppose we compiled a sample of the weights of 5
professional football players - 255, 216, 346, 300, 270
7Example 3.2Given below is a sample of monthly
rent values () for one-bedroom apartments. The
data is a sample of 70 apartments in a particular
city. The data are presented in ascending order.
Anderson, Sweeney, and Williams
8The Median
- The median is a value such that at least 50
of all measurements are less than or equal to it
and at least 50 of all measurements are greater
than or equal to it . - The median is the measure of location most often
reported for annual income and property value
data. - This measure is used instead of the mean since a
few extremely large incomes or property values
can inflate the mean.
9- The median Md is found as follows
- Arrange values in ascending order (smallest to
largest). - If the number of measurements is odd, the median
is the middle value. - If the number of measurements is even, the median
is the average of the two middle values.
10Example 3.3
- Suppose the following represent a sample of
salaries of 13 Internist(x1000) - 127 132 138 141 144 146 152 154
165 171 177 192 241 - Since n 13 (odd,) then the median is the
middlemost or 7th measurement, Md152
11Example 3.2 Revisited
Median (475 475)/2 475
12Mode
The mode, Mo , is the measurement that occurs
most frequently.
- The greatest frequency can occur at two or more
different values. - If the data have exactly two modes, the data are
bimodal. - If the data have more than two modes, the data
are multimodal. - Mode is an important measure of location for
qualitative data (can not compute median and mean
for qualitative data)
13Mode
450 occurred most frequently (7 times) Mode
450
14Percentiles and Quartiles
- A percentile provides information about how the
data are spread over the interval from the
smallest value to the largest value. - Admission test scores for colleges and
universities are frequently reported in terms of
percentiles.
15Percentiles
- The pth percentile of a data set is a value such
that at least p percent of the items take on this
value or less and at least (100 - p) percent of
the items take on this value or more. - Steps for computing percentiles
- Arrange the data in ascending order.
- 2. Compute index i, the position of the pth
percentile. i (p/100)n - 3a. If i is not an integer, round up. The p th
percentile is the value in this position. - 3b. If i is an integer, the p th percentile is
the average of the values in positions i and i 1.
16Example 3.2 Revisited
90th Percentile
- i (p/100)n (90/100)70 63
- Averaging the 63rd and 64th data values
- 90th Percentile (580 590)/2 585
1765th Percentile
i (p/100)n (65/100)70 45.5 This is a non
integer so round i up to 46. Data value in
position 46500 65th Percentile 500
18Quartiles
- Quartiles are specific percentiles
- First Quartile 25th Percentile
- Second Quartile 50th Percentile Median
- Third Quartile 75th Percentile
19Example 2.4 Revisited
- Third quartile 75th percentile
- i (p/100)n (75/100)70 52.5 53
- Third quartile 525
20Measures of Variation
- The range is the largest minus the smallest
measurement. - The variance is the average of the sum of the
square of the deviations from the mean. - The standard deviation is the square root of the
variance. - In a comparison of multiple variables, the one
with the largest variance shows the most
variability in the data.
21Example 3.3 Revisited
Internists Salaries (in thousands of dollars)
127 132 138 141 144 146 152 154 165 171 177 192
241 Range 241 - 127 114 (114,000)
22Example 3.2 Revisited
Range largest value - smallest value Range
615 - 425 190
23Variance
If the data set is a sample, the variance is
denoted by s2.
If the data set is a population, the variance is
denoted by ? 2.
24Standard Deviation
If the data set is a sample, the standard
deviation is denoted s.
If the data set is a population, the standard
deviation is denoted ? (sigma).
25Example 3.1 Revisited (recall 277.4)
xi x I - (xi - )2 x1 255
277.4 x1- 255 - 277.4 -22.4 (x1
- )2 (-22.4)2 501.76 x2 216 277.4
x2 - 216 - 277.4 -61.4 (x2 -
)2 (-61.4)2 3769.96 x3 346 277.4 x3 -
346 - 277.4 68.6 (x3 - )2
(68.6)2 4705.96 x4 300 277.4 x4 -
300 - 277.4 22.6 (x4 - )2
(22.6)2 510.76 x5270 277.4 x5 -
270 - 277.4 -7.4 (x5 - )2
(-7.4)2 54 .76 sum 9543.2
Since this is sample data
26Example 3.1 Revisited (continued)
If this was data from the entire population
xi ? xi-? (xi - ?)2 255 277.4
-22.4 501.76 216 277.4 -61.4 3769.96 346 277.4
68.6 4705.96 300 277.4
22.6 510.76 270 277.4 -7.4 54.76
9543.2
27Example 3.4
Compute the standard deviation of the following
sample data 4, 5, 1, -2, 7
xi xi- (xi- )2 4 3 1 1 5 3 2 4 1 3 -2
4 -2 3 -5 25 7 3 4 16
28Z score
- The z-score is often called the standardized
value. - It is a measure of location that tells how far a
particular observation is from the mean. - It denotes the number of standard deviations a
data value xi is from the mean. - A data value less than the sample mean will have
a z-score less than zero. - A data value greater than the sample mean will
have a z-score greater than zero. - A data value equal to the sample mean will have a
z-score of zero.
xi is the data value for which you want the z
score
29Example 3.1 Revisited
255, 216, 346, 300, 270
The z score for the data value 216 is
30Chebyshevs Rule
- Chebyshevs rule applies to any data set,
regardless of the shape of the distribution of
the data - Can be used to make statements about the
proportion of data values that must be within a
specified number of standard deviations from the
mean
- Chebyshevs Rule
- At least (1 - 1/k2) of the items in any data
set will be within k standard - deviations of the mean, where k is any
value greater than 1. - Implications
- At least 75 of the items must be within k 2
standard deviations - of the mean. (i.e. within the interval -
2s, 2s) - At least 89 of the items must be within k 3
standard deviations of the mean. (i.e. within the
interval - 3s, 3s) - At least 94 of the items must be within k 4
standard deviations of the mean. (i.e. within the
interval - 4s, 4s)
31Example 3.1 Revisited
Let k 1.5 with 277.4 and s
48.84 According to Chebyshevs Rule, At least (1
- 1/(1.5)2) 1 - 0.44 0.56 or 56 football
players weights are between - k(s) 277.4
- 1.5(48.84) 204.14 and
k(s) 277.4 1.5(48.84) 350.66
32Empirical Rule for Normal Populations
- For a Normal distribution, the Empirical rule can
be used to make statements about the proportion
of data values that must be within a specified
number of standard deviations from the mean - If a population has mean ? and standard
deviation ? and is described by a normal curve,
then - 68.26 of the population measurements lie within
one standard deviation of the mean ? -?, ? ? - 95.44 of the population measurements lie within
two standard deviations of the mean ? -2?, ?
2? - 99.73 of the population measurements lie within
three standard deviations of the mean ? -3?, ?
3?
33Outliers
- Outliers are defined as sample values that lie
very far away from the vast majority of the other
samples - Your book uses the term unusual to refer to
outliers - We will use the course notes to numerically
determine an outlier or an unusual value, NOT the
criteria set in your text book. - If we assume the data is normally distributed,
there are two ways to numerically determine if a
sample value is an outlier - 1.) Determine the interval ( - 3s,
3s). If a value is OUTSIDE the interval, then
this value in an outlier. - 2.) Compute the z-value for a sample value. If
z gt 3 or z lt -3, then the value is an outlier.
34Example (Outliers)
- For the first quiz in MATH 123, the average quiz
score was 16 (out of 20 pts.), with a variance of
4 pts. Would a score of 11 be considered an
usually low score? - We were given
- 16, s2 and xi11
- Note s2 not 4 because we must take the square
root of the variance to get the standard
deviation - Using the 1st method we determine the range
- ( - 3s, 3s) (16-3(2)
, 163(2) (10, 22). - Since 11 is within this range, this score is not
unusually low. - Using the 2nd method we determine the z-value
- Since -2.5 is not less than -3, this score is not
unusually low. - Note Either method can be used to determine
whether a value is an outlier, because both will
ALWAYS yield the same result.
35The End