Title: Interval estimates of the mean for small n, s unknown. Estimates of sample size.
1Interval estimates of the mean for small n, s
unknown. Estimates of sample size.
Economics 224 Notes for October 15
2Interval estimates using the t distribution
- When a random sample of size n is drawn from a
normally distributed population whose standard
deviation s is unknown, the sampling distribution
of the sample mean has a t distribution. - The interval estimate has the same format as
earlier, but with a t value replacing the Z
value, and the sample standard deviation s
replacing s. The interval estimate of the
population mean µ is - A confidence level must be specified for each
interval estimate the confidence level and the
sample size determine the t value.
3t distribution
- The shape of the t distribution is similar to
that of the normal distribution bell-shaped,
peaked in the centre, symmetrical about the mean,
and asymptotic to the horizontal axis. - The table of the t distribution is a standardized
t distribution, that is, the t values have mean 0
and standard deviation 1. - There are many different t distributions, one for
each degree of freedom (see explanation on a
later slide). - For small degrees of freedom, the t distribution
is very dispersed. As the degrees of freedom
increase, the t distribution becomes more
concentrated around the mean. - The limiting distribution for the t distribution
is the normal distribution. That is, as the
degrees of freedom increase, the t distribution
is approximated by the normal distribution.
4The concept of degrees of freedom (df)
- Degrees of freedom refers to how many sample
values can vary freely. In many statistical
procedures, some sample values are constrained by
the parameters to be estimated. - When a t distribution describes the sampling
distribution of the mean, one degree of freedom
is lost since s is used as an estimate of s (ASW,
304). In this case, the t distribution has n-1
degrees of freedom, where n is the sample size. - Degrees of freedom are used in the chi-square
test and distribution and in regression and
analysis of variance models. The type and number
of constraints and degrees of freedom differ from
model to model.
5t table (ASW, 303 and Appendix B, Table 2
- In ASW, the t table gives areas in the upper tail
of the distribution. The t distribution is
symmetric about a mean of t 0 so the same area
for the lower tail is given by the negative of
the t value in the table. - Each degree of freedom defines a different t
distribution. For degrees of freedom above 100,
use Z values from the standard normal
distribution, since the normal distribution
closely approximates the t distribution for large
df. - t values are given in the body of the table, with
areas under the curve given at the top of each
column, and df at the start of each row. - The values in the table state the t value, or
number of standard deviations, to the right of
the centre of the distribution that is required
to include all but the area in the right tail of
the distribution.
6General notation for interval estimates
- The confidence coefficient is given the symbol
(1-a). a is the first letter of the Greek
alphabet and is termed alpha. For interval
estimates, a is merely a symbol used to denote
the area in the two tails of a distribution. The
area in the middle of the distribution is (1-a)
or (1-a) x 100 and there is a/2 of the area in
each of the tails of the distribution. - When a sample mean is normally distributed, the Z
values for the 95 interval estimate, or the (1
0.05) x 100 95 interval, are 1.96. That is,
if a 0.05,
- and the interval estimate of the population
mean µ is -
7Notation for interval estimates using the t
distribution
- For the t distribution, the notation is the same
as in the last slide, with t replacing Z. The
only addition is that for each df, there is a
different t value. - For a with a t distribution. the t values for the
(1a) interval are ta/2 for appropriate df. - If a sample mean has a t distribution, df n-1,
that is, the sample size minus 1. - For a 98 interval estimate of a population mean,
where the sample size is n 9, or df n 1
8, t 2.896. - In this case, the interval estimate of the
population mean µ is
8Example of wages of workers employed at new jobs
after a plant shutdown I
- Prior to the shutdown of an Ontario manufacturing
plant in the 1990s, male were paid 13.76 per
hour and females 11.80 per hour. - Two years after the workers were laid off,
researchers located some of these laid off
workers. For twelve males workers who found new
jobs, the mean hourly wage was 12.20, with a
standard deviation of 3.27. For twelve female
workers who found new jobs, the mean hourly wage
was 8.11, with a standard deviation of 3.53. - Obtain 90 interval estimates of the mean wage
for all laid off male workers who found new jobs.
For female workers. What do you conclude from
these results?
Source Data and research from Belinda Leach and
Anthony Winson, Bringing Globalization Down to
Earth Restructuring and Labour in Rural
Communities, Canadian Review of Sociology and
Anthropology, 323, August 1995.
9Example of wages II
- These are small samples, and for neither males
nor females is the standard deviation of wages
known. While male wages in the new jobs may not
be exactly normally distributed, assume that they
are symmetrically distributed or close to normal.
Assume the same for the distribution of wages of
female workers. Assume that each sample is a
random sample of all laid off workers who had new
jobs at the time of the study. - From the above assumptions, for each of male and
female workers who found new jobs, the
distribution of sample mean pay is a t
distribution. In each case, the sample has size
n 12, so the df associated with each interval
estimate is df 12 -1 11. For a 90 interval
estimate, there is 10 or 0.10 of the area in the
middle of the distribution and 5 or 0.05 of the
area in each of the two tails of the
distribution. The appropriate t value for 11 df
and 90 confidence is 1.796.
10Example of wages III
- For males, let µ be the mean wage of all those
laid off male workers who found new jobs. The
90 interval estimate for the population mean µ
is - or 12.20 0.94 or (11.26 , 13.14).
- For females, the procedure is the same and the
resulting 90 interval estimate is - or 8.11 1.02 or (7.09 , 9.13).
11Example of wages IV
- These interval estimates provide reasonable
certainty that the mean hourly pay of laid off
workers has declined. - For males, the 90 confidence interval is 11.30
to 13.14, so the mean pay for all re-employed
males is very likely below the pre-layoff level
of 13.76. - For females the situation appears worse. The 90
interval estimate of mean pay of re-employed
females is from 7.09 to 9.13, an interval well
below the pre-layoff level of 11.80. - Neither result is certain but evidence at the
time of the study is that workers, especially
females, experienced a decline in mean pay, as
compared with the pre-layoff situation. - Cautions samples may not be random,
distribution of pay may not be normal,
uncertainty with only 90 confidence.
12Using the t distribution
- Small n often occurs in practice.
- s unknown is the usual situation.
- Normal distribution of population. This is
unlikely but so long as the population
distribution is close to symmetric, this should
not produce unreliable results. - In these situations, when estimating a population
mean, it is advisable to use the t distribution
as the sampling distribution of the sample mean,
rather than the normal distribution. - For larger sample sizes, the sampling
distribution of sample means can be approximated
by the normal distribution. - All of the above assume that the sample is a
random sample, or is equivalent to a random
sample.
13t distribution in economic analysis
- Much economic analysis uses very large sample
sizes. But there are situations where n is small
and the population standard deviation is unknown.
Then the sample mean has a t distribution. When
n gt100, the normal provides a good approximation
to the t distribution. - Experiments, administrative data, or other
situations with a small number of cases. - Measurement error is often close to normally
distributed. - Regression analysis, especially with time series
data, where the number of observations across
time is not large. Regression coefficients have
a t distribution (ASW, Ch. 12). - Economic variables are sometimes assumed to be
normally distributed, but with unknown
variability, so the t distribution is used for
the distribution of the sample mean.
14Estimating sample size (ASW, 310-313)
- Prior to conducting a research study, it is often
useful to estimate the sample size required to
achieve a particular margin of error, specifying
a confidence level. - While this may not be the final sample size a
researcher obtains, the following calculations
provide an estimate of the number of population
elements from which a researcher should attempt
to obtain data. This, in turn, can be used to
plan the research study and estimate the time and
cost that will be required to conduct it. - Cost may be too great, respondents may refuse to
participate, nonresponse to some questions, time
may be insufficient. - The method examined here provides the required
sample size for a random sample from a
population, given the margin of error and
confidence level.
15Formula
- Margin of error E.
- Confidence level (1 a) 100 and the
corresponding normal value is Za/2. - Population standard deviation is s.
- n is the required sample size when random
sampling from this population.
16Rationale for formula
- Formula for interval estimate is
- Researcher wishes an interval
- Let
- And solve this expression for n, giving
17Example of sample size I
- A manager at Access Communications wants to know
whether it is worthwhile to target university
students for a promotion. In order to do this
she would like to know how many minutes of TV
students watch each day, accurate to within five
minutes, with 99 confidence. The upper limit
of budget expenditures for the study is 1,000. - You have been hired as a consultant to the
manager and your task is to conduct a sample of
university students to obtain the required
information. What sample size is required?
What would you recommend to the manager?
18Example of sample size II
- Fortunately you have kept the Excel worksheet
from Economics 224 and when you check it, you
determine that the standard deviation of the
hours students in Economics 224 reported watching
TV daily was 1.298 hours. - From this, you use the requirements specified by
the manager, that is, a margin of error of 5
minutes or E 5/60 0.0833 hours and 99
confidence. The Z value is 2.576, so the
required sample size is
19Example III
- You report that the required sample size is at
least n 1,610. - You also report that a larger sample size might
be required, since you have an estimate of the
variability of the population that may be low.
That is, s for all university students might
exceed 1.3 hours. - If this is a random sample, to obtain a sampling
frame and then contact each student by telephone,
email, or Canada Post, you estimate that the cost
of sampling is approximately 10 per student, for
a total cost of well over 1,000. - From this, you might recommend a smaller sample
size, with relaxed requirements, say E 15
minutes, 95 confidence, for a sample of around
100 students. You might note that the required
margin of error of five minutes is very difficult
to obtain and too demanding. - Explore less expensive methods of conducting the
survey.
20Estimating s
- To obtain an estimate of the required n, some
estimate of the population standard deviation s
is required. - Use s from previous studies or similar
populations. - Pilot study. Obtain a preliminary estimate of s.
- Judgment or best guess. Dividing the range by 4
can produce a reasonable provisional estimate of
s. If there are outliers, it may be best to
eliminate these. For example, what is the s of
income for Saskatchewan residents? Minimum 0
and maximum might be 10 million plus. But make
range from 0 to 100,000 and this may include 99
plus of the population. Rough estimate of s for
Saskatchewan income would be around 25,000.
(For 2001, s 23,000, from Census). - Structure sampling procedure so that the sample
size can later be increased, if necessary.
21Additional notes about sample size I
- When obtaining n from the formula, round up.
- Make sure units for s and E in the formula are
the same. - n larger with
- Greater variability s in the population.
- Larger confidence level.
- Smaller margin of error E.
- Trade off between costs and accuracy of results.
- For a random sample, n does not depend on
population size if the proportion of the
population sampled (n/N) is small. - It may not be possible to obtain the required n,
so researcher will have to settle for a larger
margin of error or reduced confidence level, or
both. For example, time series data on Internet
use may only be available for n 10 years.
22Additional notes about sample size II
- Sample size given by above formula indicates the
number of population elements actually required
in the study. If individuals or firms to be
surveyed are reluctant to participate, cannot be
found, or are unwilling to answer some questions,
expand the required number of elements in the
hope that the n indicated can be obtained. For
example, if the formula indicates a required n
500 and 25 nonresponse is expected, expand
sample size to 650 or 700. - Sampling procedure affects required sample size.
Cluster samples might need to have larger n but
stratification of a sample might reduce the
required sample size. Different formulae for
more complex sampling procedures.
23Weighting of sample elements
- This issue is not discussed in the text but is
one that needs consideration in much survey
sampling. - Sampling procedure may be designed so each
element selected in the sample represents a
different number of population elements (eg.
cluster, stratified, multistage sampling).
Research methodology should report the weighting
procedures to be used when conducting data
analysis. Statistics Canada often includes a
weight in the data set. - Weighting may occur after data obtained, to
estimate characteristics of population. For
example, if males and females are about equal in
number in a population but a sample has fewer
males than females, data from males may be more
heavily weighted when analyzing and reporting
results.
24Conclusion about interval estimates and sample
size
- Formulae are precise but approximations are often
used - Random sample?
- Standard deviation of population?
- Confidence level arbitrary.
- Nonresponse and other nonsampling errors.
- When data come from samples, there is usually
sampling error. Interval estimates and estimates
of sample size are necessary but remember above
cautions about their accuracy. - Replication of studies, similar research on
related topics and comparable populations. - Careful sample and research design and data
analysis.
25Later on Wednesday or on Monday
- Normal approximation to the binomial (ASW, 6.3).
- Sampling distribution of the sample proportion
(ASW, 7.6). - Interval estimate of a population proportion
(ASW, 8.4). - Sample size for estimation of a population
proportion (ASW, 315-316). - Review Monday during class and Tuesday at
review session