Large-Sample Estimation

About This Presentation

Title:

Large-Sample Estimation

Description:

Large-Sample Estimation Stat 700 Lecture 09 10/18-10/23 Overview of Lecture The Problem of Statistical Inference Methods of Inference Estimation (Point and Interval ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 53

Provided by: Eds140

Learn more at: https://people.stat.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Large-Sample Estimation

1
Large-Sample Estimation

Stat 700 Lecture 09
10/18-10/23

2
Overview of Lecture

The Problem of Statistical Inference
Methods of Inference
Estimation (Point and Interval)
Hypotheses Testing
Point Estimation of the Mean, Standard Deviation,
and Proportion
Interval Estimation of the Mean and Proportion
Sample Size Determination
Estimation of the Difference of Means
Estimation of the Difference of Proportions

3
The Problem of Inference

What we now know!
Population the collection of interest to us.
Population Models provided by probability models
such as the Bernoulli distribution, normal
distribution, exponential distribution, etc.
(Population) Parameters characteristics of the
population/distributions. Examples are the mean
?, the standard deviation ?, and the (population)
proportion p. Others are the (population) median
and the population quartiles.
Goal to know these parameters to make decisions.

4
Inference Problem continued

We also know how to
take a sample from a population (by surveys or
designed experiments), and
to compute sample statistics, which are
characteristics of the sample. For example, we
can compute the sample mean ( ), sample standard
deviation (S), and the sample proportion ( ).
Goal to use these sample statistics to infer
about the population parameters.

5
Inference Problem continued

Furthermore, we also know how
sample statistics behave in a probabilistic way,
when we consider the experiment of taking a
sample from a population, by looking at the
statistics sampling distributions. In
particular, we know the mean of a sample
statistic as well as its variability as measured
by its standard error.
A thing to realize is that the sample statistic
will usually not coincide with the associated
parameter, but will tend to cluster to the value
of the parameter especially when the sample size
is large enough!

6
Inference Problem 1 Estimation

The basic questions when dealing with estimation
problems are
Based on the sample data, what is the value of
the parameter of interest? This is the problem
of point estimation
or
Based on the sample data, what is an interval of
values in which we will have a pre-specified
confidence that the value of the parameter
belongs to this interval? This is the problem of
interval estimation or construction of a
confidence interval.

7
Inference Problem 2 Hypotheses Testing

When dealing on the other hand with hypotheses
testing our aim is to determine, based on the
sample data, which of two complementary
propositions, called statistical hypotheses,
about the parameter of interest is true.
In hypotheses testing, we are not really
interested in knowing the exact value of the
parameter, but rather we are simply interested in
deciding between competing claims about the
parameter based on the sample data.

8
An Illustration

Situation The population of interest is the
collection of all American households and their
annual out-of-pocket medical expenses. Suppose
that we would like to determine the proportion,
p, of American households which incur at least
1000 out-of-pocket medical expenses during the
year. This p is the parameter of interest.
Why is this parameter, p, relevant in public
policy?
Except for the fact that p is between 0 and 1 we
do not know its exact value.

9
Illustration continued

Study We take an SRS of n 2000 American
households, and determine for each household
their annual out-of-pocket medical expenses.
Suppose that out of these 2000 households, 114
incurred out-of-pocket medical expenses of at
least 1000, so 114/2000 .057.
Problem of Estimation Based on the sample data,
what is the value of p? or, what is an interval
L, U such that we will be 95 confident that p
is in this interval?
Problem of Hypotheses Testing Based on the
sample data, which of the following statements is
true p is less than 0.05, or p is at least 0.05?

10
Point Estimation

For our discussion, we shall let ? denote a
generic population parameter, so it could be the
mean ?, the variance ?2, the standard deviation
?, or the proportion p.
A point estimator (denoted by ) of a parameter
? is a procedure, a rule, or a formula for
obtaining a value from the sample data which will
serve as an estimate of ?. As such, a point
estimator is a sample statistic.
When the data has been obtained, the realized
value of a point estimator is called a point
estimate.

11
Examples of Point Estimators

Example 1 For estimating the population mean ?
possible point estimators are
Estimator 1 Sample Mean
Estimator 2 Sample Median
Estimator 3 Sample Midrange, which is the
average value of the smallest and largest
observations
Estimator 4 (Sum of Observations) 1/(n 2)
Question Which among these four possible point
estimators to use??

12
Examples continued

Example 2 For estimating the population
proportion p, a point estimator is the sample
proportion, , which is the proportion of
successes in the sample.
Example 3 For estimating the population
variance ?2, a possible point estimator is the
sample variance S2. This is the variance formula
with divisor of (n-1). However, another possible
estimator of ?2 is

13
Comparing Competing Estimators

Suppose there are several possible estimators of
a parameter (for example, in estimating the
population mean, there could be several candidate
estimators). How do we decide which estimator to
use?
What are the desirable or good properties that we
want from our estimators?
How do we know which estimator will have the
desirable properties?

14
Desirable Properties of Estimators

Ideally, an estimator should always give the
exact value of the parameter, whatever that value
is. But this will never be satisfied in reality!
Property of Unbiasedness On the average, the
estimator should equal the parameter being
estimated. Formally, this means that the mean of
the sampling distribution of the estimator
recall that an estimator is a sample statistic
so it has a sampling distribution should equal
the value of the parameter it is estimating,
whatever the value of the parameter is.

15
Desirable Properties continued

For example, since from our study of the sampling
distribution of the sample mean, we found that
the mean of the sample mean is equal to the
population mean, then the sample mean is unbiased
for the population mean.
The sample proportion is also unbiased for the
population proportion.
The sample variance S2 is also unbiased for the
population variance ?2. This is the reason for
dividing by (n-1) in the formula.

16
Desirable Properties continued

Property of Small Variation this is the
property of an estimator being precise in the
sense that its variability is small. In
practical terms, we want the values of the
estimator to be closely clustered towards what it
is trying to estimate.
The variability of an estimator is measured by
the standard deviation of its sampling
distribution, which we now call as the standard
error. The smaller the standard error is, the
more desirable the estimator, provided that it is
unbiased.

17
Margin of Error (ME) of an Estimator

When reporting a point estimate, we report also
its measure of variability, and this measure of
variability is usually reported as the margin of
error (ME) of the estimate, which is equal to
1.96 times its standard error. That is,

18
Interpretation of the Margin of Error

The reason for this definition of the margin of
error is that the sampling distribution of the
estimators will usually be approximately normal
(by the central limit theorem) with mean equal to
the value of the parameter being estimated, hence
the interval from
(Parameter Value) - 1.96(Std. Error) to
(Parameter Value) 1.96(Std. Error)
will contain approximately 95 of all the
possible values of the estimator. Therefore,
approximately 95 of the time, the point estimate
will not differ by more than one ME from the true
parameter value.
But, why 95? It is the convention handed to us!

19
Illustration of Comparison of Estimators

To see in a concrete way how estimators are
compared, consider the estimation of the
population mean in the population considered in
the discussion of sampling distributions. This
population has
p(2) .4, p(4) .5, p(5) .1
Population Mean ? 3.3
Population Standard Deviation ? 1.1
We compare the four estimators of the mean
mentioned earlier
Sample Mean, Sample Median, Sample Midrange, and
((Sum of Xs) 1)/(n 2).

20
Comparison continued

Our comparison will be based on samples of size n
10. A theoretical comparison is not easy, so
we rely on a Monte Carlo simulation.
We generate 500 samples of size n 10 from the
population and for each sample compute the
estimate based on each of the 4 estimators.
We then look at the simulated sampling
distributions of the 4 estimators to see which
estimators are unbiased and compare their
variability.

21
First 10 Samples from the Simulation

For sample 1 Sample Midrange (2 5)/2 3.5
while Estimate4 (31 1)/(10 2) 32/12
2.6667.
Sample Mean and Sample Median are computed the
usual way.

22
BoxPlots of the Simulated Sampling Distributions
Recall Target is m 3.3
23
Histograms of the Simulated Sampling
Distributions Using Same Scales
24
Parameters of the Simulated Sampling
Distributions and Comparisons

Sample mean is closest to being unbiased. Next is
the sample midrange, although it is still biased.
Sample Median and Estimator 4 are very biased.
Sample median is very variable or imprecise.
Sample mean is best, though midrange is also good.

25
Point Estimation of the Mean ?

When the population of interest is normal with
(unknown) mean ? and standard deviation ?, then,
based on theoretical analysis, the best estimator
of ? is the sample mean . The margin of error
is
ME (1.96)(?/n1/2).
If ? is not known then the margin of error could
be reported as
ME (1.96)(S/n1/2)
where S is the sample standard deviation.

26
Point Estimation of Mean ...

When the population is not normal and the sample
size is large, the sample mean need not be the
best estimator anymore, but it is still unbiased
for the population mean, and has decent
variability.
For example, when the population is Uniform, the
population mean is best estimated by the Sample
Midrange instead of the Sample Mean.
However, for our purposes, we will simply use the
Sample Mean as estimator of the population mean,
and its margin of error will be (assuming ? is
not known)
ME (1.96)(S/n1/2).

27
Point Estimation of the Population Proportion, p

When the population is Bernoulli so the parameter
of interest is p, the proportion of Successes
in the population, then the best estimator of p
is the sample proportion .
When np gt 5 and n(1-p) gt 5, then its margin of
error is estimated by

28
An Example

Situation Suppose we want to estimate the mean
systolic blood pressure for the population of
1910 people in the blood pressure data set.
Sample We take a sample of size n 30 from the
population, and the sample data is
100,110, 118, 134, ., 92, 104, 100, 110, 130,
110, 132, 102, 128, 88, 135, 140, 90, 108, 112,
100, 130, 136, 124, 150, 138, 130, 104, 114, 110
The one dot indicates a missing value in the data
so n 29 in this case.

29
Example continued

Sample Statistics 116.52, S 16.76
Therefore, the point estimate for ? is
116.52
with margin of error of
ME (1.96)16.76/(29)1/2 6.10.
Interpretation We are 95 confident that the
true mean systolic blood pressure for the
population is therefore between
116.52 - 6.10, 116.52 6.10 110.42, 122.62
Indeed, the true value of ? is 114.59. (On
target!!)

30
Example Freshly-Brewed vs Instant

Example A matched pairs experiment was performed
to compare the taste of instant versus
fresh-brewed coffee. Each subject tastes two
unmarked cups of coffee, one of each type, in
random order and states which he/she prefers. Of
the 50 subjects who participated, 19 prefer the
instant coffee. Let p be the probability that a
randomly chosen subject prefers freshly brewed
coffee over instant coffee, that is, p is the
proportion in the population who prefer
freshly-brewed coffee.
Based on the given information, provide a point
estimate for p.

31
Example continued

Based on the sample data, there are 31 out of the
50 who preferred freshly-brewed coffee, so the
sample proportion is 31/50 .62. This is
our point estimate of p.
We report this by also providing an estimate of
its margin of error, which is
ME (1.96)(.62)(1-.62)/501/2 .13.
Based on these information, we are 95 confident
that the true p is between .62 - .13 .49 to .62
.13 .75. Because this interval still
includes .5, it will not be possible to conclude
that more than 50 prefer freshly-brewed coffee
over instant coffee.

32
Interval Estimation of the Mean, ?

Consider a population or distribution with
unknown mean ? and standard deviation ?. We take
a sample from this population of size n, where n
is large (at least 30).
Let ? be a number between 0 and 1. An 100(1 - ?)
interval estimator of ? is a random interval L,
U, where L and U are computed from the sample
data, such that the probability that the interval
L, U covers the mean ? equals (1 - ?). That is,
PL lt ? lt U 1 - ?.

33
Derivation of the Interval Estimator

Let z? be such that PZ gt z? ?, where Z is the
standard normal variable.
Therefore, P-z?/2 lt Z lt z?/2 1 - ?.
By virtue of the Central Limit Theorem, is
approximately normal with mean ? and standard
deviation (standard error) ?/n1/2. Therefore,

34
Continued ...

Based on this equation we therefore obtain the
large-sample 100(1-?) interval estimator of the
population mean ? to be

35
Some Comments

The interval estimator in the preceding slide
assumes that the population standard deviation is
known. In many situations, however, this will
not be the case.
If ? is not known, then we replace it by S, the
sample standard deviation, in the computation of
the lower and upper bounds.
Terminology After the sample data has been
gathered, then we could calculate the lower and
upper bound of the interval. This realized
interval is called a 100(1-?) confidence
interval for ?.

36
Interpretation of a Confidence Interval

Based on our derivation of the interval
estimator, 100(1-?) of all the possible samples
of size n will produce interval estimates that
will contain the true mean ?, while the remaining
100? will produce intervals that will not
include the true mean ?. Consequently, for the
particular confidence interval that we obtained,
we associate a 100(1-?) confidence that it will
include the true value of ?.

37
Relationships

With ? and ? remaining constant, if n is
increased, then the length of the interval will
decrease, which is desirable.
With ? and n remaining fixed, increasing the
confidence coefficient will (1- ?) lead to an
increase in length of the interval.
With ? and n remaining fixed, we could decrease
the length of the interval by decreasing ?. This
could be done for instance by improving the
measurement process.

38
Example

Situation An experiment was conducted to
estimate the effect of smoking on the blood
pressure of a group of 34 college-age cigarette
smokers. The difference for each participant was
obtained by taking the difference in the blood
pressure readings at the time of graduation and
five years later. The sample mean increase in
blood pressure was 9.7 millimeters of mercury
with a sample standard deviation of 5.8.
Question Obtain a 95 confidence interval for
the mean ?, which is the mean increase in the
blood pressure reading among all college-age
cigarette smokers.

39
The Confidence Interval

Since n 34 gt 30 the standard error is
(5.8)/(34)1/2 .9947.
For a confidence coefficient of 95, z.025
1.96.
Therefore the appropriate margin of error
becomes (1.96)(.9947) 1.95.
The 95 confidence interval is therefore
9.7 - 1.95, 9.7 1.95 7.75, 11.65.
Interpretation We are 95 confident that this
interval contains the true value of ?.

40
Decreasing the Confidence Coefficient

If instead we decrease the confidence coefficient
to 90 so ? 0.10, then z.05 1.645.
Therefore, the appropriate margin of error is
(1.645)(.9947) 1.64.
The 90 confidence interval therefore becomes
9.7 - 1.64, 9.7 1.64 8.06, 11.34.
Notice that this interval is shorter than the 95
confidence interval, but then we are less
confident that it contains the true mean ?.

41
Sample Size Determination

Suppose we want to determine the appropriate
sample size such that the margin of error for the
100(1-?) confidence is at most B, where B is a
pre-specified upper bound. Then we must have
z?/2?/n1/2 lt B so solving for n, we obtain the
formula for the minimum sample size needed to
satisfy the desired condition to be

42
When the Population Standard Deviation is Not
Known

In this sample size formula, we need to know the
standard deviation ?. If this is not the case
then we could either do the following
Perform a small pilot study to obtain an estimate
of ?, and use the resulting estimate in the
formula.
Use a historical value of ?, if such is
available.
Use an upper bound for the the value of ?, that
is, use the largest possible value that ? could
have in the situation of interest. This will
provide a conservative (safe) value for the
sample size n.

43
Confidence Interval for Proportion

The 100(1-?) confidence interval for the
population proportion, when n gt 30, is derived
similarly and is of form

44
Determining the Sample Size when Constructing CI
for Proportion, p

If one wants the 100(1-?) confidence interval
for p to have margin of error of at most B, then
the appropriate formula becomes

45
Continued ...

However, this formula requires the value of p,
which is what we are trying to determine. Two
routes to circumvent this problem are
Use a prior estimate of p, that is, some
historical or previous value of p.
Use the value of p such that p(1-p) is largest.
This occurs when p 1/2 and p(1-p) 1/4. Using
this procedure, the sample size formula becomes

46
Conservative Formula for Determining the Sample
Size when Constructing CI for the Proportion, p
47
Example

Suppose that interest is to obtain a 95
confidence interval for the proportion p which
represents the proportion of Americans without
health insurance. What would be the appropriate
sample size in order that the margin of error of
the interval is at most 0.03.
In this case, B .03 and ? 0.05. Therefore,
z.025 1.96. Furthermore, since we do not have
any idea about what p might be, we use the
conservative formula to obtain
n gt (1.96)2/(4)(0.03)2 1067.
Thus, at least 1067 people should be sampled.

48
Two-Sample Problems

Consider now the situation where we have two
populations. Population 1 has mean ?1 and
standard deviation ?1 and population 2 has mean
?2 and standard deviation ?2.
Our objective is to construct a confidence
interval for the difference ?1 - ?2. This
interval is to be constructed from a sample of
size n1 from population 1, and a sample of size
n2 from population 2, with the samples being
independent of each other.
For each sample we obtain the sample means and
standard deviations.

49
Available Data for Two-Sample Problems

The sample data could therefore be summarized
into a table of form

50
Confidence Interval for the Difference of Two
Means

For this two-sample problem, when the sample
sizes are at least equal to 30, the 100(1-?)
confidence interval for ?1 - ?2 is given by

51
Example On Obesity

Situation An experiment was conducted to compare
two diets A and B designed for weight reduction.
Two groups of 30 overweight dieters each are
randomly selected. One group was placed on diet A
and the other on diet B, and their weight losses
were recorded over a 30-day period. The means and
standard deviations of the weight-loss
measurements for the two groups are given in the
table below.

52
99 Confidence Interval for the Difference of the
Means

For a 99 confidence interval, we have z.005
2.575.
The estimate of the standard error becomes
(2.6)2/30 (1.9)2/301/2 (.3457)1/2 .5879.
The appropriate margin of error is therefore
(2.575)(.5879) 1.5138.
The difference of sample means is 21.3 - 13.4
7.9
The 99 CI for the difference of the population
means becomes 7.9 - 1.51, 7.9 1.51 6.39,
9.41.
Since this interval does not contain 0, then diet
A is more effective in reducing weight.