Title: Computation Details Confidence Intervals for the Center of Data
1Computation Details -Confidence Intervals for
the Center of Data
- Interest in median (typical value)
non-parametric interval or parametric interval. - A NP interval estimate for the true population
median is computed using the cumulative
probabilities from the Binomial distribution. - 1. The desired ? is stated, acceptable risk of
not including the true median. - 2. ?/2 of this risk is assigned to each end of
the interval,
2- 3. Table A of the cdf of the binomial
distribution with parameters n and p 0.5
provides the lower and upper critical values x
and x at one-half the desired ? level. This
table is identical to the one used for the Sign
Test to be discussed later. - 4. These critical values are transformed to ranks
R1 and Ru corresponding to data points Cl and Cu
at the ends of the confidence interval. - Rl x 1
- Ru n - x x
3(No Transcript)
4- The resulting confidence interval will reflect
the shape (skewed or symmetric) of the original
data. - NP interval cannot always exactly produce the
desired confidence level when the sample sizes
are small.
5- Minitab (SINTERVAL) uses non-linear interpolation
to get exact (1 - ?) levels. - For n gt 20, a normal approximation can be used to
compute the intervals. - Computed ranks are rounded to the nearest integer
when necessary.
6Parametric Interval Estimate for the Median
- As mentioned in FIRST LECTURE, the geometric mean
of X (GMX) is an estimate of the median of X when
yln(X) is normal or fairly symmetric. - The mean of y and confidence interval on the mean
of y become the geometric mean with its
(asymmetric) confidence interval after being
transformed back to original units by
exponentiation.
7Confidence Intervals for the Mean
- Intervals may also be computed for the true
population mean ?. These are appropriate if the
center of mass of the data is the statistic of
interest. - Symmetric intervals around the sample mean are
computed most often. For large sample sizes a
symmetric interval adequately describes the
variation of the mean regardless of the shape of
the data distribution. - Note C.I. on the geometric mean is not an
interval estimate of the mean.
8Prediction Intervals to Evaluate a Future New
Observation
- The question is often asked whether a new
observation is likely to have come from the same
distribution as previously collected data. - This can be evaluated by determining whether the
new observation is outside the prediction
interval computed from existing data. - E.g. Calculate prediction interval from
background data (or same well data), check new
compliance data (or new observation in same well)
with prediction interval.
9- PI are wider that CI because an individual
observation is more variable than is a summary
statistic. - NP prediction interval - valid for all data.
- Symmetric prediction interval - valid only for
symmetric data - Asymmetric prediction interval - valid only when
logs are symmetric. - For a normal population, table is available for
prediction interval of k new observations.
10Two - Sided NP PI
- The NP PI of confidence level (1 - ?) is simply
the interval between ?/2 and (1- ?/2) percentile
of the distribution.
11- The interval contains 100(1- ?) percent of the
data. - Therefore if the new observation comes from the
same distribution as the previously measured
data, there is a 100 ? chance that is will lie
outside the previously measured data. - to
One-Sided NP PI
One-sided PI are appropriate if the interest is
in whether a new observation is larger than
existing data, or smaller than existing data, but
not both.
12- One-sided PI
- New lt X?(n1) or New gt X(1- ?)(n1) (But not
either or).
13Parametric PI - Symmetric PI
- Assumes data follow a normal distribution. PI
are considered to be symmetric around the sample
mean and wider than CI on the mean. - The equation for PI differs from that for the CI
by adding a term s, the std. dev. of an
individual observation around their mean. - PI to
14- Difference in length between the PI and CI is
- There PI can be computed from CI and sample size
n. This is useful when using Minitab or other
software which do not have a routine to compute
the PI. - One sided PI are computed as before using ?
rather than ?/2 and comparing new data to only
one end of the PI.
15Asymmetric PI
- This is based on the log of the skewed data.
- PI
- to
- where y ln(X)
- Use parametric intervals when data are normal or
lognormal only.
16Confidence Intervals for Quantiles (Percentiles)
- E.g. Whats the CI of the 100 yr. flood?
- The 100 yr. flood is the 99th percentile (0.99
quantile) of the distribution of annual flood
data. - Similarly, the 2 yr. flood is the median or 50th
percentile of annual floods. - In environmental monitoring, the median, 95th, or
some other percentile should not exceed (or be
below) a standard (e.g. water quality standard.
17Valid for All Data - NP Interval
- Similar procedure as for median, except binomial
probability is with parameters n and p quantile
of interest. - For n gt 20, the normal approximation can be used.
- The 0.5 is a continuity correction term.
- The computed ranks Rl and Ru are rounded to the
nearest integer.
18NP Test Whether a Percentile Differs From Xo
(2-sided test)
NP Tests for Percentiles
- E.g. A water quality standard Xo could be set
such that the median of daily concentrations
should not exceed Xo ppb.
- Compute interval for percentile, if Xo falls
within this interval, the percentile is not
significantly different from Xo at the ? level
19(No Transcript)
20NP Test for Whether a Percentile Exceeds Xo
(One-sided Test)
- Compute one-sided CI for the percentile.
Remember that the entire error level ? is placed
on the side below the percentile point estimate.
21(No Transcript)
22NP Test for Whether a Percentile is Less than Xo
(One-sided Test)
- Compute one-sided CI for the percentile, place
all error ? on one side above the estimated
percentile.
23(No Transcript)
24Intervals for Normal Population
- Factors for calculating two-sided 95 intervals
for a normal distribution (or transformed to
normal) are available in a table. - For the 95 CI to contain the true mean, it is
given by - where the factor cM(n) is obtained from the
table, and s is the standard deviation. - The factor is basically the 97.5th percentile of
the t-distribution with n-1 degrees of freedom. - E.g. If n5, calculated mean 50.1 and
standard deviation 1.31, factor is 1.24.
25(No Transcript)
26- To compute the 95 tolerance intervals to contain
a specific percentage e.g. 90 of a normal
population, we use -
- where, the factor is cT,90 (n) for the given
sample size n, and s is the standard deviation. - E.g. We have 5 observations and the sample mean
is 50.1, and the standard deviation is 1.31.
From the table, the factor to contain 90 of a
normal distribution with 95 confidence when n5
is 4.28. Hence, the tolerance interval is from
44.49 to 55.71. - Other percentages 95 and 99 are also given in
the table.
27- Prediction intervals for future observations (PI
to contain all k future observations) are
computed similarly using the factors from the
table. With 95 confidence that all k future
observations from a previously sampled normal
population will be located in the interval given
by - for k 1, 2, 5, 10, and 20.
- E.g. Random sample of n5 observations, a 95 PI
to contain the values of k2 further randomly
selected observations from the same population is
50.1 3.70 (1.31).
28Comparison of length of intervals
Comparison of lengths of statistical intervals
for examples used.
29Bootstrap Methods for Standard Errors and
Confidence Intervals
- The bootstrap method for estimating the standard
error of a statistic is one of the most
significant developments in the field of
statistics. The method has also been called the
Computer Intensive method or a Resampling method.
It can be used for hypothesis testing and
probability evaluation. The method is completely
nonparametric. The essential features of the
bootstrap approach are best illustrated by and
example.
30- Suppose we wish to estimate the median m from a
sample of 13 data points. - 19.2 16.2 10.7 16.6 3.6 18.1 8.6
- 15.3 14.0 14.2 16.9 13.4 5.7
- and that we want the standard error of estimate,
or CI for m. The sample median of these data is
m 14.0. - Step 1
- Draw a random sample of size 13, with
replacement, from the original sample, e.g. - 3.6 19.2 8.6 15.3 8.6 3.6 14.2 10.7 10.7
10.7 16.9 14.2
31- This is the first bootstrap sample. Note that
some of the original sample values occur more
than once, and others not at all. The keyword
here is with replacement. - Step 2
- Calculate the median for the bootstrap sample
- m 10.7
- Step 3
- Carry out Steps 2 and 3 a large number B (say
200) times, to obtain 200 bootstrap estimates
m1, ... , m200.
32- Step 4
- The standard deviation of the 200 m values gives
the standard error of m. - To obtain the confidence interval, first sort the
200ms in ascending order (lowest to highest).
For a 90 confidence interval for m, choose the
rank 10 and 191 bootstrap values for the lower
and upper confidence limits. - The above steps can be easily implemented in
MINITAB by writing a short simple macro. The
macro can be written during the MINITAB session
or can be written using a text editor.
33- Assume data is in column C1. Use Notepad or text
editor type - gmacro
- bootmed
- do k11200
- sample 13 c1 c2
- replace.
- let c3(k1) median(c2)
- enddo
- endmacro
-
- save file as bootmed.mtb in macro directory of
Minitab - to run, type during Minitab session
- MTBgt bootmed
- the 200 values of the bootstrap medians will be
in C3. - DESCRIBE C3 will give stats of the 200 medians.
34Summary
- 1. Probability vs. Statistics (Deductive vs.
Inductive) - 2. Things to consider in an estimator.
- 3. Parametric and Non-parametric interval
estimates. - 4. Types of interval estimates
- confidence - mean or median
- prediction - one or more future values
- confidence -on percentile (tolerance)