DEFINITIONS - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

DEFINITIONS

Description:

The POPULATION is the temperature readings of human subjects under the experimental conditions. ... 2(1)/[cosh-1{1/2 (0.87) 1.910}] = 2.333. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 88
Provided by: satga
Learn more at: https://qbic.fiu.edu
Category:
Tags: definitions | cosh

less

Transcript and Presenter's Notes

Title: DEFINITIONS


1
DEFINITIONS     We are studying the effect of
environmental conditions on human subjects. The
SYSTEM is the human subject. The POPULATION is
the temperature readings of human subjects under
the experimental conditions. The RANDOM
VARIABLE is the measurement recorded on a
thermometer. The SYSTEM OUTPUT is the reading
on the thermometer.
2
  • In this example the MODEL is assumed to be a
    normal distribution.
  • There are two PARAMETERS of interest in a normal
    distribution µ and s
  • Ten subjects are tested and the DATA are
  •  
  • 98.2, 97.6, 97.7, 98.6, 98.2, 97.8, 96.7, 98.4,
    97.9, 97.4.
  •  
  • Entering this data in SPSS we obtain ESTIMATES of
    µ as 97.85 and s as 0.550.
  •  
  • These are the estimates of the MEAN and STANDARD
    DEVIATION of the population temperatures.
  • The estimates of skewness (b11/2) and kurtosis
    (b2) can be obtained from the SPSS output and are
    0.798 and 0.975.

3
Estimate the 90th percentile
  • 1) If we assume that the normal distribution is
    the correct model then the 90th PERCENTILE of the
    standard normal distribution, Z.90 is 1.645.
  • Using the equation
  • Y.90 µ sZ.90.  
  • Substituting the estimates for the parameters µ
    and s, yields Y.90 98.755.
  •  

4
  • 2) If we do not assume a statistical model and
    want to estimate the percentile directly from the
    data the following equation is used
  • Let i np 1/2
  • Where n is the number of data points, p is the
    desired percentile and i is the order number in
    the ordered data list corresponding to the
    desired percentile.
  • To find the estimate of the 75th percentile for
    our example
  • i 10x0.75 0.5 8.0.
  • Thus the eighth value in the ordered list is the
    75th percentile. If i is not an integer we use
    linear interpolation to find the estimate.
  • The ordered list is denoted the ORDER STATISTICS.
    The order statistics for our example are 
  • 96.7, 97.4, 97.6, 97.7, 97.8, 97.9, 98.2, 98.2,
    98.4, 98.6,
  •  
  • Thus 98.2 is the estimate of the 75th percentile.

5
  • To find PY lt 99.0 we use the DISTRIBUTION
    FUNCTION of the standard normal distribution,
    F(99.0).
  • To obtain this probability, we use the equation
  • Z (Y - µ)/s
  •  
  • Using the estimates of µ and s and the table of
    the normal distribution to find F(Z) .
  • For this example 
  • Z (99.0 97.85)/0.55 2.09 and F( 2.09)
    0.982.
  •  
  • If we want to obtain a confidence interval for µ
    we use the Student t distribution.
  • The 95 interval from SPSS is (97.46, 98.24).
  •  

6
Testing Hypotheses
  • To TEST the HYPOTHESIS that
  • H0 µ 88.7 vs. HA µ gt 88.7
  • A lower tailed t test is used. This can be done
    using SPSS.
  • For this example t -4.89 and p 0.00043.
  • Thus the conclusion is that there is only a small
    probability that µ 88.7.
  • Note for a one tail test we must divide the
    p-value from SPSS by two.

7
CONTINUOUS DISTRIBUTIONS
8
Normal Distribution
  • The normal is the most commonly used distribution
    to model system output. The normal distribution
    is used to represent system output that results
    from the additive effect of many factors.
  • Consider the blood pressure measurement from a
    single individual. As you know ones blood
    pressure varies each time it is measured. The
    reading obtained is the result of many factors
    both physical and mental at the time of the
    reading. Thus it would be logical to use the
    normal distribution as a model for blood pressure
    readings. Many of the system outputs that are
    measured are the result of a series of added
    effects.

9
  • The density function of the normal distribution
    is given by
  • f(t) s2(2p)-1/2 exp-(1/2)(t - µ)/s2
  • -8 lttlt 8,sgt0, -8 ltµlt8 where µ and s are the mean
    and standard deviation of the distribution.
  • The density can only be evaluated numerically
    and hence a table or computer program must be
    used to determine the distribution function F(y).
    You learned in you statistics class how to
    estimate these quantities using the sample mean
    and sample standard deviation. (See equations
    (2-31) and (2-51a) in your text.) The normal
    distribution is symmetrical about its mean hence
    its skewness measure is zero. The kurtosis
    measure is 3. The Student t distribution is used
    to obtain confidence intervals for the parameter
    µ and the chi-squared distribution is used for
    the parameter s. Review your statistics notes for
    these formulas.

10
Normal Distribution
11
The Half Normal Distribution
  • In some problems the random variable of interest
    is the absolute value of the measurement. Thus if
    we are measuring the deviation from a standard
    and are not interested in whether the reading is
    positive or negative and the distribution of the
  • original data is normal then the appropriate
    model to use for the absolute values is a
    half-normal distribution.

12
  • The density of the half normal distribution is
  •  
  • f(t) 2/ps21/2 exp-t2/2s2, tgt0 , sgt0.
  •  
  • There is only one parameter for the half-normal
    s
  • The mean of the distribution is 0.798s and the
    variance is 0.363s2.
  • The graph of the distribution looks like the
    positive portion of the normal curve.
  • Thus it is skewed with a skewness measure of
    0.995 and a kurtosis measure of 3.869.

13
Half-Normal Distribution
14
Matching of Moments
  • The method of MATCHING OF MOMENTS is a useful
    technique for estimating the parameters of a
    distribution. In this method we equate the
    moments obtained from the  data (sample mean,
    sample variance, sample measure of skewness
    and/or sample measure of kurtosis) with the
    moments of the selected model. We match one
    moment for each  unknown parameter. Thus for the
    half-normal distribution we match the sample mean
    with the mean of the half-normal, 0.798s. If a
    distribution had two parameters we would match
    the sample mean and variance with the mean and
    variance of the model. 

15
  • EXAMPLE
  • A pacemaker is being tested to determine the
    average error from the nominal of 60 beats  per
    minute. Thus the random variable of interest is
    the deviation from the nominal of 60.
  • We are not concerned whether the deviation is
    positive or negative. The readings are
  • 6.5 1.6 6.9 11.7 5.7 2.1 2.5 5.3
    1.8 2.0 10.9 6.0 13.2
  • The sample mean is 5.86. Therefore we match 5.86
    with, 0.798s yielding the estimate  of sigma of
    5.86/0.798 7.34. Tables and other methods of
    estimation can be found in the 1961 issue of
    Technometrics, volume 3 on page 543 in an
    article by Leone and Nelson entitled The Folded
    Normal Distribution. The folded normal
    distribution is more general than the half-normal
    which is a special case of the folded normal
    where the distribution is folded at zero.

16
Exponential distribution
Consider some outcome that represents the time to
the end of life of a system under study, the time
of death of a subject, the time of failure of a
piece of equipment, or the time between
occurrences of an event where the probability of
occurrence is proportional to the length of the
time interval and the rate of occurrence is
constant over time. Let F(t) prT lt t, the
distribution function. Now the probability that
the event of a failure occurring in the next
instant of time given that it is working at time
t is given by the conditional probability in the
interval (t, t?t).
17
This probability is   F(t ?t) F(t)/ 1
F(t) ? ?t where ? is the constant rate of
occurrence. The solution of this equation is
  1 F(t) e-?t where t gt 0 and ? gt 0 The
density function is obtain by differentiating
with respect to t. This yields the exponential
distribution   f(t) ?e-?t t gt 0 and ? gt
0 The distribution function is 1 e-?t
18
Exponential Distribution
19
The Conditional Failure Rate Function
A useful function which can serve as a guide for
the selection of a failure model is r(t), the
CONDITIONAL FAILURE RATE FUNCTION. This yields
the conditional   probability of a failure
occurring in the next instant of time given that
it has not failed up to time t.
r(t) f(t)/ 1 F(t)   The value of r(t) for
the exponential distribution   r(t) ?e-?t/1-
(1 - e-?t) ?  
20
The conditional failure rate for the exponential
distribution is a constant and it would be used
for the distribution of the time to failure for
systems which do not wear-out.   The reciprocal
of the parameter ? is both the mean and standard
deviation of this distribution. The reciprocal of
the sample mean, t, is used to estimate this
parameter. A confidence interval for this
parameter can be obtained using the fact that the
statistic 2n?.t has a chi-squared distribution
with 2n degrees of freedom, where n is the sample
size. Hence a (1-a) 100 confidence interval from
the fact that
pr?22?,a/2 lt 2n?.t lt ?22?,1-a/2
_
_ yielding the interval (?22?,a/2/
2nt, ?22?,1-a/2/ 2nt). Review the use of the
chi-squared tables.
21
In an experiment to test the durability of a new
design of pacemakers an accelerated test was
conducted on 50 units. The average time to
failure was 4.05 months. Estimate the parameter ?
and obtain a 95 confidence interval.  
22
The estimate of ? is 1/4.05 0.247. The
confidence interval is obtained via use of the
chi-square table with degrees of freedom of 100
using a 0.95 and thus the 0.025 and 0.975
percentiles are found in the table to be 74.2 and
129.6. This yield the interval (74.2/100x4.05,
129.6/405) (0.183, 0.320). We conclude that the
best estimate of the mean life of the pacemakers
is 4.05 months and we have a confidence that 95
of the units will have a mean life between 3.33
months and 5.46 months in the accelerated
environment.
23
Gamma distribution
Let us now consider a model that can be used when
the event or failure does not occur until there
are ??sub-occurrences and the time between each
of these sub-occurrences has an independent
exponential distribution. Thus such cases as the
life of an animal that does not die until it is
attacked five times by a predator or for the time
to overhaul a machine after it is repaired six
times where the time between events are
independent exponential variables with a constant
value of ?. Thus the random variable of interest
is the sum of the time between failures until the
occurrence of ? where the time between
occurrences are independent exponential variables
each with the parameter ??
24
This random variable has a gamma distribution
whose density is f(t) ??????????t???e-?t
t gt 0, ???0?????0 The function ???? (?-1)!
when ??is a positive integer. When ? is not an
integer this value of the function must be
looked up in a table of the gamma function. The
distribution function can only be evaluated
analytically when ? is an integer. The
distribution function F(t), when ? is an integer
is given by ??? F(t) 1
- ?k?????tk e-?t /k!. This sum can be obtained
from a Poisson table with parameter ?t and y
??-1 Check this out in your text. The mean of
this distribution is ????and the variance is
???2.
25
The parameters can be estimated by the method of
matching the moments. Since there are two
parameters we use the sample mean and sample
variance in the matching process. The estimate
of ??is t/s2. The estimate of ? is t2/s2, i.e.
the sample mean squared divided by the sample
variance. If the parameter ? can only take on
integer values the gamma distribution is
sometimes called the Erlangian distribution,
Computation of confidence intervals and the
conditional failure rate are complicated and can
be found in Statistical Modeling Techniques by
Shapiro and Gross published by Marcel Dekker,
1981. The failure rate increases with time when
????and decreases when it is less than one.
26
An experiment is run to estimate the average time
it takes for a machine to require a complete
overhauling. A machine is overhauled after it
needs to be recalibrated six times. The times
between recalibrations have independent
exponential distributions. The average time
between overhauls is 525.5 hours and the standard
deviation is 207.7. The estimate of ??is
525.5/207.72 0.0122 and the estimate of ? is
0.0122 (525.5) 6.4. Find the probability that
a machine will need overhauling in less than 300
hours.
27
  • We can get an approximate answer to this if
  • We assume that ? is equal to 6.0
  • Use the Poisson Table with y 6-1 5 and the
    column value of ?t (0.0122)300 3.67.
  • Using the closest column value to 3.67 the of
    the sum for a value of 5 is approximately 0.844.
  • Therefore F(300) 1 0.844 0.156.
  • S. Shapiro and L. Chen, Composite Test for the
    Gamma Distribution, Journal of Quality
    Technology,33, 47-59, (1998)

28
Gamma distribution
?????
29
Weibull distribution
The exponential model is limited in terms of a
lifetime model since it can be only be used in
situations where the conditional failure rate is
constant. If we start with the function r(
t) ???(t/?)??? then if ?gt 1 the function
increases with time and if it is lt 1 it
decreases. Note that when???????r(t) is constant
and it is the function for the exponential
distribution. Setting r(t) equal to f(t)/(1-F(t))
yields f(t) ( ???)t/?? exp- t/??? t
0 , ????0, ?????0 and F(t) 1 - exp- t/??? ,
t 0. The mean of this distribution is
????????????where ?( x) is the gamma function
discussed previously. The variance of the
distribution is ???????????? ??????????2
. The estimation of the parameters requires a
numerical procedure or a graphical procedure to
be discussed later, Once the parameters are known
the distribution function can be used to obtain
probabilities.
30
  • In a study of 20 patients receiving an analgesic
    to relieve a headache pain a Weibull distribution
    was used to model the time to the cessation of
    pain.
  • The estimate of ? was 2.79 and the estimate of ?
    was 2.14, The mean relief time was 1.89 hours.
  • The probability that the relief time will exceed
    4.0 hours is obtained from
  • -F(4) exp-4/2.142.79 0.003.
  • L. Chen and S. Shapiro, Can the Idea of the QH
    Test for Normality be Used for Testing the
    Weibull Distribution, J. Statistical Computation
    and Simulation, 55, 258-263. (1996)

31
Weibull distribution
s1 for all plots
?
?
?
?
?
32
The Rayleigh distribution is a Weibull with ?
2.
33
Lognormal distribution
It was stated that the normal distribution is the
statistical model for events that represent
additive effects. We now consider a model that is
used when the event is caused by multiplicative
effects. If T X1X2Xn then Y ln(Y) ?lnXi
and Y is the sum of random variables and can be
modeled by a normal distribution. The T has a
lognormal distribution with density
function f(t) ??t2??????????exp-(1/2??)lnt
- ?2 , t0, ?0, -inf lt?ltinf The mean and
variance of the lognormal are exp?????????? and
exp???????exp?? -1. Estimation of the
parameters is simple. Take the natural log of the
data and the estimate of ??is the sample mean of
the logs and the estimate of ??is the sample
standard deviation of the logs. Remember that
these are not the mean and the standard deviation
of the data!
34
An experiment is run to estimate the growth of a
strain of bacteria in a period of two days. The
growth at any point in time depends on the size
at the instant prior to the measurement. There is
a multiplicative effect that determines the size
of the bacteria colony. We will model the size
using a lognormal distribution. The following
sizes of ten colonies were measured after two
days 9.98 10.36 10.04 12.82 10.86
10.39 9.06 11.17 10.29 10.78 Taking the
natural log of these numbers yields 2.30 2.34
2.31 2.55 2.36 2.34 2.20
2.41 2.33 2.38 The estimate of ? is
2.352 and the estimate of ??is 0.039. The mean
size of the colonies is exp2.352
0.0392/2 10.514 The variance of the colony
size is exp2(2.352) 0.0392exp0.0392 1
110.56(.0015) 0.17
35
Lognormal distribution
36
Logistic distribution
The logistic distribution plays a major role in
describing growth processes, survival data and
demographic studies. Some of the early
applications have been used in the study of
population growth as well as a numerous number of
other growth function studies which have been
referenced in the Handbook of the Logistic
Distribution by Balakrishnan (1992). More
recently it has been used in bioassay studies and
in the analysis of survival distributions.
with corresponding distribution function
given by The mean and variance of the
distribution are ? and ??.
37
Like the normal distribution, practitioners often
work with the standard logistic random variable,
Z Z has density ? f(z) The
standard logistic distribution function
is The moment estimators of the
parameters are the sample mean for ? and the
sample variance for ?2. The distribution
resembles the normal distribution but has heavier
tails.
38
The growth of a bacteria colony can be modeled by
a logistic distribution. In an experiment the
diameter of a colony is measured after three days
growth. Ten colonies are measured and the results
are 23 46 32 38 30 68 44 37 36 30 The
sample mean is 38.4 and the sample standard
deviation is 12.44. Hence, the estimate of ? is
38.4 and the estimate of ? is 12.44 The
probability that the size of a colony will be
less than 22 is given by F(22). Converting this
to the standard variable Z 3.14(22-38.4)/12.44(1
.73) -2.39 then F(-2.39) 0.084. Thus there
is a little more than a probability of 0.084 that
a colony will have a diameter of less than 22.
39
Logistic distribution
s
s
s
s
40
See pages 122-132 of HS for a summary description
of this material. There are several other model
which are not covered in this course. Such
models as the Pareto and the extreme value
distribution can be found in the literature.
41
DISCRETE DISTRIBUTIONS
42
Discrete distributions
The foregoing distributions are used when the
data being observed are measurements. In some
cases the data are counts (discrete) and hence a
discrete distribution must be chosen. The
selection process depends on two factors. The
type of experiment and the question under study.
There are two types of experiments that we will
cover in this course. The first involves
Bernoulli trials.
43
Bernoulli trial
A Bernoulli trial is one with the outcome of the
experiment has only two possibilities. We will
call these success and failure. These are coded
zero and one. This covers experiments such as
live or die, win or lose, success or failure,
cure or not cure, etc. The Bernoulli random
variable is 0 if a failure occurs Y
1 if a success occurs The
probability function is   f(y) py (1 p)1-y
y 0,1.   This is not a useful model but serves
as a building block for other discrete
distributions.  

44
The Binomial Distribution
Experiment n independent Bernoulli trials where
the pr1 p on each trial. Question What
is the probability of Y successes in n
trials? Answer Binomial Distribution f(y)
n!/y!(n-y)!)py(1-p)n-y y0,1,.,n
0ltplt1. Where n is the sample size, Y is the
number of ones and p is the probability of a one.
The estimator of the parameter p is Y/n, the
mean of Y is np and the variance is np(1-p).
45
A new antibiotic is being tested on 20 subjects
that have been infected with a virus. At the end
of three days they are tested to see if the
infection is gone. There are two possible
outcomes for this experiment Yes and No The
trials are independent and we are interested in
the probability that two or less people will be
cured (Y 2). The results were that only one
person was not cured. In this case Y is the
number of persons not cured. The estimate of p is
1/20 0.05. Using a binomial table or the
computer the probability that Y is less than or
equal to two is 0.9245. Thus, the estimate of
the probability that someone will be cured from
the treatment is 0.95.
46
The Geometric distribution
Experiment n independent Bernoulli trials where
the pr1 p on all trials. Question What is
the probability that the first success comes on
trial Y? Answer Geometric distribution In this
situation the random variable is the number of
trials while in the binomial distribution it was
the number of ones occurring in n trials. f(y)
(1-p)y-1p, y 1,2,.., 0p1 The mean of
the distribution is 1/p and the variance is
(1-p)/p2. The estimator of the parameter, p, is
the reciprocal of the sample mean.
47
In the use of a defibrillator the device is used
until the patients heart starts beating on its
own. A study was conducted to estimate the
probability that the heart starts beating after
at least four uses of the device. The data
available is the number of trials required on 20
patients for a successful use of the device. The
data is 2 5 3 6 7 4 2 8
2 1 4 6 3 7 1 4 2 4 5 2
The sample mean is 78/20 3.9. Thus the
estimate of p is 1/3.9 0.256. The pYy
(1-p)y-1 and pYlty 1 - (1-p)y-1 The
estimated probability that a defibrillator will
be used at least five times is (1 - 0.256)4
0.306 A generalization of the geometric is the
negative binomial or Pascal distribution. The
experiment is the same as in the geometric case
however the question is what is the probability
that the s success occurs on trial Y. The
formula and other information can be found in HS.
48
The Hypergeometric distribution
Experiment Dependent Bernoulli trials in a
finite distribution of size N from which a sample
of size n is taken where there are k ones and N-k
zeros. Question What is the probability of Y
successes in the sample of n? The probability
function is f(y) k!/(y!(k-y))!(N-k)!/(n-
y)!(N-k-ny)/ N!/n!(N-n)!. y0,1,2..,n
yltk n-yltN-k. The mean of the distribution is
nk/N and the variance isnk(N-k)(N-n)/N2(N-1).
The parameter of interest in this case is k
its estimator is the greater integer less than or
equal to (N1)y/n. In some problems N is the
unknown parameter. If that is the case the
estimator of N is the greatest integer less than
or equal to kn/y.
49
A public health official is interested in
estimating the number of persons with a given
disease in a group of 20. The examination is
costly so he takes a random sample of 10 people
and he finds that 4 have the disease. He wishes
to estimate, k, the number of sick individuals.
The estimate of k is 21(4)/10 8.4, hence the
estimate is 8 sick individuals. Using this
estimate, the probability of finding 4 people out
of ten with the disease would be
8!/(4!4!)x12!/(6!6!)/20!/(10!10!) 0.35
50
The Poisson distribution
Experiment Count the number of occurrence of
independent events that occur at rate ? in a
period of time or in an area, volume, unit, etc.
Question What is the probability of Y
occurrences in the period of time, area,
etc.? Answer Poisson distribution. This
experiment does not involve Bernoulli trials nor
is there a sample size. Examples are the number
of times a pacemaker has to be replaced in a
patient in a period of one year, the number of
tumors found on a patient, the number of fish
caught in a five minute period, the number of
arrivals of ambulances in a one hour period, etc.
The probability function is f(y) ?ye-?/y!
?gt 0, y 0,1,2,.. The distribution function is
tabled in most statistics books. Note we used
these tables to evaluate F(t) for the gamma
distribution when ??is an integer. The mean and
variance of the distribution is ???The sample
mean is an estimator of the parameter ??
51
In a experiment the number of bacterial colonies
on a slide is counted. This count can be
modeled by a Poisson distribution. A total of
25 slides are examined yielding a sample mean of
24.0 The estimate of ? is 24 colonies per
slide. If we wish to estimate the probability
that there could be less than 30 colonies on a
slide we find pYlt300.868 from
the Poisson table using F(29).
52
TESTING MODEL FIT
53
Test of Model Fit
An important part of analyzing data is to check
on whether the model you intend to use is a
reasonable approximation of the system output. A
test of the distributional assumption should
precede any data analysis. There are tests for
all the models that we covered in the previous
lecture. I will show you some of these for the
others I will supply references. There are many
procedures available for testing model fit many
of the ones presented here were developed by your
professor who obviously has a bias in their
selection.
54
Probability Plotting
The simplest method of assessing model fit is to
use a graphical procedure known as probability
plotting. This technique can be used for
distributions that do not have a shape parameter
or can be transformed to one that does not have a
shape parameter. A shape parameter is one that
changes the shape of the distribution. The
normal, logistic and exponential distributions
have no shape parameters. The lognormal and
Weibull can be transformed to distributions
without shape parameters. If the parameter ? is
known then probability plots can be made for
specified values of this parameter.
55
  • To construct a probability plot one follows these
    steps
  • . Obtain a sheet of probability paper for the
    distribution being tested.
  • . Rank the observations from smallest to highest.
  • . Plot the ranked observations vs pi/(n-1). The
    letter i is the rank number of the data point
    which runs from 1 to n.
  • . Depending on the brand of the paper chosen one
    of the two axes will have a preprinted scale. The
    values of p are plotted on this axis.
  • . The values of the ordered observations are
    plotted on the other axis. You must scale this
    axis according to the values of the data. Try to
    use as much as the axis as is possible.
  • . Plot the data.

56
7. If the data fall in an approximate straight
line then you have chosen a reasonable model. It
helps to draw a line on the paper representing a
straight line to judge whether it reasonably fits
the data. Remember the data are random variables
and will not form a perfect straight line. Look
for departures in the tails. See chapter 8 in HS
for examples of plots. 8. If the plot is
approximately linear crude estimators of the
parameters can be obtained from the plot. 9. In
evaluating a plot remember that the variance of
points in the tail(s) is higher than those in the
middle of the distribution. Thus the relative
linearity of the plot near the tails of a
distribution will often seem poorer than at the
center of the distribution even if the correct
model was chosen. This statement is not
applicable if a tail is bounded. Thus for the
exponential distribution it is not true for the
lower tail.
57
10. The plotted points are ordered and hence not
independent. Each point is higher than the
previous one. Thus we should not expect them to
be randomly scattered about the plotted
line.   11. A model can never be proven to be
correct even if a straight line appears to be
appropriate. This is especially true for small
sample size where only extreme differences from
the selected model can be detected. The best one
can say is that there is no evidence from the
data that the model is unreasonable. Using the
data from the class assignment prepare a normal
probability plot by hand on the probability paper
given to you. Normal probability plots can also
be done using SPSS.   Enter the data in the
computer and use SPSS to prepare the plot.
Compare the two plots.
58
Normal Distribution
An estimate of ???the mean of the distribution,
can be obtained from the plotted line by
determining the data value on the line that
crosses the 50th percentile line (the preprinted
scale). The parameter ? can be estimated from the
slope of the plotted line. For any normal
distribution the standard deviation equals
approximately two-fifths of the difference
between the 90th and the 10th percentiles. The
percentiles are obtained from the plotted line
the data value where the plotted line crosses the
.10 and .90 percentile lines on the pre-printed
scale. Any percentile can be obtained directly
from the plotted line in like manner. Use the
plot to estimate the standard deviation for the
above data set and compare the result with that
obtained from SPSS. Find the value from the
distribution that will not be exceeded by more
the five percentile of the values.
59
Weibull Distribution
The plot for the Weibull distribution is done
exactly as that for the normal distribution
except the scales on the paper are different and
there are two scales for each axis. On the
paper shown in HS the values on the bottom X
scale are Z ln X. The scaling of this axis
makes the transformation. You need only to plot
the data using the preprinted scale. The
parameter ? can be estimated from the slope of
the plotted line as follows
60
  • 1. Select 2 values of W the Y axis values at the
    right hand side of the paper. Call these W2 (the
    larger value chosen) and W1 (the smaller value
    chosen).
  • 2. Using the scale at the top of the X axis find
    values Z2 and Z1 corresponding to where the
    chosen W lines cross the plotted line.
  • 3. The estimator of ? is b W2 - W1 / Z2 -
    Z1 .
  • 4. The estimator of s is obtained from finding a,
    the intercept where, where the plotted line
    crosses the line Z 0 on the top scale.

61
Other Probability Plots
Probability plots for the lognormal distribution
can be done in two ways. If you do not have a
sheet of lognormal paper you need only take the
log of the data and plot the transformed data on
regular normal probability paper. Use the same
procedures for estimating the two parameters as
was done for the normal case. If you have a sheet
of log normal paper then simply plot the data
without transforming them. This paper has a log
scale. There is also probability paper for the
gamma distribution if you know the value of ? and
it is an integer. The paper varies with the value
of ?. Note that when ??is equal to one the
distribution is the exponential. The scale
parameter of the gamma distribution can be
estimated by the slope of the line. See HS for
further details.
62
Tests Of Distributional Fit
Probability plots are a simple method of checking
whether a chosen model is reasonable. However it
is a subjective procedure and two individuals
looking at the same plot could come to different
conclusions. A formal test of hypothesis can
be made for which a p-value can be obtained and
used for making a decision as to the
reasonableness of the chosen model. Most of
these procedures are complex and only references
to them will be given. There are several
general approaches to these test procedures one
that I have used will be given for this class.
63
We have seen that in the normal probability
plotting procedure one can estimate the variance
of the distribution from the slope of the plotted
line squared. However this estimate is a valid
estimator only if the points plotted fall in an
approximate linear pattern. If the points do not
fall in an approximate straight line then the
estimated variance is incorrect. The sample
variance is an unbiased estimator of the variance
whether or not the sample data came from a normal
model. Similar to the ANOVA technique that you
learned in your statistics class a test of the
null hypothesis that the model is correct can be
made by comparing these two estimates by
computing their ratio.
64
Shapiro-Wilk W Test For The Normal Distribution
The numerator and denominator of the ratio for
testing for normality are not independent and
hence the ratio does not have an F distribution
and the slope of the line can not be obtained by
simple linear regression. The steps for
computing the ratio and obtaining the test
statistic, W, can be found in HS. This test
procedure is included in most general software
packages including SPSS. Use the data from the
normal and probability plot examples to test for
normality.
65
The Brain-Shapiro Test for the Exponential
Distribution
  • The test for the exponential distribution in HS
    is outdated. A better test was devised by Brain
    and Shapiro published in Technometrics Vol. 25
    Pg, 60-76 1983 entitled, A Regression Test for
    Exponentiality Censored and Complete Samples.
    The test statistic is computed as follows
  • Order the data from smallest to largest. X1 X2
    .. Xn
  • 2. Compute the quantities Yi1 n-1 Xi1
    Xi, i 1, 2, 3, .., n-1
  • 3. Compute Z 12/(n-2)1/2??i-(n/2??Yi1/?Yi
    1 , i 1, 2, , n-1
  • 4. Compute V 5/4(n1)(n-2)(n-3)1/212?i-(n
    /2)2Yi1 - n(n-2)?Yi1/?Yi1
  • 5. Calculate the test statistic U Z2 V2.

66
6. For large samples U has a chi-squared
distribution with two degrees of freedom when the
data come from an exponential distribution. This
is an upper tail test non-exponentiality will
result in large value of U. When n gt 15 the
critical values for the test are as
follows 90th percentile U0.90 4.605
-2.5/n 95th percentile U0.95 5.991
1.25/n 97.5th percentile U0.975 7.378
3.333/n. 7. Thus if the value of U is greater
than one of the three values above then the
p-value is less than 1 minus the corresponding
subscript. Hence if it is greater than the 90th
percentile then the p-value is less than 0.10.
67
Chi-squared Goodness-of-Fit Test
The chi-squared goodness of fit test is used to
test whether a selected discrete model is
appropriate to model a set of discrete data. We
will not discuss this procedure since it was
covered in your prior statistics course. The
procedure is described in HS.
68
Goodness-of Fit Tests for Other distributions
  • The following are references for tests for other
    models.

69
Johnson Distribution
Scientists are often faced with the problem of
summarizing a set of data by means of a
mathematical function which fits the data and
allows them to obtain estimates of percentiles
and probabilities. A common practice is to use a
flexible family of distributions to accomplish
this. In most cases the family has four
parameters. One such system is the Johnson
distributions which has three families each with
four parameters. These families are described
in HS however determining which of the three
families to use and estimation of the parameters
is out of date. The current procedures are found
in an article by Slifker and Shapiro entitled
The Johnson System Selection and Parameter
Estimation published in Technometrics, Vol. 22
Pg. 239-246, 1980.
70
The following is abstracted from that
article The system was devised by using a
transformation of a standard normal variable
using the equation z ?????ki(x??e) where z
is a standard normal variable and k is a function
that includes a wide variety of shapes. The
parameters ??and ??are shape parameters, ? is a
scale parameter and e is a location parameter.
71
The three families are obtained by letting the
function k equal to the following k(x???)
sinh-1(x-?)/? denoted the SU distribution
-infltxltinf. k(x???)
ln(x-?????????? -x) denoted the SB distribution
?x?????? k(x???) ln(x-?)/?
denoted the SL distribution x ??
72
  • The SL distribution is a form of the log-normal
    distribution having three parameters. The first
    step in using this family is to determine which
    of the three to use. The following is the
    procedure to accomplish this
  • Choose a value of z gt 0. Any value will do
    however if you want to have a good fit in the
    tail of the distribution selection of a value
    close to 0.5 is recommended for moderate sample
    sizes and 1.0 or higher for large sample sizes.
    A good value is 0.548.
  • Compute 3z. If the recommended value of 0.548 is
    used then 3z 1.645
  • Determine from a table of the normal distribution
    the percentage points corresponding to -3z, -z,
    z, and 3z. Using the above selected percentiles
    the corresponding percentage points are 0.05,
    0.292, 0.708 and 0.95. Call these pis.

73
4. We next estimate the data percentiles
corresponding to these percentages using the
equation i npi ½ to determine the ith
ranking in an order list of the data. Note that n
is the sample size. This will usually not be an
integer and linear interpolation will be
necessary. 5. Next compute the quantities m, n
and p using the data values (xjz, j-3,-1,1,3)
from the prior step as follows. m x3z -
xz n x-z x-z p x-z x-3z 6. If
mn/p2gt1 use the SU distribution, if it less than
one use the SB and if it equals one use the SL.
7. Once the proper family is selected the next
step is to estimate the parameters. The
estimation formulas for each family are
different you must select the appropriate set
depending on the selection from the last step.
74
  1. Johnson SU Distribution

The values of the parameters are presented in
such a way as to emphasize their dependance on
the ration m/p and n/p. Parameter estimates for
Johnson SU Distribution
75
ii) Johnson SB Distribution
The solutions for the SB parameters turn out to
depend on the ratios p/m and p/n ( as opposed to
m/p and n/p for SU). Parameter estimates for
Johnson SB Distribution
76
iii) Johnson SL Distribution
Parameter estimates for Johnson SL Distribution
77
8. To determine the value of F(x) for the data
set one simply substitutes the estimators of the
parameters in the transformation for z for the
selected model and use a table of the normal
distribution. 9. To determine a data percentile
corresponding to a given percentage point we
solve the defining equation of zp to obtain xp
where zp is the standard normal value
corresponding to the desired value of p.
For SL use xp exp(zp -????? For SB use
xp ??????????ew/1 ew where w (zp -
???? For SU use xp ??e2w -1/2ew ??
78
  • The following example was taken from a large
    sample of size 9440 measurements.
  • Since this is a very large sample we use a value
    of z 1.0.
  • The first step is to order the data from the
    smallest to the largest.
  • Using the chosen value of z we obtain the order
    numbers corresponding

    -3z (p0.0014), -z (p0.1587),
    z (p0.8413) and 3z (p.9986).
  • Using i npi ½ we find that

    x-3z 10.16, x-z 13.58, xz 15.24 and x3z
    16.68.
  • Thus m 1.447, n 3.172 and p 1.661. (Note
    that the numbers on the above line were rounded
    off.) Hence mn/p2 1.664. This indicates that
    the proper model is the SU distribution.
  • Using the estimating equations for this model we
    find that the estimates of the parameters are
  • ?? 2(1)/cosh-11/2 (0.87) 1.910
    2.333.
  • ? 2.333 sinh-11.910 -
    0.871/2(1.910)(0.871) 11/2 1.402.
  • ? 2(1.66)(1.664 1)1/2/(0.871 1.910
    2)(0.871 1.910 2)1/2 1.585.
  • ? 15.242 13.58/2 1.661(1.910
    0.871)/2(1.910 0.871 2) 15.516.

79
In order to find the probability that a
measurement will be smaller than 9,0 we compute
F(9.0) by first finding the corresponding z
value. z ?????sinh-1(x-?)/? z 1.402
2.333 sinh-19.0 -15.516/1.585 1.402
2.333 sinh-1(-4.11) -3.54 F( -3.54) 0.00021.
Thus there is a very low probability to get a
measurement below 9.0. If one desires the median
of the distribution (p 0.5) corresponding to a
value of z of zero then x0.50 ??e2w -1/2ew
???and w (zp - ???????(0 1.402)/2.333 -
0.6 and x0.50 1.585.30 -1/.549 15.516
14.41. Thus one-half of the distribution is
below 14.41.
80
1. S. Gulati and S. Shapiro, Goodness of Fit for
the Pareto Distribution in Statistical Models
and Biomedical and Technical Systems published by
Birkhauser, Boston, 263-277.(2007)   2. S. Gulati
and S. Shapiro, Goodness of Fit Tests for the
Logistic Distribution, Journal of Statistical
Theory and Practice, 1,
81
Analyzing Random Models Via Simulation
In the prior two weeks you learned about modeling
of biological systems. However these systems only
represent the average output to be expected. The
variables in the model are all constants. In real
life each subject has a different value for the
constant and in order to get a more realistic
picture of the output these constants should be
replaced by random variables that have
distributions like the ones we have discussed. In
order to do this it is necessary to choose a
statistical distribution for each variable in the
model, assign values to the parameters for each
random variable and then run the model over and
over again on a computer inserting a new value
for each of the random variables. This technique
is called Monte Carlo simulation and is repeated
maybe 1000 or more times. Thus you generate a
data set that gives the distribution of the
output as opposed to a static model that gives
you one value. Then you can use the Johnson
System to fit a model to the output and find
desired probabilities. Thus you are able to get a
more comprehensive picture about the properties
of the output.
82
In this lecture we will first discuss how to
generate random numbers from some of the
distributions covered earlier. Most computer
systems have programs for generating random
variates from the normal and uniform
distributions. We will use these programs to
generate variates from other models. Define Z as
a normal variate with mean zero and variance one
and U as a uniform variate on the interval (0,1).
83
Normal Distribution Suppose we desire random
variates, Y, from a normal distribution with mean
? and variance ?2. Thus, setting Y
?? ?Z will yield the desired random variable.
84
Exponential Distribution
  • F(y) for all distributions has a uniform
    distribution on (0,1).
  • If we can express F(y) U we can generate a
    random variable by solving for Y.
  • Thus F(Y) for an exponential is as an analytical
    function then by solving for Y we can obtain a
    variate from that distribution.
  • Thus for the exponential distribution F(y) 1
    ?e??y U.
  • thus Y -1/???ln(1 U) yields the desired
    random variate from the exponential distribution.

85
Gamma Distribution (integer shape parameter)
  • The gamma distribution with an integer shape
    parameter ??can be viewed as a sum of ?
    independent exponential variables with scale
    parameter ?.
  • Thus to generate a random variate we merely add
    ??exponential variables using the above formula
  • ?
  • Y -1/????i1 ln????i???i 1,2,,,?.

86
Log-normal Distribution
  • ??The log of the log-normal variable has a normal
    distribution with parameters ??and ?.?
  • Thus, we simply take the inverse of the natural
    log of Z
  • Y e?Z ?

87
Weibull Distribution
  • Since F(Y) 1 exp-(t/??? U,
  • then solving for Y yields
  • Y -?ln(1-U)1/?.
  • Other Distributions
  • Formulas for some discrete and other continuous
    distributions can be found in HS.
Write a Comment
User Comments (0)
About PowerShow.com