The Generalized Bootstrap GB - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

The Generalized Bootstrap GB

Description:

... in an area and we are designing a levee, the BM will never yield observations ... of 50. This can easily lead to design of a levee that will not withstand the100 ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 28
Provided by: statA
Category:

less

Transcript and Presenter's Notes

Title: The Generalized Bootstrap GB


1
The Generalized Bootstrap GB
  • Edward J. Dudewicz
  • Department of Mathematics
  • Syracuse University
  • Syracuse, New York 13244

2
  • The Generalized Bootstrap GB History, Theory,
    Comparison (with the Naïve Bootstrap NB of
    Efron), and Practice
  • or
  • I Fitted A Distribution to My Data, Now What Do
    I Do With It? (Dont Use Your Daddys Bootstrap)

3
History
  • I. Suppose one has a dataset consisting of n
    observations X1,X2,,Xn where
  • a. the observations are independent random
    variables, i.e.
  • P(X1 x1, X2 x2,, Xn xn)
  • P(X1 x1)P(X2 x2)P(Xn xn) for all real
    numbers x1,x2,,xn and,
  • b. each of the observations has the same
    probability distribution, i.e.
  • P(X1 x) P(X2 x) P(Xn x) for every
    real number x.
  • (We often denote the common value P(Xi x) by
    F(x), called the distribution function (d.f.) of
    Xi.)
  • We then say the dataset X1,X2,,Xn consists of
    independent (a.) and identically distributed (b.)
    (i.i.d.) random variables.

4
Examples
  • Examples of this situation include many
    situations encountered in statistical practice.
    For example, some examples in Environmental
    Quality cited by Michael E. Ginevan and Douglas
    E. Splitstone (Statistical Tools for
    Environmental Quality Measurement, CRC Press,
    2003, Ch. 6) include these
  • Ex. 1 On a residential site, two hundred and
    twenty-four samples were collected and the
    arsenic concentration in the surface soil was
    measured.
  • Ex. 2 265 daily values were measured of the
    concentration of copper in the effluent of a
    treatment plant discharged into a river.

5
  • In such examples, often it is assumed that the
    d.f. is one of the classical statistical models
    such as the normal, log-normal, or Weibull
    distribution.
  • In Ex. 1, it often would be assumed that the
    sample mean follows the normal distribution
    (relying on the Central Limit Theorem). This
    might allow one to assess the risk to an
    individual who moves around a residential lot at
    random.
  • However, the distribution of the original data is
    highly skewed towards the upper tail, and it
    turns out that even has a skewed distribution.

6
  • In Ex. 2, it often would be assumed (USEPA
    assumption for effluent concentration) that the
    data follow a log-normal distribution. This
    might allow one to estimate the 99th percentile
    of effluent concentration, which might be used in
    a permit for the amount allowed in effluent (and
    with the 99th percentile in the permit, one would
    expect a 1 chance of violating the permit on any
    one day). However, the Shapiro-Wilk test
    strongly rejects a log-normal model. And since
    penalties for violating such a permit from the
    state Department of Environmental Protection can
    range from 10,000 per day and a possible jail
    term, one might be ill-advised to assume a
    log-normal model in this setting.

7
Problem
  • So, we have a Problem the data often do not
    follow such classical models as the normal,
    log-normal, or Weibull distribution.

8
What might be a Solution?
  • An approach, in use as early as 1967, is based on
    drawing samples at random with replacement from
    the data at hand.
  • For instance, in Ex. 1, draw 10,000 samples of
    size 224 at random with replacement from the
    dataset. for each of the samples find the sample
    mean. To estimate the 95th percentile of the
    exposure distribution, take the 9500th from
    smallest of the 10,000 sample means.
  • In Ex. 2, generate (e.g.) 5,000 samples of size
    265 at random with replacement from the dataset,
    and for each find the 99th percentile of copper
    concentration in the effluent. Take the 95th
    percentile of the 5,000 samples, i.e. the
    (.95)(5000) 4750th from smallest as a number
    for which we are 95 percent confident that there
    is no more than a one-percent chance of an
    unintentional permit violation if the permit
    limit for effluent copper is set at this 4750th
    value.

9
bootstrapping
  • This method gained wide use when it was given
    the name bootstrap by B. Efron in 1979 (or, the
    bootstrap method, BM). We note that the
    Merriam-Webster Online Dictionary defines a
    bootstrap as a looped strap sewed at the side or
    the rear top of a boot to help in pulling it on
    I have boots I sometimes use in winter, and they
    have bootstraps. As an aside, according to the
    Wikipedia, Karl Friedrich Hieronymus, Baron von
    Munchhausen (17201797), was a German nobleman
    who supposedly told tall tales about his
    adventures, one of them being that he used his
    own bootstraps to pull himself out of the ocean,
    which gave rise to the term bootstrapping.

10
Theory
  • The Theory behind the bootstrap resampling method
    is essentially that of the Central Limit Theorem
    (or of the Glivenko-Cantelli Theorem) it will
    perform well as the sample size approaches
    infinity.

11
  • But, in practice, this method often fails to
    perform well. As Ginevan and Splitstone note
    for the two examples given above, application of
    the bootstrap to the estimation of extreme
    percentiles is not as robust as its application
    to the estimation of summary statistics like the
    meanIt requires large original sample sizes and
    the results are more sensitive to outlying
    observations.

12
  • So, the bootstrap is not the solution. What to
    do? Ginevan and Splitstone say
  • What can I do that would be any better? In
    this case a better alternative is not obvious,
    so, despite its limitations, the bootstrap
    provides a reasonable solution.

13
  • Lack of a better alternative does not make a
    reasonable solution.

14
  • But, there is a better alternative to the naïve
    bootstrap.

15
drawbacks of the BM
  • Consider the drawbacks of the BM. For example,
    if we have 50 years of data on rainfall in an
    area and we are designing a levee, the BM will
    never yield observations more extreme than the
    most extreme in the dataset of 50. This can
    easily lead to design of a levee that will not
    withstand the100-year ?ood (a typical design
    criterion in the U.S.) or the1000-year ?ood (a
    design criterion in Europe) with high
    probability. (For more details, see Dudewicz
    (1992) and Dudewicz and Mishra (1988, Section
    5.6).) Thus
  • the Bootstrap Method is fraught with danger of
    seriously inadequate results.

16
The Generalized Bootstrap (GB) Method
  • The Generalized Bootstrap alternative was
    introduced formally in Dudewicz (1992) (and
    discussed earlier in Karian and Dudewicz (1991,
    Section 6.6)).

17
essence of the Generalized Bootstrap (GB) Method
  • The essence of the Generalized Bootstrap (GB)
    Method is to ?t an EGLD to the available data,
    and then take samples from the ?tted distribution
    (and work with them as the BM does with its
    samples from the data itself). This method has
    been shown to do better than the BM when the
    number of data points is not very large, and do
    as well as the BM when the number of data points
    is large. Sun and MüllerSchwarze (1996) have an
    excellent exposition with real-data examples.
    More recent examples in datamining, including
    interactive computer systems, consumer purchases,
    crop damage, and pathogens, are considered by
    Dudewicz and Karian (1999b).

18
A Recent Example of use of the GB Method
  • In the article A Rainfall-Based Model for
    Predicting the Regional Incidence of Wheat Seed
    Infection by Stagonospora nodorum in New York
    (Phytopathology 92 (2002), 511-518), Denis A.
    Shah and Gary C. Bergstrom of the Department of
    Plant Pathology at Cornell University developed a
    predicted distribution of seed infection
    incidence. They modeled the relationship between
    incidence of wheat seed infection (by S. nodorum)
    and rainfall.

19
they had Dataavailable for 7 years only
  • In their study, they had Data on the incidence
    of wheat seed infection by S. nodorum in New
    Yorkavailable for 7 years only. Because least
    square parameter estimates have a bias inversely
    proportional to sample size, there is reason for
    concern with such a small data set.

20
a solution
  • As a solution, we generated data sets for the
    regression analysis by the generalized bootstrap
    methodIt differs from the familiar
    bootstrapin that samples are generated from
    probability distributions fitted to the original
    data rather than sampling from the observed data
    itself.

21
The bootstrap is not capable of observations
that may be rare
  • The bootstrap is not capable of generating
    samples containing observations that may be rare
    because it is based on the actual observed data,

22
drawbackovercome by the generalized bootstrap
  • but this drawback was overcome by the
    generalized bootstrap, which is based on the
    probability distribution assumed to describe the
    population from which the actual sampled data
    were obtained. For the multiple regression, we
    generated 1,000 generalized bootstrap samples for
    each year (7,000 total samples) from the
    probability distributions fitted to seed
    infection incidence and rainfall.

23
Comparisons
  • One could base a GB method on any distribution,
    not only the EGLD, so let us call the GB
    introduced in 1992 the GB(EGLD).

24
the GB(N)
  • For example, one could base a method on the
    Normal distribution, say the GB(N). But the N
    distribution has only a mean and variance to be
    fitted, the skewness is fixed at 0 and the
    kurtosis at 3. So, this method would not be able
    to adapt to the wide range of distributions one
    finds in applications.

25
the GB(EGLD) suitable forapplications
  • In contrast, the EGLD fits all means and
    variances, and a wide range of skewness and
    kurtosis, which makes the GB(EGLD) suitable for
    many applications.

26
Good fittingis key
  • Good fitting of the basic EGLD is key to good GB
    resultsand this is a thrust of many of the
    papers to be presented at this Symposium.

27
this Symposiumof importance
  • Other distributions and distribution systems
    could also be used as a basis for a GB, and
    results with them are thus also of importance in
    this use of the fitted distribution, as are
    results with multivariate distributions.
Write a Comment
User Comments (0)
About PowerShow.com