Producing data PowerPoint PPT Presentation

presentation player overlay
1 / 43
About This Presentation
Transcript and Presenter's Notes

Title: Producing data


1
Chapter 3
  • Producing data

2
Observation versus Experiment
  • An observational study observes individuals and
    measures variables of interest but does not
    attempt to influence the responses.
  • An experiment deliberately imposes some treatment
    on individuals to observe their responses.

3
Confounding
  • Two variables (explanatory variables or lurking
    variables) are confounded when their effects on a
    response variable cannot be distinguished from
    each other.

4
Population, Sample
  • The population in a statistical study is the
    entire group of individuals about which we want
    information.
  • A sample is a part of the population from which
    we actually collect information, which we use to
    draw conclusions about the whole.

5
Population, Sample
  • For the remainder of the course, you must be able
    to differentiate between a population of interest
    and a sample that gives information about a
    population.
  • Use your text to find many scenarios and
    practicing with those scenarios will aid your
    understanding. In other words do as many
    problems as possible.
  • Populations are often not a subset of people.
    They can be groups of objects (e.g., quality
    control of some item).

6
Example
  • All students at ISU, all citizens of Ames, all
    consumers driving Mercedes are examples of a
    population.
  • Students who study in the second floor of Parks
    library is a sample from the population of all
    ISU students.
  • Mercedes drivers in Ames is a sample from all
    consumers driving Mercedes.
  • Trees in front of the Curtis Hall is a sample.

7
Example
  • Each week, the Gallup Poll questions a sample of
    about 1500 adult U.S. residents to determine
    national opinion on a wide variety of issues,
    such as the approval rating of the president.
  • What is the population of interest?
  • All U.S. adults
  • What is the sample?
  • 1500 sampled U.S. adults

8
Example
  • A social scientist wants to know the opinions of
    the employed adult women about government funding
    for day care. She obtains a list of the 520
    members of a local business and professional
    womens club and mails a questionnaire to 100 of
    these women selected at random. Only 48
    questionnaire are returned.
  • What is the population in this study?
  • What is the sample from whom information is
    actually obtained?
  • What is the rate (percent) of response?

9
Voluntary Response Sample
  • A voluntary response sample consists of people
    who choose themselves by responding to a general
    appeal. Voluntary response samples are biased
    because people with strong opinions, especially
    negative opinions, are most likely to respond.

10
Bias
  • The design of a study is biased if it
    systematically favors certain outcomes.
  • A scale always shows that objects are 2kg too
    heavy
  • A survey written such that the true opinions are
    skewed due to wording
  • Given the myriad of health problems alcohol can
    cause, do you still support lowering the legal
    age of drinking to 18? How will most people
    respond after such a leading question.

11
Simple Random Sample
  • A simple random sample (SRS) of size n consists
    of n individuals from the population chosen in
    such a way that every set of n individuals has an
    equal chance to be the sample actually selected.

12
Undercoverage and Nonresponse
  • Undercoverage occurs when some groups in the
    population are left out of the process of
    choosing the sample.
  • Nonresponse occurs when an individual chosen for
    the sample cant be contacted or refuses to
    cooperate.

13
Subjects, Factors, Treatments
  • The individuals studied in an experiment are
    often called subjects, especially if they are
    people.
  • The explanatory variables in an experiment are
    often called factors.
  • A treatment is any specific experimental
    condition applied to the subjects. If an
    experiment has several factors, a treatment is a
    combination of a specific value (often called a
    level) of each of the factors.

14
Completely Randomized Design
  • In a completely randomized experimental design,
    all the subjects are allocated at random among
    all the treatments.

15
Principles of Experimental Design
  • Control the effects of lurking variables on the
    response, most simply by comparing two or more
    treatments.
  • Randomize use impersonal chance to assign
    subjects to treatments.
  • Replicate each treatment on enough subjects to
    reduce chance variation in the results.

16
Statistical Significance
  • An observed effect so large that it would rarely
    occur by chance is called statistically
    significant.

17
Section 3.3
  • Statistical Inference

18
Example
  • A market research firm interviews a random sample
    of 2500 adults.
  • Result 66 find shopping for cloths frustrating
    and time-consuming.
  • We want to know the opinion of almost 210 million
    adult Americans who make up the population.
  • Because the sample was chosen at random, it is
    reasonable to think that these 2500 people
    represent the entire population pretty well.

19
Example
  • 2500 adults were asked that if they agree or
    disagree that
  • I like buying new clothes, but shopping
    is often frustrating and time-consuming.
  • 1650 said they agreed.
  • is a statistic.
  • The corresponding parameter is the proportion
    (call it p) of all adult U.S. residents who would
    have said Agreed if asked the same question.
  • Whats the truth about the almost 210 million
    American adults who make up the population?
  • a basic move in statistics is to use a fact about
    a sample to estimate the truth about the whole
    population

20
Statistical Inference
  • Statistical Inference is when we infer
    conclusions about the wider population from data
    on selected individuals
  • To think about inference, we must keep straight
    whether a number describes a sample or a
    population
  • Definitions time!

21
Parameters and Statistics
  • A parameter is a number that describes the
    population
  • a fixed number
  • in practice, we dont know its value
  • A statistic is a number that describes a sample
  • its value is known when we have taken a sample
  • value can change from sample to sample
  • often used to estimate an unknown parameter
  • In the Gallup Polls the parameter is the
    proportion of adult U.S. residents who approve of
    FEMAs response to Katrina
  • The statistic is the proportion of people sampled
    who approve of FEMAs response to Katrina

22
Parameter, Statistic
  • Example
  • We denote a Normal distribution with mean ,
    and standard deviation as .
  • Formally, we call and parameters.
  • When describe a reasonably symmetric histogram,
    we can use and to describe its center and
    spread
  • and are called statistics.

23
Keep In Mind the Big Picture
Population Parameter
Inference
Sample
Sample Statistic
24
Sampling Variability
  • If we took a second sample of 2500 adults, the
    new sample would have different people.
  • it is almost certain that there would not be
    exactly 1650 positive responses.
  • The value of will vary from sample to
    sample!
  • If we choose different samples from the same
    population, we will end up different values of
    the statistic.

25
Sampling Variability
  • Random samples eliminate bias from the act of
    choosing a random sample, but they can still be
    wrong because of the variability that results
    when we choose at random.
  • If the variation when we take repeated samples
    from the same population is too great, we cant
    trust the results of any one sample
  • If we take lots of random samples of the same
    size from the same population, the variation from
    sample to sample will follow a predictable pattern

26
Sampling Variability
  • All of statistical inference is based on one
    idea to see how trustworthy a procedure is, ask
    what would happen if we repeated it many times
  • Definition The sampling distribution of a
    statistic is the distribution of values taken by
    the statistic in all possible samples of the same
    size from the same population.
  • What would happen if we took many samples?
  • take a large number of samples from the same
    population
  • calculate the sample statistic for each sample
  • make a histogram of the values of the sample
    statistic
  • examine the distribution displayed in the
    histogram for shape, center, and spread, as well
    as outliers or other deviations

27
Sampling Distribution
  • The sampling distribution of a statistic is the
    distribution of values taken by the statistic in
    all possible samples of the same size from the
    same population
  • The sampling distribution is the ideal pattern
    that would emerge if we looked at all possible
    samples of the same size from our population
  • One of the uses of probability theory in
    statistics is to obtain sampling distributions
    without simulation

28
Sampling Variability, Sampling Distribution
  • Example Opinion about shopping.
  • case 1 get 1000 random samples
  • Each sample has size 100
  • Sample statistic is the sample proportion
  • We use the sample proportion to estimate the
    unknown value of the population proportion P.
  • The histogram of values for 1000 random
    samples

29
Sampling Variability, Sampling Distribution
  • Example Opinion about shopping.
  • case 2 get 1000 random samples
  • Size of the each sample is 2500.
  • We use the sample proportion to estimate the
    unknown value of the population proportion P.
  • The histogram of 1000 values

30
Sampling Distribution
  • Shape both histograms look normal.
  • Center In both cases, the values of the sample
    proportion vary from sample to sample, but the
    values are centered at 0.6.
  • (the mean of the 1000 values of is 0.598 for
    samples of size 100 and 0.6002 for samples of
    size 2500).
  • Spread The values of from samples of size
    2500 are much less spread out than the values
    from samples of size 100. In fact, the standard
    deviations are 0.0051 and 0.01 respectively.
  • As sample size gets larger, the variation in
    sampling distribution gets smaller.

31
Want more details? Ex. 3.20
  • Our texts contains a much more detailed
    discussion of the shopping question.
  • See pages 208-209 and figures 3.7 and 3.8

32
Bias and Variability
  • Bias concerns the center of the sampling
    distribution
  • a statistic used to estimate a parameter is
    unbiased if the mean of its sampling distribution
    is equal to the true value of the parameter being
    estimated
  • The variability of a statistic is described by
    the spread of its sampling distribution
  • this spread is determined by the sampling design
    and the sample size n
  • statistics from larger samples have smaller
    spreads

33
Bias and Variability
  • We can think of the true value of the population
    parameter as the bulls-eye on a target, and we
    can think of the sample statistic as an arrow
    fired at the bulls-eye
  • bias and variability describe what happens when
    an archer fires many arrows at the target (page
    213)
  • bias means that the aim is off, and the arrows
    land consistently off the bulls-eye in the same
    direction
  • large variability means that repeated shots are
    widely scattered on the target

34
Bias and Variability
35
Managing Bias and Variability
  • To reduce bias, use random sampling. When we
    start with a list of the entire population,
    simple random sampling produces unbiased
    estimatesthe values of a statistic computed from
    a SRS neither consistently overestimate nor
    consistently underestimate the value of the
    population parameter
  • To reduce the variability of a statistic from a
    SRS, use a larger sample. You can make the
    variability as small as you want by taking a
    large enough sample.

36
Sampling from Large Populations
  • Population Size Does Not Matter
  • The variability of a statistic from a random
    sample does not depend on the size of the
    population, as long as the population is at least
    100 times larger than the sample
  • If we denote the population size by N, then we
    want N gt 100(n) where n is the sample size

37
Why Randomize?
  • The act of randomizing guarantees that our data
    are subject to the laws of probability
  • The behavior of statistics is described by a
    sampling distribution.
  • The form of the distribution is known, and in
    many cases is approximately Normal
  • Usually, the center of the distribution lies at
    the true parameter value
  • The spread of the distribution describes the
    variability of the statistic

38
Cautions
  • The proper statistical design is not the only
    aspect of a good sample or experiment
  • The sampling distribution shows only how a
    statistic varies due to the operation of chance
    in randomization
  • The sampling distribution reveals nothing about
    possible bias due to undercoverage or nonresponse
    in a sample or to lack of realism in an
    experiment
  • The true distance of a statistic from the
    parameter it is estimating can be much larger
    than the sampling distribution suggests (random
    chance!)
  • We cannot actually gauge the added error

39
Problems
  • 3.66
  • Statistic, it describes the sample
  • 3.68
  • a) High Variability (HV), High Bias (HB)
  • b) LV,LB
  • c) HV,LB
  • d) LV,HB
  • 3.70
  • a) It wont vary. Population size does not
    impact variability.
  • b) It will vary. The sample size changed!

40
Section 3.3 Summary
  • A number that describes a population is a
    parameter. A number that can be computed from
    the data is a statistic. The purpose of sampling
    or experimentation is usually to use statistics
    to make statements about unknown parameters.

41
Section 3.3 Summary
  • A statistic from a probability sample or
    randomized experiment has a sampling distribution
    that describes how the statistic varies in
    repeated data production. The sampling
    distribution answers the question, What would
    happen if we repeated the sample or experiment
    many times? Formal statistical inference is
    based on the sampling distributions of statistics.

42
Section 3.3 Summary
  • A statistic as an estimator of a parameter may
    suffer from bias or from high variability. Bias
    means that the center of the sampling
    distribution is not equal to the true value of
    the parameter. The variability of the statistic
    is described by the spread of its sampling
    distribution.

43
Section 3.3 Summary
  • Properly chosen statistics from randomized data
    production designs have no bias resulting from
    the way the sample is selected or the way the
    subjects are assigned to treatments. We can
    reduce the variability of the statistic by
    increasing the size of the sample or the size of
    the experimental groups.
Write a Comment
User Comments (0)
About PowerShow.com