Title: Producing data
1Chapter 3
2Observation versus Experiment
- An observational study observes individuals and
measures variables of interest but does not
attempt to influence the responses. - An experiment deliberately imposes some treatment
on individuals to observe their responses.
3Confounding
- Two variables (explanatory variables or lurking
variables) are confounded when their effects on a
response variable cannot be distinguished from
each other.
4Population, Sample
- The population in a statistical study is the
entire group of individuals about which we want
information. - A sample is a part of the population from which
we actually collect information, which we use to
draw conclusions about the whole.
5Population, Sample
- For the remainder of the course, you must be able
to differentiate between a population of interest
and a sample that gives information about a
population. - Use your text to find many scenarios and
practicing with those scenarios will aid your
understanding. In other words do as many
problems as possible. - Populations are often not a subset of people.
They can be groups of objects (e.g., quality
control of some item).
6Example
- All students at ISU, all citizens of Ames, all
consumers driving Mercedes are examples of a
population. - Students who study in the second floor of Parks
library is a sample from the population of all
ISU students. - Mercedes drivers in Ames is a sample from all
consumers driving Mercedes. - Trees in front of the Curtis Hall is a sample.
7Example
- Each week, the Gallup Poll questions a sample of
about 1500 adult U.S. residents to determine
national opinion on a wide variety of issues,
such as the approval rating of the president. - What is the population of interest?
- All U.S. adults
- What is the sample?
- 1500 sampled U.S. adults
8Example
- A social scientist wants to know the opinions of
the employed adult women about government funding
for day care. She obtains a list of the 520
members of a local business and professional
womens club and mails a questionnaire to 100 of
these women selected at random. Only 48
questionnaire are returned. - What is the population in this study?
- What is the sample from whom information is
actually obtained? - What is the rate (percent) of response?
9Voluntary Response Sample
- A voluntary response sample consists of people
who choose themselves by responding to a general
appeal. Voluntary response samples are biased
because people with strong opinions, especially
negative opinions, are most likely to respond.
10Bias
- The design of a study is biased if it
systematically favors certain outcomes. - A scale always shows that objects are 2kg too
heavy - A survey written such that the true opinions are
skewed due to wording - Given the myriad of health problems alcohol can
cause, do you still support lowering the legal
age of drinking to 18? How will most people
respond after such a leading question.
11Simple Random Sample
- A simple random sample (SRS) of size n consists
of n individuals from the population chosen in
such a way that every set of n individuals has an
equal chance to be the sample actually selected.
12Undercoverage and Nonresponse
- Undercoverage occurs when some groups in the
population are left out of the process of
choosing the sample. - Nonresponse occurs when an individual chosen for
the sample cant be contacted or refuses to
cooperate.
13Subjects, Factors, Treatments
- The individuals studied in an experiment are
often called subjects, especially if they are
people. - The explanatory variables in an experiment are
often called factors. - A treatment is any specific experimental
condition applied to the subjects. If an
experiment has several factors, a treatment is a
combination of a specific value (often called a
level) of each of the factors.
14Completely Randomized Design
- In a completely randomized experimental design,
all the subjects are allocated at random among
all the treatments.
15Principles of Experimental Design
- Control the effects of lurking variables on the
response, most simply by comparing two or more
treatments. - Randomize use impersonal chance to assign
subjects to treatments. - Replicate each treatment on enough subjects to
reduce chance variation in the results.
16Statistical Significance
- An observed effect so large that it would rarely
occur by chance is called statistically
significant.
17Section 3.3
18Example
- A market research firm interviews a random sample
of 2500 adults. - Result 66 find shopping for cloths frustrating
and time-consuming. - We want to know the opinion of almost 210 million
adult Americans who make up the population. - Because the sample was chosen at random, it is
reasonable to think that these 2500 people
represent the entire population pretty well.
19Example
- 2500 adults were asked that if they agree or
disagree that - I like buying new clothes, but shopping
is often frustrating and time-consuming. - 1650 said they agreed.
- is a statistic.
- The corresponding parameter is the proportion
(call it p) of all adult U.S. residents who would
have said Agreed if asked the same question. - Whats the truth about the almost 210 million
American adults who make up the population? - a basic move in statistics is to use a fact about
a sample to estimate the truth about the whole
population
20Statistical Inference
- Statistical Inference is when we infer
conclusions about the wider population from data
on selected individuals - To think about inference, we must keep straight
whether a number describes a sample or a
population - Definitions time!
21Parameters and Statistics
- A parameter is a number that describes the
population - a fixed number
- in practice, we dont know its value
- A statistic is a number that describes a sample
- its value is known when we have taken a sample
- value can change from sample to sample
- often used to estimate an unknown parameter
- In the Gallup Polls the parameter is the
proportion of adult U.S. residents who approve of
FEMAs response to Katrina - The statistic is the proportion of people sampled
who approve of FEMAs response to Katrina
22Parameter, Statistic
- Example
- We denote a Normal distribution with mean ,
and standard deviation as . - Formally, we call and parameters.
- When describe a reasonably symmetric histogram,
we can use and to describe its center and
spread - and are called statistics.
23Keep In Mind the Big Picture
Population Parameter
Inference
Sample
Sample Statistic
24Sampling Variability
- If we took a second sample of 2500 adults, the
new sample would have different people. - it is almost certain that there would not be
exactly 1650 positive responses. - The value of will vary from sample to
sample! - If we choose different samples from the same
population, we will end up different values of
the statistic.
25Sampling Variability
- Random samples eliminate bias from the act of
choosing a random sample, but they can still be
wrong because of the variability that results
when we choose at random. - If the variation when we take repeated samples
from the same population is too great, we cant
trust the results of any one sample - If we take lots of random samples of the same
size from the same population, the variation from
sample to sample will follow a predictable pattern
26Sampling Variability
- All of statistical inference is based on one
idea to see how trustworthy a procedure is, ask
what would happen if we repeated it many times - Definition The sampling distribution of a
statistic is the distribution of values taken by
the statistic in all possible samples of the same
size from the same population. - What would happen if we took many samples?
- take a large number of samples from the same
population - calculate the sample statistic for each sample
- make a histogram of the values of the sample
statistic - examine the distribution displayed in the
histogram for shape, center, and spread, as well
as outliers or other deviations
27Sampling Distribution
- The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population - The sampling distribution is the ideal pattern
that would emerge if we looked at all possible
samples of the same size from our population - One of the uses of probability theory in
statistics is to obtain sampling distributions
without simulation
28Sampling Variability, Sampling Distribution
- Example Opinion about shopping.
- case 1 get 1000 random samples
- Each sample has size 100
- Sample statistic is the sample proportion
- We use the sample proportion to estimate the
unknown value of the population proportion P. - The histogram of values for 1000 random
samples
29Sampling Variability, Sampling Distribution
- Example Opinion about shopping.
- case 2 get 1000 random samples
- Size of the each sample is 2500.
- We use the sample proportion to estimate the
unknown value of the population proportion P. - The histogram of 1000 values
30Sampling Distribution
- Shape both histograms look normal.
- Center In both cases, the values of the sample
proportion vary from sample to sample, but the
values are centered at 0.6. - (the mean of the 1000 values of is 0.598 for
samples of size 100 and 0.6002 for samples of
size 2500). - Spread The values of from samples of size
2500 are much less spread out than the values
from samples of size 100. In fact, the standard
deviations are 0.0051 and 0.01 respectively. - As sample size gets larger, the variation in
sampling distribution gets smaller.
31Want more details? Ex. 3.20
- Our texts contains a much more detailed
discussion of the shopping question. - See pages 208-209 and figures 3.7 and 3.8
32Bias and Variability
- Bias concerns the center of the sampling
distribution - a statistic used to estimate a parameter is
unbiased if the mean of its sampling distribution
is equal to the true value of the parameter being
estimated - The variability of a statistic is described by
the spread of its sampling distribution - this spread is determined by the sampling design
and the sample size n - statistics from larger samples have smaller
spreads
33Bias and Variability
- We can think of the true value of the population
parameter as the bulls-eye on a target, and we
can think of the sample statistic as an arrow
fired at the bulls-eye - bias and variability describe what happens when
an archer fires many arrows at the target (page
213) - bias means that the aim is off, and the arrows
land consistently off the bulls-eye in the same
direction - large variability means that repeated shots are
widely scattered on the target
34Bias and Variability
35Managing Bias and Variability
- To reduce bias, use random sampling. When we
start with a list of the entire population,
simple random sampling produces unbiased
estimatesthe values of a statistic computed from
a SRS neither consistently overestimate nor
consistently underestimate the value of the
population parameter - To reduce the variability of a statistic from a
SRS, use a larger sample. You can make the
variability as small as you want by taking a
large enough sample.
36Sampling from Large Populations
- Population Size Does Not Matter
- The variability of a statistic from a random
sample does not depend on the size of the
population, as long as the population is at least
100 times larger than the sample - If we denote the population size by N, then we
want N gt 100(n) where n is the sample size
37Why Randomize?
- The act of randomizing guarantees that our data
are subject to the laws of probability - The behavior of statistics is described by a
sampling distribution. - The form of the distribution is known, and in
many cases is approximately Normal - Usually, the center of the distribution lies at
the true parameter value - The spread of the distribution describes the
variability of the statistic
38Cautions
- The proper statistical design is not the only
aspect of a good sample or experiment - The sampling distribution shows only how a
statistic varies due to the operation of chance
in randomization - The sampling distribution reveals nothing about
possible bias due to undercoverage or nonresponse
in a sample or to lack of realism in an
experiment - The true distance of a statistic from the
parameter it is estimating can be much larger
than the sampling distribution suggests (random
chance!) - We cannot actually gauge the added error
39Problems
- 3.66
- Statistic, it describes the sample
- 3.68
- a) High Variability (HV), High Bias (HB)
- b) LV,LB
- c) HV,LB
- d) LV,HB
- 3.70
- a) It wont vary. Population size does not
impact variability. - b) It will vary. The sample size changed!
40Section 3.3 Summary
- A number that describes a population is a
parameter. A number that can be computed from
the data is a statistic. The purpose of sampling
or experimentation is usually to use statistics
to make statements about unknown parameters.
41Section 3.3 Summary
- A statistic from a probability sample or
randomized experiment has a sampling distribution
that describes how the statistic varies in
repeated data production. The sampling
distribution answers the question, What would
happen if we repeated the sample or experiment
many times? Formal statistical inference is
based on the sampling distributions of statistics.
42Section 3.3 Summary
- A statistic as an estimator of a parameter may
suffer from bias or from high variability. Bias
means that the center of the sampling
distribution is not equal to the true value of
the parameter. The variability of the statistic
is described by the spread of its sampling
distribution.
43Section 3.3 Summary
- Properly chosen statistics from randomized data
production designs have no bias resulting from
the way the sample is selected or the way the
subjects are assigned to treatments. We can
reduce the variability of the statistic by
increasing the size of the sample or the size of
the experimental groups.