Probability Theory - PowerPoint PPT Presentation

1 / 148
About This Presentation
Title:

Probability Theory

Description:

Probability Theory Review of essential concepts Probability P(A B) = P(A) + P(B) P(A B) 0 P(A) 1 P( )=1 Problem 1 Given that P(A)=0.6 and P(B)=0 ... – PowerPoint PPT presentation

Number of Views:846
Avg rating:3.0/5.0
Slides: 149
Provided by: bioinfFb
Category:

less

Transcript and Presenter's Notes

Title: Probability Theory


1
Probability Theory
  • Review of essential concepts

2
Probability
  • P(A ? B) P(A) P(B) P(A ? B)
  • 0 P(A) 1
  • P(O)1

3
Problem 1
  • Given that P(A)0.6 and P(B)0.7, which of the
    following cannot be true?
  • P(A ? B) 0.5 ? or
  • P(A ? B) 0.9 ? and
  • P(A ? B) 0.2
  • P(A ? B) 0.4
  • P(A ? B) 0.7

4
Conditional Probability
  • A and B are called independent if P(A
    ? B) P(A) P(B)
  • P(A B) P(A ? B)/P(B)
  • P(A B) ???? A ? B
  • A and B are independent ? P(AB)P(A)

5
Complete Probability
  • P(A) P(AH1)P(H1)
  • P(AH2)P(H2)
  • P(AHn)P(Hn)
  • H1, H2, Hn complete disjoint system of events

6
Bayes Formula
  • P(BA) - prior probability
  • P(AB) posterior probability

7
Problem 2
  • Suppose a certain drug test is 99 sensitive and
    99 specific, that is, the test will correctly
    identify a drug user as testing positive 99 of
    the time, and will correctly identify a non-user
    as testing negative 98 of the time. Let's assume
    a corporation decides to test its employees for
    opium use, and 0.5 of the employees use the
    drug. What is the probability that, given a
    positive drug test, an employee is actually a
    drug user?

8
Problem 3
  • We are presented with three doors - red, green,
    and blue - one of which has a prize. We choose
    the red door, which is not opened until the
    presenter performs an action. The presenter who
    knows what door the prize is behind, and who must
    open a door, but is not permitted to open the
    door we have picked or the door with the prize,
    opens the blue door and reveals that there is no
    prize behind it and subsequently asks if we wish
    to change our mind about our initial selection of
    red. What is the probability that the prize is
    behind each of the green and red doors?

9
Random Variables
  • Discrete (Uniform, Binomial, Poisson, Geometric,
    Hypergeometric, Negative Binomial,)
  • Continuous (Uniform, Normal, Exponential, Gamma,
    Chi-square, Student, Fisher, Dirchilet,)

10
Discrete Distributions
Poisson
11
Continuous Distributions
Beta distribution
12
Binomial Distribution
  • Binomial random number the number of successes
    in n independent trials pprobability of success
    in one trial

p0.1
p0.3
p0.5
13
Problem 4
The probability that a certain machine will
produce a defective item is 0.20. If a
random sample of 6 items is taken from the output
of this machine, what is the probability
that there will be 5 or more defectives in the
sample?
14
Problem 5
There are 10 patients on the Neo-Natal Ward of a
local hospital who are monitored by 2 staff
members. If the probability (at any one time) of
a patient requiring emergency attention by a
staff member is 0.3, assuming the patients to be
behave independently, what is the probability at
any one time that there will not be sufficient
staff to attend all emergencies?
15
Cumulative Probability
X random variable F(x) P(X x)
Most of the data analysis tools have a built-in
function for the cumulative binomial probability
16
Poisson Distribution
  • Poisson random number the number of rare events
    per unit of time or space

?1.5
?5
17
Problem 6
  • The marketing manager of a company has noted that
    she usually receives 10 complaint calls during a
    week (consisting of five working days), and that
    the calls occur at random. Find the probability
    that she gets five such calls in one day.

18
Problem 7
  • The rate at which a particular defect occurs in
    lengths of plastic film being produced by a
    stable manufacturing process is 4.2 defects per
    75 meter length. A random sample of the film is
    selected and it was found that the length of the
    film in the sample was 25 meters. What is the
    probability that there will be at most 2 defects
    found in the sample?

19
Normal Distribution
20
Cumulative Probability
Standard Normal Distribution
21
Other Normal Distributions
  • Z N(0,1)
  • Mean 0
  • Variance 1
  • X N(µ, s)
  • Mean µ
  • Variance s2
  • Z (X- µ)/s

22
Problem 8
  • The diameters of steel disks produced in a plant
    are normally distributed with a mean of 2.5 cm
    and standard deviation of 0.02 cm. What is the
    probability that a disk picked at random has a
    diameter greater than 2.54 cm?

23
Problem 9
  • The height of an adult male is known to be
    normally distributed with a mean of 69 inches and
    a standard deviation of 2.5 inches. What is the
    height of the doorway such that 96 percent of the
    adult males can pass through it without having to
    bend?

24
Problem 10
  • The longevity of people living in a certain
    locality has a standard deviation of 14 years.
    What is the mean longevity if 30 of the people
    live longer than 75 years? Assume a normal
    distribution for life spans.

25
Normal Approximation to Binomial
  • X Binom(n,p)
  • n number of trials
  • p probability of a single success
  • X N(µ, s)
  • µ np
  • s2 np(1-p)

ngt40 npgt5 n(1-p)gt5
26
Problem 11
The unemployment rate in a certain city is 8.5 .
A random sample of 100 people from the labor
force is drawn. Find the approximate probability
that the sample contains at least ten unemployed
people.
27
Continuity correction
Normal approximation is still an approximation
28
Problem 12
Companies are interested in the demographics of
those who listen to the radio programs they
sponsor. A radio station has determined that only
20 of listeners phoning in to a morning talk
program are male. During a particular week, 200
calls are received by this program. What is the
approximate probability that at least 50 of the
callers are male?
29
Poisson Approximation to Bionomial
  • X Binom(n,p)
  • n number of trials
  • p probability of a single success
  • X Poisson(?)
  • ? np

n?8 p?0 np?const
30
Problem 13
A certain genetic characteristic will express
itself in 0.001 of the population. In a sample of
n3000 subjects, k7 are observed to display the
characteristic, whereas only three are expected
to display the characteristic. How likely is it
that a rate this great or greater could occur by
mere chance?
31
Expected Value
E(X) S xi pi not a random number
E(XY) 11/221/3 E(X)E(Y)
E(X) 01/211/21/2
E(Y) 01/312/32/3
X and Y are independent ? Xa and Yb are
independent events
32
Variance
Var(X) E (X-E(X))2 E(X2)-(E (X))2
E(X)2/3
E(X-E(X)) -2/92/9 0
Var(X)4/91/31/92/32/9
E(X2)2/3
Var(X)E(X2)-E2(X)2/3 4/9 2/9
33
Expected Value and Variance
  • X random variable
  • E(XY) E(X) E(Y)
  • E(cX) cE(X)
  • E(c) c
  • If X and Y are independent then E(XY) E(X)E(Y)
  • Var(X)E(X2)-E2(X)
  • Var(cX)c2Var(X)
  • If X and Y are independent then Var(XY)
    Var(X)Var(Y)
  • For arbitrary X and Y, Var(XY) Var(X) Var(Y)
    2Cov(X,Y)

34
Exercises
  • Using properties of E(X) prove that
  • Var(X) E (X-E(X))2 E(X2)-(E (X))2
  • Var(XY) Var(X) Var(Y) 2Cov(X,Y)
  • where
  • Cov(X,Y)E (X-E(X))(Y-E(Y))
  • Cov(X,Y)E(XY) - E(X)E(Y)
  • Find X and Y such that X and Y are dependent but
    Cov(X,Y)0

35
Problem 14
  • The Attila Barbell Company makes bars for weight
    lifting. The weights of the bars are independent
    and are normally distributed with a mean of 720
    ounces (45 pounds) and a standard deviation of 4
    ounces. The bars are shipped 10 in a box to the
    retailers. The weights of the empty boxes are
    normally distributed with a mean of 320 ounces
    and a standard deviation of 8 ounces. The weights
    of the boxes filled with 10 bars are expected to
    be normally distributed with a mean of 7,520
    ounces. What is the standard deviation?

36
Statistics
  • Part I Sampling distribution

37
Sampling Distribution
  • Sample X1, X2, , Xn
  • Xi are random numbers

Population heights of adult males
  • All Xi are
  • from the same distribution
  • are independent

38
Sample Mean
  • All Xi are
  • from the same distribution, i.e,
  • E(Xi)µ, Var(Xi) s2
  • are independent random numbers

39
The Law of Large Numbers

40
Illustrative example
Population 1,2,3, sample size n2
41
Central Limit Theorem
  • The sum of a sufficiently large number of
    identically distributed independent random
    variables is approximately normally distributed
    regardless of the population distribution

42
Normal Approximation to Binomial
X number of successes in n trials XX1X2Xn
43
Problem 18
  • There are two games involving flipping a coin. In
    the first game you win a prize if you can throw
    between 45 and 55 of heads. In the second game
    you win if you can throw more than 80 heads. For
    each game would you rather flip the coin 30 times
    or 300 times?

44
Sampling distribution
X is approximately normal when ngt40 X is
approximately normal regardless of the
original distribution
45
Problem 15
  • The average outstanding bill for delinquent
    customer accounts for a national department store
    chain is 187.50 with a standard deviation of
    54.50. In a simple random sample of 50
    delinquent accounts, what is the probability that
    the mean outstanding bill is over 200?

46
Problem 16
  • The average number of daily emergency room
    admissions at a hospital is 85 with standard
    deviation of 37. In a simple random sample of 30
    days, what is the probability that the mean
    number of daily emergency admissions is between
    75 and 95?

47
Problem 17
  • A summer resort rents rowboats to customers but
    does not allow more than four people to a boat.
    Each boat is designed to hold no more than 800
    pounds. Suppose the distribution of adult males
    who rent boats, including their clothes and gear,
    is normal with a mean of 190 pounds and standard
    deviation of 10 pounds. If the weights of
    individual passengers are independent, what is
    the probability that a group of four adult male
    passengers will exceed the acceptable weight
    limit of 800 pounds?

48
Statistics
  • Part II Hypothesis testing

49
Hypothesis testing
  • H0 null hypothesis
  • HA alternative hypothesis

In a court H0 the person is not guilty HA
the person is guilty Doctors appointment H0
patient is sick HA patient is not sick
50
Type I/II error
  • Type I error (a)
  • It is the error of rejecting a null hypothesis
    when it is actually true.
  • Type II error (ß)
  • It is the error of failing to reject a null
    hypothesis when it is in fact false.

51
Decision rule
  • Assume we get many samples
  • We set up a decision rule which rejects or
    accepts the hull hypothesis for each sample
  • Sometimes we will commit Type I error
  • Sometimes we will commit Type II error
  • (Of course many times we will be correct!)

Decision rule comes separately from the set of
hypotheses
52
Type I/II error
53
Problem 19
  • A patient claims that he consumes only 2000
    calories per day, but a dietician suspects that
    the actual figure is higher. The dietician plans
    to check his food intake for 30 days and will
    reject the patient's claim if the 30-day-mean is
    more than 2100 calories. If the standard
    deviation (in calories per day) is 350, what is
    the probability that the dietician will
    mistakenly reject a patient's true claim?

54
Problem 20
  • City planners wish to test the claim that
    shoppers park for an average of only 47 minutes
    in the downtown area. The planners have decided
    to tabulate parking durations for 225 shoppers
    and to reject the claim if the sample mean
    exceeds 50 minutes. If the claim is wrong and the
    true mean is 51 minutes, what is the probability
    that the random sample will lead to a mistaken
    failure to reject the claim? Assume that the
    standard deviation in parking durations is 27
    minutes.

55
P-value
  • P-value is the probability of obtaining a result
    at least as extreme as the one that was actually
    observed, given that the null hypothesis is true.
  • ???? ?? ??, ??? ?? ???????????? ? ???????
    ???????? ???? ?????, ?? ?????? ???? ??
    ??????????? ?????? ??, ??? ?? ????? ? ???????
    (???, ??? ??? ????)

56
Hypothesis testing
  • P-value is a function of sample
  • a is a function of decision rule
  • Reject H0 if p-valuelt a
  • Small p-value indicates that you see something
    very unusual if H0 were true

57
Problem 21
  • A service station advertises that its mechanics
    can change a muffler in only 15 minutes. A
    consumers group doubts this claim and runs a
    hypothesis test using 49 cars needing new
    mufflers. In this sample the mean changing time
    is 16.25 minutes with a standard deviation of 3.5
    minutes. Is this a strong evidence against the 15
    minute claim?

58
Estimators
  • An estimator is a function of the observable
    sample data that is used to estimate an unknown
    population parameter
  • is an estimator for µ
  • s is an estimator for s
  • is an estimator for p

59
Unbiased effective estimators
  • Let be the unknown parameter
  • Let be an estimator
  • is unbiased if
  • is effective if

60
Unbiased vs. effective
Unbiased but ineffective
Effective but biased
We are looking for unbiased and effective
estimators
61
Mean Squared Error
  • Bias
  • Variance
  • Mean Squared Error

62
Problem ?
  • A box contains 70 black and 30 white balls. Ten
    balls are chosen at random and two estimators of
    the following form are considered
  • where n10. Which estimator is more effective?
    (i.e., has a smaller MSE?)

63
Standard error
  • Standard error standard deviation of the
    estimator

64
Problem 22
  • A local restaurant owner claims that only 15 of
    visiting tourists stay for more than 2 days. A
    chamber of commerce volunteer is sure that the
    real percentage is higher. He plans to survey 100
    tourists and intends to speak up if at least 18
    of the tourists stay longer than 2 days. What is
    the probability of mistakenly rejecting the
    restaurant owner's claim if it is true?

65
Two-sample mean
  • Two independent samples, X1,, Xn and Y1,,Ym
    have independent sample means

66
Two-sample proportion
  • Two independent sample proportions

67
Problem 23
  • A historian believes that the average height of
    soldiers in World War II was greater than that of
    soldiers in World War I. She examines a random
    sample of records of 100 men in each war and
    notes standard deviations of 2.5 and 2.3 inches
    in World War I and World War II, respectively. If
    the average height from the sample of World War
    II soldiers is 1 inch greater than that from the
    sample of World War I soldiers, what conclusion
    is justified from a two-sample hypothesis test
    where H0 µ1 µ2 vs. HA µ1lt µ2?

68
Confidence intervals
  • Hypothesis testing A coffee machine is supposed
    to deliver 8 ounces of coffee in a cup, but in my
    sample of 10 cups I get only 7.5 ounces. Is this
    ok?
  • Confidence intervals My sample of 10 cups of
    coffee contains on average 7.5 ounces of liquid.
    What is the likely estimate for the mean amount
    of coffee per cup?
  • Hypothesis testing and construction of confidence
    intervals are mutually inverse problems

69
Confidence intervals
  • Parameter Estimate critical SE,
  • SE standard error

70
Critical value
0.025
0.025
0.95
z1.96
71
Problem 19 revisited
  • A patient claims that he consumes only 2000
    calories per day, but a dietician suspects that
    the actual figure is higher. The dietician
    checked his food intake for 30 days and found
    that the 30-day-mean is more than 2100 calories.
    What is the 95 confidence interval for the
    number of calories in patients diet?
  • Assume standard deviation of 350 calories per
    day.

72
Problem 22 revisited
  • A chamber of commerce volunteer is interested in
    the percentage of visiting tourists staying for
    more than 2 days in a certain hotel. He surveyed
    100 tourists and found that 18 of them stay
    longer than 2 days. What is the 99 confidence
    interval for the percentage of visiting tourists
    who stay for more than 2 days?

73
Problem 24
  • In a random sample of 300 high school students,
    225 said they managed time effectively, while in
    a similar sample of 270 college students, only
    108 felt they were effective time managers. What
    is a 99 confidence interval estimate for the
    difference between the proportions of high school
    and colleges students who think they manage time
    effectively?

74
Problem 25
  • A medical researcher believes that taking 1000
    milligrams of vitamin C per day will result in
    fewer colds than a daily intake of 500 milligrams
    will. In a group of 50 volunteers taking 1000
    milligrams per day, the numbers of colds per
    individual during a winter season averaged 1.8
    with a variance of 1.5. Similar data from a group
    of 60 volunteers taking 500 milligrams per day
    showed an average of 2.4 with a variance of 1.6.
    What was the P-value of this test?

75
How do we get s?
  • Population standard deviation is usually unknown
  • If sample size is large (ngt40) then we can assume
    that the sample standard deviation (s)
    approximates the population standard deviation
    (s) well enough
  • If sample size is small then this assumption is
    no longer valid, i.e., sampling error in the
    estimation of s cannot be ignored

76
Known vs. unknown s
s
known
unknown
z
Small sample
Large sample
t
z
77
Student t-distribution
78
Student t-distribution
  • Student t-distribution has one parameter called
    degrees of freedom
  • When the number of degrees of freedom is large,
    the t-distribution is close to z-distribution

79
t-distribution table
Degrees of freedom sample size - 1
80
Problem 26
  • An article ("Undergraduate Marijuana use and
    Anger" by Sue Stoner) in a 1988 issue of the
    Journal of Psychology (Vol. 122, p. 33) reported
    that in a sample of 17 marijuana users the mean
    and standard deviation on an anger expression
    scale were 42.72 and 6.05, respectively. Test
    whether this result is significantly greater than
    the established mean of 41.6 for nonusers. What
    assumptions are necessary for the above test to
    be valid?

81
T-test assumptions
  • Random sampling (like in z-test)
  • Normal population (unlike z-test, where sample
    mean is automatically normal regardless of the
    population when sample size is large)
  • Degrees of freedom number of independent
    observations (actually, residuals)

82
Problem 27
  • A hospital exercise laboratory technician notes
    the resting pulse rates of five joggers to be 60,
    58, 59, 61, and 67, respectively, while the
    resting pulse rates of seven non-exercisers are
    83, 60, 75, 71, 91, 82, and 84, respectively.
    Establish a 99 confidence interval estimate for
    the difference in pulse rates between joggers and
    non-exercisers.
  • (Means and standard deviations are 61, 78, 3.54,
    and 10.23, respectively)

83
Equal variances assumption
  • Assume that both populations have the same
    standard deviation (i.e., amount of exercise
    affects mean of the population, not its standard
    deviation)

d.f. minn,m-1
d.f. n m - 2
84
Problem 27 revisited
  • A hospital exercise laboratory technician notes
    the resting pulse rates of five joggers to be 60,
    58, 59, 61, and 67, respectively, while the
    resting pulse rates of seven non-exercisers are
    83, 60, 75, 71, 91, 82, and 84, respectively.
    Establish a 99 confidence interval estimate for
    the difference in pulse rates between joggers and
    non-exercisers. Assume equal variances.
  • (Means and standard deviations are 61, 78, 3.54,
    and 10.23, respectively)

85
Problem 28
  • A researcher believes a new diet should improve
    weight gain in laboratory mice. If ten control
    mice on the old diet gain an average of 4 ounces
    with a standard deviation of 0.3 ounces, while
    the average gain for the ten mice on the new diet
    is 4.8 ounces with a standard deviation of 0.2
    ounces, what is the p-value?

86
Dependent samples
  • Trace metals in drinking water wells affect the
    flavor of the water and unusually high
    concentrations can pose a health hazard. In the
    paper, Trace Metals of South Indian River
    Region (Environmental Studies, 1982, 62-6),
    trace metal concentrations (mg/L) on zinc were
    found from water drawn from the bottom and the
    top of each of 6 wells.

87
Dependent samples
One sample t-test
88
FAQs
  • Do I have to divide by square root of n?
  • Yes, if you are looking for P(Xgt100)
  • No, if you are looking for P(Xgt100)
  • Do I have to divide by square root of n in
    one-proportion or two-proportion tests?
  • No. If you use Standard Error, it already
    contains the square root of n
  • When I compute standard deviation from the
    sample, do I have to divide it by square root of
    n?
  • Yes, if your calculations involve sample mean.

89
Common misconception
  • Sample standard deviation is an estimator for the
    population standard deviation
  • Standard deviation of the sampling distribution
    is smaller than the population standard deviation
  • Sample standard deviation is NOT an estimator for
    the standard deviation of the sampling
    distribution

90
Estimation of s
91
Chi-square table
92
Problem 29
  • A supplier of 100 ohm/cm silicon wafers claims
    that his fabrication process can produce wafers
    with sufficient consistency so that the standard
    deviation of resistance for the lot does not
    exceed 10 ohm/cm. A sample of 10 wafers taken
    from the lot has a standard deviation of 13.97
    ohm/cm. Is the suppliers claim reasonable?

93
Problem 30
  • A container of oil is supposed to contain 1000 ml
    of oil. We want to be sure that the standard
    deviation of the oil container is less than 20
    ml. We randomly select 10 cans of oil with a mean
    of 997 ml and a standard deviation of 32 ml.
    Using these sample construct a 95 confidence
    interval for the true value of sigma. Does the
    confidence interval suggest that the variation in
    oil containers is at an acceptable level?

94
Estimation of sample size
  • What is a minimum sample size needed to estimate
    the population mean within 2 units?
  • What is a minimum sample size needed to estimate
    the population proportion within 2 percent units?

95
Problem 31
  • An electrical firm which manufactures a certain
    type of bulb wants to estimate its mean life.
    Assuming that the life of the light bulb is
    normally distributed and that the standard
    deviation is known to be 40 hours, how many bulbs
    should be tested so that we can be 90 percent
    confident that the estimate of the mean will not
    differ from the true mean life by more than 10
    hours?

96
Problem 32
  • A quality control engineer wants to estimate the
    fraction of defective bulbs in a large lot of
    light bulbs. From past experience, he feels that
    the actual fraction of defective bulbs should be
    somewhere around 0.2 . How large a sample should
    be taken if he wants to estimate the true
    fraction within .02 using a 95 confidence
    interval?

97
Problem 33
  • Many television viewers express doubts about the
    validity of certain commercials. Let p represent
    the true proportion of consumers who believe what
    is shown in Timex television commercials. If
    Timex has no prior information regarding the true
    value of p, how many consumers should be included
    in their sample so that they will be 85
    confident that their estimate is within 0.03 of
    the true value of p?

98
Statistics
  • Part III Contingency tables

99
Non-parametric hypotheses
  • H0 features are independent
  • HA features are dependent

A restaurant owner surveys a random sample of 385
customers to determine whether customer
satisfaction is related to gender and age.
100
Assumption of independence
If gender/age and satisfaction were independent
then P(satisfied and young male)
P(satisfied)P(young male) P(satisfied)
302/385 P(young male) 33/385 P(satisfied and
young male) 30233/3852 Expected number of
satisfied young males 30233/385
101
Observed and Expected
Observed
Expected
102
Chi-square test for independence
d.f. (n-1)x(m-1)
103
Problem 34
  • A sociologist conducts a test whether there is a
    relationship between cheating on exams and
    socioeconomic status. A random sample of 750 high
    school students yields the following results
  • What is the conclusion about cheating and
    socioeconomic status at the 5 significance level?

104
Chi-square goodness of fit
  • A grocery store manager wishes to determine
    whether a certain product will sell equally well
    in any of the five locations in the store. Five
    displays are set up, one for each location, and
    the resulting numbers of the product sold are
    noted
  • Is there enough evidence to claim a difference?

105
Chi-square goodness of fit
Total 432948206 We expect 206/541.2 units
sold in each location H0 The distribution is
uniform HA The distribution is not uniform
d.f. n-1
106
Problem 35
  • A geneticist claims that four species of fruit
    flies should appear in the ratio of 1339.
    Suppose that a sample of 4000 fruit flies
    contained 226, 764, 733, and 2277 flies of each
    species, respectively. At the 10 significance
    level, is there sufficient evidence to reject the
    geneticists hypothesis?

107
Chi-square test warning
  • Chi-square test is applicable only if the
    expected value in each cell is greater than 5
    (Compare to Binomial Distribution)
  • If this doesnt hold, you might find Fisher exact
    test more useful

108
Problem 36
  • A sample of teenagers might be divided into male
    and female on the one hand, and those that are
    and are not currently dieting on the other. We
    hypothesize, perhaps, that the proportion of
    dieting individuals is higher among the women
    than among the men, and we want to test whether
    any difference of proportions that we observe is
    significant.

Expected lt 5
109
Fisher exact test
Hypergeometric Distribution
110
Statistics
  • Part IV Regression and ANOVA

111
The least squares line
  • A simple data set consists of data pairs
    (xi, yi), i 1, ..., n,
  • where xi is an independent variable and yi is a
    dependent variable
  • The model function has the form
  • y a bx
  • We wish to find a and b for which the model
    "best" fits the data.

112
Residuals
  • The least squares method defines "best" as when S
    S ri2 is at minimum.
  • A residual ri is defined as the difference
    between the values of the dependent variable and
    the predicted values from the estimated model
  • ri yi - (a b xi)

113
Regression Line
  • Residuals are shown by blue lines
  • Sum of squares of the residuals is at minimum

114
Residual plot
  • The sum of the residuals is always zero
  • A pattern in the residual plot indicates that a
    non-linear model should be used

115
Influential scores and outliers
  • In regression, an outlier is a data point with
    large residual
  • An influential score is the data point which
    significantly influences the regression line
  • If an influential score is removed from the
    sample, the regression line will change
    significantly

116
Problem 37
  • Which of the five points is an outlier, and which
    is an influential score?

117
Solving the regression
118
Regression slope and intercept
119
Correlation Coefficient
  • The correlation coefficient indicates the degree
    of linear dependence

120
Correlation and slope
121
Coefficient of determination
  • SST total sum of squares
  • SSX sum of squares explained by X
  • SSE sum of squares of residuals
  • SST SSXSSE
  • The square of the sample correlation coefficient,
    which is also known as the coefficient of
    determination, is the fraction of the variance in
    y that is accounted for by a linear fit of x

122
Sums of squares
red
blue
123
SE of the regression slope
  • The regression line is a result of random
    sampling
  • Different samples produce different lines
  • There is a family of lines for the given
    population you get just one

124
SE of the regression slope
where se is the standard deviation of the
regression error
125
Problem 38
  • What is the equation of the fitted line?
  • Find an approximate confidence interval for the
    regression slope?
  • Test the hypothesis that the slope is non-zero

126
Problem 39
Find the regression line and a 95 confidence
interval for the regression slope.
127
Confidence vs. prediction intervals
  • Suppose I fuel my car 7 days a week, from Sunday
    to Sunday, each day at a randomly chosen gas
    station. I get a sample of gasoline prices for 7
    days
  • Confidence interval is for the average gasoline
    price on Monday
  • Prediction interval is for a gasoline price at a
    randomly chosen gas station on Monday

128
Confidence vs. prediction intervals
  • Confidence interval
  • Prediction interval

129
Problem 39 revisited
Find the a 95 prediction interval for the dive
duration at 25 degrees Celsius
130
ANOVA Analysis of Variance
  • A collection of models, in which the variance of
    the observed set is partitioned into components
    due to explanatory variables
  • Assumptions
  • Independence of observations
  • The distributions in each of the groups are
    normal
  • Variance homogeneity, called homoscedasticity
    the variance of data in groups should be the
    same.

131
One-way ANOVA
  • A manager wishes to determine whether the mean
    times required to complete a certain task differ
    for the three levels of employee training. He
    randomly selected 10 employees with each of the
    three levels of training.
  • Do the data provide sufficient evidence to
    indicate that the mean times required to complete
    a certain task differ for at least two of the
    three levels of training?

132
Steiners Theorem
a
xi
?????? ??????? ??????? ????? ???????????? ????? ?
133
Problem 40
  • Three different milling machines were being
    considered for purchase by a manufacturer.
    Potentially, the company would be purchasing
    hundreds of these machines, so it wanted to make
    sure it made the best decision. Initially, five
    of each machine were borrowed, and each was
    randomly assigned to one of 15 technicians (all
    technicians were similar in skill). Each machine
    was put through a series of tasks and rated using
    a standardized test. The higher the score on the
    test, the better the performance of the machine.
    The data are

134
Partition of sum of squares
  • SST SSA SSE
  • SST total sum of squares
  • SSA sum of squares for factor A
  • SSE sum of squares of errors

135
Partition of sum of squares
136
The ANOVA table
  • SSA Sum of squares Factor
  • SSE Sum of squares Error
  • MSA Mean sum of squares Factor
  • MSE Mean sum of squares Error

137
Fisher distribution
138
Problem 40 solution
In EXCEL Tools -gt Data Analysis -gt Single Factor
ANOVA
139
Two-way ANOVA
  • Group A is given vodka, Group B is given gin, and
    Group C is given a placebo. Groups are tested
    with a memory task. One-way ANOVA
  • In an experiment testing the effects of
    expectations, subjects are randomly assigned to
    four groups
  • expect vodkareceive vodka
  • expect vodkareceive placebo
  • expect placeboreceive vodka
  • expect placeboreceive placebo
  • Each group is then tested on a memory task.
    Two-way ANOVA

140
Partition of sum of squares
  • SST SSA SSB SSE
  • SST total sum of squares
  • SSA sum of squares for factor A
  • SSB sum of squares for factor B
  • SSE sum of squares of errors

141
Partition of sum of squares
142
Problem 41
  • Three different milling machines were being
    considered for purchase by a manufacturer.
    Machines are operated by 5 different crew
    technicians

143
What is the error term?
144
Two-way ANOVA table
  • At 5 level
  • The variation across rows (crew technicians) is
    NOT significant
  • The variation across columns (machines) is
    significant

145
Problem 42
  • Some varieties of nematodes feed on the roots of
    lawn grasses and crops such as strawberries and
    tomatoes. Four brands of nematocides are to be
    compared. Twelve plots of land of comparable
    fertility that were suffering from nematodes were
    planted with a crop. The yields of each plot were
    recorded and part of the ANOVA table appears
    below
  • Find the value of F

146
THE END
147
Extra Problems
  • All bags entering a research facility are
    screened. Ninety-seven percent of the bags that
    contain forbidden material trigger an alarm.
    Fifteen percent of the bags that do not contain
    forbidden material also trigger the alarm. If 1
    out of every 1,000 bags entering the building
    contains forbidden material, what is the
    probability that a bag that triggers the alarm
    will actually contain forbidden material?

148
Extra problems
  • Pepper plants watered lightly every day for a
    month show an average growth of 27 cm with the
    standard deviation of 8.3 cm, while pepper plants
    watered heavily once a week for a month show an
    average growth of 29 cm with the standard
    deviation of 7.9 cm. In a sample of 60 plants,
    half of which were given each of the water
    treatments, what is the probability that the
    difference in average growth between the two
    halves is between -3 and 3 cm?
Write a Comment
User Comments (0)
About PowerShow.com