Continuous Probability Distributions

About This Presentation

Title:

Continuous Probability Distributions

Description:

Continuous probability distributions are typically associated with ratio scales: ... situation where people have varying degrees of tendency to visit McDonald's over ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:5.0/5.0

Slides: 104

Provided by: SocialSc2

Category:

more less

Transcript and Presenter's Notes

Title: Continuous Probability Distributions

1
Chapter 8

Continuous Probability Distributions

2
Continuous Probability Distributions

Continuous probability distributions are
typically associated with ratio scales
Height how likely is it that a child in the
class is 1.7 meters tall?
Finance what are the chances that the ratio of
First and Second Quarter profits will be ? 1.25?
Vision Science at what wavelength (measured in
nm) in the electromagnetic spectrum are the M
human photoreceptors maximally receptive?
Physics the magnetic moment of the electron is
1.001159652201 ? 0.000000000030

Similar to discrete probability distributions,
continuous probability distributions identify the
events in a probability space with sets of
numbers on the number line.

A function f is a probability density function
(pdf) if
f(x) 0, for every number x
where

If f is the pdf of a continuous distribution,
then F is the cumulative distribution function
(cdf) of that distribution

We define the probability of the event of the
random variable yielding a value less than some
number a as
Pr(X lt a) F(a)
Similarly, the probability of X being greater
than a is
Pr(X gt a) 1 F(a)

We define the probability of the event of the
random variable yielding a value in the interval
a, b
Pr(a X b) F(b) F(a)

We would like to understand a continuous
probability distribution like

So lets approximate it with

We ignore the extreme values whose absolute
value exceeds 4
We use cell marks (cf. chap. 3) to estimate the
probability of falling within a given range

Because of the way we take cell marks, with only
a few categories, our accuracy is limited
With more categories, well get more accurate

Now we can figure out the probabilities of being
in one of these categories
Just as in the previous chapter, we can represent
these probabilities precisely and completely with
a histogram.
At this point it is crucial to remember that
histograms express probabilities of events (i.e,
the probability of being in one of these
intervals) as the area of the histogram
corresponding to the event.

The probability that X yields a value between .5
and 1

In probability theory, we demand that the total
area of the histogram 1
(This contrasts with how Eviews handles sample
distributions.)
So we have
Lets check our accuracy on Eviews.

We now have a probability distribution for 9
possible categories
Each category is an interval of possible values
We trimmed off the extremities the values
greater than 4 or less than 4.
Well leave these extremities alone for now.
But why stop with just 9 categories?
Lets make a more fine-grained histogram, one
with 20 categories
Howzabout 40 categories? 100? 1000?

If you remember your calculus, you can see what
were doing here.
We are creating increasingly fine-grained
(discrete) approximations of a continuous curve.
We finish off (this part of) our project by going
whole-hog
We dont stop with n 100, or n 10 million
Instead, we let n go to infinity.

Lets look at this situation a bit more
carefully.
For any number n you like (for ease, lets assume
n gt 10)
We create a partition of the interval (a, b), by
specifying n 1 points, all equally spaced
apart
Thus a c0, b cn, and for every ci, (i n)

Thus, intuitively speaking, our probability
distribution (leaving out the extremities for
now) turns out to be represented by the histogram
of the grouped data for n groups, but with n ?,
and each category containing a single number.

Lets use Length(ci) and Height(ci) to denote the
length of category ci ( ci ci-1) and the
height of the bar associated with ci.

In our current example, a -4, and b 4.
f is the pdf of the continuous distribution
It characterizes how probabilities are
distributed across the infinitely many numbers in
the interval (a, b).
It replaces the probability function pr(ci-1 lt
X ci) used in our discrete distributions.

20
Extending the distribution to the entire line
of real numbers

Lets now turn to those numbers outside of (a, b)
that weve ignored so far
To make the situation visually more obvious,
lets pretend we were working with the interval
(-1.5, 1.5), instead of (-4, 4).

So far, weve seen how to go from

Notice that by working with (1.5, 1.5) our
estimation of the curve is forced to be more
inaccurate.
Because the area under the curve must be 1.

But now what about those extremities that weve
been ignoring?
We want our theory to allow every number to be a
possible value, not just those between a and b.
So we need to extend our theory just a little bit
more
We will do what we just did, but we will extend
each boundary by some quantity m
(a m), (b m)
E.g. (1.5 .5), (1.5 .5)
So our new interval will be (2, 2)

Now we go from

Notice also how our approximation improves

Lets make m even larger, and go from

Now our approximation is getting pretty good

The remaining probabilities that we havent yet
accounted for
pr(X 3), pr(X ? 3)
are rather small, but that doesnt matter here
We can continue extending our probability space
by setting
m 5
m 6
m 60
m 10,000

In short, we go from

Lets examine three features of the pdf f
Our construction of f ensures that pr(X c) 0,
for any number c.
f is a derivative.
f is not a probability function.

Some Preliminaries.
Recall

More specifically, for any appropriate n and i,
such as n 100, and i 32 ________________
________________

1. pr(X k) 0 for all numbers k.
Earlier we showed that
Hence

38
But as n gets very large, the length of every
cell ci ( ci ci-1) gets very small
39

2. f is a derivative
From our Preliminaries, we have
Hence

Recall that
Notice also
So we can argue

So, in conclusion, we haveBut this means
that f is a derivative

Question Is it possible to put this last
equality in the form for derivatives given by my
calculus book?
here, f F'

Let
So h is determined by n, and as n gets large, h
gets arbitrarily small.
for each h, we can define a function
where ci-1 lt x ci.

Now we define
So we have

45
(No Transcript)
46

There is another way that we can tell that f is a
derivative
From the Fundamental Theorem of Calculus, we have
the relationship

3. f is not a probability function.
Notice that pdfs take single numbers as their
arguments, probability functions take sets of
numbers as their arguments.

A concrete (counter-)example
Sometimes f takes on values greater than 1.
Probability functions, by definition, cannot do
this!
But for any 0 a b 1, pr(a X b) 1

49
The Uniform Distribution

The uniform distribution is simple but important.
The uniform distribution over the interval (a,
b) is defined as

50
Here is the uniform distribution on (0, 1)
51
Here is the uniform distribution on (-2, 14)
52
Here are the cdfs of the two distributions. Why
is the cdf F(x) (for a lt x lt b)??
53
Here are the cdfs of the two distributions.
54
In general, the cdf of U(a, b) (i.e., the uniform
distribution on the interval from a to b) is
55

The uniform distribution is useful in cases where
a number is known (or assumed) to fall within a
definite finite interval, and you have no further
information about what that number might be
Since you have no reason to treat one number as
more likely than the other, you give them all the
same density

The uniform distribution appears when all the
data must appear in some fixed interval, but
there is absolutely no further information or
structure that would bias the random variable
to take one value rather than another.

57
Example

The uniform distribution is often used as a kind
of null or default hypothesis regarding the
distribution of probabilities within a
population.
E.g., in a situation where people have varying
degrees of tendency to visit McDonalds over
Burger King, the least informative hypothesis
would be a uniform distribution of probabilities
(on the interval 0, 1)

The uniform distribution on (a, b) is a
probability distribution

59
Expectations

Expectations are defined similarly to those for
discrete random variables.
If X is a continuous random variable whose pdf is
f, then

60
Expectations

Using this definition, we can also define the
variance, standard deviation, etc. of X

61
Expectations

You should be able to calculate that if X U(a,
b), then

62
Expectations

Importantly, everything we have proven about
expectations for discrete random variables holds
for continuous random variables.
The linearity of expectations holds

63
Expectations

Importantly, everything we have proven about
expectations for discrete random variables holds
for continuous random variables.
The linearity of expectations holds

64
Expectations

Whether or not we use the linearity of
expectations, or calculate directly from our
definitions, for any continuously distributed
random variable X, whose mean is ? and standard
deviation is ?, we have
Where

65
The Normal Distribution

The normal distribution (aka the Gaussian
distribution) is probably the most common
distribution in all of science.

N(0, 1)
66

the pdf of the normal distribution is
In the case where ?0, ?1, this equation
simplifies to (and has a special name)

the cdf of the normal distribution is
In the case where ?0, ?1, this equation
simplifies to (and has a special name)

We often use expressions like
N(?, ?2),
which is shorthand for The normal distribution
with a mean of ?, and a variance of ?2.
We also write things like
X N(?, ?2),
which is shorthand for X is normally
distributed, with a mean of ?, and a variance of
?2.

69
X N(0, 1)
70
X N(3, 1)
71
X N(6, 1)
72
X N(16, 1)
73
X N(3.1, 1)
74
X N(0, 1)
75
X N(0, 3)
76
X N(0, 5)
77

X N(0, 1), Y N(0, 3), Z N(0, 5)
For any N(?, ?2), where is the high point of the
pdf?

For any N(?, ?2)
the standardized skew
The standardized kurtosis

Notice that the pdf for N(?, ?2) can be seen as
using the standardization of X
Where

It is easy to turn one normal distribution into
another.
If X N(0, 1), and Y a bX (b ? 0), then
Y N(a, b2)
If X N(?, ?2), and Y then
Y N(0, 1)
If X N(?, ?2), and Y a bX (b ? 0), then
Y N(b?a, (b?)2)

Two numbers that you will encounter are

1.64 For any Normally distributed X, there is a
95 chance that the value of X will be less than
? (1.64 ?)
82

Two numbers that you will encounter are

1.64 For any Normally distributed X, there is a
95 chance that the value of X will be greater
than ? (1.64 ?)
83

Two numbers that you will encounter are

1.96 For any Normally distributed X, there is a
95 chance that the value of X will be within
1.96 standard deviations from the mean.
84
The Central Limit Theorem

The CLT is a big part of why the normal
distribution is so important to science.
Given the way we typically do empirical science,
by collecting random samples, in a certain
senseregardless of what distribution they are
coming from, as the sample size gets large, the
mean of the random sample becomes approximately
normally distributed.

A random sample is a collection of n many i.i.d.
random variables X1,, Xn
Notice the capital letters these are random
variables, not known quantities.
The i.i.d. part is important.
Independent and Identically Distributed
But let the distribution that they all have in
common be any probability distribution in the
world that you like
Then as n gets very large the sum of the
standardizations of the Xis (divided by n1/2)
approaches a normal distribution with a mean of 0
and a variance of 1.

This is called the Central Limit Theorem.
More carefully, it says
If X1,, Xn are i.i.d. random variables from a
distribution with mean ? and variance ?2, then
In short, the Central Limit Theorem says that the
variability in the whole of any (large) random
sample is approximately distributed as N(0, 1).

What this means is that if you randomly sample a
population
children, cities, cancer patients, purchases,
etc.,
then if you standardize these measurements and
add them up,
the result, divided by n1/2, can of course still
vary some,
you couldve sampled different children,
patients,
will nevertheless vary with a distribution that
is similar to N(0, 1), especially as n gets large.

Notice that the only sources of uncertainty
come from the i.i.d. random variables X1,,Xn.
Thus, the Central Limit Theorem tells us that the
sum of all these random variables is
approximately normally distributed
This distribution will be as N(?, ?2), not
necessarily as N(0, 1).

The CLT explains why normal distributions are
fairly common
When a population is made up of individuals who
are all of the same general type, and who differ
from one another due to a large number of
influences that are themselves mutually
independent, the resulting population will often
be (approximately) normally distributed.

The CLT explains why normal distributions are
fairly common
E.g., people are roughly about the same height,
but heights differ due to many largely
independent influences
diet, various genetic propensities, illness
during adolescence, age, amputation, etc.
Thus, this population (of human heights) might
naturally be modeled as a random variable
where , and
Y is (approximately) normally distributed.

Since Y Y1 Yn, CLT tells us that Y will
be approximately normally distributed
And if we know the mean and variance of the Yis,
then if we can approximate n, we can approximate
the precise distribution of Y.
In our height example, the Yis might be
Y1 quality of diet during adolescence
Y2 racial/ethnic background (on a good scale)
Y3 degree of height propensity from some given
genetic type
Y4 amount of mercury in local water supply
Y5 severity of measles in childhood.
Y6 severity of mumps in childhood.
ETC.

More generally, if our measurement Y is the
combination of some other variables, etc., along
with the Xis, then we may have a situation where,
e.g.
Y a bZ (X1 Xn)
Y a bZ ?
Here the single random variable (X1 Xn) is
approximately normally distributed
Although Y and/or Z may not be.
The CLT is one of the reasons why the error in
our models frequently turns out to be a random
variable ? N(0, ?2)
A nice visualization of this phenomenon is at
http//www.inf.ethz.ch/personal/gut/lognormal/

Notice that the CLT can be seen as involving the
standardization of a (big) random variable
is the very same thing as

94
(No Transcript)
95
Chebychevs Inequality

Let X be a random variable with any distribution
you like, with a mean ? and standard deviation ?.
Chebychevs theorem then says that for any c gt 0
In other words, regardless of Xs distribution,
the probability of X yielding a value more than c
standard deviations away from Xs mean is always
less than 1/c2.

So regardless of the Xs actual distribution,
The probability that X yields a value more than 2
standard deviations from the mean is less than ¼
.25.
The probability that X yields a value more than 5
standard deviations from the mean is less than
1/25 .04.

The core of classical statistical inference
involves finding data which is simply too
unlikely to have come from a certain
distribution.
E.g., often our data sets x1,, xn produce a
certain number b (e.g, b).
Often our experimental design allows us to create
a complex random variable W out of the others
that generated the data set
X1,, Xn
We then see whether the probability that W would
produce b is below a certain threshold
pr(b W b) lt .05???

In theory, we could merely use our threshold (.05
in our example) to figure out how extreme our
data had to be to allow us to draw this
conclusion.
If we use Chebyshevs inequality, W will have to
be further than ?/(.05)1/2 away from the mean of
W.
Although this boundary is rather remarkable,
because it holds for any random variable, it is
rather inefficient.
If we can obtain more information about the null
hypothesis that we are testing, we may be able
to draw stronger conclusions from less extreme
data.

For example Suppose our null hypothesis
distribution is N(0, 1), and our threshold is
.05.
From Chebyshevs inequality, we can calculate
So we solve for ?

100

Since our null hypothesis distribution is N(0,
1), we can continue
Thus, to draw a statistical inference using
Chebychevs inequality, our random variable would
have to yield a value more extreme than ?4.472
As weve seen, this hardly ever occurs from
N(0, 1)!

101

In short, it can be rather hard to draw
inferences using Chebyshevs inequality
This is a price you pay for the fact that the
inequality is so general.

102

But what if we made use of the information that
our null hypothesis was N(0, 1)?
This amounts to utilizing more information in the
experimental design.
As you will learn later, if you do use this
information, then you can draw an inference (at
the .05 level) if your data isnt more extreme
than ?4.472
Instead, it only needs to be more extreme than
?1.96

103

In sum, there is a kind of trade-off
Chebychevs inequality requires no (significant)
background assumptions, and so applies everywhere
But it is very inefficient.
The techniques we will explore later require some
significant background assumptions, and so cannot
apply to all situations .
But they are much more efficient.

Write a Comment

User Comments (0)