Review of Basic Statistical Concepts presentation

About This Presentation

Transcript and Presenter's Notes

Title: Review of Basic Statistical Concepts

1
Review of Basic Statistical Concepts

Farideh Dehkordi-Vakil

2
Review of Basic Statistical Concepts

Descriptive Statistics
Methods that organize and summarize data.
Numerical summary
Graphical Methods
Inferential Statistics
Generalizing from a sample to the population from
which it was selected.
Estimation
Hypothesis testing

3
Review of Basic Statistical Concepts

Population
The entire collection of individuals or objects
about which information is desired.
Sample
A subset of the population selected in some
prescribed manner for study.

4
Review of Basic Statistical Concepts

Numerical summaries
Measure of central tendencies
Mean
Median
Measure of variability
Variance, Standard deviation
Range
Quartiles

5
Review of Basic Statistical Concepts

The Mean
To find the mean of a set of observations, add
their values and divide by the number of
observations. If the n observations are
, their mean is
In a more compact notation,

6
Example Books Page Length

A sample of n 8 books is selected from a
librarys collection, and page length of each one
is determined, resulting in the following data
set.
X1247, X2312, X3198, X4780,
X5175, X6286, X7293, X8258

7
Review of Basic Statistical Concepts

Median M
The Median M is the midpoint of a distribution,
the number such that half of the observations are
smaller and the other half are larger. To find
the median of a distribution
Arrange all observations in order of size, from
smallest to largest.
If the number of observations n is odd, the
median M is the center observation in the ordered
list.
If the number of observations n is even, the
median M is the mean of the two center
observations in the ordered list.

8
Review of Basic Statistical Concepts

Quartiles Q1 and Q3
To calculate the quartiles
Arrange the observations in increasing order and
locate the median M in the ordered list of
observations.
The first quartile Q1 is the median of the
observations whose position in the ordered list
is to the left of the location of the overall
median.
The third quartile Q3 is the median of the
observations whose position in the ordered list
is to the right of the location of the overall
median.

9
Example Books Page Length

Median
Order the list
175, 198, 247, 258, 286, 293, 312, 780
There are two middle numbers 258, and 286.
The median is the average of these two numbers

10
Review of Basic Statistical Concepts

The Five Number Summary and Box-Plot
The five number summary of a distribution
consists of the smallest observation, the first
quartile, the median, the third quartile, and the
largest observation, written in order from
smallest to largest. In symbols, the five number
summary is
Minimum Q1 M Q3 Maximum

11
Review of Basic Statistical Concepts

A box-plot is a graph of the five number Summary.
A central box spans the quartiles.
A line in the box marks the median.
Lines extend from the box out to the smallest and
largest observations.
Box-plots are most useful for side-by-side
comparison of several distributions.

12
Review of Basic Statistical Concepts
13
Review of Basic Statistical Concepts

The Variance s2
The Variance s2 of a set of observations is the
average of the squares of the deviations of the
observations from their mean. In symbols, the
variance of n observations is
or, more compactly,

14
Review of Basic Statistical Concepts

The Standard Deviation s
The standard deviation s is the square root of
the variance s2
Computational formula for variance

15
Example Book page length
16
Review of Basic Statistical Concepts

Choosing a Summary
The five number summary is usually better than
the mean and standard deviation for describing a
skewed distribution or a distribution with
extreme outliers. Use , and s only for
reasonably symmetric distributions that are free
of outliers.

17
Review of Basic Statistical Concepts

Introduction to Inference
The purpose of inference is to draw conclusions
from data.
Conclusions take into account the natural
variability in the data, therefore formal
inference relies on probability to describe
chance variation.
We will go over the two most prominent types of
formal statistical inference
Confidence Intervals for estimating the value of
a population parameter.
Tests of significance which asses the evidence
for a claim.
Both types of inference are based on the sampling
distribution of statistics.

18
Review of Basic Statistical Concepts

Parameters and Statistics
A parameter is a number that describes the
population.
A parameter is a fixed number, but in practice we
do not know its value.
A statistic is a number that describes a sample.
The value of a statistic is known when we have
taken a sample, but it can change from sample to
sample.
We often use statistic to estimate an unknown
parameter.

19
Review of Basic Statistical Concepts

Since both methods of formal inference are based
on sampling distributions, they require
probability model for the data.
The model is most secure and inference is most
reliable when the data are produced by a properly
randomized design.
When we use statistical inference we assume that
the data come from a randomly selected sample
(SRS) or a randomized experiment.

20
ExampleConsumer attitude towards shopping

A recent survey asked a nationwide random sample
of 2500 adults if they agreed or disagreed with
the following statement
I like buying new cloths, but shopping is often
frustrating and time consuming.
Of the respondents, 1650 said they agreed.
The proportion of the sample who agreed that
cloths shopping is often frustrating is

21
ExampleConsumer attitude towards shopping

The number .66 is a statistic.
The corresponding parameter is the proportion
(call it P) of all adult U.S. residents who would
have said Yes if asked the same question.
We dont know the value of parameter P, so we use
as its estimate.

22
Review of Basic Statistical Concepts

If the marketing firm took a second random sample
of 2500 adults, the new sample would have
different people in it.
It is almost certain that there would not be
exactly 1650 positive responses.
That is, the value of will vary from sample
to sample.
Random samples eliminate bias from the act of
choosing a sample, but they can still be wrong
because of the variability that results when we
choose at random.

23
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
24
Review of Basic Statistical Concepts

The first advantage of choosing at random is that
it eliminates bias.
The second advantage is that if we take lots of
random samples of the same size from the same
population, the variation from sample to sample
will follow a predictable pattern.
All statistical inference is based on one idea
to see how trustworthy a procedure is, ask what
would happen if we repeated it many times.

25
Review of Basic Statistical Concepts

Sampling Distribution of Statistics
Suppose that exactly 60 of adults find shopping
for cloths frustrating and time consuming.
That is, the truth about the population is that P
0.6. (parameter)
We select a SRS (Simple Random Sample) of size
100 from this population and use the sample
proportion( , statistic) to estimate the
unknown value of the population proportion P.
What is the distribution of ?

26
Review of Basic Statistical Concepts

To answer this question
Take a large number of samples of size 100 from
this population.
Calculate the sample proportion for each
sample.
Make a histogram of the values of .
Examine the distribution displayed in the
histogram for shape, center, and spread, as well
as outliers or other deviations.

27
Review of Basic Statistical Concepts

The result of many SRS have a regular pattern.
Here we draw 1000 SRS of size 100 from the same
population.
The histogram shows the distribution of the 1000
sample proportions

28
Review of Basic Statistical Concepts

Sampling Distribution
The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population.

29
Normal Distribution

These curves, called normal curves, are
Symmetric
Single peaked
Bell shaped
Normal curves describe normal distributions.

30
Normal Density Curve

The exact density curve for a particular normal
distribution is described by giving its mean ?
and its standard deviation ?.
The mean is located at the center of the
symmetric curve and it is the same as the median.
The standard deviation ? controls the spread of a
normal curve.

31
Normal Density Curve
32
Standard Normal Distribution

The standard Normal distribution is the Normal
distribution N(0, 1) with mean
? 0 and standard deviation ? 1.

33
Standard Normal Distribution

If a variable x has any normal distribution N(?,
?) with mean ? and standard deviation ?, then the
standardized variable
has the standard Normal distribution.

34
The Standard Normal Table

Table A is a table of the area under the
standard Normal curve. The table entry for each
value z is the area under the curve to the left
of z.

35
The Standard Normal Applet

Or you can use this applet
http/www.stat.sc.eduwest/applets/normaldemo.html

36
The Standard Normal Table

What is the area under the standard normal curve
to the right of
z - 2.15?
Compact notation
P 1 - .0158 .9842

37
The Standard Normal Table

What is the area under the standard normal curve
between z 0 and z 2.3?
Compact notation
P .9893 - .5 .4893

38
ExampleAnnual rate of return on stock indexes

The annual rate of return on stock indexes (which
combine many individual stocks) is approximately
Normal. Since 1954, the Standard Poors 500
stock index has had a mean yearly return of about
12, with standard deviation of 16.5. Take this
Normal distribution to be the distribution of
yearly returns over a long period. The market is
down for the year if the return on the index is
less than zero. In what proportion of years is
the market down?

39
ExampleAnnual rate of return on stock indexes

State the problem
Call the annual rate of return for Standard
Poors 500-stocks Index x. The variable x has the
N(12, 16.5) distribution. We want the proportion
of years with X lt 0.
Standardize
Subtract the mean, then divide by the standard
deviation, to turn x into a standard Normal z

40
ExampleAnnual rate of return on stock indexes

Draw a picture to show the standard normal curve
with the area of interest shaded.
Use the table
The proportion of observations less than
- 0.73 is .2327.
The market is down on an annual basis about
23.27 of the time.

41
ExampleAnnual rate of return on stock indexes

What percent of years have annual return between
12 and 50?
State the problem
Standardize

42
ExampleAnnual rate of return on stock indexes

Draw a picture.
Use table.
The area between 0 and 2.30 is the area below
2.30 minus the area below 0.
0.9893- .50 .4893

43
Estimation

So far, we have used our sample estimates as
point estimates of parameters, for example
These estimators have properties.

44
Estimators

They are both unbiased estimators
The expected value of an unbiased estimator is
equal to the parameter that it is trying to
estimate.

45
Estimators

For Example
It tends to give an answer that is a little too
small.

46
Estimators

is also a minimum variance estimator of ?.
This means that it has the smallest variability
among all estimators of ?.
What if we want to do more than just provide a
point estimate?

47
Estimating with Confidence

Suppose we are interested in the value of some
parameter, and we want to construct a confidence
interval for it, with some desired level of
confidence

48
Estimating with Confidence

Suppose we can estimate this parameter from
sample data, and we know the distribution of this
estimator, then we can use this knowledge and
construct a probability statement involving both
the estimator and the true value of the
parameter.
This statement is manipulated mathematically to
produce confidence intervals.

49
Confidence intervals

The general form of a confidence interval is
sample value of estimator ?
(Factor)?(SE of estimator)
The value of the factor will depend on the level
of confidence desired, and the distribution of
the estimator.

50
Estimating with Confidence

Community banks are banks with less than a
billion dollars of assets. There are
approximately 7500 such banks in the United
States. In many studies of the industry these
banks are considered separately from banks that
have more than a billion dollars of assets. The
latter banks are called large institutions. The
community bankers Council of the American bankers
Association (ABA) conducts an annual survey of
community banks. For the 110 banks that make up
the sample in a recent survey, the mean assets
are 220 (in millions of dollars). What can
we say about ?, the mean assets of all community
banks?

51
Estimating with Confidence

The sample mean is the natural estimator of
the unknown population mean ?.
We know that
is an unbiased estimator of ?.
The law of large numbers says that the sample
mean must approach the population mean as the
size of the sample grows.
Therefore, the value 220 appears to be a
reasonable estimate of the mean assets ? for all
community banks.
But, how reliable is this estimate?

52
Standard Error of Estimator

An estimate without an indication of its
variability is of limited value.
Questions about variation of an estimator is
answered by looking at the spread of its sampling
distribution.
According to Central Limit theorem
If the entire population of community bank assets
has mean ? and standard deviation ?, then in
repeated samples of size 110 the sample mean
approximately follows the N(?, ???110)
distribution

53
Standard Error of Estimator

Suppose that the true standard deviation ? is
equal to the sample standard deviation s 161.
This is not realistic, although it will give
reasonably accurate results for samples as large
as 100. Later on we will learn how to proceed
when ? is not known.
Therefore, by Central Limit theorem. In repeated
sampling the sample mean is approximately
normal, centered at the unknown population mean
??,with standard deviation

54
Confidence Interval for the Population Mean

We use the sampling distribution of the sample
mean to construct a level C confidence
interval for the mean ? of a population.
We assume that data are a SRS of size n.
The sampling distribution is exactly N(
) when the population has the N(?, ?)
distribution.
The Central Limit Theorem says that this same
sampling distribution is approximately correct
for large samples whenever the population mean
and standard deviation are ? and ?.

55
Confidence Interval for a Population Mean

Choose a SRS of size n from a population having
unknown mean ? and known standard deviation ?. A
level C confidence interval for ? is
Here z is the critical value with area C
between z and z under the standard Normal
curve. The quantity
is the margin of error. The interval is exact
when the population distribution is normal and is
approximately correct when n is large in other
cases.

56
Confidence Interval for a Population Mean

Recall the community bank Example
What is a 90 confidence interval for the mean
assets of all community banks?
Estimator
Standard error of the sample mean
Factor

57
Example Banks loan to-deposit ration

The ABA survey of community banks also asked
about the loan-to-deposit ratio (LTDR), a banks
total loans as a percent of its total deposits.
The mean LTDR for the 110 banks in the sample is
and the standard deviation is s
12.3. This sample is sufficiently large for us to
use s as the population ? here. Find a 95
confidence interval for the mean LTDR for
community banks.

58
Confidence Interval for a Population Mean

What if the sample size is small and the
population standard deviation is not known?
Then the sampling distribution of will be
students t.
The Students t distribution has a symmetric,
bell-shaped density centered at zero, and depends
on a parameter called the degrees of freedom.
The number of degrees of freedom depends upon the
sample size.

59
Students T Distribution

The density of the Students t distribution
differs from that of the standard normal
density.
The distribution of the density in the tails and
flanks is different from the normal distribution
The tails are higher and wider than that of a
standard normal density, indicating that the
standard deviation is larger than the standard
normal, especially for small sample sizes.0

60
Students T Distribution
N(0, 1)
T distribution with 1 degree of freedom
http//www.wordiq.com/definition/ImageT_distribut
ion_1df.png
61
Students T Distribution

As the sample size becomes larger, the degrees of
freedom of the Student's t distribution also
become larger.
As the degrees of freedom become larger, the
Student's t distribution approaches the standard
normal distribution.

62
Students T Distribution
http//www.etfos.hr/fridl/primjer6.htm
63
Students T Distribution

A level C confidence interval for ? when the
sample size is small and the population standard
deviation is not known is
The t-distribution has n-1 degrees of freedom.
Given the confidence level. the value can
be determined from published tables for
t-distribution.

64
Students T Distribution

Example
If the sample size is 15, the critical value for
a 95 confidence interval from a t-table is
2.14.
Note the degrees of freedom is 14.
If the sample size is 25, what is the critical
value for a 90 confidence interval?

65
Tests of Significance

Confidence intervals are appropriate when our
goal is to estimate a population parameter.
The second type of inference is directed at
assessing the evidence provided by the data in
favor of some claim about the population.
A significance test is a formal procedure for
comparing observed data with a hypothesis whose
truth we want to assess.
The hypothesis is a statement about the
parameters in a population or model.
The results of a test are expressed in terms of a
probability that measures how well the data and
the hypothesis agree.

66
Example Banks net income

The community bank survey described in previously
also asked about net income and reported the
percent change in net income between the first
half of last year and the first half of this
year. The mean change for the 110 banks in the
sample is Because the sample size
is large, we are willing to use the sample
standard deviation s 26.4 as if it were the
population standard deviation ?. The large sample
size also makes it reasonable to assume that
is approximately normal.

67
Example Banks net income

Is the 8.1 mean increase in a sample good
evidence that the net income for all banks has
changed?
The sample result might happen just by chance
even if the true mean change for all banks is ?
0.
To answer this question we ask another
Suppose that the truth about the population is
that ? 0 (this is our hypothesis)
What is the probability of observing a sample
mean at least as far from zero as 8.1?

68
Example Banks net income

The answer is
Because this probability is so small, we see that
the sample mean is incompatible with
a population mean of ? 0.
We conclude that the income of community banks
has changed since last year.

69
Example Banks net income

The fact that the calculated probability is very
small leads us to conclude that the average
percent change in income is not in fact zero.
Here is why.
If the true mean is ? 0, we would see a sample
mean as far away as 8.1 only six times per 10000
samples.
So there are only two possibilities
? 0 and we have observed something very
unusual, or
? is not zero but has some other value that makes
the observed data more probable

70
Example Banks net income

We calculated a probability taking the first of
these choices as true (? 0 ). That probability
guides our final choice.
If the probability is very small, the data dont
fit the first possibility and we conclude that
the mean is not in fact zero.

71
Tests of Significance Formal details

The first step in a test of significance is to
state a claim that we will try to find evidence
against.
Null Hypothesis H0
The statement being tested in a test of
significance is called the null hypothesis.
The test of significance is designed to assess
the strength of the evidence against the null
hypothesis.
Usually the null hypothesis is a statement of no
effect or no difference. We abbreviate null
hypothesis as H0.

72
Tests of Significance Formal details

A null hypothesis is a statement about a
population, expressed in terms of some parameter
or parameters.
The null hypothesis in our bank survey example is
H0 ? 0
It is convenient also to give a name to the
statement we hope or suspect is true instead of
H0.
This is called the alternative hypothesis and is
abbreviated as Ha.
In our bank survey example the alternative
hypothesis states that the percent change in net
income is not zero. We write this as
Ha ? ? 0

73
Tests of Significance Formal details

Since Ha expresses the effect that we hope to
find evidence for we often begin with Ha and then
set up H0 as the statement that the Hoped-for
effect is not present.
Stating Ha is not always straight forward.
It is not always clear whether Ha should be
one-sided or two-sided.
The alternative Ha ? ? 0 in the bank net income
example is two-sided.
In any give year, income may increase or
decrease, so we include both possibilities in the
alternative hypothesis.

74
Tests of Significance Formal details

Test statistics
We will learn the form of significance tests in a
number of common situations. Here are some
principles that apply to most tests and that help
in understanding the form of tests
The test is based on a statistic that estimate
the parameter appearing in the hypotheses.
Values of the estimate far from the parameter
value specified by H0 gives evidence against H0.

75
Example banks income

The test statistic
In our banking example The null hypothesis is
H0 ? 0, and a sample gave the
. The test statistic for this problem is the
standardized version of
This statistic is the distance between the sample
mean and the hypothesized population mean in the
standard scale of z-scores.

76
Example Banks net income

p-values
P-value is the probability that the test
statistic would take a value as large or larger
than one observed assuming that H0 is true.
The smaller the p-value, the stronger the
evidence against H0.

77
Example Banks net income

Conclusion
One approach is to state in advance how much
evidence against H0 we will require in order to
reject it.
The level that says this evidence is strong
enough is called significance level and is
denoted by letter ?.
We compare the p-value with the significance
level.
We reject H0 if the p-value is smaller than the
significance level, and say that the data are
statistically significant at level ?.

78
One sample t-test

Suppose we have a simple random sample of size n
from a Normally distributed population with mean
? and standard deviation ?.
The standardized sample mean, or one-sample z
statistic
has the standard Normal distribution N(0, 1).
When we substitute the standard deviation of the
mean (standard error) s /?n for the ?/?n, the
statistic does not have a Normal distribution.

79
The t-distribution

t-test
Suppose that a SRS of size n is drawn from a N(?,
?) population. Then the one sample t statistic
has the t-distribution with n-1 degrees of
freedom.
There is a different t distribution for each
sample size.
A particular t distribution is specified by
giving the degrees of freedom.

80
Exploring Relationships between Two Quantitative
Variables

Scatter plots
Represent the relationship between two different
continuous variables measured on the same
subjects.
Each point in the plot represents the values for
one subject for the two variables.

81
Exploring Relationships between Two Quantitative
Variables

Example
Data reported by the organization for Economic
Development and Cooperation on its 29 member
nations in 1998.
Per capita gross domestic product is on x-axis
Per capita health care expenditures is on y-axis.

82
Exploring Relationships between Two Quantitative
Variables

We can describe the overall pattern of scatter
plot by
Form or shape
Direction
strength

83
Exploring Relationships between Two Quantitative
Variables

Form or shape
The form shown by the scatter plot is linear if
the points lie in a straight-line pattern.
Strength
The relation ship is strong if the points lie
close to a line, with little scatter.

84
Exploring Relationships between Two Quantitative
Variables

Direction
Positive and negative association
Two variables are positively associated when
above-average values of one variable tend to
occur in individuals with above average values
for the other variable, and below average values
of both also tend to occur together.
Two variable are negatively associated when above
average values for one tend to occur in subjects
with below average values of the other, and
vice-versa

85
Exploring Relationships between Two Quantitative
Variables

Per capita health care example
subjects studied are countries
Form of relationship is roughly linear
The direction is positive
The relationship is strong.

86
Correlation

It is often useful to have a measure of degree of
association between two variables. For example,
you may believe that sales may be affected by
expenditures on advertising, and want to measure
the degree of association between sales and
advertising.
Correlation coefficient is a numeric measure of
the direction and strength of linear relationship
between two continuous variables
The notation for sample correlation coefficient
is r.

87
Correlation

There are several alternative ways to write the
algebraic expression for the correlation
coefficient. The following is one.
X and Y represent the two variables of interest.
For example advertising and sales or per capita
gross domestic product, and per capita health
care expenditure.
n is the number of subjects in the sample
The notation for population correlation
coefficient is ?.

88
Correlation

Facts about correlation coefficient
r has no unit.
r gt 0 indicates a positive association r lt 0
indicates a negative association
r is always between 1 and 1
Values of r near 0 imply a very weak linear
relationship
Correlation measures only the strength of linear
association.

89
Correlation

We could perform a hypothesis test to determine
whether the value of a sample correlation
coefficient (r) gives us reason to believe that
the population correlation (?) is significantly
different from zero
The hypothesis test would be
H0 ? 0
Ha ? ? 0

90
Correlation

The test statistic would be
The test statistic has a t-distribution with n-2
degrees of freedom.
Reject H0 if

91
Example Do wages rise with experience?

Many factors affect the wages of workers the
industry they work in, their type of job, their
education and their experience, and changes in
general levels of wages. We will look at a sample
of 59 married women who hold customer service
jobs in Indiana banks. The following table gives
their weekly wages at a specific point in time
also their length of service with their employer,
in month. The size of the place of work is
recorded simply as large (100 or more workers)
or small. Because industry, job type, and the
time of measurement are the same for all 59
subjects, we expect to see a clear relationship
between wages and length of service.

92
Example Do wages rise with experience?
93
Example Do wages rise with experience?
94
Example Do wages rise with experience?

The correlation between wages and length of
service for the 59 bank workers is r 0.3535.
We expect a positive correlation between length
of service and wages in the population of all
married female bank workers. Is the sample result
convincing that this is true?

95
Example Do wages rise with experience?

To compute correlation we need
Replacing these in the formula
We want to test
H0 ? 0 Ha ? gt 0
The test statistic is

96
Example Do wages rise with experience?

Comparing t 2.853 with critical values from the
t-table with n - 2 57 degrees of freedom help
us to make our decision.
Conclusion
Since P( t gt 2.853) lt .005, we reject H0.
There is a positive correlation between wages and
length of service.

97
T-distribution applet

Tail probability for students t-distribution can
computed using the applet at the following site.
http//www.acs.ucalgary.ca/nosal/src/Applets/T-T
ailProb/T-TailProb.html

Write a Comment

User Comments (0)

About PowerShow.com

Review of Basic Statistical Concepts PowerPoint PPT Presentation