BASIC STATISTICS - PowerPoint PPT Presentation

1 / 182
About This Presentation
Title:

BASIC STATISTICS

Description:

Normal Curves ... 68-95-99.7 Rule for. Any Normal Curve ... Here, since we chose a woman randomly, her height, X, is a random variable. ... – PowerPoint PPT presentation

Number of Views:532
Avg rating:3.0/5.0
Slides: 183
Provided by: alanjc
Category:
Tags: basic | statistics

less

Transcript and Presenter's Notes

Title: BASIC STATISTICS


1
BASIC STATISTICS
  • Alan J. Chaput, BScPhm, PharmD, MD, MSc (Epid),
    FRCPC
  • June 28, 2007

2
Descriptive statistics
3
Types of data
  • Categorical
  • Places individuals into one of several categories
  • Quantitative (numerical)
  • Numerical values for which arithmetic operations
    such as adding and averaging make sense

4
How would you categorize these variables?
  • Height
  • Hours of sleep
  • Hair color
  • Eye color
  • Ever taken stats course
  • Heart rate
  • 7-point Likert scale

5
Distribution of a variable
  • Tells us values a variable takes and how often it
    takes them
  • Start data analyses by exploring distributions of
    single variables with a graph
  • Later, move on to studying relationships between
    variables

6
Displaying distributions
  • Categorical
  • Pie charts
  • Bar graphs
  • Quantitative
  • Histograms
  • Stem-and-leaf plots
  • Boxplots

7
Graphs vs. tables
  • Graphs
  • A pictures worth
  • Can show ALL the data points
  • Visual impact of data (presentations)
  • Tables
  • More efficient (usually)
  • Show actual values (more precision)
  • Easier to produce (historically)

8
Categorical VariablesExample U.S. Solid Waste
(2000)
Pie Chart
Difficult to do by hand, use computer program
(e.g., Excel) for production if necessary.
9
Categorical Variables Example U.S. Solid Waste
(2000)
Bar Graph
Notes 1) Bars do not touch (categorical data)
2) Can plot counts or percents
10
Quantitative variables - histograms
  • Divide data into class intervals of equal width
  • Count how many observations fall in each interval
  • Draw histogram

11
Weight Data Histogram
Number of students
Weight
12
Interpreting histograms
  • Shape
  • Symmetric (bell, other)
  • Asymmetric (right-tail, left-tail)
  • Unimodal, bimodal (mode a high point)
  • Centre (find middle position)
  • Count observations (n)
  • Find middle position (n 1) / 2
  • Find middle value the value that has the middle
    position
  • Spread (from low to high)
  • Outliers (values outside the regular pattern)

13
Interpreting histogramIllustrative example
State population, Hispanic (Fig 1.3)
  • Shape asymmetrical w/ right tail
  • Center
  • n 50 states
  • Middle position (50 1) 2 25.5
  • Middle value is in the first category, so is
    between 0 and 5
  • Spread From 0.7 (W. Virginia) to 42.1 (New
    Mexico)
  • Outlier New Mexico

14
Interpreting histogramIllustrative example Fig
1.4 in text
  • Shape symmetrical bell
  • Center
  • n 100
  • Middle position 50.5
  • Middle value around 7
  • Spread From 2 to 12
  • Outlier 12 (maybe)

15
Stem-and-leaf plots
  • For quantitative variables
  • Separate each value into a stem value (first part
    of the number) and leaf value (the remaining part
    of the number)
  • Create a stem axis
  • Write each leaf to the right of its stem value
  • Place leaves in rank order
  • Interpretation like a histogram on its side

16
Weight DataStemplot(Stem Leaf Plot)
10 0166 11 009 12 0034578 13 00359 14 08 15
00257 16 555 17 000255 18 000055567 19 245 20
3 21 025 22 0 23 24 25 26 0 (10)
Key 203 means203 pounds Stems 10sLeaves
1s
17
Interpretation like a histogram on its side
100166 11009 120034578 1300359 1408 1500257
16555 17000255 18000055567 19245 203 21025
220 23 24 25 260 (10)
  • Shape positive skew
  • Center
  • n 53
  • middle position (531)/2 27
  • middle value 165
  • Spread from 100 to 260
  • Outlier 260

18
Boxplot
  • Central box spans Q1 to Q3
  • A line in the box marks the median
  • Lines extend from the box out to the minimum and
    maximum

19
Weight Data Boxplot
20
Boxplots are esp. useful for comparing groups
21
Numerical summaries
  • Centre
  • Median
  • Mean
  • Spread
  • Quartiles (and IQR)
  • Standard deviation (and variance)

22
Mean (Arithmetic Average)
  • Traditional measure of center
  • Notation (xbar)
  • Sum the values and divide by the sample size (n)

23
Median
  • Half the ordered values are less than or equal to
    the median (and half are greater)
  • If n is odd, the median is the middle ordered
    value
  • If n is even, the median is the average of the
    two middle ordered values

24
Comparing the mean and median
  • Mean median when data are symmetrical
  • Mean ? median when data are skewed

25
Spread variability
  • Variability the amount the values spread above
    and below the centre
  • Can be measured in several ways
  • Range
  • Quartiles and inter-quartile range
  • Variance and standard deviation

26
Range
  • Range maximum minimum
  • The range is NOT a reliable measure of spread

27
Quartiles
  • The quartiles are the 3 numbers that divide the
    ordered data into 4 equally sized groups
  • Q1 has 25 of the data below it
  • Q2 has 50 of the data below it (median)
  • Q3 has 75 of the data below it

28
Obtaining the quartiles
  • Order the data
  • Find the median (this is Q2)
  • Look at the lower half of the data (those below
    the median)
  • The median of this lower half Q1
  • Look at the upper half of the data (those above
    the median)
  • The median of this upper half Q3

29
5 number summary
  • Minimum
  • Q1
  • Median (Q2)
  • Q3
  • Maximum
  • Note
  • IQR Q3 Q1
  • IQR gives spread of middle 50 of data

30
Variances and standard deviation
  • The most common measures of spread
  • Based on deviations around the mean
  • Each data value has a deviation

31
Variance Formula
32
Standard Deviation Square root of the variance
33
Choosing summary statistics
  • Use the mean and standard deviation for
    reasonably symmetric distributions that are free
    of outliers
  • Use the median and IQR (or 5-point summary) when
    data are skewed or when outliers are present

34
The normal distribution
35
Who is this??
36
Mathematical formula of a normal curve
37
Normal curves
  • Bell-shaped
  • Not too steep, not too fat
  • Defined by means standard deviations

38
Normal Curves
  • The mean and standard deviation computed from
    actual observations (data) are denoted by and
    s, respectively.
  • The mean and standard deviation of the
    distribution represented by the density curve are
    denoted by µ (mu) and ? (sigma), respectively.

39
Bell-Shaped CurveThe Normal Distribution
Standard deviation
40
The Normal Distribution
  • Mean µ defines the center of the curve
  • Standard deviation ? defines the spread
  • Notation is N(µ,?).

41
68-95-99.7 Rule forAny Normal Curve
  • 68 of the observations fall within one standard
    deviation of the mean
  • 95 of the observations fall within two standard
    deviations of the mean
  • 99.7 of the observations fall within three
    standard deviations of the mean

42
68-95-99.7 Rule forAny Normal Curve
43
Standard Normal (Z) Distribution
  • The Standard Normal distribution has mean 0 and
    standard deviation 1
  • We call this a Z distribution ZN(0,1)
  • Any Normal variable x can be turned into a Z
    variable (standardized) by subtracting µ and
    dividing by s

44
Standard Normal Table
45
Statistical tests of normality
  • Kolmogorov-Smirnoff test
  • Anderson-Darling test
  • Shapiro-Wilk test
  • DAgostino-Pearson omnibus test

46
Idea of probability
  • Probability is the science of chance behavior
  • Chance behavior is unpredictable in the short
    run, but has a predictable pattern in the long
    run
  • A phenomenon is random if individual outcomes are
    uncertain but there is a predictable distribution
    of outcomes in a large number of repetitions.

The probability of any outcome of a random
phenomenon can be defined as the proportion of
times the outcome would occur in a very long
series of repetitions.
47
How probabilities behave
Eventually, the proportion of heads in fair coin
tosses approaches 0.5
48
Recall the Normal curve
  • We use the Normal density curve to determine
    probabilities

49
Normal probability distribution
individuals with X such that x1 lt X lt x2
The shaded area under the density curve shows the
proportion, or percent, of individuals in the
population with values of X between x1 and x2.
Because the probability of drawing one individual
at random depends on the frequency of this type
of individual in the population, the probability
is also the shaded area under the curve.
50
Normal probability distribution
A variable whose value is a number resulting from
a random process is a random variable. The
probability distribution of many random variables
is the normal distribution. It shows what values
the random variable can take and is used to
assign probabilities to those values.
Example Probability distribution of womens
heights. Here, since we chose a woman randomly,
her height, X, is a random variable.
To calculate probabilities with the normal
distribution, we will standardize the random
variable
51
Mens Height Example (NHANES, 1980)
  • What proportion of men are less than 68 inches
    tall?

-0.71 0 (standardized values)
52
Standardized Scores
  • How many standard deviations is 68 from µ on
    XN(70,2.8)?
  • z (x µ) / s
  • (68 - 70) / 2.8 -0.71
  • The value 68 is 0.71 standard deviations below
    the mean 70

53
Table AStandard Normal Probabilities
.01
?0.7
.2389
54
Mens Height Example (NHANES, 1980)
  • What proportion of men are greater than 68 inches
    tall?
  • Area under curve sums to 1, so Pr(X gt x) 1
    Pr(X lt x), as shown below

1?.2389 .7611
.2389
55
Reminder standardizing N (m,s)
We standardize Normal data by calculating
z-scores.
Any N(µ,s) can be standardized to a N(0,1).
56
Distribution of womens heights
N (µ, ?) N (64.5, 2.5)
Example What's the proportion of women with a
height between 57" and 72"? Thats within 3
standard deviations s of the mean m, thus that
proportion is roughly 99.7.
Since about 99.7 of all women have heights
between 57" and 72", the chance of picking one
woman at random with a height in that range is
also about 99.7.
57
What is the probability, if we pick one woman at
random, that her height will be some value X?
For instance, between 68 and 70 P(68 lt X lt
70)? Because the woman is selected at random, X
is a random variable.
N(µ, s) N(64.5, 2.5)
0.9192
0.9861
The area under the curve for the interval 68" to
70" is 0.9861 - 0.9192 0.0669. Thus, the
probability that a randomly chosen woman falls
into this range is 6.69
58
Odds and probability
59
Standard 2x2 table
Outcome/Disease
Exposure/Treatment
60
Calculations from 2x2 table
  • RR a/(ab) / c/(cd)
  • RRR c/(cd) a/(ab) / c/(cd)
  • ARR c/(cd) a/(ab)
  • NNT 1/ARR
  • OR a/b / c/d ad/cb

61
Why use odds ratios?
  • Perfectly good measure of association
  • Can be estimated from a case-control study (RR
    cannot)
  • Offers advantages in meta-analysis
  • Easier to model than RD or RR
  • As the risk falls, the odds and risk come closer
    together for low event rates, the OR and RR are
    very close

62
Odds and probability/risk
  • Odds are an alternative way of describing the
    chance of an event
  • Odds Prob / 1- Prob
  • Rearranging
  • Prob Odds / 1 Odds

63
Risk and odds
64
Producing data sampling
65
Inference!
  • We often want answers to questions about a large
    group of individuals -- this is the population
  • We seldom study the entire population, but
    instead select a subset of the population -- this
    is the sample
  • We use inferential techniques to make conclusions
    about the population based on the data in the
    sample

66
Two types of studies
  • Observational individuals are studied without
    intervention
  • E.g., case-control and cohort studies
  • Experimental studies the investigator assigns
    an explanatory factor to some subjects but not to
    others
  • E.g., clinical trial

67
Observations vs. Experiments
  • Both types of studies may be used to learn about
    relationships between variables
  • Experimental studies are better suited for
    determining cause-and-effect because they can
    deal with confounding variables via randomization

68
Sample Quality
  • To do a good study, you need a good sample
  • Poor quality samples produce misleading results
  • Study should be designed to generate high quality
    data that can then be used to infer population
    characteristics

69
Examples of Poor Quality Sampling Designs
  • Voluntary response sampling
  • Allows individuals to choose to be in the study
  • e.g., Call-in polls (pp. 1789 in text)
  • Convenience sampling
  • Individuals that are easiest to reach are
    selected
  • e.g., Interviewing at the mall (p. 179)
  • These techniques favor certain outcomes and
    cannot be trusted to reveal population
    characteristics (sampling bias)

70
Simple Random Sample (SRS)
  • To avoid biased sampling, use impersonal chance
    mechanisms as the basis for selection
  • Simple Random Sample (SRS)
  • (1) Each individual in population has the same
    chance of being selected
  • (2) Every possible sample has an equal chance to
    be chosen

71
Methods for selecting SRSs
  • Physical, e.g., pick numbers from a hat
  • Computerized random number generators
  • Use a table of random digits

72
Picking SRS (Illustration)
  • Suppose 30 individuals labeled 01 30 and we
    want to select two at random
  • Random digit table
  • Select a row in table at random
  • Break digits into couples
  • 68417 35013 15529
  • First two individuals in the sample are 13 and
    15

73
Producing data experiments
74
Experimentation
  • In an experiment, the investigator exposes the
    explanatory factor to some individuals but not to
    others. The investigator then measures the effect
    on the response variable.
  • In an observational (non-experimental) study,
    individuals are studied without imposition of
    intervention, creating greater opportunity for
    confounding

75
Comparison
Comparison is first principle of experimentation
  • The effects of treatment can be judged only in
    relation to what would happen in its absence (all
    other things being equal)
  • You cannot assess the effects of a treatment
    without a proper comparison group because
  • Many factors contribute to a response
  • Conditions change on their own over time
  • People are open to suggestion (Placebo effect)
  • Observation changes behavior (Hawthorne effect)

76
Randomization
Randomization is the second principle of
experimentation
  • Randomization use of chance mechanisms to
    assign treatments
  • Randomization balances confounding variables
    among treatments groups

77
Blinding
Blinding is the third principle of experimentation
  • Blinding assessment of the response in subjects
    is made without knowledge of which treatment they
    are receiving
  • Single blinding subjects are unaware of
    treatment group
  • Double blinding subjects and investigators are
    blinded

78
The Logic of Randomization
  • Randomization ensures that differences in the
    response are due to either
  • Treatment
  • Chance in the assignment of treatments
  • If an experiment finds a difference among groups,
    we then ask whether this difference is due to the
    treatment or due to chance
  • If the observed difference is larger than what
    would be expected just by chance, then we say it
    is statistically significant

79
The logic of randomization
  • Consider an experiment of weight gain in
    laboratory rats
  • Just by luck, some faster-growing rats are going
    to end up in one group or the other
  • If we assign many rats to each group, the effects
    of chance will balance out
  • Use enough controls to balance out chance
    differences

80
Sampling distribution of means
81
Parameters and Statistics
  • Parameter a fixed number that describes the
    location or spread of a population (the value of
    parameter NOT known)
  • Statistic a number calculated from data in the
    sample (the value of statistics IS known after it
    is calculated)
  • Sampling variability different samples or
    experiments from the same population yield
    different values of the same statistic

82
Parameters and statistics
  • The mean of a population is denoted µ ? this is
    a parameter
  • The mean of a sample is called x-bar ? this is
    a statistic
  • Illustration
  • Average age of all U of O students (µ) is 26.5
  • A SRS of 10 U of O students yields an average age
    (x-bar) of 22.3
  • x-bar and µ are related but are not the same
    thing!

83
Law of Large Numbers
The figure to the right demonstrates the law of
large numbers. The average of the first 50 or so
observations is unreliable. As n increases, the
sample mean becomes a better reflection of the
population mean.
84
Sampling distribution of xbar
Key questions What would happen if we took many
samples or did the experiment many times? How
would this affect the statistics calculated from
such samples?
85
Case Study Does This Wine Smell Bad?
  • For the variable with ? 25 µg / L, ? 7 µg / L
    with a Normal distribution, suppose we take 1,000
    samples, each of n 10 from this population,
    calculate x-bar from each sample, and plot x-bars
    as a histogram

86
The distribution of all x-bars
x-bar is an unbiased estimate of µ
Averages are less variable than individual
observations.
87
Central Limit Theorem
No matter the shape of the population, the
distribution of x-bars will tend to be Normal
when the sample is large.
88
Central Limit Theorem Illustrative example time
to perform activity
  • Data time to perform an activity (hours)
  • Population NOT Normal (Fig a) with µ 1 s 1
  • Fig (b) is for x-bars based on n 2
  • Fig (c) is for x-bars based on n 10
  • Fig (d) is for x-bars based on n 25
  • Distributions become increasingly Normal because
    of the Central Limit Theorem

89
Confidence intervals the basics
90
Statistical Inference
  • Two types of statistical inference
  • Confidence Intervals
  • Tests of Significance

91
Confidence IntervalMean of a Normal Population
  • Take an SRS of size n from a Normal population
    with unknown mean m and known standard deviation
    s. A level C confidence interval for m is

92
Confidence IntervalMean of a Normal Population
93
How Confidence Intervals Behave
  • The margin of error is
  • The margin of error gets smaller, resulting in
    more accurate inference,
  • when n gets larger
  • when z gets smaller (confidence level gets
    smaller)
  • when s gets smaller (less variation)

94
Interpretation of a confidence interval
  • We are (1-a) x 100 confident that the true value
    of µ lies in the interval µL to µH
  • If we used the interval several times, then (1-a)
    x 100 of the time, it will cover the true value
    of µ
  • The main purpose of a CI is to estimate an
    unknown parameter with an indication of how
    accurate the estimate is and how confident we are
    that the result is accurate

95
Level of Confidence (C)
  • Confidence level success rate of method
  • e.g., a 95 CI says we got this interval by a
    method that gives correct results 95 of the
    time (next slide)
  • Most common levels of confidence are 90, 95,
    and 99

96
Common MISinterpretations of a CI
  • The means µ will lie within the interval with
    probability 0.95
  • µ is in this interval with probability 0.95
  • The mean of a future sample from this population
    will lie in the interval
  • 95 of the data will lie in the interval

97
Factors that influence a CI
  • The higher the level of confidence, the wider the
    CI
  • The larger the variability in the sample, the
    wider the CI
  • The larger the sample size, the narrower the CI

98
Tests of significance the basics
99
Recall basics about inference
  • Goal to generalize from the sample (statistic)
    to the population (parameter)
  • Two forms of inference
  • Confidence intervals
  • Significance testing
  • Both CI and significance testing are based on the
    idea of a sampling distribution

100
Stating Hypotheses
  • The goal of this procedure is to quantify the
    evidence against a claim of no difference?
  • The claim being tested is called the null
    hypothesis H0
  • The null hypothesis is contradicted by the
    alternative hypothesis Ha (which indicates that
    there is difference)
  • The test is designed to assess the strength of
    evidence against the null hypothesis.

101
Test hypotheses
  • Null H0 m m0
  • One sided alternatives
  • Ha m gt m0
  • Ha m lt m0
  • Two sided alternative
  • Ha m ¹ m0

102
Hypothesis testing
103
P-value
  • The P-value provides the probability that the
    test statistic would take a value as extreme or
    more extreme than the value observed if the null
    hypothesis were true.
  • The smaller the P-value, the stronger the
    evidence the data provide against the null
    hypothesis

104
P-value
  • A measure of strength of evidence in support of
    the null hypothesis
  • Large p-values indicate that the Ho is quite
    plausible given the data
  • Small p-values indicate Ho is implausible, data
    are inconsistent with Ho
  • Rejecting the null hypothesis usually implies a
    treatment effect or a real difference exists
  • Strength of evidence is a continuous spectrum but
    we tend to use plt0.05 as the point at which we
    reject Ho
  • The value of p which just rejects Ho is called a,
    the Type I error we will inappropriately reject
    Ho with risk a
  • Always try to compute an exact p-value is
    possible (as opposed to saying p lt 0.05 or
    0.025ltplt0.05)

105
Strength of evidence
106
Statistical Significance
  • If the P-value is as small or smaller than the
    significance level a (i.e., P-value a), then we
    say that data are statistically significant at
    level a.
  • If we choose a 0.05, we are requiring that the
    data give evidence against H0 so strong that it
    would occur no more than 5 of the time when H0
    is true.
  • If we choose a 0.01, we are insisting on
    stronger evidence against H0, evidence so strong
    that it would occur only 1 of the time when H0
    is true.

107
One vs. two-sided tests
  • Determined by the alternative hypothesis
  • Two-sided test implies that we are interested in
    detecting departures from Ho in BOTH directions
  • One-sided test implies that we are specifying a
    DIRECTIONAL alternative hypothesis (usually
    dictated by our understanding of the biology)
  • Must specify direction of effect a-priori

108
One vs. two-sided tests
  • Some people refuse to accept that one-sided
    tests are legitimate (e.g. NEJM, FDA)
  • Statistical trickery
  • Always possible that treatment has negative
    effect
  • But
  • Perfectly acceptable statistically speaking
  • Accepting Ho does not rule out a negative
    treatment effect
  • It is legitimate and often desirable not to
    prove that treatment is harmful

109
Inference about a population mean
110
Conditions for Inferenceabout a Mean
  • Data are a SRS of size n
  • Population has a Normal distribution with mean µ
    and standard s (both µ and s are unknown)
  • Because s is unknown (realistic situation), we
    can NOT use z procedures for our confidence
    intervals and significance tests

111
Standard Error
When we do not know population standard deviation
s, we use sample standard deviation s to
calculate the standard error
This is called the standard error of the mean
112
One-Sample t Statistic
  • The one-sample z statistic now becomes a
    one-sample t statistic
  • The t statistic does NOT follow a Normal
    distribution
  • It follows a t distribution with n 1 degrees
    of freedom

113
The t Distributions
  • t distributions are a family of distributions
    similar to the Standard Normal Z distribution
  • Each member of t identified by its degree of
    freedom
  • Notation t(k) denotes a t distribution with k
    degrees of freedom

114
The t Distributions
As k increases, the t(k) curve approaches the Z
curve As n increases, s becomes a better
estimate of s
115
t Table
Table C gives t critical values with upper tail
probability p and corresponding confidence level C
The bottom row of the table applies to z because
a t with infinite degrees of freedom is a
standard Normal (Z) variable
116
One-Sample t Confidence Interval
  • Take a SRS of size n from a population with
    unknown mean m and unknown standard deviation s.
    A level C confidence interval for m is given by

where t is the critical value with (n 1) for
confidence level C (from t Table)
117
One-Sample t Test
  • The t test is similar in form to the z test
    learned earlier. The test statistic is

The test statistic has n 1 degrees of
freedom. Get the approximate P-value from the t
table.
118
Matched Pairs t Procedures
  • Matched pair samples allow us to compare
    responses to two treatments in paired couples
  • Apply the one-sample t procedures to the observed
    differences within pairs
  • The parameter m is the mean difference in the
    responses to the two treatments within matched
    pairs in the entire population

119
Case Study Matched Pairs
Air Pollution
  • Pollution index measurements were recorded for
    two areas of a city on each of 8 days
  • To analyze, subtract Area B levels from Area A
    levels. The 8 differences form a single sample.
  • Are the average pollution levels the same for the
    two areas of the city?


120
Normality Assumptiont Procedure Robustness
  • t procedures produce perfect results when the
    population is Normal. They are robust, producing
    almost perfect confidence intervals and P-values
    when the sample is moderate to small
  • Sample size less than 15 Use t procedures if
    the data appear about Normal (symmetric, single
    peak, no outliers). If the data are skewed or if
    outliers are present, do not use t.
  • Sample size at least 15 The t procedures can be
    used except in the presence of outliers or strong
    skewness in the data.
  • Large samples The t procedures can be used even
    for clearly skewed distributions when the sample
    is large, roughly n 40.

121
Can we use a t procedure?
Moderately sized data set (n 20), with strong
negative skew. t procedures cannot be trusted
122
Can we use t?
  • This histogram shows the distribution of word
    lengths in Shakespeares plays. The sample is
    very large.
  • The data has a strong positive skew, but there
    are no outliers. We can use the t procedures
    since n 40

123
Can we use t?
The distribution has no clear violations of
Normality. Therefore, we trust the t procedure.
124
2-sample problems
125
Conditions for inference comparing two means
  • We have two independent SRSs (simple random
    samples) coming from two distinct populations
    (like men vs. women) with (m1,s1) and (m2,s2)
    unknown.
  • Both populations should be Normally distributed.
    However, in practice, it is enough that the two
    distributions have similar shapes and that the
    sample data contain no strong outliers.

126
Two-sample t-test
  • The null hypothesis is that both population means
    m1 and m2 are equal, thus their difference is
    equal to zero.
  • H0 m1 m2 ltgt m1 - m2 0
  • with either a one-sided or a two-sided
    alternative hypothesis.
  • We find how many standard errors (SE) away from
    (m1 - m2) is ( 1 - 2) by standardizing with
    t
  • Because in a two-sample test H0 poses (m1 - m2)
    0, we simply use
  • with df smallest(n1 - 1,n2 - 1)

127
Two sample t-confidence interval
  • Because we have two independent samples we use
    the difference between both sample averages (
    1 - 2) to estimate (m1 - m2).
  • Practical use of t t
  • C is the area between -t and t.
  • We find t in the line of Table C for df
    smallest (n1-1 n2-1) and the column for
    confidence level C.
  • The margin of error m is

128
Robustness
  • The two-sample statistic is the most robust when
    both sample sizes are equal and both sample
    distributions are similar. But even when we
    deviate from this, two-sample tests tend to
    remain quite robust.
  • As a guideline, a combined sample size (n1 n2)
    of 40 or more will allow you to work even with
    the most skewed distributions.

129
Two-sample test assuming equal variance
  • There are two versions of the two-sample t-test
    one assuming equal variance (pooled two-sample
    test) and one not assuming equal variance
    (unequal variance) for the two populations. You
    may have noticed slightly different formulas and
    degrees of freedom.

The pooled (equal variance) two-sample t-test was
often used before computers because it has
exactly the t distribution for degrees of freedom
n1 n2 - 2. However, the assumption of equal
variance is hard to check, and thus the unequal
variance test is safer.
Two normally distributed populations with unequal
variances
130
Comparing two standard deviations
  • It is also possible to compare two population
    standard deviations s1 and s2 by comparing the
    standard deviations of two SRSs. However, the
    procedures are not robust at all against
    deviations from normality.
  • When s12 and s22 are sample variances from
    independent SRSs of sizes n1 and n2 drawn from
    normal populations, the F-statistic F s12 / s22
  • has the F distribution with n1 - 1 and n2 - 1
    degrees of freedom when H0 s1 s2 is true.
  • The F-value is then compared with critical values
    from Table D for the P-value with a one-sided
    alternative this P-value is doubled for a
    two-sided alternative.

131
Proportions
132
Two-Way Tables
  • Cross-tabulate counts to form two-way table
  • row variable
  • column variable
  • The count of observations falling into each
    combination of categories fall in tables cell
  • Counts are totaled to create marginal totals

133
2 independent samples
  • Described as a 2 x 2 table
  • Exact test
  • Fishers Exact test
  • Approximate tests
  • Z-test
  • Chi-squared test
  • Summary measures
  • RD (risk difference) Ho RD 0
  • RR Ho RR 1
  • OR Ho OR 1

134
Fishers test or Chi-square?
  • Either can be used for analyzing contingency
    tables with 2 rows and 2 colums
  • Fishers test is the best choice as it always
    gives the exact P value
  • Chi-square test is simpler to calculate but
    yields only an approximate P value
  • If using computer, choose Fishers test
  • Avoid chi-square when the numbers in the
    contingency table are very small (lt 6)

135
2 related (paired) samples
  • Exact test
  • Based on binomial distribution
  • Approximate test
  • McNemars Chi-squared test
  • Summary measure
  • OR estimator
  • OR confidence interval

136
2 independent stratified samples
  • 2 x 2 x k table
  • Estimation of the OR
  • Mantel-Haenszel chi-square (good because can
    handle cells with 0)
  • Woolf (precision-weighted) chi-square (most
    commonly used for meta-analysis because formulas
    simple, but not as good as M-H))
  • Peto (O-E-V) chi-square
  • Generalized M-H and P-W estimators
  • Tests for homogeneity over strata i.e. can ORs
    be pooled across strata?
  • Exact test
  • Zelens test
  • Approximate chi-squared tests
  • Breslow-Day
  • Woolf/Precision-weighted

137
Recommended analysis of 2 x 2 x k tables
  • Choose a suitable effect measure (RD, RR or OR)
  • Test for homogeneity of stratum-specific effect
    measures
  • Compute the summary effect measure estimate and
    its associated CI
  • Test for association

138
Sample size calculation
139
http//statpages.org/
  • (for all your statistical needs, including sample
    size calculations)

140
Nonparametric tests
141
Non-parametric tests
  • Make no distributional assumptions about data
    (non-normal data)
  • Main focus is p-value, dont lend themselves to
    CIs or sample size calculations
  • Sign test equivalent to one-sample test
  • Wilcoxon Rank Sum (Mann-Whitney Test)
    equivalent to independent two sample t-test
  • Wilcoxon Signed Rank equivalent to the paired
    t-test

142
Non-parametric tests - advantages
  • Fewer assumptions are required (i.e. no
    distributional assumptions or assumptions about
    equality of variances)
  • Only nominal (categorical) or ordinal (ranked)
    data are required, rather than numerical
    (interval) data

143
Non-parametric tests - disadvantages
  • They are less efficient
  • Less powerful than parametric counterparts
  • Often lead to overestimation of variances of test
    statistics when there are large proportions of
    tied observations
  • They dont lend themselves easily to CIs and
    sample size calculations
  • Interpretation of non-parametric results is quite
    hard

144
Scatterplots and correlation
145
Variable X and variable Y
  • Relationship between 2 quantitative variables
  • Explanatory variable X
  • Response variable Y
  • Does X cause Y ?

146
Scatterplot
  • Start by plotting bivariate data points to make a
    scatterplot

147
Example of a scatterplot
X students taking SAT Y mean SAT verbal
score What is the relation between X and Y?
148
Interpreting scatterplots
  • Form can data be described by a straight line?
  • Exceptions (outliers) to form
  • Direction upward or downward
  • Strength extent to which data points adhere to
    trend line

149
Examples of Forms
150
Strength and direction
  • Direction positive, negative, neither
  • Strength How closely do points adhere to trend
    line?
  • Close fitting ? strong
  • Loose fitting ? weak

151
Strength cannot be judged by eye alone
  • These two scatterplots are of the same data set
  • The second scatter plot looks like a stronger
    correlation, but this is an artifact of the axis
    scaling

152
Correlation coefficient (r)
  • r the correlation coefficient
  • r is always between -1 and 1, inclusive
  • r 1? all points on upward sloping line
  • r -1 ? all points on downward line
  • r 0 ? no line or horizontal line
  • The closer r is to 1 or 1, the better the fit,
    the stronger the correlation

153
Interpretation of r
Direction positive or negative Strength the
closer r is to 1, the stronger the
correlation 0.0 ? r lt 0.3 ? weak
correlation 0.3 ? r lt 0.7 ? moderate
correlation 0.7 ? r lt 1.0 ? strong correlation
r 1.0 ? perfect correlation
154
Not all Relationships are Linear Miles per
Gallon versus Speed
  • r ? 0 (flat line) with strong non-linear relation
  • But

155
Not all Relationships are Linear Miles per
Gallon versus Speed
  • Very strong non-linear (curved) relationship
  • r was misleading!

156
Outliers and Correlation
The outlier in the above graph decreases r so
that r 0 If we remove the outlier ? strong
relation
157
Beware!
  • Not all relations are linear
  • Outliers can have large influence on r
  • Lurking variables confound relations

158
Regression
159
Objectives of Regression
  • Quantitative X (explanatory)
  • Quantitative Y (response)
  • Objectives
  • Describe the change in Y per unit X
  • Predict average Y at given X

160
Equation for an algebraic line
Y (intercept) (slope)(X) or Y (slope)(X)
(intercept)
Intercept ? where line crosses Y axisSlope ?
angle of line
161
Equation for a regression line
  • Algebraic line ? every point falls on lineExact
    Y intercept (slope)(X)
  • Statistical line ? scatter around linear trend
  • Predicted Y intercept (slope)(X)

y a bx where y (y-hat) is the predicted
value of Y, a is the intercept, and b is the slope
162
What Line Fits Best?
  • The method we use to draw the best fitting line
    is called the least squares method
  • If we try to draw the line by eye, different
    people will draw different lines

163
The least squares regression line
  • Each point has
  • Residual observed y predicted y
  • distance of point from line predicted by
    model

The least squares line minimizes the sum of the
square residuals
164
Regression Line
  • For the bird population data
  • a 31.9343
  • b ?0.3040
  • The linear regression equation is
  • y 31.9343 ? 0.3040x

The slope -0.3040 represents the average change
in Y per unit X
165
CautionsAbout Correlation and Regression
  • Describe only linear relationships
  • Beware of influential outliers
  • Cannot predict beyond the range of X (do not
    extrapolate)
  • Beware of lurking variables (variables other than
    X and Y)
  • Association does not equal causation!

166
Transformations
  • Often we work with transformed data rather than
    the original numbers
  • 1/x for skewed left data
  • Log (x) or ln (x) when skewed right
  • Transforming data can
  • Make a skewed distribution more symmetric
  • Make the distribution more normal-like
  • Stabilize variability
  • Linearize a relationship between 2 or more
    variables
  • Show summary statistics in original units but
    test on the transformed scale

167
Caution
  • Even strong correlations may be non-causal
    (Beware lurking variables!)

168
Survival analysis
169
One sample case
  • Time-to-event data
  • Estimating the survival curve Kaplan-Meier
    (product limit) method
  • Inferences based on survival curve

170
Two-sample case
  • Comparison at a fixed time
  • Hazard rates
  • Overall comparison
  • Mantel-Haenszel method
  • Logrank method
  • Estimation of common hazard ratio
  • Summary event rates
  • Sample size

171
What statistical test should I use?
172
Inferential statistics
173
Inferential statistics
174
Multiple comparisons
  • ANOVA, Kruskall-Wallis and chi-square will test
    whether there are any differences between groups
  • If doing multiple comparisons, must correct
  • Bonferroni ( p by number of comparisons)
  • Student-Neuman-Keuls
  • Tukeys
  • Scheffes f test

175
Common errors in statistical interpretation
176
Error 1
  • The p-value is the probability of the null
    hypothesis being true base on the observed result

177
Error 2
  • Failure to reject the null hypothesis equals
    acceptance of the null hypothesis

178
Error 3
  • Failure to reject the null hypothesis equals
    rejection of the alternate hypothesis

179
Error 4
  • An a level of 0.05 is a standard with an
    objective basis

180
Error 5
  • The smaller the p-value, the larger the effect

181
Error 6
  • Statistical significance implies importance

182
Error 7
  • Data verify or refute theory
Write a Comment
User Comments (0)
About PowerShow.com