John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics - PowerPoint PPT Presentation

About This Presentation

Title:

John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics

Description:

... were invited to have their haemoglobin (Hb) level and packed cell volume ... Haemoglobin and packed cell volume. Example - relationships between variables ... – PowerPoint PPT presentation

Number of Views:818

Avg rating:3.0/5.0

Slides: 103

Provided by: humans7

Category:

more less

Transcript and Presenter's Notes

Title: John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics

1
Introductory Statistics

John Matthews, Professor of Medical Statistics,
School of Mathematics and Statistics
Janine Gray, Senior Lecturer and Deputy Director,
Newcastle Clinical Trials Unit

University of Newcastle-upon-Tyne
2
Course Outline

Data Description
Mean, Median, Standard Deviation
Graphs
The Normal Distribution
Populations and Samples
Confidence intervals and p-values
Estimation and Hypothesis testing
Continuous data
Categorical data
Regression and Correlation

3
Course Objectives

To have an understanding of the Normal
distribution and its relationship to common
statistical analyses
To have an understanding of basic statistical
concepts such as confidence intervals and
p-values
To know which analysis is appropriate for
different types of data

4
Recommended Textbooks

Swinscow TDV and Campbell MJ. Statistics at
Square One (10th edn). BMJ Books
Altman DG. Practical Statistics for Medical
Research. Chapman and Hall
Bland M. An Introduction to Medical Statistics.
Oxford Medical Publications
Campbell MJ Machin D. Medical Statistics A
Commonsense Approach. Wiley

5
Other reading

Chinn S. Statistics for the European Respiratory
Journal. Eur Respir J 2001 18393-401
www.mas.ncl.ac.uk/njnsm/medfac/MDPhD/notes.htm
BMJ statistics notes

6
Types of Data

Numerical Data
discrete
number of lesions
number of visits to GP
continuous
height
lesion area

7
Types of Data

Categorical
unordered
Pregnant/Not pregnant
married/single/divorced/separated/widowed
ordered (ordinal)
minimal/moderate/severe/unbearable
Stage of breast cancer I II III IV

8
Exercise

What type are the following variables?a) sexb)
diastolic blood pressurec) diagnosisd) heighte)
family sizef) cancer stage

9
Types of Data

Outcome/Dependent variable
outcome of interest
e.g. survival, recovery
Explanatory/Independent variable
treatment group
age
sex

10
Histogram of Birthweight (grams) at 40 weeks GA
11
Summary Statistics

Location
Mean (average value)
Median (middle value)
Mode (most frequently occurring value)
Variability
Variance/SD
Range
Centiles

12
Birthweights (g) at 40 weeks Gestation

mean 3441g
median 3428g
sd 434g
min 2050g
max 4975g
range 2925g

13
Boxplot
14
Symmetric Data

mean median (approx) ?
standard deviation ?

15
Skew Data

median "typical" value ?
mean affected by extreme values - larger than
median ?
SD fairly meaningless ?
centiles (less affected by extreme
values/outliers) ?

16
Half of all doctors are below average.

Even if all surgeons are equally good, about half
will have below average results, one will have
the worst results, and the worst results will be
a long way below average
Ref. BMJ 1998 3161734-1736

17
Discrete Data Principal diagnosis of patients in
Tooting Bec Hospital
18
Bar Chart
19
Summarising data - Summary

Choosing the appropriate summary statistics and
graph depends upon the type of variable you have
Categorical (unordered/ordered)
Continuous (symmetric/skew)

20
The Normal Distribution

N(???2?
????unknown population mean - estimate using
sample mean
????unknown population SD - estimate using sample
SD
Birthweight is N(3441, 4342)

21
N(0,1) - Standard Normal Distribution
68 within 1
SD Units
95 within 1.96
99 within 2.58
z - SD units
22
Birthweight (g) at 40 weeks
95 within 1.96 SDs 2590 - 4292 grams
99 within 2.58 SDs 2321 - 4561 grams
23
Further Reading

http//www.mas.ncl.ac.uk/njnsm/medfac/docs/intro.
pdf
Altman DG, Bland JM (1996) Presentation of
numerical data. BMJ 312, 572
Altman DG, Bland JM. (1995) The normal
distribution. BMJ 310, 298.

24
Samples and Populations

Use samples to estimate population quantities
(parameters) such as disease prevalence, mean
cholesterol level etc
Samples are not interesting in their own right -
only to infer information about the population
from which they are drawn
Sampling Variation
Populations are unique - samples are not.

25
Sample and Populations

How much might these estimates vary from sample
to sample?
Determine precision of estimates (how close/far
away from the population?)

26
(Artifical) example

Have 5000 measurements of diastolic blood
pressure from airline pilots. This accounts for
ALL airline pilots and is the population of
airline pilots.
(Artificial example - if we had the whole
population we wouldnt need to sample!!)
Since we have the population, we know the true
population characteristics. It is these we are
trying to estimate from a sample.

27
Population distribution of diastolic BP from
Airline Pilots (in mmHg)
True mean 78.2 True SD 9.4
28
Example

Write each measurement on a piece of paper and
put into a hat.
Draw 5 pieces of paper and calculate the mean of
the BP.
replace and repeat 49 more times
End up with 50 (different) estimates of mean BP

29
Sampling Distribution

Each estimate of the mean will be different.
Treat this as a random sample of means
Plot a histogram of the means.
This is an estimate of the sampling distribution
of the mean.
Can get the sampling distribution of any
parameter in a similar way.

30
Distribution of the mean
? 78.2, ? 9.4
Population
50 samples N5
50 samples N10
50 samples N100
31
Distribution of the Mean

BUT! Dont need to take multiple samples
Standard error of the mean
SE of the mean is the SD of the distribution of
the sample mean

32
Distribution of Sample Mean

Distribution of sample mean is Normal regardless
of distribution of sample(unless small or very
skew sample)
SOCan apply Normal theory to sample mean also

33
Distribution of Sample Mean

i.e. 95 of sample means lie within 1.96 SEs of
(unknown) true mean
This is the basis for a 95 confidence interval
(CI)
95 CI is an interval which on 95 of occasions
includes the population mean

34
Example

57 measurements of FEV1 in male medical students

35
Example

95 of population lie withini.e. within 4.06
1.96?0.67, from 2.75 to 5.38 litres

36
Example

Thus for FEV1 data, 95 chance that the interval
contains the true population meani.e. between
3.89 and 4.23 litres
This is the 95 confidence interval for the mean

37
Confidence Intervals

The confidence interval (CI) measures
uncertainty. The 95 confidence interval is the
range of values within which we can be 95 sure
that the true value lies for the whole of the
population of patients from whom the study
patients were selected. The CI narrows as the
number of patients on which it is based
increases.

38
Standard Deviations Standard Errors

The SE is the SD of the sampling distribution (of
the mean, say)
SE SD/vN
Use SE to describe the precision of estimates
(for example Confidence intervals)
Use SD to describe the variability of samples,
populations or distributions (for example
reference ranges)

39
The t-distribution

When N is small, estimate of SD is particularly
unreliable and the distribution of sample mean is
not Normal
Distribution is more variable - longer tails
Shape of distribution depends upon sample size
This distribution is called the t-distribution

40
N2
t(1) 95 within 12.7
N(0,1)
t(1)
41
N10
N(0,1)
t(9) 95 within 2.26
t(9)
42
N30
t(29) 95 within 2.04
43
t-distribution

As N becomes larger, t-distribution becomes more
similar to Normal distribution
Degrees of Freedom (DF)- sample size - 1
DF measure of amount of information contained in
data set

44
Implications

Confidence interval for the mean
Sample size lt 30 Use t-distribution
Sample size gt 30 Use either Normal or t
distribution
Note Stats packages (generally) will
automatically use the correct distribution for
confidence intervals

45
Example

Numbers of hours of relief obtained by 7
arthritic patients after receiving a new drug
2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3
Mean 3.33, SD 1.03, DF 6, t(5) 2.45
95 CI 3.33 2.45??1.03/ ?72.38 to 4.28 hours
Normal 95 CI 3.33 1.96??1.03/ ?72.57 to
4.09 hours TOO NARROW!!

46
Hypothesis Testing

Enables us to measure the strength of evidence
supplied by the data concerning a proposition of
interest
In a trial comparing two treatments there will
ALWAYS be a difference between the estimates for
each treatment - a real difference or random
variation?

47
Null Hypothesis

Study hypothesis - hypothesis in the mind of the
investigator (patients with diabetes have raised
blood pressure)
Null hypothesis is the converse of the study
hypothesis - aim to disprove it (patients with
diabetes do not have raised blood pressure)
Hypothesis of no effect/difference

48
Two-Sample t-test

Two independent samples
Can the two samples be considered to be the same
with respect to the variable you are measuring or
are they different?
Sample means will ALWAYS be different - real
difference or random variation?
ASSUMPTION Data are normally distributed and
SD in each group similar

49
Two-Sample t-test

24 hour total energy expenditure (MJ/day) in
groups of lean and obese women
Do the women differ in their energy expenditure?
Null hypothesis energy expenditure in lean and
obese women is the same

50
Boxplot of energy expenditure MJ/day
51
Two-sample t-test

Summary statistics lean obeseMean 8.1 10.3
SD 1.2 1.4
N 13 9
Difference in means 10.3 - 8.1 2.2
SE difference 0.57 (weighted average)

52
Two Sample t-test

Test statistic is 2.2/0.57 3.9
N1 N2 - 2 DF ( 20)
Calculate the probability of observing a value at
least as extreme as 3.9 if the null hypothesis is
true
If the null hypothesis is true, the test
statistic should have a t-distribution with 20 df
(df N1N2-2)

53
Two Sample t-test

95 of values from t-distribution with 20 DF lie
between -2.09 and 2.09
Probability of observing a value as extreme or
more extreme than 3.9 in a t-distribution with 20
df is 0.001
Only a very small probability that the value of
3.9 fits reasonably with a t-distribution with 20
df
Conclude that energy expenditure is significantly
different between lean and obese women

54
The P-value

The P-value is the probability of observing a
test statistic at least as extreme as that
observed if the null hypothesis is true

55
t distribution with 20 df
56
Confidence Interval for the difference in two
means

95 CI 2.2 - 2.09?0.57 to 2.2 2.09?0.57
or from 1.05 to 3.41 MJ/day
Thus we are 95 confident that obese women use
between 1.05 and 3.41 MJ/day energy more than
lean women

57
Confidence Interval or P-value?

Confidence interval!!!
P-value will tell you whether or not there is a
statistically significant difference
confidence interval will give information about
the size of the difference and the strength of
the evidence

58
Paired t-test

Obvious pairing between observations
two measurements on each subject (before-after
study)
case-control pairs
Assumption - paired data are normally distributed
Example - Systolic blood pressure (SBP) measured
in 16 middle aged men before and after a standard
exercise. Post-exercise SBP - Pre-exercise SBP
calculated for each man

59
Boxplot of differences
60
Paired t-test

Mean difference 6.6
SE(Mean) 1.5
t 6.6/1.5 4.4
Compare with t(15)
P lt 0.001
Conclusion- mean systolic blood pressure is
higher after exercise than before

61
Paired t-test

95 confidence interval for the mean difference
6.6 ? 2.131.5 3.4 to 9.8

62
Categorical Variables

To investigate the relationship between two
categorical variables form contingency table
Hypothesis tests
Chi-squared test (?2 test)
Fishers exact test (small samples)
McNemars test (paired data)

63
Chi-squared test

Used to test for associations between categorical
variables (2 or more distinct outcomes)
Example - a comparison between psychotherapy and
usual care for major depression in primary care

64
Patient Reported Recovery at 8 months
65
Patient Reported Recovery at 8 months

Difference between means 30.8
95 confidence interval for difference 17.7 to
43.8

66
Larger tables

Similar methods can be applied to larger tables
to test the association between two categorical
variables
Example - Is there an association between housing
tenure and time of delivery of baby
(preterm/term).
Null hypothesis There is no relationship
between housing tenure and time of delivery

67
Relationship between housing tenure and time of
delivery
68
Relationship between housing tenure and time of
delivery

DF (5-1)??(2-1) 4
P 0.03
Thus we strong evidence of a relationship between
housing tenure and time of delivery

69
Notes

Chi-squared test not valid if expected values are
small (lt5)
Combine rows or columns to obtain a smaller table
with larger expected values
Use Fishers exact test for small tables

70
McNemars test

Appropriate for use with paired or matched
(case-control) data with a dichotomous outcome

71
Example - McNemars test

Skaane compared the use of mammography and
ultrasound in the assessment of 327 (228 palpable
and 99 non-palpable) consecutive malignant
tumours confirmed at histology.
Acta radiologica vol 40486-490 (1999)

72
McNemars test - example
73
McNemars test - example

308/327 (94) were picked up by mammograpy
compared with 278/327 (85) picked up by
ultrasound
Plt0.001
Conclusion Mammography is significantly more
sensitive in diagnosing tumours than ultrasound
in a population of mixed malignant tumours

74
Hypothesis testing - summary
Adapted from Chinn S. Statistics for the European
Respiratory Journal.
75
Correlation and Regression

Relationship between two continuous variables
regression
correlation

76
Relationship between two continuous variables

3 main purposes for doing this
to assess whether the two variables are
associated (correlation)
to enable the value of one variable to be
predicted from any known value of the other
variable (regression)
to assess the amount of agreement between two
variables (method comparison study)

77
Example

Women from a pre-defined geographical area were
invited to have their haemoglobin (Hb) level and
packed cell volume measured. They were also
asked their age.

78
Haemoglobin and packed cell volume
79
Example - relationships between variables

Association between Hb and PCV? Hb affects PCV
or PCV affects Hb?
Use correlation to measure the strength of an
association
Association between Hb and age?age must affect
Hb and not vice versa
Use regression to predict Hb from age

80
Correlation

Not interested in causation i.e. does a high
PCV cause a high Hb level
Interested in associationi.e. is a high PCV
associated with a high Hb level?
sample correlation coefficient
summarises strength of relationship
can be used to test the hypothesis that the
population correlation coefficient is 0

81
Correlation Coefficient

dimensionless, from -1 to 1
measures the strength of a linear relationship
ve - high value of one variable associated with
high value of the other
-ve - high value of one variable associated with
low value of the other
1 exact linear relationship
strictly called Pearson correlation coefficient

82
Example Data
r -0.4
r 1
r 0
r 0.7
83
When not to use the correlation coefficient

If the relationship is non-linear
with caution in the presence of outliers
when the variables are measured over more than
one distinct group (i.e. disease groups)
when one of the variables is fixed in advance
Assessing agreement

84
Correlation - example data
85
Is there an alternative?

If the data are non-linear or there is an outlier
use spearman rank correlation coefficient

86
Haemoglobin and Packed Cell Volume

Without outlier
Pearson0.67
Spearman0.63
With outlier
Pearson0.34
Spearman0.48

87
Regression

Assume a change in x will cause a change in y
predict y for a given value of x
usually not logical to believe y causes x
y is the dependent variable (vertical axis)
x is the independent variable (horizontal axis)

88
Example - Haemoglobin vs Age
89
Regression

Logical to assume that increasing age leads to
increasing Hb
Not logical to assume Hb affects age!
Assume underlying true linear relationship
Make an estimate of what that true linear
relationship is

90
Estimating a regression line

How do I identify the best straight line?
least squares estimate
straight line determined by slope and intercept
y a b?x
a and b are estimates of the true intercept and
slope and are subject to sampling variation

91
Regression line of haemoglobin on age
92
Regression of haemoglobin on age

Variable(s) Entered on Step Number 1.. AGE
Age (Years)Multiple R .87959R
Square .77367Adjusted R Square
.76110Standard Error 1.17398
Analysis of Variance DF
Sum of Squares Mean SquareRegression
1 84.80397
84.80397Residual 18
24.80803 1.37822F 61.53133
Signif F .0000

93
Regression of haemoglobin on age

---------------------- Variables in the Equation
-------------Variable B SE B
95 Confdnce Intrvl BAGE .134251
.017115 .098295 .170208 (Constant)
8.239786 .794261 6.571104 9.908467
----------- in ------------Variable T
Sig TAGE 7.844 .0000(Constant)
10.374 .0000

94
What does this tell us?

Mean Hb 8.2 0.13 ??AGE
95 CI for the slope goes from 0.098 to 0.170
P lt 0.0001
Significant relationship between Hb and age
77 of the variability in Hb can be accounted for
by age

95
How can it be used?

Predict mean Hb for a given age
Eg. What is the mean Hb of a 50 year old?
Mean Hb 8.2 0.13???50 14.7 g/dl
95 CI for the estimate from 14.4 to 15.5 g/dl

96
How can it be used?

To calculate reference ranges for the population
E.g. What range would you expect 95 of 50 year
olds to lie within? (reference range)
Between 12.4 to 17.5 g/dl

97
95 Confidence Interval for the Mean 95
prediction interval for individuals
98
Definitions

Predicted value
the value predicted by the regression line
an estimate of the mean value
Residual
Observed value - predicted value

99
What assumptions have I made?

The relationship is approximately linear
The residuals have a normal distribution

100
Multiple Regression

One outcome variable with multiple predictor
variables
Residuals assumed to be normally distributed
Predictor variables can be continuous or
categorical
No assumptions made about distribution of
continuous predictor variables

101
Multiple Regression

Example. Does the value of packed cell volume
improve the prediction of hb?
Model fitted
Mean Hb 5.2 0.1?age(years) 0.1?packed
cell volume()
R2 83
Knowledge of packed cell volume improves the
prediction of haemoglobin

102
Summary

Regression can be used to estimate the numerical
relationship between an outcome variable and one
or more predictor variables
Correlation coefficient alone is of limited use

Write a Comment

User Comments (0)