Title: John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics
1Introductory Statistics
- John Matthews, Professor of Medical Statistics,
School of Mathematics and Statistics - Janine Gray, Senior Lecturer and Deputy Director,
Newcastle Clinical Trials Unit
University of Newcastle-upon-Tyne
2Course Outline
- Data Description
- Mean, Median, Standard Deviation
- Graphs
- The Normal Distribution
- Populations and Samples
- Confidence intervals and p-values
- Estimation and Hypothesis testing
- Continuous data
- Categorical data
- Regression and Correlation
3Course Objectives
- To have an understanding of the Normal
distribution and its relationship to common
statistical analyses - To have an understanding of basic statistical
concepts such as confidence intervals and
p-values - To know which analysis is appropriate for
different types of data
4Recommended Textbooks
- Swinscow TDV and Campbell MJ. Statistics at
Square One (10th edn). BMJ Books - Altman DG. Practical Statistics for Medical
Research. Chapman and Hall - Bland M. An Introduction to Medical Statistics.
Oxford Medical Publications - Campbell MJ Machin D. Medical Statistics A
Commonsense Approach. Wiley
5Other reading
- Chinn S. Statistics for the European Respiratory
Journal. Eur Respir J 2001 18393-401 - www.mas.ncl.ac.uk/njnsm/medfac/MDPhD/notes.htm
- BMJ statistics notes
6Types of Data
- Numerical Data
- discrete
- number of lesions
- number of visits to GP
- continuous
- height
- lesion area
7Types of Data
- Categorical
- unordered
- Pregnant/Not pregnant
- married/single/divorced/separated/widowed
- ordered (ordinal)
- minimal/moderate/severe/unbearable
- Stage of breast cancer I II III IV
8Exercise
- What type are the following variables?a) sexb)
diastolic blood pressurec) diagnosisd) heighte)
family sizef) cancer stage
9Types of Data
- Outcome/Dependent variable
- outcome of interest
- e.g. survival, recovery
- Explanatory/Independent variable
- treatment group
- age
- sex
10Histogram of Birthweight (grams) at 40 weeks GA
11Summary Statistics
- Location
- Mean (average value)
- Median (middle value)
- Mode (most frequently occurring value)
- Variability
- Variance/SD
- Range
- Centiles
12Birthweights (g) at 40 weeks Gestation
- mean 3441g
- median 3428g
- sd 434g
- min 2050g
- max 4975g
- range 2925g
13Boxplot
14Symmetric Data
- mean median (approx) ?
- standard deviation ?
15Skew Data
- median "typical" value ?
- mean affected by extreme values - larger than
median ? - SD fairly meaningless ?
- centiles (less affected by extreme
values/outliers) ?
16Half of all doctors are below average.
- Even if all surgeons are equally good, about half
will have below average results, one will have
the worst results, and the worst results will be
a long way below average - Ref. BMJ 1998 3161734-1736
17Discrete Data Principal diagnosis of patients in
Tooting Bec Hospital
18Bar Chart
19Summarising data - Summary
- Choosing the appropriate summary statistics and
graph depends upon the type of variable you have - Categorical (unordered/ordered)
- Continuous (symmetric/skew)
20The Normal Distribution
- N(???2?
- ????unknown population mean - estimate using
sample mean - ????unknown population SD - estimate using sample
SD - Birthweight is N(3441, 4342)
21N(0,1) - Standard Normal Distribution
68 within 1
SD Units
95 within 1.96
99 within 2.58
z - SD units
22Birthweight (g) at 40 weeks
95 within 1.96 SDs 2590 - 4292 grams
99 within 2.58 SDs 2321 - 4561 grams
23Further Reading
- http//www.mas.ncl.ac.uk/njnsm/medfac/docs/intro.
pdf - Altman DG, Bland JM (1996) Presentation of
numerical data. BMJ 312, 572 - Altman DG, Bland JM. (1995) The normal
distribution. BMJ 310, 298.
24Samples and Populations
- Use samples to estimate population quantities
(parameters) such as disease prevalence, mean
cholesterol level etc - Samples are not interesting in their own right -
only to infer information about the population
from which they are drawn - Sampling Variation
- Populations are unique - samples are not.
25Sample and Populations
- How much might these estimates vary from sample
to sample? - Determine precision of estimates (how close/far
away from the population?)
26(Artifical) example
- Have 5000 measurements of diastolic blood
pressure from airline pilots. This accounts for
ALL airline pilots and is the population of
airline pilots. - (Artificial example - if we had the whole
population we wouldnt need to sample!!) - Since we have the population, we know the true
population characteristics. It is these we are
trying to estimate from a sample.
27Population distribution of diastolic BP from
Airline Pilots (in mmHg)
True mean 78.2 True SD 9.4
28Example
- Write each measurement on a piece of paper and
put into a hat. - Draw 5 pieces of paper and calculate the mean of
the BP. - replace and repeat 49 more times
- End up with 50 (different) estimates of mean BP
29Sampling Distribution
- Each estimate of the mean will be different.
- Treat this as a random sample of means
- Plot a histogram of the means.
- This is an estimate of the sampling distribution
of the mean. - Can get the sampling distribution of any
parameter in a similar way.
30Distribution of the mean
? 78.2, ? 9.4
Population
50 samples N5
50 samples N10
50 samples N100
31Distribution of the Mean
- BUT! Dont need to take multiple samples
- Standard error of the mean
- SE of the mean is the SD of the distribution of
the sample mean
32Distribution of Sample Mean
- Distribution of sample mean is Normal regardless
of distribution of sample(unless small or very
skew sample) - SOCan apply Normal theory to sample mean also
33Distribution of Sample Mean
- i.e. 95 of sample means lie within 1.96 SEs of
(unknown) true mean - This is the basis for a 95 confidence interval
(CI) - 95 CI is an interval which on 95 of occasions
includes the population mean
34Example
- 57 measurements of FEV1 in male medical students
35Example
-
- 95 of population lie withini.e. within 4.06
1.96?0.67, from 2.75 to 5.38 litres
36Example
- Thus for FEV1 data, 95 chance that the interval
contains the true population meani.e. between
3.89 and 4.23 litres - This is the 95 confidence interval for the mean
37Confidence Intervals
- The confidence interval (CI) measures
uncertainty. The 95 confidence interval is the
range of values within which we can be 95 sure
that the true value lies for the whole of the
population of patients from whom the study
patients were selected. The CI narrows as the
number of patients on which it is based
increases.
38Standard Deviations Standard Errors
- The SE is the SD of the sampling distribution (of
the mean, say) - SE SD/vN
- Use SE to describe the precision of estimates
(for example Confidence intervals) - Use SD to describe the variability of samples,
populations or distributions (for example
reference ranges)
39The t-distribution
- When N is small, estimate of SD is particularly
unreliable and the distribution of sample mean is
not Normal - Distribution is more variable - longer tails
- Shape of distribution depends upon sample size
- This distribution is called the t-distribution
40N2
t(1) 95 within 12.7
N(0,1)
t(1)
41N10
N(0,1)
t(9) 95 within 2.26
t(9)
42N30
t(29) 95 within 2.04
43t-distribution
- As N becomes larger, t-distribution becomes more
similar to Normal distribution - Degrees of Freedom (DF)- sample size - 1
- DF measure of amount of information contained in
data set
44Implications
- Confidence interval for the mean
- Sample size lt 30 Use t-distribution
- Sample size gt 30 Use either Normal or t
distribution - Note Stats packages (generally) will
automatically use the correct distribution for
confidence intervals
45Example
- Numbers of hours of relief obtained by 7
arthritic patients after receiving a new drug
2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3 - Mean 3.33, SD 1.03, DF 6, t(5) 2.45
- 95 CI 3.33 2.45??1.03/ ?72.38 to 4.28 hours
- Normal 95 CI 3.33 1.96??1.03/ ?72.57 to
4.09 hours TOO NARROW!!
46Hypothesis Testing
- Enables us to measure the strength of evidence
supplied by the data concerning a proposition of
interest - In a trial comparing two treatments there will
ALWAYS be a difference between the estimates for
each treatment - a real difference or random
variation?
47Null Hypothesis
- Study hypothesis - hypothesis in the mind of the
investigator (patients with diabetes have raised
blood pressure) - Null hypothesis is the converse of the study
hypothesis - aim to disprove it (patients with
diabetes do not have raised blood pressure) - Hypothesis of no effect/difference
48Two-Sample t-test
- Two independent samples
- Can the two samples be considered to be the same
with respect to the variable you are measuring or
are they different? - Sample means will ALWAYS be different - real
difference or random variation? - ASSUMPTION Data are normally distributed and
SD in each group similar
49Two-Sample t-test
- 24 hour total energy expenditure (MJ/day) in
groups of lean and obese women - Do the women differ in their energy expenditure?
- Null hypothesis energy expenditure in lean and
obese women is the same
50Boxplot of energy expenditure MJ/day
51Two-sample t-test
- Summary statistics lean obeseMean 8.1 10.3
- SD 1.2 1.4
- N 13 9
- Difference in means 10.3 - 8.1 2.2
- SE difference 0.57 (weighted average)
52Two Sample t-test
- Test statistic is 2.2/0.57 3.9
- N1 N2 - 2 DF ( 20)
- Calculate the probability of observing a value at
least as extreme as 3.9 if the null hypothesis is
true - If the null hypothesis is true, the test
statistic should have a t-distribution with 20 df
(df N1N2-2)
53Two Sample t-test
- 95 of values from t-distribution with 20 DF lie
between -2.09 and 2.09 - Probability of observing a value as extreme or
more extreme than 3.9 in a t-distribution with 20
df is 0.001 - Only a very small probability that the value of
3.9 fits reasonably with a t-distribution with 20
df - Conclude that energy expenditure is significantly
different between lean and obese women
54The P-value
- The P-value is the probability of observing a
test statistic at least as extreme as that
observed if the null hypothesis is true
55t distribution with 20 df
56Confidence Interval for the difference in two
means
- 95 CI 2.2 - 2.09?0.57 to 2.2 2.09?0.57
- or from 1.05 to 3.41 MJ/day
- Thus we are 95 confident that obese women use
between 1.05 and 3.41 MJ/day energy more than
lean women
57Confidence Interval or P-value?
- Confidence interval!!!
- P-value will tell you whether or not there is a
statistically significant difference - confidence interval will give information about
the size of the difference and the strength of
the evidence
58Paired t-test
- Obvious pairing between observations
- two measurements on each subject (before-after
study) - case-control pairs
- Assumption - paired data are normally distributed
- Example - Systolic blood pressure (SBP) measured
in 16 middle aged men before and after a standard
exercise. Post-exercise SBP - Pre-exercise SBP
calculated for each man
59Boxplot of differences
60Paired t-test
- Mean difference 6.6
- SE(Mean) 1.5
- t 6.6/1.5 4.4
- Compare with t(15)
- P lt 0.001
- Conclusion- mean systolic blood pressure is
higher after exercise than before
61Paired t-test
- 95 confidence interval for the mean difference
- 6.6 ? 2.131.5 3.4 to 9.8
62Categorical Variables
- To investigate the relationship between two
categorical variables form contingency table - Hypothesis tests
- Chi-squared test (?2 test)
- Fishers exact test (small samples)
- McNemars test (paired data)
63Chi-squared test
- Used to test for associations between categorical
variables (2 or more distinct outcomes) - Example - a comparison between psychotherapy and
usual care for major depression in primary care
64Patient Reported Recovery at 8 months
65Patient Reported Recovery at 8 months
- Difference between means 30.8
- 95 confidence interval for difference 17.7 to
43.8
66Larger tables
- Similar methods can be applied to larger tables
to test the association between two categorical
variables - Example - Is there an association between housing
tenure and time of delivery of baby
(preterm/term). - Null hypothesis There is no relationship
between housing tenure and time of delivery
67Relationship between housing tenure and time of
delivery
68Relationship between housing tenure and time of
delivery
- DF (5-1)??(2-1) 4
- P 0.03
- Thus we strong evidence of a relationship between
housing tenure and time of delivery
69Notes
- Chi-squared test not valid if expected values are
small (lt5) - Combine rows or columns to obtain a smaller table
with larger expected values - Use Fishers exact test for small tables
70McNemars test
- Appropriate for use with paired or matched
(case-control) data with a dichotomous outcome
71Example - McNemars test
- Skaane compared the use of mammography and
ultrasound in the assessment of 327 (228 palpable
and 99 non-palpable) consecutive malignant
tumours confirmed at histology. - Acta radiologica vol 40486-490 (1999)
72McNemars test - example
73McNemars test - example
- 308/327 (94) were picked up by mammograpy
compared with 278/327 (85) picked up by
ultrasound - Plt0.001
- Conclusion Mammography is significantly more
sensitive in diagnosing tumours than ultrasound
in a population of mixed malignant tumours
74Hypothesis testing - summary
Adapted from Chinn S. Statistics for the European
Respiratory Journal.
75Correlation and Regression
- Relationship between two continuous variables
- regression
- correlation
76Relationship between two continuous variables
- 3 main purposes for doing this
- to assess whether the two variables are
associated (correlation) - to enable the value of one variable to be
predicted from any known value of the other
variable (regression) - to assess the amount of agreement between two
variables (method comparison study)
77Example
- Women from a pre-defined geographical area were
invited to have their haemoglobin (Hb) level and
packed cell volume measured. They were also
asked their age.
78Haemoglobin and packed cell volume
79Example - relationships between variables
- Association between Hb and PCV? Hb affects PCV
or PCV affects Hb? - Use correlation to measure the strength of an
association - Association between Hb and age?age must affect
Hb and not vice versa - Use regression to predict Hb from age
80Correlation
- Not interested in causation i.e. does a high
PCV cause a high Hb level - Interested in associationi.e. is a high PCV
associated with a high Hb level? - sample correlation coefficient
- summarises strength of relationship
- can be used to test the hypothesis that the
population correlation coefficient is 0
81Correlation Coefficient
- dimensionless, from -1 to 1
- measures the strength of a linear relationship
- ve - high value of one variable associated with
high value of the other - -ve - high value of one variable associated with
low value of the other - 1 exact linear relationship
- strictly called Pearson correlation coefficient
82Example Data
r -0.4
r 1
r 0
r 0.7
83When not to use the correlation coefficient
- If the relationship is non-linear
- with caution in the presence of outliers
- when the variables are measured over more than
one distinct group (i.e. disease groups) - when one of the variables is fixed in advance
- Assessing agreement
84Correlation - example data
85Is there an alternative?
- If the data are non-linear or there is an outlier
- use spearman rank correlation coefficient
86Haemoglobin and Packed Cell Volume
- Without outlier
- Pearson0.67
- Spearman0.63
- With outlier
- Pearson0.34
- Spearman0.48
87Regression
- Assume a change in x will cause a change in y
- predict y for a given value of x
- usually not logical to believe y causes x
- y is the dependent variable (vertical axis)
- x is the independent variable (horizontal axis)
88Example - Haemoglobin vs Age
89Regression
- Logical to assume that increasing age leads to
increasing Hb - Not logical to assume Hb affects age!
- Assume underlying true linear relationship
- Make an estimate of what that true linear
relationship is
90Estimating a regression line
- How do I identify the best straight line?
- least squares estimate
- straight line determined by slope and intercept
- y a b?x
- a and b are estimates of the true intercept and
slope and are subject to sampling variation
91Regression line of haemoglobin on age
92Regression of haemoglobin on age
- Variable(s) Entered on Step Number 1.. AGE
Age (Years)Multiple R .87959R
Square .77367Adjusted R Square
.76110Standard Error 1.17398 - Analysis of Variance DF
Sum of Squares Mean SquareRegression
1 84.80397
84.80397Residual 18
24.80803 1.37822F 61.53133
Signif F .0000
93Regression of haemoglobin on age
- ---------------------- Variables in the Equation
-------------Variable B SE B
95 Confdnce Intrvl BAGE .134251
.017115 .098295 .170208 (Constant)
8.239786 .794261 6.571104 9.908467 - ----------- in ------------Variable T
Sig TAGE 7.844 .0000(Constant)
10.374 .0000
94What does this tell us?
- Mean Hb 8.2 0.13 ??AGE
- 95 CI for the slope goes from 0.098 to 0.170
- P lt 0.0001
- Significant relationship between Hb and age
- 77 of the variability in Hb can be accounted for
by age
95How can it be used?
- Predict mean Hb for a given age
- Eg. What is the mean Hb of a 50 year old?
- Mean Hb 8.2 0.13???50 14.7 g/dl
- 95 CI for the estimate from 14.4 to 15.5 g/dl
96How can it be used?
- To calculate reference ranges for the population
- E.g. What range would you expect 95 of 50 year
olds to lie within? (reference range) - Between 12.4 to 17.5 g/dl
9795 Confidence Interval for the Mean 95
prediction interval for individuals
98Definitions
- Predicted value
- the value predicted by the regression line
- an estimate of the mean value
- Residual
- Observed value - predicted value
99What assumptions have I made?
- The relationship is approximately linear
- The residuals have a normal distribution
100Multiple Regression
- One outcome variable with multiple predictor
variables - Residuals assumed to be normally distributed
- Predictor variables can be continuous or
categorical - No assumptions made about distribution of
continuous predictor variables
101Multiple Regression
- Example. Does the value of packed cell volume
improve the prediction of hb? - Model fitted
- Mean Hb 5.2 0.1?age(years) 0.1?packed
cell volume() - R2 83
- Knowledge of packed cell volume improves the
prediction of haemoglobin
102Summary
- Regression can be used to estimate the numerical
relationship between an outcome variable and one
or more predictor variables - Correlation coefficient alone is of limited use