John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics

Description:

... were invited to have their haemoglobin (Hb) level and packed cell volume ... Haemoglobin and packed cell volume. Example - relationships between variables ... – PowerPoint PPT presentation

Number of Views:818
Avg rating:3.0/5.0
Slides: 103
Provided by: humans7
Category:

less

Transcript and Presenter's Notes

Title: John%20Matthews,%20Professor%20of%20Medical%20Statistics,%20School%20of%20Mathematics%20and%20Statistics


1
Introductory Statistics
  • John Matthews, Professor of Medical Statistics,
    School of Mathematics and Statistics
  • Janine Gray, Senior Lecturer and Deputy Director,
    Newcastle Clinical Trials Unit

University of Newcastle-upon-Tyne
2
Course Outline
  • Data Description
  • Mean, Median, Standard Deviation
  • Graphs
  • The Normal Distribution
  • Populations and Samples
  • Confidence intervals and p-values
  • Estimation and Hypothesis testing
  • Continuous data
  • Categorical data
  • Regression and Correlation

3
Course Objectives
  • To have an understanding of the Normal
    distribution and its relationship to common
    statistical analyses
  • To have an understanding of basic statistical
    concepts such as confidence intervals and
    p-values
  • To know which analysis is appropriate for
    different types of data

4
Recommended Textbooks
  • Swinscow TDV and Campbell MJ. Statistics at
    Square One (10th edn). BMJ Books
  • Altman DG. Practical Statistics for Medical
    Research. Chapman and Hall
  • Bland M. An Introduction to Medical Statistics.
    Oxford Medical Publications
  • Campbell MJ Machin D. Medical Statistics A
    Commonsense Approach. Wiley

5
Other reading
  • Chinn S. Statistics for the European Respiratory
    Journal. Eur Respir J 2001 18393-401
  • www.mas.ncl.ac.uk/njnsm/medfac/MDPhD/notes.htm
  • BMJ statistics notes

6
Types of Data
  • Numerical Data
  • discrete
  • number of lesions
  • number of visits to GP
  • continuous
  • height
  • lesion area

7
Types of Data
  • Categorical
  • unordered
  • Pregnant/Not pregnant
  • married/single/divorced/separated/widowed
  • ordered (ordinal)
  • minimal/moderate/severe/unbearable
  • Stage of breast cancer I II III IV

8
Exercise
  • What type are the following variables?a) sexb)
    diastolic blood pressurec) diagnosisd) heighte)
    family sizef) cancer stage

9
Types of Data
  • Outcome/Dependent variable
  • outcome of interest
  • e.g. survival, recovery
  • Explanatory/Independent variable
  • treatment group
  • age
  • sex

10
Histogram of Birthweight (grams) at 40 weeks GA
11
Summary Statistics
  • Location
  • Mean (average value)
  • Median (middle value)
  • Mode (most frequently occurring value)
  • Variability
  • Variance/SD
  • Range
  • Centiles

12
Birthweights (g) at 40 weeks Gestation
  • mean 3441g
  • median 3428g
  • sd 434g
  • min 2050g
  • max 4975g
  • range 2925g

13
Boxplot
14
Symmetric Data
  • mean median (approx) ?
  • standard deviation ?

15
Skew Data
  • median "typical" value ?
  • mean affected by extreme values - larger than
    median ?
  • SD fairly meaningless ?
  • centiles (less affected by extreme
    values/outliers) ?

16
Half of all doctors are below average.
  • Even if all surgeons are equally good, about half
    will have below average results, one will have
    the worst results, and the worst results will be
    a long way below average
  • Ref. BMJ 1998 3161734-1736

17
Discrete Data Principal diagnosis of patients in
Tooting Bec Hospital
18
Bar Chart
19
Summarising data - Summary
  • Choosing the appropriate summary statistics and
    graph depends upon the type of variable you have
  • Categorical (unordered/ordered)
  • Continuous (symmetric/skew)

20
The Normal Distribution
  • N(???2?
  • ????unknown population mean - estimate using
    sample mean
  • ????unknown population SD - estimate using sample
    SD
  • Birthweight is N(3441, 4342)

21
N(0,1) - Standard Normal Distribution
68 within 1
SD Units
95 within 1.96
99 within 2.58
z - SD units
22
Birthweight (g) at 40 weeks
95 within 1.96 SDs 2590 - 4292 grams
99 within 2.58 SDs 2321 - 4561 grams
23
Further Reading
  • http//www.mas.ncl.ac.uk/njnsm/medfac/docs/intro.
    pdf
  • Altman DG, Bland JM (1996) Presentation of
    numerical data. BMJ 312, 572
  • Altman DG, Bland JM. (1995) The normal
    distribution. BMJ 310, 298.

24
Samples and Populations
  • Use samples to estimate population quantities
    (parameters) such as disease prevalence, mean
    cholesterol level etc
  • Samples are not interesting in their own right -
    only to infer information about the population
    from which they are drawn
  • Sampling Variation
  • Populations are unique - samples are not.

25
Sample and Populations
  • How much might these estimates vary from sample
    to sample?
  • Determine precision of estimates (how close/far
    away from the population?)

26
(Artifical) example
  • Have 5000 measurements of diastolic blood
    pressure from airline pilots. This accounts for
    ALL airline pilots and is the population of
    airline pilots.
  • (Artificial example - if we had the whole
    population we wouldnt need to sample!!)
  • Since we have the population, we know the true
    population characteristics. It is these we are
    trying to estimate from a sample.

27
Population distribution of diastolic BP from
Airline Pilots (in mmHg)
True mean 78.2 True SD 9.4
28
Example
  • Write each measurement on a piece of paper and
    put into a hat.
  • Draw 5 pieces of paper and calculate the mean of
    the BP.
  • replace and repeat 49 more times
  • End up with 50 (different) estimates of mean BP

29
Sampling Distribution
  • Each estimate of the mean will be different.
  • Treat this as a random sample of means
  • Plot a histogram of the means.
  • This is an estimate of the sampling distribution
    of the mean.
  • Can get the sampling distribution of any
    parameter in a similar way.

30
Distribution of the mean
? 78.2, ? 9.4
Population
50 samples N5
50 samples N10
50 samples N100
31
Distribution of the Mean
  • BUT! Dont need to take multiple samples
  • Standard error of the mean
  • SE of the mean is the SD of the distribution of
    the sample mean

32
Distribution of Sample Mean
  • Distribution of sample mean is Normal regardless
    of distribution of sample(unless small or very
    skew sample)
  • SOCan apply Normal theory to sample mean also

33
Distribution of Sample Mean
  • i.e. 95 of sample means lie within 1.96 SEs of
    (unknown) true mean
  • This is the basis for a 95 confidence interval
    (CI)
  • 95 CI is an interval which on 95 of occasions
    includes the population mean

34
Example
  • 57 measurements of FEV1 in male medical students

35
Example
  • 95 of population lie withini.e. within 4.06
    1.96?0.67, from 2.75 to 5.38 litres

36
Example
  • Thus for FEV1 data, 95 chance that the interval
    contains the true population meani.e. between
    3.89 and 4.23 litres
  • This is the 95 confidence interval for the mean

37
Confidence Intervals
  • The confidence interval (CI) measures
    uncertainty. The 95 confidence interval is the
    range of values within which we can be 95 sure
    that the true value lies for the whole of the
    population of patients from whom the study
    patients were selected. The CI narrows as the
    number of patients on which it is based
    increases.

38
Standard Deviations Standard Errors
  • The SE is the SD of the sampling distribution (of
    the mean, say)
  • SE SD/vN
  • Use SE to describe the precision of estimates
    (for example Confidence intervals)
  • Use SD to describe the variability of samples,
    populations or distributions (for example
    reference ranges)

39
The t-distribution
  • When N is small, estimate of SD is particularly
    unreliable and the distribution of sample mean is
    not Normal
  • Distribution is more variable - longer tails
  • Shape of distribution depends upon sample size
  • This distribution is called the t-distribution

40
N2
t(1) 95 within 12.7
N(0,1)
t(1)
41
N10
N(0,1)
t(9) 95 within 2.26
t(9)
42
N30
t(29) 95 within 2.04
43
t-distribution
  • As N becomes larger, t-distribution becomes more
    similar to Normal distribution
  • Degrees of Freedom (DF)- sample size - 1
  • DF measure of amount of information contained in
    data set

44
Implications
  • Confidence interval for the mean
  • Sample size lt 30 Use t-distribution
  • Sample size gt 30 Use either Normal or t
    distribution
  • Note Stats packages (generally) will
    automatically use the correct distribution for
    confidence intervals

45
Example
  • Numbers of hours of relief obtained by 7
    arthritic patients after receiving a new drug
    2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3
  • Mean 3.33, SD 1.03, DF 6, t(5) 2.45
  • 95 CI 3.33 2.45??1.03/ ?72.38 to 4.28 hours
  • Normal 95 CI 3.33 1.96??1.03/ ?72.57 to
    4.09 hours TOO NARROW!!

46
Hypothesis Testing
  • Enables us to measure the strength of evidence
    supplied by the data concerning a proposition of
    interest
  • In a trial comparing two treatments there will
    ALWAYS be a difference between the estimates for
    each treatment - a real difference or random
    variation?

47
Null Hypothesis
  • Study hypothesis - hypothesis in the mind of the
    investigator (patients with diabetes have raised
    blood pressure)
  • Null hypothesis is the converse of the study
    hypothesis - aim to disprove it (patients with
    diabetes do not have raised blood pressure)
  • Hypothesis of no effect/difference

48
Two-Sample t-test
  • Two independent samples
  • Can the two samples be considered to be the same
    with respect to the variable you are measuring or
    are they different?
  • Sample means will ALWAYS be different - real
    difference or random variation?
  • ASSUMPTION Data are normally distributed and
    SD in each group similar

49
Two-Sample t-test
  • 24 hour total energy expenditure (MJ/day) in
    groups of lean and obese women
  • Do the women differ in their energy expenditure?
  • Null hypothesis energy expenditure in lean and
    obese women is the same

50
Boxplot of energy expenditure MJ/day
51
Two-sample t-test
  • Summary statistics lean obeseMean 8.1 10.3
  • SD 1.2 1.4
  • N 13 9
  • Difference in means 10.3 - 8.1 2.2
  • SE difference 0.57 (weighted average)

52
Two Sample t-test
  • Test statistic is 2.2/0.57 3.9
  • N1 N2 - 2 DF ( 20)
  • Calculate the probability of observing a value at
    least as extreme as 3.9 if the null hypothesis is
    true
  • If the null hypothesis is true, the test
    statistic should have a t-distribution with 20 df
    (df N1N2-2)

53
Two Sample t-test
  • 95 of values from t-distribution with 20 DF lie
    between -2.09 and 2.09
  • Probability of observing a value as extreme or
    more extreme than 3.9 in a t-distribution with 20
    df is 0.001
  • Only a very small probability that the value of
    3.9 fits reasonably with a t-distribution with 20
    df
  • Conclude that energy expenditure is significantly
    different between lean and obese women

54
The P-value
  • The P-value is the probability of observing a
    test statistic at least as extreme as that
    observed if the null hypothesis is true

55
t distribution with 20 df
56
Confidence Interval for the difference in two
means
  • 95 CI 2.2 - 2.09?0.57 to 2.2 2.09?0.57
  • or from 1.05 to 3.41 MJ/day
  • Thus we are 95 confident that obese women use
    between 1.05 and 3.41 MJ/day energy more than
    lean women

57
Confidence Interval or P-value?
  • Confidence interval!!!
  • P-value will tell you whether or not there is a
    statistically significant difference
  • confidence interval will give information about
    the size of the difference and the strength of
    the evidence

58
Paired t-test
  • Obvious pairing between observations
  • two measurements on each subject (before-after
    study)
  • case-control pairs
  • Assumption - paired data are normally distributed
  • Example - Systolic blood pressure (SBP) measured
    in 16 middle aged men before and after a standard
    exercise. Post-exercise SBP - Pre-exercise SBP
    calculated for each man

59
Boxplot of differences
60
Paired t-test
  • Mean difference 6.6
  • SE(Mean) 1.5
  • t 6.6/1.5 4.4
  • Compare with t(15)
  • P lt 0.001
  • Conclusion- mean systolic blood pressure is
    higher after exercise than before

61
Paired t-test
  • 95 confidence interval for the mean difference
  • 6.6 ? 2.131.5 3.4 to 9.8

62
Categorical Variables
  • To investigate the relationship between two
    categorical variables form contingency table
  • Hypothesis tests
  • Chi-squared test (?2 test)
  • Fishers exact test (small samples)
  • McNemars test (paired data)

63
Chi-squared test
  • Used to test for associations between categorical
    variables (2 or more distinct outcomes)
  • Example - a comparison between psychotherapy and
    usual care for major depression in primary care

64
Patient Reported Recovery at 8 months
65
Patient Reported Recovery at 8 months
  • Difference between means 30.8
  • 95 confidence interval for difference 17.7 to
    43.8

66
Larger tables
  • Similar methods can be applied to larger tables
    to test the association between two categorical
    variables
  • Example - Is there an association between housing
    tenure and time of delivery of baby
    (preterm/term).
  • Null hypothesis There is no relationship
    between housing tenure and time of delivery

67
Relationship between housing tenure and time of
delivery
68
Relationship between housing tenure and time of
delivery
  • DF (5-1)??(2-1) 4
  • P 0.03
  • Thus we strong evidence of a relationship between
    housing tenure and time of delivery

69
Notes
  • Chi-squared test not valid if expected values are
    small (lt5)
  • Combine rows or columns to obtain a smaller table
    with larger expected values
  • Use Fishers exact test for small tables

70
McNemars test
  • Appropriate for use with paired or matched
    (case-control) data with a dichotomous outcome

71
Example - McNemars test
  • Skaane compared the use of mammography and
    ultrasound in the assessment of 327 (228 palpable
    and 99 non-palpable) consecutive malignant
    tumours confirmed at histology.
  • Acta radiologica vol 40486-490 (1999)

72
McNemars test - example
73
McNemars test - example
  • 308/327 (94) were picked up by mammograpy
    compared with 278/327 (85) picked up by
    ultrasound
  • Plt0.001
  • Conclusion Mammography is significantly more
    sensitive in diagnosing tumours than ultrasound
    in a population of mixed malignant tumours

74
Hypothesis testing - summary
Adapted from Chinn S. Statistics for the European
Respiratory Journal.
75
Correlation and Regression
  • Relationship between two continuous variables
  • regression
  • correlation

76
Relationship between two continuous variables
  • 3 main purposes for doing this
  • to assess whether the two variables are
    associated (correlation)
  • to enable the value of one variable to be
    predicted from any known value of the other
    variable (regression)
  • to assess the amount of agreement between two
    variables (method comparison study)

77
Example
  • Women from a pre-defined geographical area were
    invited to have their haemoglobin (Hb) level and
    packed cell volume measured. They were also
    asked their age.

78
Haemoglobin and packed cell volume
79
Example - relationships between variables
  • Association between Hb and PCV? Hb affects PCV
    or PCV affects Hb?
  • Use correlation to measure the strength of an
    association
  • Association between Hb and age?age must affect
    Hb and not vice versa
  • Use regression to predict Hb from age

80
Correlation
  • Not interested in causation i.e. does a high
    PCV cause a high Hb level
  • Interested in associationi.e. is a high PCV
    associated with a high Hb level?
  • sample correlation coefficient
  • summarises strength of relationship
  • can be used to test the hypothesis that the
    population correlation coefficient is 0

81
Correlation Coefficient
  • dimensionless, from -1 to 1
  • measures the strength of a linear relationship
  • ve - high value of one variable associated with
    high value of the other
  • -ve - high value of one variable associated with
    low value of the other
  • 1 exact linear relationship
  • strictly called Pearson correlation coefficient

82
Example Data
r -0.4
r 1
r 0
r 0.7
83
When not to use the correlation coefficient
  • If the relationship is non-linear
  • with caution in the presence of outliers
  • when the variables are measured over more than
    one distinct group (i.e. disease groups)
  • when one of the variables is fixed in advance
  • Assessing agreement

84
Correlation - example data
85
Is there an alternative?
  • If the data are non-linear or there is an outlier
  • use spearman rank correlation coefficient

86
Haemoglobin and Packed Cell Volume
  • Without outlier
  • Pearson0.67
  • Spearman0.63
  • With outlier
  • Pearson0.34
  • Spearman0.48

87
Regression
  • Assume a change in x will cause a change in y
  • predict y for a given value of x
  • usually not logical to believe y causes x
  • y is the dependent variable (vertical axis)
  • x is the independent variable (horizontal axis)

88
Example - Haemoglobin vs Age
89
Regression
  • Logical to assume that increasing age leads to
    increasing Hb
  • Not logical to assume Hb affects age!
  • Assume underlying true linear relationship
  • Make an estimate of what that true linear
    relationship is

90
Estimating a regression line
  • How do I identify the best straight line?
  • least squares estimate
  • straight line determined by slope and intercept
  • y a b?x
  • a and b are estimates of the true intercept and
    slope and are subject to sampling variation

91
Regression line of haemoglobin on age
92
Regression of haemoglobin on age
  • Variable(s) Entered on Step Number 1.. AGE
    Age (Years)Multiple R .87959R
    Square .77367Adjusted R Square
    .76110Standard Error 1.17398
  • Analysis of Variance DF
    Sum of Squares Mean SquareRegression
    1 84.80397
    84.80397Residual 18
    24.80803 1.37822F 61.53133
    Signif F .0000

93
Regression of haemoglobin on age
  • ---------------------- Variables in the Equation
    -------------Variable B SE B
    95 Confdnce Intrvl BAGE .134251
    .017115 .098295 .170208 (Constant)
    8.239786 .794261 6.571104 9.908467
  • ----------- in ------------Variable T
    Sig TAGE 7.844 .0000(Constant)
    10.374 .0000

94
What does this tell us?
  • Mean Hb 8.2 0.13 ??AGE
  • 95 CI for the slope goes from 0.098 to 0.170
  • P lt 0.0001
  • Significant relationship between Hb and age
  • 77 of the variability in Hb can be accounted for
    by age

95
How can it be used?
  • Predict mean Hb for a given age
  • Eg. What is the mean Hb of a 50 year old?
  • Mean Hb 8.2 0.13???50 14.7 g/dl
  • 95 CI for the estimate from 14.4 to 15.5 g/dl

96
How can it be used?
  • To calculate reference ranges for the population
  • E.g. What range would you expect 95 of 50 year
    olds to lie within? (reference range)
  • Between 12.4 to 17.5 g/dl

97
95 Confidence Interval for the Mean 95
prediction interval for individuals
98
Definitions
  • Predicted value
  • the value predicted by the regression line
  • an estimate of the mean value
  • Residual
  • Observed value - predicted value

99
What assumptions have I made?
  • The relationship is approximately linear
  • The residuals have a normal distribution

100
Multiple Regression
  • One outcome variable with multiple predictor
    variables
  • Residuals assumed to be normally distributed
  • Predictor variables can be continuous or
    categorical
  • No assumptions made about distribution of
    continuous predictor variables

101
Multiple Regression
  • Example. Does the value of packed cell volume
    improve the prediction of hb?
  • Model fitted
  • Mean Hb 5.2 0.1?age(years) 0.1?packed
    cell volume()
  • R2 83
  • Knowledge of packed cell volume improves the
    prediction of haemoglobin

102
Summary
  • Regression can be used to estimate the numerical
    relationship between an outcome variable and one
    or more predictor variables
  • Correlation coefficient alone is of limited use
Write a Comment
User Comments (0)
About PowerShow.com