Title: Master class Data, understanding it, interpreting it and using it.
1Master classData, understanding it, interpreting
it and using it.
- Ruth Harrell
- Liann Brookes-smith
2Agenda
- 9.30am 10.30am
- 10.30am break
- 10.45 11.30am
- 11.40 12.30pm
- 12.30 1.30pm lunch
- 1.30 2.30pm probability
- 2.30 2.45pm break
- 2.45 3.30pm sampling and curve
- 3.30 4.30pm confidence and risk
3Introduction
- Statistics may be defined as "a body of methods
for making wise decisions in the face of
uncertainty." W.A. Wallis - There are three kinds of lies lies, damned
lies, and statistics. Disraeli (according to
Mark Twain) - 98 of all statistics are made up. Author
Unknown - Statistics are like bikinis. What they reveal is
suggestive, but what they conceal is vital.
Aaron Levenstein - If you can not measure it, it does not exist
Author unknown
4Question to the Room
- What are statistics?
- Why are data important?
- What do you feel about stats?
- What do they tell us?
- E.g. 40 of children on XX area have dental
caries, what does that tell us? - List types of data you are aware of or use in
your day to day
5Practitioner competencies
- Obtain, verify, analyse and interpret data and/or
information to improve the health and wellbeing
outcomes of a population / community / group
demonstrating - a. knowledge of the importance of accurate and
reliable data / information and the anomalies
that might occur - b. knowledge of the main terms and concepts used
in epidemiology and the routinely used methods
for analysing quantitative and qualitative data - c. ability to make valid interpretations of the
data and/or information and communicate these
clearly to a variety of audiences
6Aim for the day
- Aim of the day is to improve people understanding
of the data they use, how to analyse it and
interpret it. - This session is concentrating on the data rather
than things such as the study design but we are
happy to discuss and answer questions on both
you cant understand what the data is telling you
without understanding how it has been collected
and the potential for bias.
7Topics covered
- Types of data
- Basic probability and stats
- Understanding how data is collected
- Measures of odds and ratios - comparing
populations and study results. - Population sampling - Good samples and bad
samples - Understanding Confidence intervals p values -
is the result reliable - How I apply data to what I am doing
8 9Describing the data
- We have a responsibility to present data in a way
that can be easily understood, and which does not
misrepresent the true meaning of the data. - Key decisions are made based on the data or
more accurately peoples impression of the data
so this has an impact on use of resources and
eventually on patient care. - Accurate analysis and presentation of the data
saves lives!
10Quantitative vs. Qualitative
- Quantitative data measures quantity ie is
numerical. - Qualitative data is usually more descriptive and
not measured in numbers. - However, data originally obtained as qualitative
information about individual items may give rise
to quantitative data if they are summarised by
means of counts
11Discrete Continuous
- Discrete data can only take certain particular
values - Continuous falls on a scale.
- For example height is continuous, but the number
of siblings is discrete.
12Nominal - Ordinal
- Nominal comes from the Latin nomen, meaning
'name', and is used to describe categorical data.
There is no quantitative relationship between the
different categories (though sometimes a number
may be assigned for ease of analysis). An example
is ethnicity. - Ordinal data again describes categories but there
is some order to them - though the relationship
between them may not be well defined. For
example, Agenda for change pay scales, since they
are ordered and can therefore be put in sequence
(but there is no numerical relationship between
them).
13Transforming the data
- Sometimes the data you have isn't the most
effective way of displaying the data. - E.g. You have data on weight in Kilos.
- Having a list of continuous weights is not
intuitive, therefore you convert this to BMI
I.e., those who are underweight, healthy weight,
obese and morbidly obese. - Continuous to ordinal.
14Transforming the data (2)
- With this you can display more meaningful data
- BUT
- You lose the detail, the number of the edge of
each category (borderline). You cant transform it
back. - What you transform it to may not be the best use
of data. - You can also transform data using complex
calculations doing a log of each number, this
will sometimes convert skewed data to normal
curved data (discussed later)
15Exercise
16Displaying the data
- What are the options?
- Tables simple descriptive, cross tab (mention
pivot table) - Graphs bar, line, x-y or scatter, pie chart.
17Basic statistics and probability
- Having looked at the raw data and carried out any
transformations you felt necessary, you now want
to describe the features of this data. - Distributions plotting the data is the first
step in this. You need to consider the shape of
the graph before you know how to best analyse the
data.
18Types of graph
19Types of graph
20Types of graph
21Types of graph
22 23Data measures
- Definitions
- Range the difference between the highest and the
lowest values in a set - Mean the total value of measure values summed
divided by the number of measures - Median the middle measure
- Mode measure found most often
- Interquartile ranges is a measure of statistical
dispersion, being equal to the difference between
the upper and lower quartiles - Standard deviation is a measure of how spread
out numbers are.
24Mean, median and mode
- Mean (sum of observations)
- (number of observations)
- Mode the most common observation
- Median the number where 50 of observations are
below and 50 are above
25Standard Deviation and IQR
- Std Dev sum of (difference squared between each
observation and the mean) / (number of
observations - 1) - IQR the difference between the value at the 25th
percentile and 75th percentile
26Formulas
- Sample mean
- x ( S xi ) / n
- Sample standard deviation
- s sqrt S ( xi - x )2 / ( n - 1 )
- xi is each observation
- N is the number of observations
- S means sum
27 28 29How reliable is my data?
- Any data missing?
- How old is it?
- What is the denominator?
- Who collected it
- How was it collected?
- Ways to avoid making statements about inaccurate
data?
30Describing data
31Interpret the graph
- This graph is a graph showing the trend of
obesity in adults from 1993 2007 - Percentage of what (all adults presumed, all
registered? All resident?) what age is defined as
an adult? - Is the increase due to chance or an actual
increase? - Data is quantitative/continuous
32Bias
- When looking at data sometimes the relationship
we see is one caused by the way in which we are
measuring not actually what is there.
33Fudging
- Rate or Number
- You have 50 cases of COPD in area 1, and 150
cases in COPD in area 2. should you do something
in area 2? - Area 1 has population of 2000
- Area 2 has population of 5000
- In area 1 rate in 50-74 year olds is 20/1000
- In area 1 rate in 50-74 year olds is 42/1000
- Area 1s data was from 2004
- Area 2s data was from 2005-2009
- Area 1 is 20/1000 confidence interval (12-48 per
1000) - Area 2 is 42/1000 confidence interval (18 56
per 1000) - Now what?
34Exercise
- Exercise 5
- What do these data tell you? Key message?
- What would you ask of these data? What further
information would you want to know?
35Basics of probability
- Probability is a way of quantifying the
judgements that we make all the time from do I
need an umbrella? to shall I bet on that
horse? - Probability is measured on a linear scale of 0 to
1 where 0 is impossible and 1 is absolutely
certain.
36Probability
- Why is probability relevant to public health?
- Probability gives us a quantitative measurement
of the chances of something happening, and there
are 2 key ways in which it is used in Public
Health - It is another word for risk (or if it has a
positive impact benefit). For example, the
probability that some who smokes cigarettes will
get lung cancer has been shown to be much higher
than for someone who doesnt smoke. - It helps us to answer the question how likely is
it that the observed effect is due to our
intervention not just to chance?, and is used in
all types of studies testing medical
treatments, evaluating the impact of public
health interventions, assessing need of one
population compared to another.
37Probability and risk
- Odd number of events divided by the number of
opportunities - Risk in exposed number of events divided by the
number of exposed - Risk in un- exposed number of events divided by
the number of un-exposed - Relative risk or Risk ratio is a ratio of the
probability of the event occurring in the exposed
group versus a non-exposed group - Absolute risk is the difference in risk between
the exposed and unexposed.
38Probability cont
- What is the probability of a 6 if you throw an
unbiased dice? - What is the probability of a total of 6 if you
throw two unbiased dice?
39Welcome back!!
- I'm not an outlier I just haven't found my
distribution yet.
40Exercise
- Exercise 6
- Worse and early death 0-3/10
- No change 4-5 /10
- Cure 2-6/10
41Population sampling (1)
- In the real world we dont usually get data from
everybody that we are interested in. Why not? - Cost and resources may be too large
- People may choose to opt in or out
- May have incomplete data (data entry problems
etc)
42Population sampling (2)
- So what we need to do is measure a sample of
people and infer from that sample what the
population looks like. We can do this by tweaking
the statistical formula used but there are two
things to consider - If your sample size is too low you are unlikely
to get a reasonable result you can still use
the formula but you need to bear this in mind
when interpreting it - Think about who you have managed to sample are
they representative of the population? (imagine
walking in to a large open plan office with a set
of scales and asking people if they would mind
being weighed who is more likely to volunteer?)
43Population sampling (3)
- If we have a REPRESENTATIVE sample, we can apply
a statistical tweak to help us to estimate the
figure for the population. - If we dont (if the sample is biased), though we
can carry out the maths, it will always be flawed.
44Population sampling (4)
- Principle
- Measure your sample
- Calculate the mean and standard deviation (of the
sample) - Calculate the standard error standard deviation
of the sample / n - To estimate your mean, we say our best guess is
that the population mean is equal to the sample
mean - Then we can use the standard error to estimate
how close we think our estimate is. - First we need to talk about confidence intervals
45Which one is an Insult.
- Darling, you are two standard deviations below
the mean - Of course your normal (mean 10, mode, 7)
- You are mean
- Your looks are in the 80 percentile
- The difference between you and her is a standard
deviation
46(No Transcript)
47Probability, Population Sampling and the Normal
Curve
- Thinking about our data that fitted the normal
curve - By using the mathematical model we can easily
calculate probabilities. - The maths tells us that
- The total area under the normal curve is equal to
1. - The probability that any new observation will
fall within one standard deviation of the mean is
68 - The probability that any new observation will
fall within two standard deviations of the mean
is 95 - The probability that any new observation will
fall within three standard deviations of the mean
is 99.7
48Examples
49CERN experiments observe particle consistent with
long-sought Higgs boson Geneva, 4 July 2012. We
observe in our data clear signs of a new
particle, at the level of 5 sigma, in the mass
region around 126 GeV. The outstanding
performance of the LHC and ATLAS and the huge
efforts of many people have brought us to this
exciting stage, said ATLAS experiment
spokesperson Fabiola Gianotti, but a little more
time is needed to prepare these results for
publication.
At five-sigma there is only one chance in nearly
two million that the result is wrong, i.e. the
measurement seen is a random fluctuation.
50Confidence intervals (1)
- if we measure one individuals IQ we can be 95
sure that it would fall between 70 and 130 - This interval is called the 95 confidence
interval. - We use 95 by convention sometimes other figures
are used such as 98. - If we measure the heights of a class of children
and we have a mean of 1.2m, standard deviation of
0.1, what is your estimate for the height of a
child randomly selected from the sample? - 1.2 /-0.2, ie 95 of this sample lies between
1.0 and 1.4m
51Confidence intervals (2)
- Reminder the heights of a class of children have
a mean of 1.2m, standard deviation of 0.1 - We measure a new child and their height is 1.5m.
What does this mean? - This is equal to mean 3 standard deviations.
This means we had less than a 0.5 chance that we
would have this height in a child in this
population. That doesnt mean they are not part
of the distribution (0.5 is not that rare) but
you might be sensible to check a few things to be
sure they are part of the same population (age!).
52Confidence intervals (3)
- This time we are using confidence intervals to
estimate our true population characteristics
based on a sample. - Best estimate of the mean measured mean of
sample - Best estimate of standard deviation of population
std deviation of sample/ number of measurements
in the sample - Therefore we can say that we are 95 confident
that the mean of the population lies between the
sample mean /- 2xstandard error - This implies that
- Our estimate of the mean gets better as n
increases because our error gets smaller. - This is the way we usually use confidence
intervals in public health as we usually measure
a sample and infer the population. - Examples Health survey for England, Household
surveys, etc
53- You are a significant part of my life
- P value 9
54I would never treat you differently to your
sisters
- Sister 1 CI 4-9
- Sister 2 CI 5-11
- Sister 3 CI 4-13
- ME CI 2-3
55Comparing two samples
- The important question is is there a difference
between two populations? - This question might be asked in slightly
different ways for different types of study, but
is fundamentally the same - For an RCT you compare control group with the
intervention group - For a cohort you compare the outcomes in those
exposed to a risk factor compared to those not
exposed - For a case-control you look at the group with the
disease and compare their risk factors to those
without the disease - You might look at before and after an
intervention was put in place - You might compare one city or country to another
56Comparing two samples (2)
The important question is is there a difference
between two populations?
57Comparing two samples (3)
- We can calculate the difference between the two
populations as - Mean difference mean of pop 1 mean pop2
- Confidence interval mean difference /- 1.96SE
- SE (standard error) is a combination of the
standard errors for each sample (shown here as s1
and s2) - SE sqrt (s12 / n1) (s22 / n2)
- (se can be slightly different for different
situations but this gives you an idea)
58T tests
- Testing using t test
- You need to know the mean and standard deviation
of both of your samples. - You start with a hypothesis this is that there
is no difference between the two samples (or
populations) - You then do some maths
- t (mean of sample 1 mean of sample 2) / SE
- where SE sqrt (standard dev of pop 1)2 / n1)
(standard dev of pop 2)2 / n2)
59T tests (2)
- So what does t mean?
- t the horizontal axis of a normal distribution
with mean0 and standard deviation1 - You can read the probability of the two samples
coming from the same population from a table of t
values - Most important value -
- if tgt1.96 then the probability of them being from
the same distribution is lt0.05 - By convention, we discard the null hypothesis if
plt0.05 - Its good practice to quote the p value e.g.
P0.01 - If tgt1.96, then the probability of the two
samples coming from the same population is lt0.05
(5). This suggests that they are fundamentally
different
60T tests (3)
- What do these results mean?
- Mean difference 0, with 95 confidence interval
(-1.0, 1.0), p 0.50 - Mean difference 0.5, with 95 confidence
interval (0.1, 0.9), p 0.049 - Mean difference 1, with 95 confidence interval
(-0.1, 1.1), p 0.055 - Mean difference 1, with 95 confidence interval
(0.2, 1.8), p 0.02
61Risk differences
- Same principle null hypothesis is that there is
no difference - For no difference, the 95 confidence interval
would include 0 - If it does not include 0, then you can be 95
confident that there is a risk difference. - You can also quote a p value
- Example the risk difference for having a heart
attack in the placebo group compared to the
intervention group was 2 with a 95 confidence
interval of (1.5 to 2.4), p0.02 - Would you take the intervention?
62Risk differences (2)
- You can also calculate the number needed to treat
from this - NNT is the number of people you need to treat to
prevent one event from occuring - Example the risk difference for having a heart
attack in the placebo group compared to the
intervention group was 2 with a 95 confidence
interval of (1.5 to 2.4), p0.02 - If you treat 100 people you avoid 2 heart
attacks. - NNT 50
63Risk ratio
- A relative measure of risk very commonly used
- Same principle null hypothesis is that there is
no difference IN THE RATIO OF RISKS - For no difference, the 95 confidence interval
would include 1 - Why 1 this time?
- Because if both had the same risk, the ratio
would be 1 - If it does not include 1, then you can be 95
confident that there is a risk difference. - You can also quote a p value
64Odds ratio
- A relative measure of risk very commonly used
- Very similar to risk ratio
- Used for certain types of study, and the result
of some calculations - For no difference, the 95 confidence interval
would include 1 - If it does not include 1, then you can be 95
confident that there is a difference. - You can also quote a p value
65Examples
- Meta-analysis of the 5 prospective cohort studies
(86,092 patients) indicated that individuals with
periodontal disease had a 1.14 times higher risk
of developing CHD than the controls (relative
risk 1.14, 95 CI 1.074-1.213, P lt .001) - the risk of VTE was 2.33 for obesity (95 CI,
1.68 to 3.24), 1.51 for hypertension (95 CI,
1.23 to 1.85), 1.42 for diabetes mellitus (95
CI, 1.12 to 1.77), 1.18 for smoking (95 CI, 0.95
to 1.46), and 1.16 for hypercholesterolemia (95
CI, 0.67 to 2.02).
66In summary
- Your boss says
- do we need a weight loss service for kids in XXX
area - You collect data, definition of kids, is this
data accurate, how was it collected, what year. - Compare the areas, are you much different is
there an underlying reason - Is this value statistically significant?
67In summary (2)
- You look at a service elsewhere (from evidence)
- You ask yourself, who was included in this
sample, are they different to my population - Looking at the odds what proportion of kids will
this work on - Look to see if the test group were bias compared
to control group - Were the results normally distributed, skewed or
other
68In summary (3)
- Were the results significant between the two
groups. - Can you rely on these findings
- You have just found the need.
- Evaluated its accuracy
- Reviewed a solution
- Looked at effectiveness
- WELL DONE!!!
69Useful websites
- Basic maths and probability
- http//www.cimt.plymouth.ac.uk/projects/mepres/boo
k7/bk7i21/bk7_21i1.htm - Tutorials on statistics
- http//www.stattrek.com/tutorials/statistics-tutor
ial.aspx