Master class Data, understanding it, interpreting it and using it. - PowerPoint PPT Presentation

About This Presentation
Title:

Master class Data, understanding it, interpreting it and using it.

Description:

Master class Data, understanding it, interpreting it and using it. Ruth Harrell Liann Brookes-smith * * In summary Your boss says: do we need a weight loss service ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 70
Provided by: liannbroo
Category:

less

Transcript and Presenter's Notes

Title: Master class Data, understanding it, interpreting it and using it.


1
Master classData, understanding it, interpreting
it and using it.
  • Ruth Harrell
  • Liann Brookes-smith

2
Agenda
  • 9.30am 10.30am
  • 10.30am break
  • 10.45 11.30am
  • 11.40 12.30pm
  • 12.30 1.30pm lunch
  • 1.30 2.30pm probability
  • 2.30 2.45pm break
  • 2.45 3.30pm sampling and curve
  • 3.30 4.30pm confidence and risk

3
Introduction
  • Statistics may be defined as "a body of methods
    for making wise decisions in the face of
    uncertainty."  W.A. Wallis
  • There are three kinds of lies lies, damned
    lies, and statistics. Disraeli (according to
    Mark Twain)
  • 98 of all statistics are made up.  Author
    Unknown
  • Statistics are like bikinis.  What they reveal is
    suggestive, but what they conceal is vital. 
    Aaron Levenstein
  • If you can not measure it, it does not exist
    Author unknown

4
Question to the Room
  • What are statistics?
  • Why are data important?
  • What do you feel about stats?
  • What do they tell us?
  • E.g. 40 of children on XX area have dental
    caries, what does that tell us?
  • List types of data you are aware of or use in
    your day to day

5
Practitioner competencies
  • Obtain, verify, analyse and interpret data and/or
    information to improve the health and wellbeing
    outcomes of a population / community / group
    demonstrating
  • a. knowledge of the importance of accurate and
    reliable data / information and the anomalies
    that might occur
  • b. knowledge of the main terms and concepts used
    in epidemiology and the routinely used methods
    for analysing quantitative and qualitative data
  • c. ability to make valid interpretations of the
    data and/or information and communicate these
    clearly to a variety of audiences

6
Aim for the day
  • Aim of the day is to improve people understanding
    of the data they use, how to analyse it and
    interpret it.
  • This session is concentrating on the data rather
    than things such as the study design but we are
    happy to discuss and answer questions on both
    you cant understand what the data is telling you
    without understanding how it has been collected
    and the potential for bias.

7
Topics covered
  1. Types of data
  2. Basic probability and stats
  3. Understanding how data is collected
  4. Measures of odds and ratios - comparing
    populations and study results.
  5. Population sampling - Good samples and bad
    samples
  6. Understanding Confidence intervals p values -
    is the result reliable 
  7. How I apply data to what I am doing

8
  • Types of data

9
Describing the data
  • We have a responsibility to present data in a way
    that can be easily understood, and which does not
    misrepresent the true meaning of the data.
  • Key decisions are made based on the data or
    more accurately peoples impression of the data
    so this has an impact on use of resources and
    eventually on patient care.
  • Accurate analysis and presentation of the data
    saves lives!

10

Quantitative vs. Qualitative
  • Quantitative data measures quantity ie is
    numerical.
  • Qualitative data is usually more descriptive and
    not measured in numbers.
  • However, data originally obtained as qualitative
    information about individual items may give rise
    to quantitative data if they are summarised by
    means of counts

11
Discrete Continuous
  • Discrete data can only take certain particular
    values
  • Continuous falls on a scale.
  • For example height is continuous, but the number
    of siblings is discrete.

12
Nominal - Ordinal
  • Nominal comes from the Latin nomen, meaning
    'name', and is used to describe categorical data.
    There is no quantitative relationship between the
    different categories (though sometimes a number
    may be assigned for ease of analysis). An example
    is ethnicity.
  • Ordinal data again describes categories but there
    is some order to them - though the relationship
    between them may not be well defined. For
    example, Agenda for change pay scales, since they
    are ordered and can therefore be put in sequence
    (but there is no numerical relationship between
    them).

13
Transforming the data
  • Sometimes the data you have isn't the most
    effective way of displaying the data.
  • E.g. You have data on weight in Kilos.
  • Having a list of continuous weights is not
    intuitive, therefore you convert this to BMI
    I.e., those who are underweight, healthy weight,
    obese and morbidly obese.
  • Continuous to ordinal.

14
Transforming the data (2)
  • With this you can display more meaningful data
  • BUT
  • You lose the detail, the number of the edge of
    each category (borderline). You cant transform it
    back.
  • What you transform it to may not be the best use
    of data.
  • You can also transform data using complex
    calculations doing a log of each number, this
    will sometimes convert skewed data to normal
    curved data (discussed later)

15
Exercise
  • Exercise 1 and 2

16
Displaying the data
  • What are the options?
  • Tables simple descriptive, cross tab (mention
    pivot table)
  • Graphs bar, line, x-y or scatter, pie chart.

17
Basic statistics and probability
  • Having looked at the raw data and carried out any
    transformations you felt necessary, you now want
    to describe the features of this data.
  • Distributions plotting the data is the first
    step in this. You need to consider the shape of
    the graph before you know how to best analyse the
    data.

18
Types of graph
  • Normal

19
Types of graph
  • Skewed

20
Types of graph
  • Bimodal

21
Types of graph
  • Uniform

22
  • 15 minute
  • Break!

23
Data measures
  • Definitions
  • Range the difference between the highest and the
    lowest values in a set
  • Mean the total value of measure values summed
    divided by the number of measures
  • Median the middle measure
  • Mode measure found most often
  • Interquartile ranges is a measure of statistical
    dispersion, being equal to the difference between
    the upper and lower quartiles
  • Standard deviation is a measure of how spread
    out numbers are.

24
Mean, median and mode
  • Mean (sum of observations)
  • (number of observations)
  • Mode the most common observation
  • Median the number where 50 of observations are
    below and 50 are above

25
Standard Deviation and IQR
  • Std Dev sum of (difference squared between each
    observation and the mean) / (number of
    observations - 1)
  • IQR the difference between the value at the 25th
    percentile and 75th percentile

26
Formulas
  • Sample mean
  • x ( S xi ) / n
  • Sample standard deviation
  • s sqrt S ( xi - x )2 / ( n - 1 )
  • xi is each observation
  • N is the number of observations
  • S means sum

27
  • Exercise 3

28
  • Exercise 4

29
How reliable is my data?
  • Any data missing?
  • How old is it?
  • What is the denominator?
  • Who collected it
  • How was it collected?
  • Ways to avoid making statements about inaccurate
    data?

30
Describing data
31
Interpret the graph
  • This graph is a graph showing the trend of
    obesity in adults from 1993 2007
  • Percentage of what (all adults presumed, all
    registered? All resident?) what age is defined as
    an adult?
  • Is the increase due to chance or an actual
    increase?
  • Data is quantitative/continuous

32
Bias
  • When looking at data sometimes the relationship
    we see is one caused by the way in which we are
    measuring not actually what is there.

33
Fudging
  • Rate or Number
  • You have 50 cases of COPD in area 1, and 150
    cases in COPD in area 2. should you do something
    in area 2?
  • Area 1 has population of 2000
  • Area 2 has population of 5000
  • In area 1 rate in 50-74 year olds is 20/1000
  • In area 1 rate in 50-74 year olds is 42/1000
  • Area 1s data was from 2004
  • Area 2s data was from 2005-2009
  • Area 1 is 20/1000 confidence interval (12-48 per
    1000)
  • Area 2 is 42/1000 confidence interval (18 56
    per 1000)
  • Now what?

34
Exercise
  • Exercise 5
  • What do these data tell you? Key message?
  • What would you ask of these data? What further
    information would you want to know?

35
Basics of probability
  • Probability is a way of quantifying the
    judgements that we make all the time from do I
    need an umbrella? to shall I bet on that
    horse?
  • Probability is measured on a linear scale of 0 to
    1 where 0 is impossible and 1 is absolutely
    certain.

36
Probability
  • Why is probability relevant to public health?
  • Probability gives us a quantitative measurement
    of the chances of something happening, and there
    are 2 key ways in which it is used in Public
    Health
  • It is another word for risk (or if it has a
    positive impact benefit). For example, the
    probability that some who smokes cigarettes will
    get lung cancer has been shown to be much higher
    than for someone who doesnt smoke.
  • It helps us to answer the question how likely is
    it that the observed effect is due to our
    intervention not just to chance?, and is used in
    all types of studies testing medical
    treatments, evaluating the impact of public
    health interventions, assessing need of one
    population compared to another.

37
Probability and risk
  • Odd number of events divided by the number of
    opportunities
  • Risk in exposed number of events divided by the
    number of exposed
  • Risk in un- exposed number of events divided by
    the number of un-exposed
  • Relative risk or Risk ratio is a ratio of the
    probability of the event occurring in the exposed
    group versus a non-exposed group
  • Absolute risk is the difference in risk between
    the exposed and unexposed.

38
Probability cont
  • What is the probability of a 6 if you throw an
    unbiased dice?
  • What is the probability of a total of 6 if you
    throw two unbiased dice?

39
Welcome back!!
  • I'm not an outlier I just haven't found my
    distribution yet.

40
Exercise
  • Exercise 6
  • Worse and early death 0-3/10
  • No change 4-5 /10
  • Cure 2-6/10

41
Population sampling (1)
  • In the real world we dont usually get data from
    everybody that we are interested in. Why not?
  • Cost and resources may be too large
  • People may choose to opt in or out
  • May have incomplete data (data entry problems
    etc)

42
Population sampling (2)
  • So what we need to do is measure a sample of
    people and infer from that sample what the
    population looks like. We can do this by tweaking
    the statistical formula used but there are two
    things to consider
  • If your sample size is too low you are unlikely
    to get a reasonable result you can still use
    the formula but you need to bear this in mind
    when interpreting it
  • Think about who you have managed to sample are
    they representative of the population? (imagine
    walking in to a large open plan office with a set
    of scales and asking people if they would mind
    being weighed who is more likely to volunteer?)

43
Population sampling (3)
  • If we have a REPRESENTATIVE sample, we can apply
    a statistical tweak to help us to estimate the
    figure for the population.
  • If we dont (if the sample is biased), though we
    can carry out the maths, it will always be flawed.

44
Population sampling (4)
  • Principle
  • Measure your sample
  • Calculate the mean and standard deviation (of the
    sample)
  • Calculate the standard error standard deviation
    of the sample / n
  • To estimate your mean, we say our best guess is
    that the population mean is equal to the sample
    mean
  • Then we can use the standard error to estimate
    how close we think our estimate is.
  • First we need to talk about confidence intervals

45
Which one is an Insult.
  • Darling, you are two standard deviations below
    the mean
  • Of course your normal (mean 10, mode, 7)
  • You are mean
  • Your looks are in the 80 percentile
  • The difference between you and her is a standard
    deviation

46
(No Transcript)
47
Probability, Population Sampling and the Normal
Curve
  • Thinking about our data that fitted the normal
    curve
  • By using the mathematical model we can easily
    calculate probabilities.
  • The maths tells us that
  • The total area under the normal curve is equal to
    1.
  • The probability that any new observation will
    fall within one standard deviation of the mean is
    68
  • The probability that any new observation will
    fall within two standard deviations of the mean
    is 95
  • The probability that any new observation will
    fall within three standard deviations of the mean
    is 99.7

48
Examples
49
CERN experiments observe particle consistent with
long-sought Higgs boson Geneva, 4 July 2012. We
observe in our data clear signs of a new
particle, at the level of 5 sigma, in the mass
region around 126 GeV. The outstanding
performance of the LHC and ATLAS and the huge
efforts of many people have brought us to this
exciting stage, said ATLAS experiment
spokesperson Fabiola Gianotti, but a little more
time is needed to prepare these results for
publication.
At five-sigma there is only one chance in nearly
two million that the result is wrong, i.e. the
measurement seen is a random fluctuation.
50
Confidence intervals (1)
  • if we measure one individuals IQ we can be 95
    sure that it would fall between 70 and 130
  • This interval is called the 95 confidence
    interval.
  • We use 95 by convention sometimes other figures
    are used such as 98.
  • If we measure the heights of a class of children
    and we have a mean of 1.2m, standard deviation of
    0.1, what is your estimate for the height of a
    child randomly selected from the sample?
  • 1.2 /-0.2, ie 95 of this sample lies between
    1.0 and 1.4m

51
Confidence intervals (2)
  • Reminder the heights of a class of children have
    a mean of 1.2m, standard deviation of 0.1
  • We measure a new child and their height is 1.5m.
    What does this mean?
  • This is equal to mean 3 standard deviations.
    This means we had less than a 0.5 chance that we
    would have this height in a child in this
    population. That doesnt mean they are not part
    of the distribution (0.5 is not that rare) but
    you might be sensible to check a few things to be
    sure they are part of the same population (age!).

52
Confidence intervals (3)
  • This time we are using confidence intervals to
    estimate our true population characteristics
    based on a sample.
  • Best estimate of the mean measured mean of
    sample
  • Best estimate of standard deviation of population
    std deviation of sample/ number of measurements
    in the sample
  • Therefore we can say that we are 95 confident
    that the mean of the population lies between the
    sample mean /- 2xstandard error
  • This implies that
  • Our estimate of the mean gets better as n
    increases because our error gets smaller.
  • This is the way we usually use confidence
    intervals in public health as we usually measure
    a sample and infer the population.
  • Examples Health survey for England, Household
    surveys, etc

53
  • You are a significant part of my life
  • P value 9

54
I would never treat you differently to your
sisters
  • Sister 1 CI 4-9
  • Sister 2 CI 5-11
  • Sister 3 CI 4-13
  • ME CI 2-3

55
Comparing two samples
  • The important question is is there a difference
    between two populations?
  • This question might be asked in slightly
    different ways for different types of study, but
    is fundamentally the same
  • For an RCT you compare control group with the
    intervention group
  • For a cohort you compare the outcomes in those
    exposed to a risk factor compared to those not
    exposed
  • For a case-control you look at the group with the
    disease and compare their risk factors to those
    without the disease
  • You might look at before and after an
    intervention was put in place
  • You might compare one city or country to another

56
Comparing two samples (2)
The important question is is there a difference
between two populations?
57
Comparing two samples (3)
  • We can calculate the difference between the two
    populations as
  • Mean difference mean of pop 1 mean pop2
  • Confidence interval mean difference /- 1.96SE
  • SE (standard error) is a combination of the
    standard errors for each sample (shown here as s1
    and s2)
  • SE sqrt (s12 / n1) (s22 / n2)
  • (se can be slightly different for different
    situations but this gives you an idea)

58
T tests
  • Testing using t test
  • You need to know the mean and standard deviation
    of both of your samples.
  • You start with a hypothesis this is that there
    is no difference between the two samples (or
    populations)
  • You then do some maths
  • t (mean of sample 1 mean of sample 2) / SE
  • where SE sqrt (standard dev of pop 1)2 / n1)
    (standard dev of pop 2)2 / n2)

59
T tests (2)
  • So what does t mean?
  • t the horizontal axis of a normal distribution
    with mean0 and standard deviation1
  • You can read the probability of the two samples
    coming from the same population from a table of t
    values
  • Most important value -
  • if tgt1.96 then the probability of them being from
    the same distribution is lt0.05
  • By convention, we discard the null hypothesis if
    plt0.05
  • Its good practice to quote the p value e.g.
    P0.01
  • If tgt1.96, then the probability of the two
    samples coming from the same population is lt0.05
    (5). This suggests that they are fundamentally
    different

60
T tests (3)
  • What do these results mean?
  • Mean difference 0, with 95 confidence interval
    (-1.0, 1.0), p 0.50
  • Mean difference 0.5, with 95 confidence
    interval (0.1, 0.9), p 0.049
  • Mean difference 1, with 95 confidence interval
    (-0.1, 1.1), p 0.055
  • Mean difference 1, with 95 confidence interval
    (0.2, 1.8), p 0.02

61
Risk differences
  • Same principle null hypothesis is that there is
    no difference
  • For no difference, the 95 confidence interval
    would include 0
  • If it does not include 0, then you can be 95
    confident that there is a risk difference.
  • You can also quote a p value
  • Example the risk difference for having a heart
    attack in the placebo group compared to the
    intervention group was 2 with a 95 confidence
    interval of (1.5 to 2.4), p0.02
  • Would you take the intervention?

62
Risk differences (2)
  • You can also calculate the number needed to treat
    from this
  • NNT is the number of people you need to treat to
    prevent one event from occuring
  • Example the risk difference for having a heart
    attack in the placebo group compared to the
    intervention group was 2 with a 95 confidence
    interval of (1.5 to 2.4), p0.02
  • If you treat 100 people you avoid 2 heart
    attacks.
  • NNT 50

63
Risk ratio
  • A relative measure of risk very commonly used
  • Same principle null hypothesis is that there is
    no difference IN THE RATIO OF RISKS
  • For no difference, the 95 confidence interval
    would include 1
  • Why 1 this time?
  • Because if both had the same risk, the ratio
    would be 1
  • If it does not include 1, then you can be 95
    confident that there is a risk difference.
  • You can also quote a p value

64
Odds ratio
  • A relative measure of risk very commonly used
  • Very similar to risk ratio
  • Used for certain types of study, and the result
    of some calculations
  • For no difference, the 95 confidence interval
    would include 1
  • If it does not include 1, then you can be 95
    confident that there is a difference.
  • You can also quote a p value

65
Examples
  • Meta-analysis of the 5 prospective cohort studies
    (86,092 patients) indicated that individuals with
    periodontal disease had a 1.14 times higher risk
    of developing CHD than the controls (relative
    risk 1.14, 95 CI 1.074-1.213, P lt .001)
  • the risk of VTE was 2.33 for obesity (95 CI,
    1.68 to 3.24), 1.51 for hypertension (95 CI,
    1.23 to 1.85), 1.42 for diabetes mellitus (95
    CI, 1.12 to 1.77), 1.18 for smoking (95 CI, 0.95
    to 1.46), and 1.16 for hypercholesterolemia (95
    CI, 0.67 to 2.02).

66
In summary
  • Your boss says
  • do we need a weight loss service for kids in XXX
    area
  • You collect data, definition of kids, is this
    data accurate, how was it collected, what year.
  • Compare the areas, are you much different is
    there an underlying reason
  • Is this value statistically significant?

67
In summary (2)
  • You look at a service elsewhere (from evidence)
  • You ask yourself, who was included in this
    sample, are they different to my population
  • Looking at the odds what proportion of kids will
    this work on
  • Look to see if the test group were bias compared
    to control group
  • Were the results normally distributed, skewed or
    other

68
In summary (3)
  • Were the results significant between the two
    groups.
  • Can you rely on these findings
  • You have just found the need.
  • Evaluated its accuracy
  • Reviewed a solution
  • Looked at effectiveness
  • WELL DONE!!!

69
Useful websites
  • Basic maths and probability
  • http//www.cimt.plymouth.ac.uk/projects/mepres/boo
    k7/bk7i21/bk7_21i1.htm
  • Tutorials on statistics
  • http//www.stattrek.com/tutorials/statistics-tutor
    ial.aspx
Write a Comment
User Comments (0)
About PowerShow.com