Data types and presentation of data - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Data types and presentation of data

Description:

Health check. continuous. continuous. continuous. dichotomous. kg/m2. Drink alcohol: why? ... Health Checks. Summaries of qualitative data ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 64
Provided by: jillmo9
Category:

less

Transcript and Presenter's Notes

Title: Data types and presentation of data


1
Data types and presentation of data
  • Dr Gordon Prescott
  • Senior Lecturer in Medical Statistics

2
Why are you here?
  • Who wants to be journalist, novelist or author?

Yet you have invested time learning to
communicate, speak, write, spell, organise facts,
write essays, and make an argument to support
your point of view.
Who wants to be mathematician?
Yet you use numbers, make estimations, make
judgements about time, money and risk every
day What can you afford to spend on
accommodation or food? Does it make sense for
you to buy a bus pass? How many times a week can
you go out? Is your plan of action sensible or
too risky?
3
Why are you here?
  • Who wants to be a good scientist/researcher?

Must invest time and learn to apply intrinsic
numerical skills in the correct way in order to
organise numerical information so that you can
communicate, speak and write about it, to read
and write scientific reports, to make arguments
using numerical evidence to support and justify a
scientific conclusion.
4
Skills common to statistics and medical and
scientific research
Communication
Analysis
These skills will be useful well beyond this
course
General skills
Team working
Reporting
Problem solving
5
Statistics in healthmedicine - public health -
the health service - the pharmaceutical industry
medical statisticsMedicine and health are of
crucial importance to society. Huge resources are
devoted to these areas.
Promotinggood health
Investigatingdifferentmethods oftreatment
Caring forthe ill
Understandingdisease
Managing thehealth service
Question How many times more likely is it that a
man or woman in the UK will die from heart
disease than a man or woman of the same age in
Japan?
Answer 7
6
Statistics in health
Allan HackshawSt Bartholomews and the Royal
London School of Medicine and Dentistry
  • MSc in Biometry, now a lecturer in epidemiology
    and medical statistics.
  • His work involves
  • assessing the health effects of passive smoking
  • investigating methods of detecting breast and
    ovarian cancer
  • studying chromosomal abnormalities in pregnancy
  • teaching medical screening and epidemiology to
    undergraduates and postgraduates

"My work in medical statistics has always
involved collaborating with experts in medicine
and it is important to be able to communicate
effectively. Statistics are used to clarify
research issues and quantify their effects. The
results can influence local, national and
international policy."
7
Statistics in health
pharmaceutical statistics
The pharmaceutical industry is one of the UK's
biggest economic successes. It employs a large
numbers of scientists .. including many
statisticians. Statisticians are responsible for
designing and analysing experiments to assess the
effects of drugs and to check for any
side-effects. They can also be involved in all
areas of drug development, from chemical
discovery through clinical development to
post-marketing safety surveillance. There are
strict legal requirements relating to any drug
development, and the role of the statistician is
central.
8
Administration
  • 5 charge for handbooks to cover printing
  • Computer practicals in Computer room 3
  • 9 - 10 HS and PHR
  • 10 - 11 Clinical pharmacology
    non-graduating students
  • 11 - 12 Nutrition
  • Go to 2nd floor on main stairs and take half
    flight of stairs towards back (North) of
    building. Computer room 3 is at the back of the
    café area.

9
Administration
  • Open door policy for this course,
  • Wednesday 2 - 4 pm, Room 1.026
  • Student statistics clinics for other statistical
    queries not related to this course i.e. projects
  • Five times a week 1 - 2 pm Monday-Friday
  • Aug/Sept only twice a week
  • Assessment
  • data analysis assignment (40) November
  • exam (60) January

10
To get the most out of this course
  • Read the relevant handbook section prior to the
    lecture
  • Consult one of the recommended textbooks
  • If you do not understand something
  • Please come and see me
  • Open door policy
  • Make an appointment
  • Ask tutors during the SPSS practicals or tutorial
    sessions

11
Web resources
  • Public Health -gt Teaching -gt Applied Statistics
    (PU5005)
  • All powerpoint presentations used in lectures
  • Worked examples for SPSS practicals
  • Any other correspondence regarding this course

12
Criteria for assessment
Students marks will reflect the following
criteria
13
Very Important
NO EXTENSIONS unless medical certificate is
presented
14
Texts
  • Statistics at a Glance - Petrie and Sabin
  • Excellent if little knowledge of statistics,
    short topics
  • Medical Statistics a common sense approach -
    Campbell
  • Essential Medical Statistics - Kirkwood
  • More detailed text, covers more than this course
  • An introduction to medical statistics - Bland
  • Have a look and choose the one that you prefer.

15
Previous exposure to applied statistics
16
Outline
  • Introduction to data types
  • Preparation of data for analysis
  • Descriptive statistics
  • Summarising data
  • location
  • spread
  • Graphical presentation of data

17
Qualitative data
Dichotomous (Binary) This variable has only 2
possible categories (mutually exclusive) E.g.
Result Success/Failure, Gender
Male/Female. Nominal This variable has more
than 2 categories, mutually exclusive and
unordered E.g. Blood group A, B, AB, O,
Marital status Married, Divorced/Widowed,
Single. Ordinal This variable has more than 2
categories, mutually exclusive and ordered E.g.
Disease stage Mild, Moderate, Severe,
Satisfaction Very satisfied, Satisfied,
Unsatisfied.
18
Quantitative data
Discrete This variable often represents counts
(integer values) e.g. Number of visits to a GP
in a year, Number of children,
Number of times admitted to hospital in the
last 5 years.
Continuous This variable can take any value
within a range of values e.g. Height in cm,
Weight in kg, Distance from home to
work in km.
19
Ordinal v discrete
  • Ordinal
  • Stage of breast cancer I II III IV
  • Discrete
  • Number of children 0 1 2 3 4 5

20
Importance of data type
  • The type of data is critically important in
    determining which methods of analysis will be
    appropriate and valid
  • The major distinction is between continuous and
    categorical variables

21
Class exercise
  • Handbook, page 7/8
  • 10 minutes
  • Decide how a variable is to be measured
  • Complete data type
  • Think about any other areas of health that should
    be considered

22
(No Transcript)
23
Data checking
  • Aim is to identify and rectify errors in the the
    data
  • These errors may be due to
  • data mistyped when entered onto computer
  • confusion over correct units of measurement
  • digits may be transposed
  • If the original data is correct the errors can be
    rectified
  • If the original data is incorrect and the value
    is implausible, then a missing value code should
    be used

24
Implausible values
  • Categorical data
  • All possible categories (codes) are specified,
    any values outwith this range are implausible
  • Continuous data
  • Not as clear cut as to what values are
    implausible
  • Set reasonable ranges within which values should
    lie
  • Suspicious values should be checked and any
    errors found should be corrected

25
Outliers
  • May find outlying values that seem incompatible
    with the rest of the data
  • Suspicious values should be carefully checked
  • Outliers can have a considerable influence on the
    results of statistical analysis
  • The decision to include or exclude outliers
    should be made with caution
  • Often useful to analyse data with and without the
    outlier

26
Missing values
  • Missing values are usually given the code of -9,
    9, 99, 999
  • Must remember to declare these as missing before
    starting to analyse the data
  • Consider the reason why the data are missing
  • random
  • non-random

27
Continuous data
  • The shape of the frequency distribution should be
    assessed for continuous data

28
Descriptive statistics
  • After the data that has been collected has been
    checked, the next step is to describe the data.
  • Describe the subjects from which data has been
    collected
  • What are the characteristics of those subjects?
  • Summarise the data in a simple and informative
    manner

29
Example Lifestyle survey
  • Characteristics of participants
  • age, sex, height, weight, employment, Health
    status
  • Food
  • Smoking
  • Alcohol
  • Health Checks

30
Summaries of qualitative data
  • Analysis of qualitative data often lead to counts
    of occurrences
  • in different categories
  • together with the corresponding percentages
  • For a single qualitative variable
  • use a table or a graphical presentation

31
Frequencies nominal
32
Frequencies ordinal
33
Graphical presentationpie chart
34
Bar chart
35
  • Summary of two quanlitative
  • variables together
  • Crosstabulation
  • Clustered bar chart
  • counts
  • percentages

36
Two variables crosstabulation
37
Bar chart counts in two groups
38
Bar chart percentages within two groups
39
Frequency or percent?
males n119 females n158
40
Presentation of qualitative data
41
Continuous data
  • For continuous data it is important to check the
    shape of the frequency distribution. To do this
    a histogram should be plotted
  • The shape of the distribution is important in
    determining the most appropriate summary measures
    to indicate the location and spread of the data

42
Histogram
43
Shapes of frequency distributions
  • Symmetrical
  • Skewed
  • Positively (tail to right)
  • Negatively (tail to left)
  • Most distributions encountered in medical work
    are symmetrical or positively skewed

44
Distributions
Symmetrical (Normal)
Skewed
Normal
Negatively skewed
Positively skewed
45
Measures of central tendency
  • Mean - arithmetic average (? mu)
  • sum all observations and divide by the number of
    observations
  • Median - central value of the distribution
  • rank observations and the median is the
    observation below which 50 of all values fall
  • Mode - value which occurs most often

46
Location of mean and median
Median
Mean lt
Mean median
47
Mean or median?
  • When the frequency distribution of the dataset
    is
  • symmetrical - the mean is commonly the most
    appropriate measure of location
  • Skewed - the median is commonly the most
    appropriate measure of location

48
Measures of variation
  • Range
  • largest measurement minus smallest measurement
  • Sample variance (?2 sigma squared)
  • sum of squared distances of each observation from
    the mean observation divided by n-1
  • Sample standard deviation (? sigma)
  • the square root of the sample variance
  • Coefficient of variation
  • the standard deviation is expressed as a
    percentage of the mean
  • Interquartile range
  • the difference between the 25th percentile and
    75th percentile of the dataset

49
Quartiles
50
Properties of Normal curve
51
Example Recorded age for 282 subjects
52
Appropriate measures of location and spread
53
Coefficient of variation (CV)
CV SD/mean
  • CV is a unitless quantity which indicates the
    variability around the mean in relation to the
    size of the mean
  • Useful for comparing variability between groups,
    especially when measurements have been made using
    different units (e.g. Celsius and Fahrenheit)
  • Measurement error
  • CV often used to describe measurement error
  • Best used in situations where the measurement
    error is related to value of the measurement
  • When measurement error does not depend on the
    value of the measurement, the SD (within
    subjects) is a better representation of
    measurement error

54
Presentation and summary of continuous data
  • Mean, standard deviation
  • The mean age of responders was 43.1 years
    (standard deviation (SD) 16.8)
  • Median, interquartile range
  • The median age was 43.0 years (interquartile
    range 25.0 to 56.3)

55
Graphical presentation of continuous data
  • Histogram
  • indicates shape of frequency distribution
  • Box and Whisker
  • Indicates shape and can be useful when comparing
    two or more groups of subjects

56
Histogram
  • Assessment of normality
  • Judgement by eye
  • Normal plot
  • Statistical test

57
Normal plot to assess normality
  • This plot can help you assess whether your data
    can reasonably be assumed to come from a normal
    distribution
  • Any departures from a straight line are seen as
    departures of the sample data from normality

58
Box and Whisker Plot
  • Box and whisker plot displays
  • Median - horizontal line in box
  • 25th and 75th percentile are the edges of the
    shaded box
  • Minimum and maximum values are the whiskers
  • Any outliers are plotted outside the whiskers

59
Example
  • Weights of the 26 men from the 1996 US Olympic
    rowing team. Information is also available on
    the event in which the men competed.

60
Histogram of rowers weights
  • ?Impossible values
  • ?Outliers
  • Mean 191.8 lbs
  • Median 202.5 lbs

61
Events competed in
62
Weights excluding lightweights
  • Mean 206.32 lbs
  • Median 207.0 lbs
  • Recall that for the complete dataset
  • mean 191.85 lbs
  • Median 202.5 lbs

63
Summary
  • Type of data very important
  • Qualitative or quantitative
  • Data checking suspicious values, outliers
  • Different methods needed to summarise and
    graphically present them
  • Methods for one quantitative variable
  • Methods for two quantitative variables
  • Methods for a continuous variable
  • Distributions symmetric or skewed
  • Different summaries required depending on
    distribution
Write a Comment
User Comments (0)
About PowerShow.com