Title: Data types and presentation of data
1Data types and presentation of data
- Dr Gordon Prescott
- Senior Lecturer in Medical Statistics
2Why are you here?
- Who wants to be journalist, novelist or author?
Yet you have invested time learning to
communicate, speak, write, spell, organise facts,
write essays, and make an argument to support
your point of view.
Who wants to be mathematician?
Yet you use numbers, make estimations, make
judgements about time, money and risk every
day What can you afford to spend on
accommodation or food? Does it make sense for
you to buy a bus pass? How many times a week can
you go out? Is your plan of action sensible or
too risky?
3Why are you here?
- Who wants to be a good scientist/researcher?
Must invest time and learn to apply intrinsic
numerical skills in the correct way in order to
organise numerical information so that you can
communicate, speak and write about it, to read
and write scientific reports, to make arguments
using numerical evidence to support and justify a
scientific conclusion.
4Skills common to statistics and medical and
scientific research
Communication
Analysis
These skills will be useful well beyond this
course
General skills
Team working
Reporting
Problem solving
5Statistics in healthmedicine - public health -
the health service - the pharmaceutical industry
medical statisticsMedicine and health are of
crucial importance to society. Huge resources are
devoted to these areas.
Promotinggood health
Investigatingdifferentmethods oftreatment
Caring forthe ill
Understandingdisease
Managing thehealth service
Question How many times more likely is it that a
man or woman in the UK will die from heart
disease than a man or woman of the same age in
Japan?
Answer 7
6Statistics in health
Allan HackshawSt Bartholomews and the Royal
London School of Medicine and Dentistry
- MSc in Biometry, now a lecturer in epidemiology
and medical statistics. - His work involves
- assessing the health effects of passive smoking
- investigating methods of detecting breast and
ovarian cancer - studying chromosomal abnormalities in pregnancy
- teaching medical screening and epidemiology to
undergraduates and postgraduates
"My work in medical statistics has always
involved collaborating with experts in medicine
and it is important to be able to communicate
effectively. Statistics are used to clarify
research issues and quantify their effects. The
results can influence local, national and
international policy."
7Statistics in health
pharmaceutical statistics
The pharmaceutical industry is one of the UK's
biggest economic successes. It employs a large
numbers of scientists .. including many
statisticians. Statisticians are responsible for
designing and analysing experiments to assess the
effects of drugs and to check for any
side-effects. They can also be involved in all
areas of drug development, from chemical
discovery through clinical development to
post-marketing safety surveillance. There are
strict legal requirements relating to any drug
development, and the role of the statistician is
central.
8Administration
- 5 charge for handbooks to cover printing
- Computer practicals in Computer room 3
- 9 - 10 HS and PHR
- 10 - 11 Clinical pharmacology
non-graduating students - 11 - 12 Nutrition
- Go to 2nd floor on main stairs and take half
flight of stairs towards back (North) of
building. Computer room 3 is at the back of the
café area.
9Administration
- Open door policy for this course,
- Wednesday 2 - 4 pm, Room 1.026
- Student statistics clinics for other statistical
queries not related to this course i.e. projects - Five times a week 1 - 2 pm Monday-Friday
- Aug/Sept only twice a week
- Assessment
- data analysis assignment (40) November
- exam (60) January
10To get the most out of this course
- Read the relevant handbook section prior to the
lecture - Consult one of the recommended textbooks
- If you do not understand something
- Please come and see me
- Open door policy
- Make an appointment
- Ask tutors during the SPSS practicals or tutorial
sessions
11Web resources
- Public Health -gt Teaching -gt Applied Statistics
(PU5005) - All powerpoint presentations used in lectures
- Worked examples for SPSS practicals
- Any other correspondence regarding this course
12Criteria for assessment
Students marks will reflect the following
criteria
13Very Important
NO EXTENSIONS unless medical certificate is
presented
14Texts
- Statistics at a Glance - Petrie and Sabin
- Excellent if little knowledge of statistics,
short topics - Medical Statistics a common sense approach -
Campbell - Essential Medical Statistics - Kirkwood
- More detailed text, covers more than this course
- An introduction to medical statistics - Bland
- Have a look and choose the one that you prefer.
15Previous exposure to applied statistics
16Outline
- Introduction to data types
- Preparation of data for analysis
- Descriptive statistics
- Summarising data
- location
- spread
- Graphical presentation of data
17Qualitative data
Dichotomous (Binary) This variable has only 2
possible categories (mutually exclusive) E.g.
Result Success/Failure, Gender
Male/Female. Nominal This variable has more
than 2 categories, mutually exclusive and
unordered E.g. Blood group A, B, AB, O,
Marital status Married, Divorced/Widowed,
Single. Ordinal This variable has more than 2
categories, mutually exclusive and ordered E.g.
Disease stage Mild, Moderate, Severe,
Satisfaction Very satisfied, Satisfied,
Unsatisfied.
18Quantitative data
Discrete This variable often represents counts
(integer values) e.g. Number of visits to a GP
in a year, Number of children,
Number of times admitted to hospital in the
last 5 years.
Continuous This variable can take any value
within a range of values e.g. Height in cm,
Weight in kg, Distance from home to
work in km.
19Ordinal v discrete
- Ordinal
- Stage of breast cancer I II III IV
- Discrete
- Number of children 0 1 2 3 4 5
20Importance of data type
- The type of data is critically important in
determining which methods of analysis will be
appropriate and valid - The major distinction is between continuous and
categorical variables
21Class exercise
- Handbook, page 7/8
- 10 minutes
- Decide how a variable is to be measured
- Complete data type
- Think about any other areas of health that should
be considered
22(No Transcript)
23Data checking
- Aim is to identify and rectify errors in the the
data - These errors may be due to
- data mistyped when entered onto computer
- confusion over correct units of measurement
- digits may be transposed
- If the original data is correct the errors can be
rectified - If the original data is incorrect and the value
is implausible, then a missing value code should
be used
24Implausible values
- Categorical data
- All possible categories (codes) are specified,
any values outwith this range are implausible - Continuous data
- Not as clear cut as to what values are
implausible - Set reasonable ranges within which values should
lie - Suspicious values should be checked and any
errors found should be corrected
25Outliers
- May find outlying values that seem incompatible
with the rest of the data - Suspicious values should be carefully checked
- Outliers can have a considerable influence on the
results of statistical analysis - The decision to include or exclude outliers
should be made with caution - Often useful to analyse data with and without the
outlier
26Missing values
- Missing values are usually given the code of -9,
9, 99, 999 - Must remember to declare these as missing before
starting to analyse the data - Consider the reason why the data are missing
- random
- non-random
27Continuous data
- The shape of the frequency distribution should be
assessed for continuous data
28Descriptive statistics
- After the data that has been collected has been
checked, the next step is to describe the data. - Describe the subjects from which data has been
collected - What are the characteristics of those subjects?
- Summarise the data in a simple and informative
manner
29Example Lifestyle survey
- Characteristics of participants
- age, sex, height, weight, employment, Health
status - Food
- Smoking
- Alcohol
- Health Checks
30Summaries of qualitative data
- Analysis of qualitative data often lead to counts
of occurrences - in different categories
- together with the corresponding percentages
- For a single qualitative variable
- use a table or a graphical presentation
31Frequencies nominal
32Frequencies ordinal
33Graphical presentationpie chart
34Bar chart
35- Summary of two quanlitative
- variables together
- Crosstabulation
- Clustered bar chart
- counts
- percentages
36Two variables crosstabulation
37Bar chart counts in two groups
38Bar chart percentages within two groups
39Frequency or percent?
males n119 females n158
40Presentation of qualitative data
41Continuous data
- For continuous data it is important to check the
shape of the frequency distribution. To do this
a histogram should be plotted - The shape of the distribution is important in
determining the most appropriate summary measures
to indicate the location and spread of the data
42Histogram
43Shapes of frequency distributions
- Symmetrical
- Skewed
- Positively (tail to right)
- Negatively (tail to left)
- Most distributions encountered in medical work
are symmetrical or positively skewed
44Distributions
Symmetrical (Normal)
Skewed
Normal
Negatively skewed
Positively skewed
45Measures of central tendency
- Mean - arithmetic average (? mu)
- sum all observations and divide by the number of
observations - Median - central value of the distribution
- rank observations and the median is the
observation below which 50 of all values fall - Mode - value which occurs most often
46Location of mean and median
Median
Mean lt
Mean median
47Mean or median?
- When the frequency distribution of the dataset
is - symmetrical - the mean is commonly the most
appropriate measure of location - Skewed - the median is commonly the most
appropriate measure of location
48Measures of variation
- Range
- largest measurement minus smallest measurement
- Sample variance (?2 sigma squared)
- sum of squared distances of each observation from
the mean observation divided by n-1 - Sample standard deviation (? sigma)
- the square root of the sample variance
- Coefficient of variation
- the standard deviation is expressed as a
percentage of the mean - Interquartile range
- the difference between the 25th percentile and
75th percentile of the dataset
49Quartiles
50Properties of Normal curve
51Example Recorded age for 282 subjects
52Appropriate measures of location and spread
53Coefficient of variation (CV)
CV SD/mean
- CV is a unitless quantity which indicates the
variability around the mean in relation to the
size of the mean - Useful for comparing variability between groups,
especially when measurements have been made using
different units (e.g. Celsius and Fahrenheit) - Measurement error
- CV often used to describe measurement error
- Best used in situations where the measurement
error is related to value of the measurement - When measurement error does not depend on the
value of the measurement, the SD (within
subjects) is a better representation of
measurement error
54Presentation and summary of continuous data
- Mean, standard deviation
- The mean age of responders was 43.1 years
(standard deviation (SD) 16.8) - Median, interquartile range
- The median age was 43.0 years (interquartile
range 25.0 to 56.3)
55Graphical presentation of continuous data
- Histogram
- indicates shape of frequency distribution
- Box and Whisker
- Indicates shape and can be useful when comparing
two or more groups of subjects
56Histogram
- Assessment of normality
- Judgement by eye
- Normal plot
- Statistical test
57Normal plot to assess normality
- This plot can help you assess whether your data
can reasonably be assumed to come from a normal
distribution - Any departures from a straight line are seen as
departures of the sample data from normality
58Box and Whisker Plot
- Box and whisker plot displays
- Median - horizontal line in box
- 25th and 75th percentile are the edges of the
shaded box - Minimum and maximum values are the whiskers
- Any outliers are plotted outside the whiskers
59Example
- Weights of the 26 men from the 1996 US Olympic
rowing team. Information is also available on
the event in which the men competed.
60Histogram of rowers weights
- ?Impossible values
- ?Outliers
- Mean 191.8 lbs
- Median 202.5 lbs
61Events competed in
62Weights excluding lightweights
- Mean 206.32 lbs
- Median 207.0 lbs
- Recall that for the complete dataset
- mean 191.85 lbs
- Median 202.5 lbs
63Summary
- Type of data very important
- Qualitative or quantitative
- Data checking suspicious values, outliers
- Different methods needed to summarise and
graphically present them - Methods for one quantitative variable
- Methods for two quantitative variables
- Methods for a continuous variable
- Distributions symmetric or skewed
- Different summaries required depending on
distribution