# DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 - PowerPoint PPT Presentation

PPT – DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 PowerPoint presentation | free to download - id: 6d7b95-ZDBhO The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013

Description:

### DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Slides: 44
Provided by: ITSer200
Category:
Tags:
Transcript and Presenter's Notes

Title: DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013

1
DTC Quantitative Methods Descriptive
Statistics Thursday 17th January 2013
2
Some relevant online course extracts
• Cramer (1998) Chapter 2
• - Measurement and univariate analysis.
• Diamond and Jefferies (2001) Chapter 5
• - Measures and displays of spread.
• Sarantakos (2007) Chapter 5
• - Graphical displays.
• Huizingh (2007) Chapter 12
• - SPSS material.

3
Some basic terminology
• Quantitative measures are typically referred to
as variables.
• Some variables are generated directly via the
data generation process, but other, derived
variables may be constructed from the original
set of variables later on.
• As the next slide indicates, variables are
frequently referred to in more specific ways.

4
Cause(s) and effect?
• Often, one variable (and occasionally more than
one variable) is viewed as being the dependent
variable.
• Variables which are viewed as impacting upon this
variable, or outcome, are often referred to as
independent variables.
• However, for some forms of statistical analyses,
independent variables are referred to in more
specific ways (as can be seen within the menus of
SPSS for Windows)

5
Levels of measurement (Types of quantitative
data)
• A nominal variable relates to a set of categories
such as ethnic groups or political parties which
is not ordered.
• An ordinal variable relates to a set of
categories in which the categories are ordered,
such as social classes or levels of educational
qualification.
• An interval-level variable relates to a scale
measure, such as age or income, that can be
subjected to mathematical operations such as
averaging.

6
How many variables?
• The starting point for statistical analyses is
typically an examination of the distributions of
values for the variables of interest. Such
examinations of variables one at a time are a
form of univariate analysis.
• Once a researcher moves on to looking at
relationships between pairs of variables she or
he is engaging in bivariate analyses.
• and if they attempt to explain why two
variables are related with reference to another
variable or variables they have moved on to a
form of multivariate analysis.

7
Looking at categorical variables
• For nominal/ordinal variables this largely means
looking at the frequencies of each category,
often pictorially using, say, bar-charts or
pie-charts.
• It is usually easier to get a sense of the
relative importance of the various categories if
one converts the frequencies into percentages!

8
Example of a frequency table
Place met marital or cohabiting partner
Frequency
At school, college or university 872 12.4
At/through work 1405 19.9
In a pub/cafe/restaurant/ bar/club 2096 29.7
At a social event organised by friend(s) 1055 14.9
Other 1631 23.1
TOTAL 7059 100.0
9
Example of a pie-chart
At school, college or university
Other
At/through work
At a social event organised by friend(s)
In a pub/cafe/restaurant/ bar/club
10
What are percentages?
• It may seem self-evident, but percentages are a
form of descriptive statistic
• Specifically, they are useful in describing the
distributions (of frequencies) for nominal or
ordinal (i.e. categorical) variables
• When we consider interval-level variables or more
than one variable, we need (somewhat) more
sophisticated descriptive statistics

11
Descriptive statistics...
• ... are data summaries which provide an
alternative to graphical representations of
distributions of values (or relationships)
• ... aim to describe key aspects of distributions
of values (or relationships)
• ... are of most relevance when we are thinking

12
Description or inference?
• Descriptive statistics summarise relevant
features of a set of values.
• Inferential statistics help researchers decide
whether features of quantitative data from a
sample can be safely concluded to be present in
the population.
• Generalizing from a sample to a population is
part of the process of statistical inference
• One objective may be to produce an estimate of
the proportion of people in the population with a
particular characteristic, i.e. a process of
estimation.

13
Types of (univariate) descriptive statistics
• Measures of ...
• ... location (averages)
• ... skewness (asymmetry)
• ... kurtosis
• We typically want to know about the first two,
fourth!

14
What is kurtosis anyway?
• Increasing kurtosis is associated with the
movement of probability mass from the shoulders
of a distribution into its center and tails.
(Balanda, K.P. and MacGillivray, H.L. 1988.
Kurtosis A Critical Review, The American
Statistician 422 111119.)
• Below, kurtosis increases from left to right...

15
Visualising scale variables
• For interval-level data the appropriate visual
summary of a distribution is a histogram,
examining which can allow the researcher to
assess whether it is reasonable to assume that
the quantity of interest has a particular
distributional shape (and whether it exhibits
skewness).
• Unlike bar charts, distances along the
horizontal dimension of a histogram have a
well-defined, consistent meaning i.e. they
represent differences between values on the
interval-level scale in question.

16
Example of a histogram
17
Measures of location
• Mean (the arithmetic average of the values,
i.e. the result of dividing the sum of the
values by the total number of cases)
• Median (the middle value, when the values are
ranked/ordered)
• Mode (the most common value)

18
• Standard deviation (and Variance)
• (This is linked with the mean, as it is based on
• averaging squared deviations from it. The
• variance is simply the standard deviation
squared).
• Interquartile range / Quartile deviation
• (These are linked with the median, as they
• are also based on the values placed in order).

19
Measures of location and spread an example
(household size)
Mean 2.94, Median 2, Mode 2 Mean 2.96,
Median 3, Mode 2
s.d. 1.93, skewness 2.10 kurtosis 5.54
s.d. 1.58, skewness 1.27 kurtosis
2.24 West Midlands London
20
Why is the standard deviation so important?
• The standard deviation (or, more precisely, the
variance) is important because it introduces the
idea of summarising variation in terms of summed,
squared deviations.
• And it is also central to some of the statistical
theory used in statistical testing/statistical
inference...

21
An example of the calculation of a standard
deviation
• Number of seminars attended by a sample of
undergraduates 5, 4, 4, 7, 9, 8, 9, 4, 6, 5
• Mean 61/10 6.1
• Variance ((5 6.1)2 (4 6.1)2 (4 6.1)2
(7 6.1)2
• (9 6.1)2 (8 6.1)2 (9 6.1)2 (4
6.1)2 (6 6.1)2
• (5 6.1)2)/(10 1) 36.9 /9 4.1
• Standard deviation Square root of variance
2.025

22
The Empire Median Strikes Back!
• Comparing descriptive statistics between groups
can be done graphically in a rather nice way
using a form of display called a boxplot.
• Boxplots are based on medians and quartiles
rather than on the more commonly found mean and
standard deviation.

23
Example of a boxplot
24
Moving on to bivariate descriptive
statistics'...
• These are referred to as Measures of
association, as they quantify the (strength of
the) association between two variables
• The most well-known of these is the (Pearson)
correlation coefficient, often referred to as
the correlation coefficient, or even the
correlation
• This quantifies the closeness of the relationship
between two interval-level variables (scales)

25
Positive and negative relationships
• Positive or direct relationships
• If the points cluster around a line that runs
from the lower left to upper right of the graph
area, then the relationship between the two
variables is positive or direct.
• An increase in the value of x is more likely to
be associated with an increase in the value of y.
• The closer the points are to the line, the
stronger the relationship.
• Negative or inverse relationships
• If the points tend to cluster around a line that
runs from the upper left to lower right of the
graph, then the relationship between the two
variables is negative or inverse.
• An increase in the value of x is more likely to
be associated with a decrease in the value of y.

26
Working out the correlation coefficient
(Pearsons r)
• Pearsons r tells us how much one variable
changes as the values of another changes their
covariation.
• Variation is measured with the standard
deviation. This measures average variation of
each variable from the mean for that variable.
• Covariation is measured by calculating the amount
by which each value of X varies from the mean of
X, and the amount by which each value of Y varies
from the mean of Y and multiplying the
differences together and finding the average (by
dividing by n-1).
• Pearsons r is calculated by dividing this by (SD
of x) x (SD of y) in order to standardize it.

27
Working out the correlation coefficient
(Pearsons r)
• Because r is standardized it will always fall
between 1 and -1.
• A correlation of either 1 or -1 means perfect
association between the two variables.
• A correlation of 0 means that there is no
association.
• Note correlation does not mean causation. We can
only investigate causation by reference to our
theory. However (thinking about it the other way
round) there is unlikely to be causation if there
is not correlation.

28
A scatterplot of the values of two
interval-level variables
29
Example of calculating a correlation coefficient
(corresponding to the last slide)
• X 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X)
6.1
• Y 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y)
7.1
• (5 - 6.1)(8 7.1) (4 6.1)(7 7.1) ... etc.
• -0.99 0.21 ... 7.9 (Covariation)
• S.D. (X) 2.02 S.D. (Y) 1.37
• (7.9 / 9) / (2.02 x 1.37) 0.316

30
Looking at the relationship between two
categorical variables
• If two variables are nominal or ordinal, i.e.
categorical, we can look at the relationship
between them in the form of a cross-tabulation,
using percentages to summarize the pattern.
(Typically, if there is one variable that can be
viewed as depending on the other, i.e. a
dependent variable, and the categories of this
variable make up the columns of the
cross-tabulation, then it makes sense to have
percentages that sum to 100 across each row
these are referred to as row percentages).

31
An example of a cross-tabulation (from Jamieson
et al., 2002)
When you and your current partner first decided
to set up home or move in together, did you think
of it as a permanent arrangement or something
that you would try and then see how it worked?
Both permanent Both try and see Different answers TOTAL
Cohabiting without marriage 15 (48) 4 (13) (39) 31 (100)
Cohabited and then married (67) 1 (4) 7 (29) 24 (100)
Married without cohabiting 9 (100) 0 (0) 0 (0) 9 (100)
Jamieson, L. et al. 2002. Cohabitation and
commitment partnership plans of young men and
women, Sociological Review 50.3 356377.
32
Alternative forms of percentage
• In the following example, row percentages allow
us to compare outcomes between the categories of
an independent variable.
• However, we can also use column percentages to
look at the composition of each category of the
dependent variable.
• In addition, we can use total percentages to look
at how the cases are distributed across
combinations of the two variables.

33
Example Cross-tabulation II Row percentages
Derived from Goldthorpe, J.H. with Llewellyn,
C. and Payne, C. (1987). Social Mobility and
Class Structure in Modern Britain (2nd Edition).
Oxford Clarendon Press.
34
Example Cross-tabulation II Column percentages
35
Example Cross-tabulation II Total percentages
36
Percentages and Association
• It is possibly self-evident that the differences
between the percentages in different rows (or
columns) can collectively be viewed as measuring
association
• In the case of a 2x2 cross-tabulation (i.e. one
with two rows and two columns), the difference
between the percentages is a measure of
association for that cross-tabulation
• But there are other ways of quantifying the
association in the cross-tabulation

37
Odds ratios as a measure of association
• The patterns in the social mobility table
examined in an earlier session can clearly be
expressed as differences in percentages (e.g. the
differences between the percentages of sons with
fathers in classes I and VII who are themselves
in classes I and VII.
• However, an alternative way of quantifying these
class differences is to compare the odds of class
I fathers having sons in class I as opposed to
class VII with the odds of class VII fathers
having sons in class I as opposed to class VII.
• The ratio of these two sets of odds is an odds
ratio, which will have a value of close to 1.0 if
the two sets of odds are similar, i.e. if there
is little or no difference between the chances of
being in classes I and VII for sons with fathers
in classes I and VII respectively.

38
Odds Ratios vs. Differences An Example Gender
and Higher Education
• Age 30-39 Degree No Degree
difference -0.8
• Men 56 (13.0) 374 Odds ratio
((56/374)/(70/438))
• Women 70 (13.8) 438 0.937
•
• Age 40-49 Degree No Degree
difference 5.3
• Men 56 (14.4) 334 Odds ratio
((56/334)/(38/378))
• Women 38 (9.1) 378 1.668
•
• Age 50-59 Degree No Degree
difference 4.7
• Men 34 (9.9) 308 Odds ratio
((34/308)/(18/329))
• Women 18 (5.2) 329 2.018

39
Choice of measure can matter!
• The choice of differences between percentages
versus odds ratios as a way of quantifying
differences between groups can matter, as in the
preceding example of the effect of gender on
the likelihood of having a degree, according to
age.
• The difference values of 4.7, 5.3 and -0.8
suggest that inequality increased before it
disappeared, whereas the odds ratios of 2.018,
1.668 and 0.937 suggest a small decrease in
inequality before a larger decrease led to
approximate equality!
• Evidently, there are competing ways of measuring
association in a cross-tabulation. But neither
differences between percentages nor odds ratios
provide an overall summary of the association in
a cross-tabulation

40
Another measure of association
• If we need an overall measure of association for
two cross-tabulated (categorical) variables, one
standard possibility is Cramérs V
• Like the Pearson correlation coefficient, it has
a maximum of 1, and 0 indicates no relationship,
but it can only take on positive values, and
makes no assumption of linearity.
• It is derived from a test statistic (inferential
statistic), chi-square, which we will consider in
a later session

41
An example of Cramérs V
Cramérs V 0.074
42
Other measures of association for
cross-tabulations
• In a literature review more than thirty years
ago, Goodman and Kruskal identified several dozen
of these
• Goodman, L.A. and Kruskal, W.H. 1979. Measures
of association for cross classifications. New
York, Springer-Verlag.
• and I added one of my own, Tog, which measures
inequality (in a particular way) where both
variables are ordinal

One of Togs (distant) relatives
43
What if one variable is a set of categories, and
the other is a scale?
• The equivalent to comparing percentages in this
instance is comparing means but there may be
quite a lot of these!
• So one possible overall measure of association
used in this situation is eta2 (?2) (eta-squared)
• But this is a less familiar measure (at least to
researchers in some social science disciplines!)