DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 - PowerPoint PPT Presentation


PPT – DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 PowerPoint presentation | free to download - id: 6d7b95-ZDBhO


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013


DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Date added: 20 February 2020
Slides: 44
Provided by: ITSer200


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013

DTC Quantitative Methods Descriptive
Statistics Thursday 17th January 2013  
Some relevant online course extracts
  • Cramer (1998) Chapter 2
  • - Measurement and univariate analysis.
  • Diamond and Jefferies (2001) Chapter 5
  • - Measures and displays of spread.
  • Sarantakos (2007) Chapter 5
  • - Graphical displays.
  • Huizingh (2007) Chapter 12
  • - SPSS material.

Some basic terminology
  • Quantitative measures are typically referred to
    as variables.
  • Some variables are generated directly via the
    data generation process, but other, derived
    variables may be constructed from the original
    set of variables later on.
  • As the next slide indicates, variables are
    frequently referred to in more specific ways.

Cause(s) and effect?
  • Often, one variable (and occasionally more than
    one variable) is viewed as being the dependent
  • Variables which are viewed as impacting upon this
    variable, or outcome, are often referred to as
    independent variables.
  • However, for some forms of statistical analyses,
    independent variables are referred to in more
    specific ways (as can be seen within the menus of
    SPSS for Windows)

Levels of measurement (Types of quantitative
  • A nominal variable relates to a set of categories
    such as ethnic groups or political parties which
    is not ordered.
  • An ordinal variable relates to a set of
    categories in which the categories are ordered,
    such as social classes or levels of educational
  • An interval-level variable relates to a scale
    measure, such as age or income, that can be
    subjected to mathematical operations such as

How many variables?
  • The starting point for statistical analyses is
    typically an examination of the distributions of
    values for the variables of interest. Such
    examinations of variables one at a time are a
    form of univariate analysis.
  • Once a researcher moves on to looking at
    relationships between pairs of variables she or
    he is engaging in bivariate analyses.
  • and if they attempt to explain why two
    variables are related with reference to another
    variable or variables they have moved on to a
    form of multivariate analysis.

Looking at categorical variables
  • For nominal/ordinal variables this largely means
    looking at the frequencies of each category,
    often pictorially using, say, bar-charts or
  • It is usually easier to get a sense of the
    relative importance of the various categories if
    one converts the frequencies into percentages!

Example of a frequency table
Place met marital or cohabiting partner
At school, college or university 872 12.4
At/through work 1405 19.9
In a pub/cafe/restaurant/ bar/club 2096 29.7
At a social event organised by friend(s) 1055 14.9
Other 1631 23.1
TOTAL 7059 100.0
Example of a pie-chart
At school, college or university
At/through work
At a social event organised by friend(s)
In a pub/cafe/restaurant/ bar/club
What are percentages?
  • It may seem self-evident, but percentages are a
    form of descriptive statistic
  • Specifically, they are useful in describing the
    distributions (of frequencies) for nominal or
    ordinal (i.e. categorical) variables
  • When we consider interval-level variables or more
    than one variable, we need (somewhat) more
    sophisticated descriptive statistics

Descriptive statistics...
  • ... are data summaries which provide an
    alternative to graphical representations of
    distributions of values (or relationships)
  • ... aim to describe key aspects of distributions
    of values (or relationships)
  • ... are of most relevance when we are thinking
    about interval-level variables (scales)

Description or inference?
  • Descriptive statistics summarise relevant
    features of a set of values.
  • Inferential statistics help researchers decide
    whether features of quantitative data from a
    sample can be safely concluded to be present in
    the population.
  • Generalizing from a sample to a population is
    part of the process of statistical inference
  • One objective may be to produce an estimate of
    the proportion of people in the population with a
    particular characteristic, i.e. a process of

Types of (univariate) descriptive statistics
  • Measures of ...
  • ... location (averages)
  • ... spread
  • ... skewness (asymmetry)
  • ... kurtosis
  • We typically want to know about the first two,
    sometimes about the third, and rarely about the

What is kurtosis anyway?
  • Increasing kurtosis is associated with the
    movement of probability mass from the shoulders
    of a distribution into its center and tails.
    (Balanda, K.P. and MacGillivray, H.L. 1988.
    Kurtosis A Critical Review, The American
    Statistician 422 111119.)
  • Below, kurtosis increases from left to right...

Visualising scale variables
  • For interval-level data the appropriate visual
    summary of a distribution is a histogram,
    examining which can allow the researcher to
    assess whether it is reasonable to assume that
    the quantity of interest has a particular
    distributional shape (and whether it exhibits
  • Unlike bar charts, distances along the
    horizontal dimension of a histogram have a
    well-defined, consistent meaning i.e. they
    represent differences between values on the
    interval-level scale in question.

Example of a histogram
Measures of location
  • Mean (the arithmetic average of the values,
    i.e. the result of dividing the sum of the
    values by the total number of cases)
  • Median (the middle value, when the values are
  • Mode (the most common value)

... and measures of spread
  • Standard deviation (and Variance)
  • (This is linked with the mean, as it is based on
  • averaging squared deviations from it. The
  • variance is simply the standard deviation
  • Interquartile range / Quartile deviation
  • (These are linked with the median, as they
  • are also based on the values placed in order).

Measures of location and spread an example
(household size)
Mean 2.94, Median 2, Mode 2 Mean 2.96,
Median 3, Mode 2
s.d. 1.93, skewness 2.10 kurtosis 5.54
s.d. 1.58, skewness 1.27 kurtosis
2.24 West Midlands London
Why is the standard deviation so important?
  • The standard deviation (or, more precisely, the
    variance) is important because it introduces the
    idea of summarising variation in terms of summed,
    squared deviations.
  • And it is also central to some of the statistical
    theory used in statistical testing/statistical

An example of the calculation of a standard
  • Number of seminars attended by a sample of
    undergraduates 5, 4, 4, 7, 9, 8, 9, 4, 6, 5
  • Mean 61/10 6.1
  • Variance ((5 6.1)2 (4 6.1)2 (4 6.1)2
    (7 6.1)2
  • (9 6.1)2 (8 6.1)2 (9 6.1)2 (4
    6.1)2 (6 6.1)2
  • (5 6.1)2)/(10 1) 36.9 /9 4.1
  • Standard deviation Square root of variance

The Empire Median Strikes Back!
  • Comparing descriptive statistics between groups
    can be done graphically in a rather nice way
    using a form of display called a boxplot.
  • Boxplots are based on medians and quartiles
    rather than on the more commonly found mean and
    standard deviation.

Example of a boxplot
Moving on to bivariate descriptive
  • These are referred to as Measures of
    association, as they quantify the (strength of
    the) association between two variables
  • The most well-known of these is the (Pearson)
    correlation coefficient, often referred to as
    the correlation coefficient, or even the
  • This quantifies the closeness of the relationship
    between two interval-level variables (scales)

Positive and negative relationships
  • Positive or direct relationships
  • If the points cluster around a line that runs
    from the lower left to upper right of the graph
    area, then the relationship between the two
    variables is positive or direct.
  • An increase in the value of x is more likely to
    be associated with an increase in the value of y.
  • The closer the points are to the line, the
    stronger the relationship.
  • Negative or inverse relationships
  • If the points tend to cluster around a line that
    runs from the upper left to lower right of the
    graph, then the relationship between the two
    variables is negative or inverse.
  • An increase in the value of x is more likely to
    be associated with a decrease in the value of y.

Working out the correlation coefficient
(Pearsons r)
  • Pearsons r tells us how much one variable
    changes as the values of another changes their
  • Variation is measured with the standard
    deviation. This measures average variation of
    each variable from the mean for that variable.
  • Covariation is measured by calculating the amount
    by which each value of X varies from the mean of
    X, and the amount by which each value of Y varies
    from the mean of Y and multiplying the
    differences together and finding the average (by
    dividing by n-1).
  • Pearsons r is calculated by dividing this by (SD
    of x) x (SD of y) in order to standardize it.

Working out the correlation coefficient
(Pearsons r)
  • Because r is standardized it will always fall
    between 1 and -1.
  • A correlation of either 1 or -1 means perfect
    association between the two variables.
  • A correlation of 0 means that there is no
  • Note correlation does not mean causation. We can
    only investigate causation by reference to our
    theory. However (thinking about it the other way
    round) there is unlikely to be causation if there
    is not correlation.

A scatterplot of the values of two
interval-level variables
Example of calculating a correlation coefficient
(corresponding to the last slide)
  • X 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X)
  • Y 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y)
  • (5 - 6.1)(8 7.1) (4 6.1)(7 7.1) ... etc.
  • -0.99 0.21 ... 7.9 (Covariation)
  • S.D. (X) 2.02 S.D. (Y) 1.37
  • (7.9 / 9) / (2.02 x 1.37) 0.316

Looking at the relationship between two
categorical variables
  • If two variables are nominal or ordinal, i.e.
    categorical, we can look at the relationship
    between them in the form of a cross-tabulation,
    using percentages to summarize the pattern.
    (Typically, if there is one variable that can be
    viewed as depending on the other, i.e. a
    dependent variable, and the categories of this
    variable make up the columns of the
    cross-tabulation, then it makes sense to have
    percentages that sum to 100 across each row
    these are referred to as row percentages).

An example of a cross-tabulation (from Jamieson
et al., 2002)
When you and your current partner first decided
to set up home or move in together, did you think
of it as a permanent arrangement or something
that you would try and then see how it worked?
Both permanent Both try and see Different answers TOTAL
Cohabiting without marriage 15 (48) 4 (13) (39) 31 (100)
Cohabited and then married (67) 1 (4) 7 (29) 24 (100)
Married without cohabiting 9 (100) 0 (0) 0 (0) 9 (100)
Jamieson, L. et al. 2002. Cohabitation and
commitment partnership plans of young men and
women, Sociological Review 50.3 356377.
Alternative forms of percentage
  • In the following example, row percentages allow
    us to compare outcomes between the categories of
    an independent variable.
  • However, we can also use column percentages to
    look at the composition of each category of the
    dependent variable.
  • In addition, we can use total percentages to look
    at how the cases are distributed across
    combinations of the two variables.

Example Cross-tabulation II Row percentages
Derived from Goldthorpe, J.H. with Llewellyn,
C. and Payne, C. (1987). Social Mobility and
Class Structure in Modern Britain (2nd Edition).
Oxford Clarendon Press.
Example Cross-tabulation II Column percentages
Example Cross-tabulation II Total percentages
Percentages and Association
  • It is possibly self-evident that the differences
    between the percentages in different rows (or
    columns) can collectively be viewed as measuring
  • In the case of a 2x2 cross-tabulation (i.e. one
    with two rows and two columns), the difference
    between the percentages is a measure of
    association for that cross-tabulation
  • But there are other ways of quantifying the
    association in the cross-tabulation

Odds ratios as a measure of association
  • The patterns in the social mobility table
    examined in an earlier session can clearly be
    expressed as differences in percentages (e.g. the
    differences between the percentages of sons with
    fathers in classes I and VII who are themselves
    in classes I and VII.
  • However, an alternative way of quantifying these
    class differences is to compare the odds of class
    I fathers having sons in class I as opposed to
    class VII with the odds of class VII fathers
    having sons in class I as opposed to class VII.
  • The ratio of these two sets of odds is an odds
    ratio, which will have a value of close to 1.0 if
    the two sets of odds are similar, i.e. if there
    is little or no difference between the chances of
    being in classes I and VII for sons with fathers
    in classes I and VII respectively.

Odds Ratios vs. Differences An Example Gender
and Higher Education
  • Age 30-39 Degree No Degree
    difference -0.8
  • Men 56 (13.0) 374 Odds ratio
  • Women 70 (13.8) 438 0.937
  • Age 40-49 Degree No Degree
    difference 5.3
  • Men 56 (14.4) 334 Odds ratio
  • Women 38 (9.1) 378 1.668
  • Age 50-59 Degree No Degree
    difference 4.7
  • Men 34 (9.9) 308 Odds ratio
  • Women 18 (5.2) 329 2.018

Choice of measure can matter!
  • The choice of differences between percentages
    versus odds ratios as a way of quantifying
    differences between groups can matter, as in the
    preceding example of the effect of gender on
    the likelihood of having a degree, according to
  • The difference values of 4.7, 5.3 and -0.8
    suggest that inequality increased before it
    disappeared, whereas the odds ratios of 2.018,
    1.668 and 0.937 suggest a small decrease in
    inequality before a larger decrease led to
    approximate equality!
  • Evidently, there are competing ways of measuring
    association in a cross-tabulation. But neither
    differences between percentages nor odds ratios
    provide an overall summary of the association in
    a cross-tabulation

Another measure of association
  • If we need an overall measure of association for
    two cross-tabulated (categorical) variables, one
    standard possibility is Cramérs V
  • Like the Pearson correlation coefficient, it has
    a maximum of 1, and 0 indicates no relationship,
    but it can only take on positive values, and
    makes no assumption of linearity.
  • It is derived from a test statistic (inferential
    statistic), chi-square, which we will consider in
    a later session

An example of Cramérs V
Cramérs V 0.074
Other measures of association for
  • In a literature review more than thirty years
    ago, Goodman and Kruskal identified several dozen
    of these
  • Goodman, L.A. and Kruskal, W.H. 1979. Measures
    of association for cross classifications. New
    York, Springer-Verlag.
  • and I added one of my own, Tog, which measures
    inequality (in a particular way) where both
    variables are ordinal

One of Togs (distant) relatives
What if one variable is a set of categories, and
the other is a scale?
  • The equivalent to comparing percentages in this
    instance is comparing means but there may be
    quite a lot of these!
  • So one possible overall measure of association
    used in this situation is eta2 (?2) (eta-squared)
  • But this is a less familiar measure (at least to
    researchers in some social science disciplines!)
About PowerShow.com