Loading...

PPT – DTC Quantitative Methods Descriptive Statistics Thursday 17th January 2013 PowerPoint presentation | free to download - id: 6d7b95-ZDBhO

The Adobe Flash plugin is needed to view this content

DTC Quantitative Methods Descriptive

Statistics Thursday 17th January 2013

Some relevant online course extracts

- Cramer (1998) Chapter 2
- - Measurement and univariate analysis.
- Diamond and Jefferies (2001) Chapter 5
- - Measures and displays of spread.
- Sarantakos (2007) Chapter 5
- - Graphical displays.
- Huizingh (2007) Chapter 12
- - SPSS material.

Some basic terminology

- Quantitative measures are typically referred to

as variables. - Some variables are generated directly via the

data generation process, but other, derived

variables may be constructed from the original

set of variables later on. - As the next slide indicates, variables are

frequently referred to in more specific ways.

Cause(s) and effect?

- Often, one variable (and occasionally more than

one variable) is viewed as being the dependent

variable. - Variables which are viewed as impacting upon this

variable, or outcome, are often referred to as

independent variables. - However, for some forms of statistical analyses,

independent variables are referred to in more

specific ways (as can be seen within the menus of

SPSS for Windows)

Levels of measurement (Types of quantitative

data)

- A nominal variable relates to a set of categories

such as ethnic groups or political parties which

is not ordered. - An ordinal variable relates to a set of

categories in which the categories are ordered,

such as social classes or levels of educational

qualification. - An interval-level variable relates to a scale

measure, such as age or income, that can be

subjected to mathematical operations such as

averaging.

How many variables?

- The starting point for statistical analyses is

typically an examination of the distributions of

values for the variables of interest. Such

examinations of variables one at a time are a

form of univariate analysis. - Once a researcher moves on to looking at

relationships between pairs of variables she or

he is engaging in bivariate analyses. - and if they attempt to explain why two

variables are related with reference to another

variable or variables they have moved on to a

form of multivariate analysis.

Looking at categorical variables

- For nominal/ordinal variables this largely means

looking at the frequencies of each category,

often pictorially using, say, bar-charts or

pie-charts. - It is usually easier to get a sense of the

relative importance of the various categories if

one converts the frequencies into percentages!

Example of a frequency table

Place met marital or cohabiting partner

Frequency

At school, college or university 872 12.4

At/through work 1405 19.9

In a pub/cafe/restaurant/ bar/club 2096 29.7

At a social event organised by friend(s) 1055 14.9

Other 1631 23.1

TOTAL 7059 100.0

Example of a pie-chart

At school, college or university

Other

At/through work

At a social event organised by friend(s)

In a pub/cafe/restaurant/ bar/club

What are percentages?

- It may seem self-evident, but percentages are a

form of descriptive statistic - Specifically, they are useful in describing the

distributions (of frequencies) for nominal or

ordinal (i.e. categorical) variables - When we consider interval-level variables or more

than one variable, we need (somewhat) more

sophisticated descriptive statistics

Descriptive statistics...

- ... are data summaries which provide an

alternative to graphical representations of

distributions of values (or relationships) - ... aim to describe key aspects of distributions

of values (or relationships) - ... are of most relevance when we are thinking

about interval-level variables (scales)

Description or inference?

- Descriptive statistics summarise relevant

features of a set of values. - Inferential statistics help researchers decide

whether features of quantitative data from a

sample can be safely concluded to be present in

the population. - Generalizing from a sample to a population is

part of the process of statistical inference - One objective may be to produce an estimate of

the proportion of people in the population with a

particular characteristic, i.e. a process of

estimation.

Types of (univariate) descriptive statistics

- Measures of ...
- ... location (averages)
- ... spread
- ... skewness (asymmetry)
- ... kurtosis
- We typically want to know about the first two,

sometimes about the third, and rarely about the

fourth!

What is kurtosis anyway?

- Increasing kurtosis is associated with the

movement of probability mass from the shoulders

of a distribution into its center and tails.

(Balanda, K.P. and MacGillivray, H.L. 1988.

Kurtosis A Critical Review, The American

Statistician 422 111119.) - Below, kurtosis increases from left to right...

Visualising scale variables

- For interval-level data the appropriate visual

summary of a distribution is a histogram,

examining which can allow the researcher to

assess whether it is reasonable to assume that

the quantity of interest has a particular

distributional shape (and whether it exhibits

skewness). - Unlike bar charts, distances along the

horizontal dimension of a histogram have a

well-defined, consistent meaning i.e. they

represent differences between values on the

interval-level scale in question.

Example of a histogram

Measures of location

- Mean (the arithmetic average of the values,

i.e. the result of dividing the sum of the

values by the total number of cases) - Median (the middle value, when the values are

ranked/ordered) - Mode (the most common value)

... and measures of spread

- Standard deviation (and Variance)
- (This is linked with the mean, as it is based on

- averaging squared deviations from it. The
- variance is simply the standard deviation

squared). - Interquartile range / Quartile deviation
- (These are linked with the median, as they
- are also based on the values placed in order).

Measures of location and spread an example

(household size)

Mean 2.94, Median 2, Mode 2 Mean 2.96,

Median 3, Mode 2

s.d. 1.93, skewness 2.10 kurtosis 5.54

s.d. 1.58, skewness 1.27 kurtosis

2.24 West Midlands London

Why is the standard deviation so important?

- The standard deviation (or, more precisely, the

variance) is important because it introduces the

idea of summarising variation in terms of summed,

squared deviations. - And it is also central to some of the statistical

theory used in statistical testing/statistical

inference...

An example of the calculation of a standard

deviation

- Number of seminars attended by a sample of

undergraduates 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 - Mean 61/10 6.1
- Variance ((5 6.1)2 (4 6.1)2 (4 6.1)2

(7 6.1)2 - (9 6.1)2 (8 6.1)2 (9 6.1)2 (4

6.1)2 (6 6.1)2 - (5 6.1)2)/(10 1) 36.9 /9 4.1
- Standard deviation Square root of variance

2.025

The Empire Median Strikes Back!

- Comparing descriptive statistics between groups

can be done graphically in a rather nice way

using a form of display called a boxplot. - Boxplots are based on medians and quartiles

rather than on the more commonly found mean and

standard deviation.

Example of a boxplot

Moving on to bivariate descriptive

statistics'...

- These are referred to as Measures of

association, as they quantify the (strength of

the) association between two variables - The most well-known of these is the (Pearson)

correlation coefficient, often referred to as

the correlation coefficient, or even the

correlation - This quantifies the closeness of the relationship

between two interval-level variables (scales)

Positive and negative relationships

- Positive or direct relationships
- If the points cluster around a line that runs

from the lower left to upper right of the graph

area, then the relationship between the two

variables is positive or direct. - An increase in the value of x is more likely to

be associated with an increase in the value of y.

- The closer the points are to the line, the

stronger the relationship.

- Negative or inverse relationships
- If the points tend to cluster around a line that

runs from the upper left to lower right of the

graph, then the relationship between the two

variables is negative or inverse. - An increase in the value of x is more likely to

be associated with a decrease in the value of y.

Working out the correlation coefficient

(Pearsons r)

- Pearsons r tells us how much one variable

changes as the values of another changes their

covariation. - Variation is measured with the standard

deviation. This measures average variation of

each variable from the mean for that variable. - Covariation is measured by calculating the amount

by which each value of X varies from the mean of

X, and the amount by which each value of Y varies

from the mean of Y and multiplying the

differences together and finding the average (by

dividing by n-1). - Pearsons r is calculated by dividing this by (SD

of x) x (SD of y) in order to standardize it.

Working out the correlation coefficient

(Pearsons r)

- Because r is standardized it will always fall

between 1 and -1. - A correlation of either 1 or -1 means perfect

association between the two variables. - A correlation of 0 means that there is no

association. - Note correlation does not mean causation. We can

only investigate causation by reference to our

theory. However (thinking about it the other way

round) there is unlikely to be causation if there

is not correlation.

A scatterplot of the values of two

interval-level variables

Example of calculating a correlation coefficient

(corresponding to the last slide)

- X 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X)

6.1 - Y 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y)

7.1 - (5 - 6.1)(8 7.1) (4 6.1)(7 7.1) ... etc.
- -0.99 0.21 ... 7.9 (Covariation)
- S.D. (X) 2.02 S.D. (Y) 1.37
- (7.9 / 9) / (2.02 x 1.37) 0.316

Looking at the relationship between two

categorical variables

- If two variables are nominal or ordinal, i.e.

categorical, we can look at the relationship

between them in the form of a cross-tabulation,

using percentages to summarize the pattern.

(Typically, if there is one variable that can be

viewed as depending on the other, i.e. a

dependent variable, and the categories of this

variable make up the columns of the

cross-tabulation, then it makes sense to have

percentages that sum to 100 across each row

these are referred to as row percentages).

An example of a cross-tabulation (from Jamieson

et al., 2002)

When you and your current partner first decided

to set up home or move in together, did you think

of it as a permanent arrangement or something

that you would try and then see how it worked?

Both permanent Both try and see Different answers TOTAL

Cohabiting without marriage 15 (48) 4 (13) (39) 31 (100)

Cohabited and then married (67) 1 (4) 7 (29) 24 (100)

Married without cohabiting 9 (100) 0 (0) 0 (0) 9 (100)

Jamieson, L. et al. 2002. Cohabitation and

commitment partnership plans of young men and

women, Sociological Review 50.3 356377.

Alternative forms of percentage

- In the following example, row percentages allow

us to compare outcomes between the categories of

an independent variable. - However, we can also use column percentages to

look at the composition of each category of the

dependent variable. - In addition, we can use total percentages to look

at how the cases are distributed across

combinations of the two variables.

Example Cross-tabulation II Row percentages

Derived from Goldthorpe, J.H. with Llewellyn,

C. and Payne, C. (1987). Social Mobility and

Class Structure in Modern Britain (2nd Edition).

Oxford Clarendon Press.

Example Cross-tabulation II Column percentages

Example Cross-tabulation II Total percentages

Percentages and Association

- It is possibly self-evident that the differences

between the percentages in different rows (or

columns) can collectively be viewed as measuring

association - In the case of a 2x2 cross-tabulation (i.e. one

with two rows and two columns), the difference

between the percentages is a measure of

association for that cross-tabulation - But there are other ways of quantifying the

association in the cross-tabulation

Odds ratios as a measure of association

- The patterns in the social mobility table

examined in an earlier session can clearly be

expressed as differences in percentages (e.g. the

differences between the percentages of sons with

fathers in classes I and VII who are themselves

in classes I and VII. - However, an alternative way of quantifying these

class differences is to compare the odds of class

I fathers having sons in class I as opposed to

class VII with the odds of class VII fathers

having sons in class I as opposed to class VII. - The ratio of these two sets of odds is an odds

ratio, which will have a value of close to 1.0 if

the two sets of odds are similar, i.e. if there

is little or no difference between the chances of

being in classes I and VII for sons with fathers

in classes I and VII respectively.

Odds Ratios vs. Differences An Example Gender

and Higher Education

- Age 30-39 Degree No Degree

difference -0.8 - Men 56 (13.0) 374 Odds ratio

((56/374)/(70/438)) - Women 70 (13.8) 438 0.937
- Age 40-49 Degree No Degree

difference 5.3 - Men 56 (14.4) 334 Odds ratio

((56/334)/(38/378)) - Women 38 (9.1) 378 1.668
- Age 50-59 Degree No Degree

difference 4.7 - Men 34 (9.9) 308 Odds ratio

((34/308)/(18/329)) - Women 18 (5.2) 329 2.018

Choice of measure can matter!

- The choice of differences between percentages

versus odds ratios as a way of quantifying

differences between groups can matter, as in the

preceding example of the effect of gender on

the likelihood of having a degree, according to

age. - The difference values of 4.7, 5.3 and -0.8

suggest that inequality increased before it

disappeared, whereas the odds ratios of 2.018,

1.668 and 0.937 suggest a small decrease in

inequality before a larger decrease led to

approximate equality! - Evidently, there are competing ways of measuring

association in a cross-tabulation. But neither

differences between percentages nor odds ratios

provide an overall summary of the association in

a cross-tabulation

Another measure of association

- If we need an overall measure of association for

two cross-tabulated (categorical) variables, one

standard possibility is Cramérs V - Like the Pearson correlation coefficient, it has

a maximum of 1, and 0 indicates no relationship,

but it can only take on positive values, and

makes no assumption of linearity. - It is derived from a test statistic (inferential

statistic), chi-square, which we will consider in

a later session

An example of Cramérs V

Cramérs V 0.074

Other measures of association for

cross-tabulations

- In a literature review more than thirty years

ago, Goodman and Kruskal identified several dozen

of these - Goodman, L.A. and Kruskal, W.H. 1979. Measures

of association for cross classifications. New

York, Springer-Verlag. - and I added one of my own, Tog, which measures

inequality (in a particular way) where both

variables are ordinal

One of Togs (distant) relatives

What if one variable is a set of categories, and

the other is a scale?

- The equivalent to comparing percentages in this

instance is comparing means but there may be

quite a lot of these! - So one possible overall measure of association

used in this situation is eta2 (?2) (eta-squared) - But this is a less familiar measure (at least to

researchers in some social science disciplines!)