Loading...

PPT – Introduction to Data Analysis. PowerPoint presentation | free to download - id: 69c1c7-NDcxN

The Adobe Flash plugin is needed to view this content

Introduction to Data Analysis.

- Levels of measurement and
- Descriptive statistics

Whats this course about?

- Introduction to the use of quantitative data in

social science. - The tools we need in order to use numerical data

(i.e. anything we can count) to better understand

the world. - Very basic introduction, students intending to

write theses using primarily quantitative data

should also attend the intermediate/ advanced

lectures.

Why am I here?

- Your own research.
- Using quantitative data as an integral part of

your thesis. - Using quantitative data as supplementary

evidence. - Making better use of qualitative data.
- Other peoples research.
- Understanding work in your area.
- Criticising work in your area.
- Its compulsory.

What is Statistics?

- Methods for
- Designing and carrying out research studies
- Describing collected data
- Making decisions/inferences about phenomena
- represented by data

Some key terms (1)

- Populationthe total set of individual objects of

persons of interest in a study - Samplea subset of the population that is

actually observed

Key Terms (2)

- Descriptive Stats consist of methods of graphical

and numerical techniques for summarizing the

information in a collection of data - Inferential stats consist of procedures for

making generalizations about characteristics of a

population, based on info from a sample.

Key terms (3)

- Parameters are the characteristics of the

population about which we make inferences using

sample data - Statistics are the corresponding characteristics

of the sample data, upon which we base our

inferences about parameters.

Some key terms (1)

- Populationthe total set of individual objects of

persons of interest in a study - Samplea subset of the population that is

actually observed

Variables and their measurement

- Variable measurement of a characteristic of a

subject (something or someone) that varies across

subjects in a population of subjects. - Different levels of measurement, which means that

we have to examine different types of data in

different ways.

Some key terms (1)

- Populationthe total set of individual objects of

persons of interest in a study - Samplea subset of the population that is

actually observed

Nominal level measures (1)

- Just represent a category.
- e.g. Male Female.
- e.g. Single Married Divorced.
- Since there is no ordering, these are nominal

measures. - Often called qualitative, since two values differ

in quality not quantity.

Nominal level measures (2)

- Can quantify these data by tabulating them.
- Normally represent nominal data in a simple table

with percentages. - Take the marital status of all of my 25 friends

(i.e. the population we are looking at is all

Ryans friends).

Marital status Number

Single 18 72

Married 6 24

Divorced 1 4

Total 25 100

Ordinal level measures

- Categories again, but these categories are

ordered. - e.g. Many polling/survey questions.
- It was right for Britain to send troops to Iraq
- Strongly agree
- Agree
- Disagree
- Strongly disagree.
- The distance between each category is unknown.
- Strong agreers are more hawkish than agreers,

but we have no idea how much more hawkish they

are. - We can say on observation is greater in rank than

another. - Can be ranking in class (for example) or from

naturally ordered categories - Called quantitative because different values

represent different magnitudes.

Interval level measures

- Numbers represent a quantitative variable.
- e.g. Income, number of pupils per teacher, age,

etc. - There is a specific distance between each level.

- We can not only say that my sister is younger

than I am, but that she is 2 years younger. - Age is a continuous variable, one can also

subdivide the measure (784 days, 3 hours and 2

minutes younger). - It is also true that my parents have only 2

children. - Number of children is a discrete variable, you

cannot sub-divide children, you have 1, or 2, or

3. You cant have 2 ½ children.

Descriptive statistics

- Most statistics that we will cover today apply to

variables that are interval level measures. - Descriptive statistics are just that. They

describe a large amount of data in a summary

form. - Why bother? Because were often interested in

what a typical person (or country or school or

parliament etc.) looks like.

Measuring the central tendency

- What we want to do is reduce a lot of interval

level measurements to a few numbers. - The salaries of all of my best friends (the

population is Ryans best friends). - What is the typical annual salary of a best

friend of mine.

Name Salary

Ellen 75,000

Jenny 13,000

Justin 31,000

Andrew 26,000

Mungo 15,000

The mean

- The most usual way of measuring the central

tendency is to use the mean (or average). - This is simply the sum of the measurements

divided by the number of observations. - For our salaried people
- Mean 32,000

A (very) little bit of math

- To introduce some terms which will be useful

later, the mean is calculated as follows. Suppose

we have n observations, with each value denoted

by X1, X2 and so on until Xn. Then the mean is

described as follows

Or, to put it another way

The means properties

- Shift of origin of measurement.
- If everyone earns 2000 more, then the new mean

salary is just the old mean salary (32,000) PLUS

2000. - Change of scale.
- If we calculate salary in dollars (say 1 2),

then the new mean salary is simply twice the old

mean salary. - Sum of two variables.
- Imagine that income salary savings interest.
- Mean income mean salary mean savings

interest.

The median

- Another common way deriving one number to

describe many is to use the median. - Imagine we ranked all observations, the median is

simply the observation in the middle (½ of

observations above and ½ below). - In ascending order the salaries are
- 13,000 15,000 26,000 31,000 75,000.
- Median 26,000.

- Median ½(26,00031,000) 28,500.

The medians properties (1)

- Shift of origin of measurement. YES
- Change of scale. YES
- Sum of two variables. NO
- The lack of this property is somewhat important

(which will become apparent in the following

weeks), and is related to one of the reasons why

we generally use the mean in most statistical

analysis. - Nonetheless, the median does have some advantages

over the mean in describing some types of data.

The medians properties (2)

- For our salary example, the mean of my best

friends salaries gives a substantially higher

value than the median (6000 more). - This is due to the distribution of the

observations. For the mean and median to be the

same the distribution of observations needs to be

symmetrical. - Imagine we now look at all my friends and

acquaintances (the population of 25 people as

before), and plot the frequency of each salary

for all 25.

Frequency graph of salaries

Median 26,000

Mean 34,000

Positions of the median and mean

- For distributions with a long tail to the right,

the mean will take a higher value than the

median. - This is generally true across the world for

income distributions, and is captured by Pens

parade of dwarfs and a few giants. - If such a parade were organised today, then the

person of mean height (and income) would be

taller (and richer) than 65 of the population

and so would pass by after 40 minutes had

elapsed. - Mean income is 24,000, median income is

16,000. - For data with outliers the median can give a

better idea of what the typical observation is

like.

Ordinal level data

- The median can be used for ordinal level data.
- Imagine we had asked my 5 best friends about

their position on the Iraq war 2 strongly agreed

with sending British troops, one agreed, one

disagreed and one strongly disagreed. - We can rank these answers and then find the

median. - Strongly agree strongly agree agree disagree

strongly disagree. - Thus the median answer is agree.

Nominal level data

- In general, we cant use the median or mean for

nominal data. - Normally use the mode. This is the most commonly

occurring value. - e.g if 53 people here are politics students, 40

sociology students, and 46 are other subjects,

then the modal value is politics. - There is one special case in which we can use the

mean for nominal data however

Nominal binary data

- binary data is an exception as we can use the

mean. Binary data (e.g. Yes/No, Male/Female) can

be coded as 0 or 1. - A variable measuring sex, men are coded 1 and

women coded 0. - The mean score for those 0s and 1s is the

proportion of men. There were 2 women and 3 men

amongst my best friends. - The median does NOT make sense for binary data.

It just tells us what the majority of the

population is.

Measures of dispersion

- The mean (or median) tells us something about the

centre of the distribution, but what about its

dispersion? - The means/medians of the below distributions of

childrens scores on a maths test in three

different classes are all the same (48

observations, mean of 7, median of 7), but each

tells a quite different story.

The range

- The range is simply a measure of the distance

between the largest and smallest observations. - The range for our salary example is therefore
- 75,000 13,000 62,000.
- Clearly this is not ideal as it relies on only

two observations. - Say we have 1000 poker players. 999 win nothing,

and 1 wins 1million. The range indicates lots of

variation, when most people are actually

identical.

The variance and standard deviation

- A better way of assessing how much values of a

variable vary around the mean is to use the

standard deviation or variance. - Basic idea is to measure how different individual

values are from the mean value. - Some of these deviations from the mean will be

positive and some negative, so we square each

deviation.

The variance

- Take my 5 best friends. The mean salary was

32,000. - If we added up all the differences then we would

get zero, so we need to square the differences

(i.e. multiply them by themselves).

Andrew

Ellen (75,000)

Justin

Jenny

Mungo (15,000)

15,000 - 32,000 -17,000

Difference 75,000 - 32,000 43,000

Mean32,000

Calculating variance

- Salary example, with 5 obs, and mean of 32,000.

Salary (000s) Deviation from mean Squared deviation

75 75 - 32 43 43 43 1849

13 13 32 -19 -20 -20 361

31 31 32 -1 -1 -1 1

26 26 32 -6 -6 -6 36

15 15 32 -17 -17 -17 289

Calculating standard deviation

- The standard deviation is the most common way to

measure deviation from the mean and is simply the

square root of the variance. - We normally call the variance s2 and the standard

deviation s. Thus for our example, s2 507.2,

and s 22.5.

we usually use n-1 in the denominator

Examples of standard deviation

s 1.02 Tight distribution (All children perform

similarly)

s 1.67 Clustered distribution (Most children

perform to a similar level, with some variation)

s 4.01 Dispersed distribution (One group of

geniuses, one group of idiots)

But what does it mean?

- Our salary example had a standard deviation of

22.5, but for the distributions above the s

varied between 1 and 4, what does this tell us? - Best way to think of it is as a kind of rough

average distance of an observation to the mean. - Thus the standard deviation depends on the units

we are measuring in.

Standard deviation summary

- Broadly speaking, high levels of s indicate

greater variation, and the value of s gives a

broad idea of a typical distance from the mean. - The concept of standard deviation is an important

one, and next week Ill talk more about

particular types of distributions and their

properties.

How to (not) lie with statistics

- Even simple descriptive statistics can be misused

in order to mislead. - Particularly the case for simple graphs.
- Most examples I will use here are from Edward

Tufte The Visual Display of Quantitative

Information (1983, and later reprints). - See any copy of any of the Sunday papers for

similar glaring errors however.

Too little information

- Presenting too little summary information.
- Example courtesy of Tukey (1979) in JASA.
- Take Washoe County in Nevada, USA. There is a

mean population density of 13 ½ people per square

mile. - The mean is not informative without information

on the distribution however, for in fact 80 of

the inhabitants live in two cities. - The cities have population densities of 5000 per

square mile. - The rest of the county has a population density

of 2 ½ people per square mile.

Base years (1)

- Picking your base year (Tufte 1983).

Base years (2)

Measures over time (1)

Measures over time (2)

The lie factor

- Are doctors really becoming smaller?

Small differences

- Just because somethings top or bottom of a list,

doesnt imply anything. - The difference between top and bottom might be

very small. - Close to home, look at the Norrington table for

this. The difference between the middle 10

colleges is essentially zero, but its the

ranking that everyone cares about. - Ranking of countries by something like literacy

rates is often similarly futile. There has to be

one at the top with 99.9 but all Western

countries will have 99 rates

(very) Small samples

- 9 out of 10 cats prefer Whiskers
- We may think that the evidence for this is strong

if thousands of cats had their opinion solicited,

but maybe weak if only 10 cats were questioned

out of the population of millions. - Knowing when a small sample is too small is one

of the topics we will cover over the next two

weeks and is a critical part of understanding

commonly used statistics.

How to talk back to a statistic

- Who says so?
- We all want to prove our own theories correct
- How does he know?
- Is the data reputable?
- Whats missing?
- e.g. means are no use without standard

deviations. - Does it make sense?
- Social science is the science of the bloody

obvious most of the time. Dont let numbers

confuse or fool you if it sounds wrong, it

probably is.