Statistical Techniques for Analyzing Quantitative

Data

- Maryam Ramezani
- Values in Computer Technology
- CSC 426

Outline

Role of Statistics in Research

- With Statistics , we can summarize large bodies

of data, make predictions about future trends

,and determine when different experimental

treatments have led to significantly different

outcomes. - Statistics are among the most powerful tools in

the research's toolbox.

How statistics come to research?

- In quantitative research we use numbers to

represent physical or nonphysical phenomena - We use statistics to summarize and interpret

numbers

Exploring and Organizing a Data Set

- Look at your data and find the ways of organizing

them - example Scores of test for 11 children
- What do you see?

Ruth 96, Robert 60, chuck 68, Margaret 88 Tom

56, Mary 92,Ralph 64, Bill 72,Alice 80 Adam

76,Kathy 84

Exploring and Organizing a Data Set

Alphabetical Order

Using Computer Spreadsheets to Organize and

Analyze Data

- Sorting
- Graphing
- Formulas
- What Ifs
- Save, Store, recall, update information

Functions of Statistics

- Descriptive Statistics
- describes what the data look like
- Inferential Statistics
- inference about a large population by collecting

small samples.

Considering the Nature of the Data

- Continuous or discrete
- Nominal, ordinal, interval or ratio scale
- Normal or non-normal distribution

Continuous versus Discrete Variables

- Continuous Data takes on any value within a

finite or infinite interval. You can count, order

and measure continuous data. - Example height, weight, temperature, the amount

of sugar in an orange, the time required to run a

mile. - Discrete Data values / observations belong are

distinct and separate, i.e. they can be counted

(1,2,3,....). - Example the number of kittens in a litter the

number of patients in a doctors surgery the

number of flaws in one metre of cloth gender

(male, female) blood group (O, A, B, AB).

Nominal Data

- the numbers are simply labels. You can count but

not order or measure nominal data - Example males could be coded as 0, females as 1

marital status of an individual could be coded as

Y if married, N if single. - classification data, e.g. m/f
- no ordering, e.g. it makes no sense to state that

M gt F - arbitrary labels, e.g., m/f, 0/1, etc

Ordinal Data

- ordered but differences between values are not

important - e.g., Like scales, rank on a scale of 1..5 your

degree of satisfaction - rating of 2 rather than 1 might be much less than

the difference in enjoyment expressed by giving a

rating of 4 rather than 3. - You can count and order, but not measure, ordinal

data.

Interval Data

- ordered, constant scale, but no natural zero
- differences make sense, but ratios do not
- e.g. 30-2020-10, but 20/10 is not twice

as hot! - e.g. Dates the time interval between the starts

of years 1981 and 1982 is the same as that

between 1983 and 1984, namely 365 days. The zero

point, year 1 AD, is arbitrary time did not

begin then

Ratio Data

- Like interval data but has true zero
- Ordered, Constant scale, natural zero
- e.g., height, weight, age, length

Normal and Non-Normal Distributions

Normal Distribution

Non-Normal Distributions

Skewed to the Left(Negatively Skewed)

Skewed to the Right (Positevely Skewed)

Leptokurtic and Platykurtic Distributions

Descriptive Statistics

- Descriptive Statistics describes data
- Points of Central Tendency
- Amount of Variability
- Relation of different variables to each other

Points Of Central Tendency Mean

- Measuring center If the n observations are x1,

x2,, xn, arithmetic mean is

Geometric Mean

e.x. Biological growth, Population growth

Measure of Central Tendency

Measures of Variability

How great is the Spread? RangeHighest

Score-Lowest score the quartiles The pth

percentile of a distribution is the value such

that p percent of the observations fall at or

below it. The 50th percentile median, M The

25th percentile first quartile, Q1 The 75th

percentile third quartile, Q3 Interquartile

Quartile 3- Quartile 1

- Example
- 13 13 16 19 21 21 23 23 24 26 26 27 27

27 28 28 30 30 - M?, Q1?, Q3?

Measures of Variability

Standard Devastation

standardized score

Measure of Relationship Correlation

- correlation indicates the strength and direction

of a linear relationship between two variables. - See page 266 for other examples or correlation

statistics

Notes about Correlation

- Substantial correlations between two

characteristics needs reasonable Validity and

Reliability in measuring - Correlation does not indicate causation

Examples of using Statistics in Computer Science

- Conceptual Representation of User Transactions or

Sessions

Pageview/objects

Session/user data

Inferential Statistics

- We use the samples as estimate of population

parameter. - The quality of all statistical analysis depends

on the quality of the sample data

Random Sampling every unit in the population

has an equal chance to be Chosen A random sample

should represent the population well, so sample

statistics from a random sample should provide

reasonable estimates of population parameters

Some definitions

- Parameter describes a population
- Statistic describes a sample

A parameter is a characteristic or quality of a

population that in concept is constant ,however,

its value is variable. example radius is a

parameter in a circle

Inferential Statistics

- Estimate a population parameter from a random

sample - Test statistically hypotheses

Inferential Statistics Estimate a Population

Parameter from Sample

- All sample statistics have some error in

estimating population parameters - Example estimate mean height of 10 year old boys

in Chicago, Sample200 boys - How close the sample mean is to the population

mean? - we dont know but we know
- The mean from an infinite number of samples form

a normal distribution. - The population mean equals the average (mean) of

all samples. - The Standard deviation of sample distribution (

standard error) is directly related to the std

of the characteristic in question for the overall

population.

Standard Error

- Standard error tell us how much the particular

mean vary from one sample to another when all

samples are the same size and drawn randomly from

the sample population. - Standard Error
- n is size of all samples and s is the population

std which we dont have! - We use the std of sample

Accuracy of the Estimator

As in many problems, there is a trade off between

accuracy and dollars.

What we will get from our money if we

invest dollars in obtaining a larger size?

n 100? n 200?

Point versus Interval Estimate

- A point estimate is a single value--a

point--taken from a sample and used to estimate

the corresponding parameter of a population - , s, s2 and r estimate µ, s, s2, ?

respectively - An interval estimate is a range of values--an

interval within whose limits a population

parameter probably lies. - we say that we are 95 confident that the unknown

population mean lies in the interval

95 confidence interval for µ.

(x -2?/(n1/2), x2 ?/(n1/2))

- In only 5 of all samples,
- the sample mean x is not in the above interval,
- that is 5 of all samples give inaccurate results.

Testing Hypothesis

- Confidence intervals are used when the goal of

our analysis is to estimate an unknown parameter

in the population. - A second goal of a statistical analysis is to

verify some claim about the population on the

basis of the data. - Research Hypothesis /Statistical hypothesis
- A test of significance is a procedure to assess

the truth about a hypothesis using the observed

data. The results of the test are expressed in

terms of a probability that measures how well the

data support the hypothesis.

Example To determine whether the mean nicotine

content of a brand of cigarettes is greater than

the advertised value of 1.4 milligrams, a health

advocacy group takes a sample of 500 cigarettes

and measures the amount of nicotine in the

sample.

Sample values The sample average of nicotine

1.51 mlg The standard deviation 1.016.

The estimated amount of nicotine is 1.51mlg,

based on the sample values. The standard error

of the sample average is S.E.s.d./sqrt(n-1)0.04

5 Is there an actual difference between the

sample value (1.51mlg) and the advertised value

(1.4 mlg)? Or is it just due to sampling

error? To answer this question we need a Test of

Significance

Stating an hypotheses

The null hypothesis H0 expresses the idea that

the observed difference is due to chance. It is a

statement of no effect or no difference,

and is expressed in terms of the population

parameter.

Let ? denote the true average amount of

nicotine. H0 ? 1.4mlg

The alternative hypothesis Ha represents the idea

that the difference is real. It is expressed as

the statement we hope or suspect is true instead

of the null hypothesis.

The alternative hypothesis states that the

cigarettes contain a higher amount of nicotine,

that is Ha ? gt 14mlg

General comments on stating hypotheses

- It is not easy to state the null and the

alternative hypothesis! - The hypotheses are statements on the population

values. - The alternative hypothesis Ha is often called

researcher hypothesis, because it is the

hypothesis we are interested about. - A significance test is a test against the null

hypothesis - Often we set Ha first and then Ho is defined as

the opposite statement!

Errors in Hypothesis testing

- Type I Error the null hypothesis is rejected

when it is in fact true that is, H0 is wrongly

rejected. - Type II Error the null hypothesis H0, is not

rejected when it is in fact false

Meta- Analysis

- Meta-analysis refers to the analysis of

analyses...the statistical analysis of a large

collection of analysis results from individual

studies for the purpose of integrating the

findings. (Glass, 1976, p. 3) - Conduct a fairly extensive search for relevant

studies - Identify appropriate studies to include in

meta-analysis - Convert each studys results to a common

statistical index

Using Statistical Software Packages

- SPSS
- SAS
- Matlab Statistics toolbox
- SYSTAT, Minitab, Stat View, Statistica

Interpreting the Data

- Relating the findings to the original research

problem and to the specific research questions

and hypothesis - Relating the findings to preexisting literature,

concepts, theories and research results. - Determining whether the findings have practical

Identifying limitations of the study

