Loading...

PPT – Descriptive vs. Inferential Statistics PowerPoint presentation | free to download - id: 3cfb46-MmFlZ

The Adobe Flash plugin is needed to view this content

Descriptive vs. Inferential Statistics

- Descriptive
- Methods for summarizing data
- Summaries usually consist of graphs and numerical

summaries of the data - Inferential
- Methods of making decisions or predictions about

a populations based on sample information.

Data Vocabulary

- We will refer to Data as plural and data set as a

particular collection of data as a whole. - Observation each data value.
- Subject (or individual) an item for study

(e.g., an employee in your company). - Variable a characteristic about the subject or

individual (e.g., employees income).

Data Vocabulary

Consider the multivariate data set with

5 variables

8 subjects

5 x 8 40 observations

Data Vocabulary Data Types

- A data set may have a mixture of data types.

Data Vocabulary Attribute Data

- Also called categorical, nominal or qualitative

data. - Values are described by words rather than

numbers. - For example,
- Automobile style (e.g., X full, midsize,

compact, subcompact).

Data Vocabulary Data Coding

- Coding refers to using numbers to represent

categories to facilitate statistical analysis. - Coding an attribute as a number does not make the

data numerical. - For example, 1 Bachelors, 2 Masters, 3

Doctorate - 1 Liberal, 2 Moderate, 3 Conservative

Data Vocabulary Binary Data

- A binary variable has only two values, 1

presence, 0 absence of a characteristic of

interest (codes themselves are arbitrary). - For example, 1 employed, 0 not employed

1 married, 0 not married 1 male, 0

female 1 female, 0 male - The coding itself has no numerical value so

binary variables are attribute data.

Data Vocabulary Numerical Data

- Numerical or quantitative data arise from

counting or some kind of mathematical operation. - For example, - Number of auto insurance claims

filed in March (e.g., X 114 claims). - Ratio

of profit to sales for last quarter (e.g., X

0.0447). - Can be broken down into two types discrete or

continuous data.

Data Vocabulary Discrete Data

- A numerical variable with a countable number of

values that can be represented by an integer (no

fractional values). - For example, - Number of Medicaid patients

(e.g., X 2). - Number of takeoffs at OHare

(e.g., X 37)

Data Vocabulary Continuous Data

- A numerical variable that can have any value

within an interval (e.g., length, weight, time,

sales, price/earnings ratios). - Any continuous interval contains infinitely many

possible values (e.g., 426 lt X lt 428).

Data Vocabulary - Rounding

- Ambiguity is introduced when continuous data are

rounded to whole numbers. - Underlying measurement scale is continuous.
- Precision of measurement depends on instrument.
- Sometimes discrete data are treated as

continuous when the range is very large (e.g.,

SAT scores) and small differences (e.g., 604 or

605) arent of much importance.

Four Levels of Measurement

Nominal Level of Measurement

- Nominal data merely identify a category.
- Nominal data are qualitative, attribute,

categorical or classification data (e.g., Apple,

Compaq, Dell, HP). - Nominal data are usually coded numerically, codes

are arbitrary (e.g., 1 Apple, 2 Compaq, 3

Dell, 4 HP). - Only mathematical operations are counting (e.g.,

frequencies) and simple statistics.

Ordinal Level of Measurement

- Ordinal data codes can be ranked (e.g., 1

Frequently, 2 Sometimes, 3 Rarely, 4

Never). - Distance between codes is not meaningful (e.g.,

distance between 1 and 2, or between 2 and 3, or

between 3 and 4 lacks meaning). Many useful

statistical tests exist for ordinal data.

Especially useful in social science, marketing

and human resource research.

Interval Level of Measurement

- Data can not only be ranked, but also have

meaningful intervals between scale points. (e.g.,

difference between 60?F and 70?F is same as

difference between 20?F and 30?F). - Since intervals between numbers represent

distances, mathematical operations can be

performed (e.g., average). - Zero point of interval scales is arbitrary, so

ratios are not meaningful (e.g., 60?F is not

twice as warm as 30?F).

Level of Measurement Likert Scales

- A special case of interval data frequently used

in survey research. - The coarseness of a Likert scale refers to the

number of scale points (typically 5 or 7).

Likert Scales

- Careful choice of verbal anchors results in

measurable intervals (e.g., the distance from 1

to 2 is the same as the interval, say, from 3

to 4). - Ratios are not meaningful (e.g., here 4 is not

twice 2). - Many statistical calculations can be performed

(e.g., averages, correlations, etc.).

Time Series vs. Cross-sectional Data Time Series

- Each observation in the sample represents a

different equally spaced point in time (e.g.,

years, months, days). - Periodicity may be annual, quarterly, monthly,

weekly, daily, hourly, etc. - We are interested in trends and patterns over

time (e.g., annual growth in consumer debit card

use from 1999 to 2008).

Time Series vs. Cross-sectional Data

Cross-sectional

- Each observation represents a different

individual unit (e.g., person) at the same point

in time (e.g., monthly VISA balances). - We are interested in - variation among

observations or in - relationships. - We can combine the two data types to get pooled

cross-sectional and time series data.

Population and Sample

- Population All subjects of interest
- Sample Subset of the population for whom we have

data

Populations and Samples

Population

Example The Sample and the Population for an

Exit Poll

- In California in 2003, a special election was

held to consider whether Governor Gray Davis

should be recalled from office. - An exit poll sampled 3160 of the 8 million people

who voted.

Example The Sample and the Population for an

Exit Poll

Example The Sample and the Population for an

Exit Poll

- Whats the sample and the population for this

exit poll? - The population was the 8 million people who voted

in the election. - The sample was the 3160 voters who were

interviewed in the exit poll.

Parameter and Statistic

- A parameter is a numerical summary of the

population - A statistic is a numerical summary of a sample

taken from the population

Sampling Methods

Sampling Methods

Simple Random Sample

- Every item in the population of N items has the

same chance of being chosen in the sample of n

items. - We rely on random
- numbers to select a
- name.

Graphical Summaries

- Describe the main features of a variable
- For Quantitative variables key features are

center (Where are the data values concentrated?

What seem to be typical or middle data values?) - spread (How much variation is there in the

data? How spread out are the data values? Are

there unusual values?) and shape (Are the data

values distributed symmetrically? Skewed?

Sharply peaked? Flat? Bimodal? - For Categorical variables key feature is the

percentage in each of the categories

Frequency Table

- A method of organizing data
- Lists all possible values for a variable along

with the number of observations for each value - Natural categories exist for qualitative

variables - For quantitative variables artificial bins are

created

Example Shark Attacks

Example Shark Attacks

Example Shark Attacks

- What is the variable?
- Is it categorical or quantitative?
- How is the proportion for Florida calculated?
- How is the for Florida calculated?

Example Shark Attacks

- Insights what the data tells us about shark

attacks

Graphs for Categorical Data

- Pie Chart A circle having a slice of pie for

each category. Center angle of slice represents

relative frequency/percentage. - Bar Graph A graph that displays a vertical bar

for each category. Length of bars represents

frequency.

Example Sources of Electricity Use in the U.S.

and Canada

Pie Chart

- A pie chart can only convey a general idea of the

data. - Pie charts should be used to portray data which

sum to a total (e.g., percent market shares). - A pie chart should only have a few (i.e., 3 to

5) slices. - Each slice should be labeled with data values or

percents.

Pie Chart

Bar Chart

Pie Charts Are Often Abused

- Consider the following charts used to illustrate

an article from the Wall Street Journal. Which

type is better? Why?

ILL-Advised Pie Charts Options

- Exploded and 3-D pie charts add strong visual

impact but slices are hard to assess.

Summarizing Quantitative Data

- Example Price/Earnings Ratios

- P/E ratios are current stock price divided by

earnings per share in the last 12 months. For

example

Graphs for Quantitative Data

- Dot Plot shows a dot for each observation
- Histogram uses bars to portray the data
- Which is Best?
- Dot-plot
- More useful for small data sets
- Data values are retained
- Histogram
- More useful for large data sets
- Most compact display
- More flexibility in defining intervals

Dot Plot

- A dot plot is the simplest graphical display of n

individual values of numerical data. - Easy to

understand - Not good for large samples (e.g., gt

5,000). - Make a scale that covers the data range
- Mark the axes and label them
- Plot each data value as a dot above the scale at

its approximate location - If more than one data value lies at about the

same axis location, the dots are piled up

vertically.

Dot Plot

- Range of data shows dispersion.
- Clustering shows central tendency.
- Dot plots do not tell much of shape of

distribution.

- Can add annotations (text boxes) to call

attention to specific features.

Frequency Distributions and Histograms

- A frequency distribution is a table formed by

classifying n data values into k classes (bins). - Bin limits define the values to be included in

each bin. Widths must all be the same. - Frequencies are the number of observations within

each bin. - Express as relative frequencies (frequency

divided by the total) or percentages (relative

frequency times 100).

Constructing a Frequency Distribution

- Sort data in ascending order (e.g., P/E ratios)
- Choose the number of bins (k)
- - k should be much smaller than n.
- Too many bins results in sparsely populated bins,

too few and dissimilar data values are lumped

together.

Constructing a Frequency Distribution Sturges

Rule

Constructing a Frequency Distribution

- Set the bin limits according to k from Sturges

Rule - For example, for k 7 bins, the approximate bin

width is - To obtain nice limits, round the width to 10

and start - the first bin at 0 to yield 0, 10, 20, 30, 40,

50, 60, 70

Constructing a Frequency Distribution

- Put the data values in the appropriate bin
- In general, the lower limit is included in the

bin while - the upper limit is excluded.
- Create the table you can include
- Frequencies counts for each bin
- Relative frequencies absolute frequency divided

by - total number of data values.
- Cumulative frequencies accumulated relative
- frequency values as bin limits increase.

3A-49

Bin Limits for the P/E Ratio Data

3A-50

Frequency Distributions and Histograms

- A histogram is a graphical representation of a

frequency distribution. - Y-axis shows frequency within each bin.
- A histogram is a bar chart with no gaps between

bars - X-axis ticks shows end points of each bin.

3A-51

Frequency Distributions and Histograms

- Consider 3 histograms for the P/E ratio data with

different bin widths. What do they tell you?

Frequency Distributions and Histograms Modal

Class

- A histogram bar that is higher than those on

either side is called the modal class. - Monomodal a single modal class.
- Bimodal two modal classes.
- Multimodal more than two modal classes.
- Modal classes may be artifacts of the way bin

limits are chosen.

3A-53

Shape of Histograms

- A histogram suggests the shape of the population.
- It is influenced by number of bins and bin

limits. - Skewness indicated by the direction of the

longer tail of the histogram. - Left-skewed (negatively skewed) a longer left

tail. - Right-skewed (positively skewed) a longer right

tail. - Symmetric both tail areas approximately the

same.

(No Transcript)

3A-55

Line Charts

- Used to display a time series or spot trends, or

to compare time periods.

- Can display several variables at once.

Scatter Plots for Bi-variate Data

- A scatter plot shows n pairs of observations as

dots (or some other symbol) on an XY graph. - A starting point for bivariate data analysis.
- Allows observations about the relationship

between two variables. - Answers the question Is there an association

between the two variables and if so, what kind of

association?

Scatter Plot Example Birth Rates vs. Life

Expectancy

Scatter Plot Example Birth Rates vs. Life

Expectancy

- Here is a scatter plot with life expectancy on

the X-axis and birth rates on the Y-axis.

- Is there an association between the two variables?

- Is there a cause-and-effect relationship?

Scatter Plot Example Aircraft Fuel Consumption

- Consider five observations on flight time and

fuel consumption for a twin-engine Piper Cheyenne

aircraft. - A causal relationship is assumed since a longer

flight would consume more fuel.

Scatter Plot Example Aircraft Fuel Consumption

- Here is the scatter plot with flight time

(explanatory) on the X-axis and fuel use

(response) on the Y-axis. Is there an association

between the variables?

Scatter Plots for Bi-variate Data

Scatter Plots and Policy Making

- Scatter plots can be helpful when policy

decisions need to be made. - For example, compare traffic fatalities resulting

from crashes per million vehicles sold between

1995 and 1999. - Do SUVs create a greater risk to the drivers of

both cars?

Numerical Descriptive Statistics

- How Can We describe the Center of Quantitative

Data?

Measures of Central Tendency

Measures of Central Tendency

Measures of Central Tendency

Measures of Central Tendency - Mean

- A familiar measure of central tendency.

- In Excel, use function AVERAGE(Data) where Data

is an array of data values.

Characteristics of the Mean

- Arithmetic mean is the most familiar average.
- Affected by every sample item.
- The balancing point or fulcrum for the data.

Characteristics of the Median

Characteristics of the Median

Comparison Among Mean, Median, and Mode

- Consider the following quiz scores for 3 students

Lees scores 60, 70, 70, 70, 80 Mean 70,

Median 70, Mode 70 Pats scores 45, 45,

70, 90, 100 Mean 70, Median 70, Mode

45 Sams scores 50, 60, 70, 80, 90 Mean

70, Median 70, Mode none Xiaos scores

50, 50, 70, 90, 90 Mean 70, Median 70, Modes

50,90

- What does the mode for each student tell you?

Relationships Among Mean, Median and Mode

Measures of Variation

- Variation is the spread of data points about

the center of the distribution in a sample.

Consider the following measures of dispersion

Measures of Variation

Measures of Variation

The Range

Range largest measurement - smallest measurement

Example Internists Salaries (in thousands of

dollars) 127 132 138 141 144 146 152 154 165 171

177 192 241 Range 241 - 127 114 (114,000)

The Variance

Population X1, X2, , XN

s2

Population Variance

The Standard Deviation

Example Population Variance/Standard Deviation

Population of annual returns for five junk bond

mutual funds 10.0, 9.4, 9.1, 8.3, 7.8

m 10.09.49.18.37.8 44.6 8.92

5 5

1.1664.2304.38441.2544 3.068

.6136 5

5

Sample Variance Example

Sample 2, 3, 5, 6. Here n 4 and x 4

xi (xi-x) (xi- x)2

- 2 4 -2 4
- 3 4 -1 1
- 5 4 1 1
- 6 4 2 4

Sum 10

s2 10 /(4-1) 3.33

Example Sample Variance/Standard Deviation

Sample of five car mileages 30.8, 31.7, 30.1,

31.6, 32.1

s2 2.572 ? 4 0.643

Coefficient of Variation

- Useful for comparing variables measured in

different units or with different means. - A unit-free measure of dispersion
- Expressed as a percent of the mean.
- Only appropriate for nonnegative data. It is

undefined if the mean is zero or negative.

Coefficient of Variation Examples

Mean Absolute Deviation

- The Mean Absolute Deviation (MAD) reveals the

average distance from an individual data point to

the mean (center of the distribution).

- Uses absolute values of the deviations around the

mean.

- Excels function is AVEDEV(Array)

Central Tendency vs. Dispersion

- Consider the histograms of hole diameters drilled

in a steel plate during manufacturing.

- The desired distribution is outlined in red.

Central Tendency vs. Dispersion

Acceptable variation but mean is less than 5 mm.

Desired mean (5mm) but too much variation.

- Take frequent samples to monitor quality.

Central Tendency vs. Dispersion Job Performance

- A high mean (better rating) and low standard

deviation (more consistency) is preferred. Which

professor do you think is best?

Section 2.6 2.7

- Interpreting Standard Deviation and Measures of

Relative Standing

Empirical Rule

- For bell-shaped data sets
- Approximately 68 of the observations fall within

1 standard deviation of the mean - Approximately 95 of the observations fall within

2 standard deviations of the mean - Approximately 100 of the observations fall

within 3 standard deviations of the mean

Scale in std. dev. units

m 9.12 s 0.15

Empirical Rule Detecting Unusual Observations

- The P/E ratio data contains several large data

values. Are they unusual or outliers?

Empirical Rule Detecting Unusual Observations

- If the sample came from a normal distribution,

then the Empirical rule states

22.72 1(14.08)

(8.9, 38.8)

22.72 2(14.08)

(-5.4, 50.9)

22.72 3(14.08)

(-19.5, 65.0)

Empirical Rule Detecting Unusual Observations

- Are there any unusual values or outliers?

7 8 . . . 48 55

68 91

22.72

Defining a Standardized Variable or Z-Score

- A standardized variable (Z) redefines each

observation in terms the number of standard

deviations from the mean.

Standardization formula for a population

Standardization formula for a sample

Z-Score Example

- zi tells how far away the observation is from the

mean. A negative z value indicates the

observation is below the mean while positive z

value indicates the observation is above the

mean. - For example, for the P/E data, the first value x1

7. The associated z value is

Percentiles, Deciles and Quartiles

- Percentiles are data that have been divided into

100 groups. - For example, you score in the 83rd percentile on

a standardized test. That means that 83 of the

test-takers scored below you. - Deciles are data that have been divided into 10

groups. - Quintiles are data that have been divided into 5

groups. - Quartiles are data that have been divided into 4

groups.

Use of Percentiles and Quartiles

- Percentiles are used to establish benchmarks for

comparison purposes (e.g., health care,

manufacturing and banking industries use 5, 25,

50, 75 and 90 percentiles). - Percentiles are used in employee merit evaluation

and salary benchmarking. - Quartiles (25, 50, and 75 percent) are commonly

used to assess financial performance and stock

portfolios.

Quartiles

- Quartiles are scale points that divide the sorted

data into four groups of approximately equal size.

- The three values that separate the four groups

are called Q1, Q2, and Q3, respectively.

Quartiles

- The second quartile Q2 is the median, an

important indicator of central tendency.

- Q1 and Q3 measure dispersion since the

interquartile range Q3 Q1 measures the degree

of spread in the middle 50 percent of data values.

Calculating Quartiles

- For small data sets, find quartiles using method

of medians

Step 1. Sort the observations.

Step 2. Find the median Q2.

Step 3. Find the median of the data values that

lie below Q2.

Step 4. Find the median of the data values that

lie above Q2.

Calculating Quartiles

- Use Excel function QUARTILE(Array, k) to return

the kth quartile. - Excel treats quartiles as a special case of

percentiles. For example, to calculate Q3 - QUARTILE(Array, 3)
- PERCENTILE(Array, 75)
- Excel calculates the quartile positions as

Central Tendency Using Quartiles

Dispersion Using Quartiles

Box Plots

- A useful tool of exploratory data analysis (EDA).
- Also called a box-and-whisker plot.
- Based on a five-number summary
- Consider the five-number summary for the 68 P/E

ratios

Xmin, Q1, Q2, Q3, Xmax

Box Plots

Detecting Unusual Observations and Potential

Outliers

- IQR Q3 Q1
- An observation is considered unusual if it falls

more than 1.5 x IQR below the first quartile or

more than 1.5 x IQR above the third quartile - An observation is a potential outlier if it falls

more than 3 x IQR below the first quartile or

more than 3 x IQR above the third quartile

Box - Whiskers Plots

Box Plots

- Fences and Unusual Data Values

- Truncate the whisker at the fences and display

unusual values and outliers as dots.

- Based on these fences, there are three unusual

P/E values and two outliers.

(No Transcript)

(No Transcript)

(No Transcript)

Probability Concepts

An experiment is any process of observation with

an uncertain outcome. The possible outcomes for

an experiment are called the experimental

outcomes. Probability is a measure of the chance

that an experimental outcome will occur when an

experiment is carried out

Probability

If E is an experimental outcome, then P(E)

denotes the probability that E will occur

and Conditions If E can never occur, then P(E)

0 If E is certain to occur, then P(E) 1 The

probabilities of all the experimental outcomes

must sum to 1.

Assigning Probabilities to Experimental Outcomes

- Classical Method
- For equally likely outcomes
- Relative frequency or Empirical Approach
- In the long run
- Subjective
- Assessment based on experience, expertise, or

intuition

The Sample Space

The sample space of an experiment is the set of

all experimental outcomes. Example Genders of

Two Children

Computing Probabilities of Events

An event is a set (or collection) of experimental

outcomes. The probability of an event is the sum

of the probabilities of the experimental outcomes

that belong to the event.

Probabilities Equally Likely Outcomes

If the sample space outcomes (or experimental

outcomes) are all equally likely, then the

probability that an event will occur is equal to

the ratio

Example Computing Probabilities

Events P(one boy and one girl) P(BG) P(GB)

¼ ¼ ½ P(at least one girl) P(BG)

P(GB) P(GG) ¼ ¼ ¼ ¾

Note Experimental Outcomes BB, BG, GB, GG All

outcomes equally likely P(BB) P(GG) ¼

Event Relations

The Addition Rule for Unions

The probability that A or B (the union of A and

B) will occur is

Conditional Probability

The probability of an event A, given that the

event B has occurred is called the conditional

probability of A given B and is denoted as

. Further,

Independence of Events

Two events A and B are said to be independent if

and only if P(AB) P(A) or,

equivalently, P(BA) P(B)

Multiplication Rule for Intersections

The probability that A and B (the intersection of

A and B) will occur is

If A and B are independent, then the probability

that A and B (the intersection of A and B) will

occur is

Applications of Independence

- To illustrate system reliability, suppose a Web

site has 2 independent file servers. Each server

has 99 reliability. What is the total system

reliability? Let, - F1 be the event that server 1 fails
- F2 be the event that server 2 fails
- P(F1 ? F2 ) P(F1) P(F2) (.01)(.01)

.0001 So, the probability that both servers are

down is .0001. - The probability that at least one server is up

is - 1 - .0001 .9999 or 99.99

Applications of Independence the Five Nines Rule

Contingency Tables

Contingency Tables Example Salary Gains MBA

Tuition

Contingency Tables Example Salary Gains MBA

Tuition

- Are large salary gains more likely to accrue to

graduates of high-tuition MBA programs? - For example, find the marginal probability of a

small salary gain (P(S1)). - The marginal probability of a single event is

found by dividing a row or column total by the

total sample size. - P(S1) 17/67 0.2537
- Conclude that about 25 of salary gains at the

top-tier schools were under 50,000.

Contingency Tables Example Salary Gains MBA

Tuition

- Find the marginal probability of a low tuition

P(T1).

P(T1) 16/67 0.2388 There is a 24 chance that

a top-tier schools MBA tuition is under

40,000.

Contingency Tables Example Salary Gains MBA

Tuition

- Find the joint probability of a low tuition and

large salary gains P(T1 ? S3)

- P(T1 ? S3) 1/67 0.0149
- There is less than a 2 chance that a top-tier

school has both low tuition and large salary

gains.

Contingency Tables Example Salary Gains MBA

Tuition

- Find the conditional probability that the salary

gains are small (S1) given that the MBA tuition

is large (T3).

- P(S1 T3) 5/32 0.1563
- There is about a16 chance that a top-tier school

has small salary gains given the tuition is

large.

Salary Gains MBA Tuition - Independence

- To check for independent events in a contingency

table, compare the conditional to the marginal

probabilities. - For example, if small salary gains (S1) were

independent of high tuition (T3), then P(S1 T3)

P(S1).

- What do you conclude about events S1 and T3?
- They are dependent or not independent

Contingency Tables Relative Frequencies

- Calculate the relative frequencies below for each

cell of the cross-tabulation table to facilitate

probability calculations.

- Symbolic notation for relative frequencies