1 / 83

Clustering and Grouping Analysis

Victor M. H. Borden, Ph.D. Associate Vice

Chancellor, Information Management and

Institutional ResearchAssociate Professor of

PsychologyIndiana University Purdue University

Indianapolis vborden_at_iupui.edu

- Methods for Identifying Groups and Determining

How Groups Differ

Purpose

- To provide a working knowledge
- Logistic Regression
- Discriminant Analysis
- Cluster Analysis
- To demonstrate their application to common

institutional research issues - student market segmentation
- student retention
- faculty workload

Learning Objectives

- Understand the fundamental concepts of logistic

regression, cluster analysis and discriminant

analysis - Determine when to use appropriate variations of

each technique - Understand the data requirements for performing

these analyses - Use SPSS software to perform basic logistic,

cluster and discriminant analyses

Learning Objectives

- Know how to interpret the tabular and graphical

outputs of these procedures to evaluate the

validity and reliability of various solutions - Prepare reports on the results of these analysis

for professional or lay audiences - Understand the relationship between these and

related statistical methods

Workshop Pre-requisites

- Basic Statistics
- General Linear Models
- Statistical Software
- Institutional Research

Workshop Method

- Introduction to basics using an IR example
- On your own exercise using IR datasets
- Discussion of methods and issues as you

experience them

Workshop Schedule Day 1

- Introduction and overview (15)
- Logistic regression
- Basic concepts with example (45)
- On your own examples (30)
- Break (30)
- Discriminant analysis
- Basic concepts with example (30)
- On your own examples (30)
- Logistic regression v. discriminant analysis

Workshop Schedule Day 2

- Cluster Analysis
- Basic concepts with example (45)
- Example Peer institution identification (30)
- Break (30)
- Decision Tree Techniques
- Basic concepts with example (30)
- Free play (45)

Overview

- Analyzing differences among existing groups
- Extending the regression model to look at a

(dichotomous) group outcome variable - Logistic regression
- Discriminant analysis
- Identifying groups out of whole cloth
- Cluster analysis
- Focus on proximity aspect
- Decision trees as a hybrid model
- CHAID

Workshop Datasets

Existing Group Differences

- The outcome of interest is membership in a group
- Retained vs. non-returning students
- Admits who matriculate vs. those who dont
- Alums who donate vs. those who dont
- Faculty who get grants vs. those who dont
- Institutions that get sanctioned for assessment

on their accreditation visits vs. those who dont - Class sections that meet during the day vs.

evening

Three Basic Questions

- Which, if any, of the variables are useful for

predicting group membership? - What is the best combination of variables to

optimize predictions among the original sample? - How useful is that combination for classifying

new cases?

One or More Groups

- Group outcomes can be dichotomous or

polychotomous - Logistic regression and discriminant analysis

can handle both - We will focus on dichotomous case with only lip

service to polychotomous situation

Examining Group Differences

- Why not t-test or ANOVA?
- Group factor is dependent variable (outcome) not

independent variable (predictor, causal agent,

etc.) - No random assignment to group
- But we always violate that assumption
- Requires normal distribution of outcome

The Linear Regression Problem

- Group membership as the outcome (dependent)

variable violates an important assumption that

has serious consequences - Under certain conditions the problems are not

completely debilitating - Group membership evenly distributed
- Predictors are all solid continuous/normal

variables

Two Regression-Based Solutions

- Logistic regression
- Transforms outcome into continuous odds ratio
- Readily accommodates continuous and categorical

predictors - Interpretations differ substantially from OLS

linear form - Includes classification matrix
- Discriminant analysis
- Uses standard OLS procedures
- Requires continuous/normal predictors
- Interpretations similar to OLS linear regression
- Includes classification matrix

Remember OLS Linear Form

- Finding linear equation that fits pattern best

OLS Linear Form

- Overall fit of model
- Significance of model
- Predictor (b) coefficients

The Group Outcome Problem

- Y equals either 0 or 1
- Predictions can be gt 1 or lt 0
- Coefficients may be biased
- Heteroscedasticity is present
- Error term is not normally distributed so

hypothesis tests are invalid

The Logistic Regression Solution

- Use a different method for estimating parameters
- Maximum Likelihood Estimates (MLE) instead of OLS
- Maximizes the probability that predicted Y equal

observed Y - Transforms the outcome variable into a form that

is continuous/normal - The natural log of the odds ratio or logit
- Ln(P/(1-P)) a b1x1 b2x2

The Odds Ratio

- The probability of being in one group relative to

the probability of being in the other group - If P(group 1) .5, the odds ratio is 1 (.5/.5)
- If the retention rate is 80, the odds ratio is

0.8/0.2 4 (odds of 4 to 1 of being retained) - If the yield rate is 67, the odds ratio is

0.67/0.33 2 or 2 to 1. - If 12.5 of of alums donate, the odds ratio is

0.125/0.875 .143 or 1 to 7

Predictors and Coefficients

- Predictors can be continuous or categorical

(dummy) variables - Coefficient shows the change in the Ln(P(1-P))

for unit change in predictor - Can be converted into marginal effects effect on

probability that Y1 for unit change in X - Not easy to explain but
- Can talk in general terms (positive, negative,

zero) - Have classification statistics that are more

intuitive

Logistic Regression in SPSS

- Retention dataset

Retention Example Output

- Ominbus Tests
- Model Summary
- Classification table

Retention Example Output

- Predictor statistics

Interpreting Logistic Reg Output

- Ominbus Tests
- Overall significance of model
- Relative performance of one model v. another
- Model summary
- Goodness of fit R2 Statistics
- Several versions, none of which are real R2
- Classification table
- Ability to successfully classify from prediction
- Remember prediction is probability that then has

to be categorized

Interpreting Coefficients

- B value is change in ln(odds ratio) for unit

change in predictor - S.E. is error in predictor
- Relates to significance and estimation
- Wald Statistic like the t-value in OLS Linear
- Has corresponding significance level
- Can be incorrect for large coefficients
- Exp(B) is marginal effect
- effect on probability that Y1 for unit change in

X

Interpreting Coefficients

A unit change in SATACT increases the odds that

the student will be retained by a factor of 1,

that is, not at all

A full-time student is twice as likely to be

retained then a part-time student

A unit (full grade) change in GPA increases by

more than a factor of 2 the likelihood that a

student will be retained

On Your Own

- Try different variables or entry methods on

retention data - Predict admissions yield status with application

data set - Predict full- vs. part-time faculty status with

faculty data set - Distinguish between two Carnegie categories on

institutional data set - Dont forget to select only two groups

Questions and Answers

Discriminant Analysis

- Closer to linear regression in form and

interpretation - Predictors must by continuous or dichotomous
- Logistic regression can handle polychotomous

categorical variables - Can be used for multi-group outcome
- Generates k-1 orthogonal solutions for k groups

Requirements for Discriminant

- Two or more groups
- At least two cases per group
- Any number of discriminating variables, provided

that it is less than the total number of cases

minus 2 - Discriminating variables are on an interval or

ratio scale

Requirements for Discriminant

- No discriminating variable can be a linear

combination of any other discriminating variable - The covariance matrices for each group must by

approximately equal, unless special formulas are

used - Each group had been drawn from a population with

a multivariate normal distribution on the

discriminating variables

Interpreting Group Differences

- Which variables best predict differences
- What is the best combination of predictors The

canonical discriminant function

The Discriminant Function

- Derivation
- Like regression, maximize the between-groups sum

of squares, relative to the within-groups sum of

squares for the value D - Interpretation
- Overall function statistics
- Predictor variable statistics

Retention Example

- Overall model statistics
- Eigenvalue/Canonical Correlation
- Wilks Lambda
- 1 Canonical2

Predictor Coefficients

- Standardized Discriminant Coefficients
- Variables with largest (absolute) coefficient

contribute most to prediction of group membership - Sign is direction of effect

Retention Coefficients

- Structure coefficients
- Correlation between each predictor and overall

discriminant function

Classification in Discrminant

- Prior probabilities
- Can be .5 or set by size of group

Classification in Discriminant

- Accuracy within each group as well as overall

The Classification Matrix

- Comparing actual group membership against

predicted group membership, using the

classification function - Can have an "unknown" region
- Split samples can (should?) be used to further

test the accuracy of classification

The Classification Matrix

- Measures of interest include
- Overall prediction accuracy
- (ad)/N
- Sensitivity-accuracy among positive cases
- a/(ac)
- Specificity-accuracy among negative cases
- d/(bd)
- False positive rate
- b/(bd)
- False negative rate
- c/(ac)

The Confusion Matrix

- But thats not all
- Prevalence (ac)/N
- Overall diagnostic power (bd)/N
- Positive predictive power a/(ab)
- Negative predictive power d/(cd)
- Misclassification rate (bc)/N
- Odds ratio (ad)/(bc)
- Kappa (a d) - (((a c)(a b) (b d)(c

d))/N) N - (((a c)(a b) (b d)(c

d))/N) - NMI 1 - -a.ln(a)-b.ln(b)-c.ln(c)-d.ln(d)(ab).ln(

ab)(cd).ln(cd) N.lnN -

((ac).ln(ac) (bd).ln(bd))

Adjusting for Prior Probabilities or the "Costs"

of Misclassification

- Methods so far have considered each group equally
- One can take into account known differences in

group composition - This usually takes one of two forms
- Prior information regarding the likely

distribution of group size - There are known higher "costs" of misclassifying

objects into one group compared to in the other

.5 v. Group Size Cutoffs-Discrim

.5 v. Group Size Cutoffs-Logistic

Logistic v. Discrim Classification

- Classification measure calculator for 2x2
- http//members.aol.com/johnp71/ctab2x2.html

On Your Own

- Rerun your logistic regressions as discriminant

analyses - Play with different cutoff conditions
- .5 vs. predicted from group size for discriminant
- Set your own value for logistic regression

Questions and Answers

Logistic vs. Discriminant

- Logistic
- Accommodates categorical predictors with gt 2

groups - Fewer assumptions
- More robust
- Easier to use?
- Discriminant
- Easier to interpret?
- More classification features
- Can accommodate costs of misclassification

Reporting Results

- Logistic regression coefficients (and their

anti-logs) are difficult to convey graphically - Positive impact values above 1 range

considerably - Negative impact values below 1 have a limited

range - Delta P is an alternative
- Change in probability of outcome given unit

change in predictor

Reporting Results

- Classification table and some of the related

measures are usually most effective way to convey

usefulness of results - As with all higher level analyses, the most

important point is to interpret in context of

real decisions - E.g., impact of changing selection index cutoff

in terms of entering class size and predicted

change in retention rate

Some Reasonable Examples

- Smith and Nielsen, Longwood College
- http//www.longwood.edu/assessment/Retention_Analy

sis.htm - DePauw University
- http//www.depauw.edu/admin/i_research/research/ye

ar1_02supp.pdf

Good Night!

- Read Chapter 5 of RIR Stats Volume
- To reinforce lessons for today and tomorrow
- If you are having trouble falling asleep

Cluster Analysis

- Any of a wide variety of numerical procedures

that can be used to create a classification

scheme - Conceptually easy to understand and well suited

to segmentation studies - It is a heuristic algorithm, not supported by

extensive statistical reasoning - It is entirely data driven
- Sometimes yields inconsistent results

Cluster Analysis

- Creating groups out of whole cloth
- Drawing circles around points scattered in

n-dimensional space

What Is a Cluster?

- A set of objects or points, that are relatively

close to each other and relatively far from

points in other clusters. - This view that tends to favor spherical clusters

over ones of other shapes

Steps to Cluster Analysis

- Selecting variables
- Selecting a similarity or distance measure
- Choosing a clustering algorithm

Selecting Variables

- Most popular forms based on measures of

"similarity" according to some combination of

attributes - The choice of variables is one of the most

critical steps - Should be guided by an explicit theory or at

least solid reasoning - Higher education researchers typically have ready

access to certain types of student

characteristics

Choosing a Similarity Measure

- Distance measures Spatial relationship
- Association measures Similarities or

dissimilarities, using measures of association

(e.g., Correlation, contingency tables) - The type of variable constrains the choice
- Nominal variables require either association

coefficients or a decision-tree technique - Continuous variables lend themselves to

distance-type measures

Distance-Type Measures

- Several are cases of what is called the Minkowski

metric. - Euclidean distance (r 2 distance between two

points in n-dimensional space) - City-block metric (r 1 the sum of differences

along each measure

Distance-Type Measures

- Another common distance measure is Mahalanobis

D2, which takes into account correlations among

the predictors

Standardized vs. Unstandardized Measures

- One must be careful about the implications of

using standardized vs. unstandardized measures in

computing these distances.

Matching-Type Measures

- Association coefficients
- The only game in town when the predictors are

nominally scaled. - The predictor variables are usually converted to

binary indicators. - Similarity Coefficients are a form of matching

type measure based on a series of binary

variables that represent the presence or absence

of a trait.

Contingency Table-based Similarity Coefficients

- Possible coefficients that differ according to
- How negative matches (0,0) are incorporated
- Whether matched pairs are equally weighted or

doubled - Whether unmatched pairs carry twice the weight of

matched pairs - Whether negative matched are excluded altogether

Contingency-Table Measures

- (ad)/(abcd) matching coefficient
- a/(abcd) Russel/Rao index
- a/(abc) Jaccard coefficient
- 2a/(2abc) Dices coefficient

Correlation Coefficients

- Pearson r, Spearman r, etc.
- Correlation is across variables and between each

pair of objects

Across Variable Correlation

Standard Across Person Correlation

The Distance Matrix

- Regardless of method, the first step in cluster

analysis is to produce a distance matrix. - A row and column for each object
- Cells represent the distance or similarity

measure between each pair. - Symmetric with diagonal of 0's for distance

matrices, or 1's for similarity measures. - This is what makes cluster analyses like these so

computationally intensive.

Choosing a Clustering Algorithm

- Hierarchical algorithms
- Agglomerative methods start with each object in

its own cluster and then merges points and

clusters until some criteria is reached - Single linkage (nearest neighbor)
- Complete linkage (furthest neighbor)
- Average linkage
- Wards error sum of squares

Choosing a Clustering Algorithm

- Hierarchical algorithms (continued)
- Divisive methods start with one group of the

whole and partitioning objects into smaller

clusters until some criteria is reached. - Splinter-average distance
- Decision tree methods
- Partitioning algorithms
- K-means clustering
- Trace-based methods

Peer Institution Example

- Variables Derived from IPEDS

Create Proximity Matrix

- Screen institutions to a manageable number (lt

300) - Select ClassifyHierarchical Cluster
- Place predictors in Variable box
- Under Method, choose Z scores for standardize

box (by variable). - Paste syntax
- Erase Cluster procedure
- Change proximity matrix file name so you can find

it - Run it

Using Proximity Matrix

- Find target institution (sort by name)
- Identify Varname and find target column
- Get rid of excess columns
- Sort (ascending) by varname column
- VOILA! Institutions now sorted by similarity to

target

Graphical Clustering Methods

- Glyphs, Metroglyphs, Fourier series, and Chernoff

faces

Decision Trees

- Hybrid between clustering and discriminant

analysis - The criterion variable does not define the groups
- But the groups are defined so as to maximize

differences according to the criterion. - The purpose is to identify key variables for

distinguishing among groups and formulating group

membership prediction rules.

Functions of Decision Trees

- Derive decision rules from data
- Develop classification system to predict future

observations - Illustrate these through a decision tree
- Discretize continuous variables

SPSS AnswerTree

- Three decision tree algorithms
- All use "brute force" methods

Common Features

- Merging categories of the predictor variables so

that non-significantly different values are

pooled together - Splitting the variables at points that maximizes

differences - Stopping the branching when further splits do not

contribute significantly - Pruning branches from an existing tree
- Validation and error estimation

CHAID

- Not just binary splits
- Handles nominal, ordinal, and continuous

variables - Useful for discretizing continuous variables
- Demo and sample output included within session

support materials

(No Transcript)

Playing with CHAID, etc.

- Work with retention dataset
- Use retention status as criteria
- Use Semester GPA as criteria
- Try newer institutional dataset
- Use graduation rate as criteria
- Try nearest neighbor analysis with newer data

Final Thoughts

- Flexibility of logistic regression models make it

the coin of the realm - E.g., Multinomial logistic HLM regression
- Cluster analysis is so data driven as to make its

use fairly limited - Threshold approach to peer identification much

more popular but its always good to run things

multiple ways (See IR Primer Peer Chapter) - CHAID is fun to play with and informative

(No Transcript)