Clustering and Grouping Analysis - PowerPoint PPT Presentation

1 / 83
About This Presentation

Clustering and Grouping Analysis


Associate Vice Chancellor, Information Management and Institutional Research ... a/(a b c d) Russel/Rao index. a/(a b c) Jaccard coefficient ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 84
Provided by: info77


Transcript and Presenter's Notes

Title: Clustering and Grouping Analysis

Clustering and Grouping Analysis
Victor M. H. Borden, Ph.D. Associate Vice
Chancellor, Information Management and
Institutional ResearchAssociate Professor of
PsychologyIndiana University Purdue University
  • Methods for Identifying Groups and Determining
    How Groups Differ

  • To provide a working knowledge
  • Logistic Regression
  • Discriminant Analysis
  • Cluster Analysis
  • To demonstrate their application to common
    institutional research issues
  • student market segmentation
  • student retention
  • faculty workload

Learning Objectives
  • Understand the fundamental concepts of logistic
    regression, cluster analysis and discriminant
  • Determine when to use appropriate variations of
    each technique
  • Understand the data requirements for performing
    these analyses
  • Use SPSS software to perform basic logistic,
    cluster and discriminant analyses

Learning Objectives
  • Know how to interpret the tabular and graphical
    outputs of these procedures to evaluate the
    validity and reliability of various solutions
  • Prepare reports on the results of these analysis
    for professional or lay audiences
  • Understand the relationship between these and
    related statistical methods

Workshop Pre-requisites
  • Basic Statistics
  • General Linear Models
  • Statistical Software
  • Institutional Research

Workshop Method
  • Introduction to basics using an IR example
  • On your own exercise using IR datasets
  • Discussion of methods and issues as you
    experience them

Workshop Schedule Day 1
  • Introduction and overview (15)
  • Logistic regression
  • Basic concepts with example (45)
  • On your own examples (30)
  • Break (30)
  • Discriminant analysis
  • Basic concepts with example (30)
  • On your own examples (30)
  • Logistic regression v. discriminant analysis

Workshop Schedule Day 2
  • Cluster Analysis
  • Basic concepts with example (45)
  • Example Peer institution identification (30)
  • Break (30)
  • Decision Tree Techniques
  • Basic concepts with example (30)
  • Free play (45)

  • Analyzing differences among existing groups
  • Extending the regression model to look at a
    (dichotomous) group outcome variable
  • Logistic regression
  • Discriminant analysis
  • Identifying groups out of whole cloth
  • Cluster analysis
  • Focus on proximity aspect
  • Decision trees as a hybrid model

Workshop Datasets
Existing Group Differences
  • The outcome of interest is membership in a group
  • Retained vs. non-returning students
  • Admits who matriculate vs. those who dont
  • Alums who donate vs. those who dont
  • Faculty who get grants vs. those who dont
  • Institutions that get sanctioned for assessment
    on their accreditation visits vs. those who dont
  • Class sections that meet during the day vs.

Three Basic Questions
  • Which, if any, of the variables are useful for
    predicting group membership?
  • What is the best combination of variables to
    optimize predictions among the original sample?
  • How useful is that combination for classifying
    new cases?

One or More Groups
  • Group outcomes can be dichotomous or
  • Logistic regression and discriminant analysis
    can handle both
  • We will focus on dichotomous case with only lip
    service to polychotomous situation

Examining Group Differences
  • Why not t-test or ANOVA?
  • Group factor is dependent variable (outcome) not
    independent variable (predictor, causal agent,
  • No random assignment to group
  • But we always violate that assumption
  • Requires normal distribution of outcome

The Linear Regression Problem
  • Group membership as the outcome (dependent)
    variable violates an important assumption that
    has serious consequences
  • Under certain conditions the problems are not
    completely debilitating
  • Group membership evenly distributed
  • Predictors are all solid continuous/normal

Two Regression-Based Solutions
  • Logistic regression
  • Transforms outcome into continuous odds ratio
  • Readily accommodates continuous and categorical
  • Interpretations differ substantially from OLS
    linear form
  • Includes classification matrix
  • Discriminant analysis
  • Uses standard OLS procedures
  • Requires continuous/normal predictors
  • Interpretations similar to OLS linear regression
  • Includes classification matrix

Remember OLS Linear Form
  • Finding linear equation that fits pattern best

OLS Linear Form
  • Overall fit of model
  • Significance of model
  • Predictor (b) coefficients

The Group Outcome Problem
  • Y equals either 0 or 1
  • Predictions can be gt 1 or lt 0
  • Coefficients may be biased
  • Heteroscedasticity is present
  • Error term is not normally distributed so
    hypothesis tests are invalid

The Logistic Regression Solution
  • Use a different method for estimating parameters
  • Maximum Likelihood Estimates (MLE) instead of OLS
  • Maximizes the probability that predicted Y equal
    observed Y
  • Transforms the outcome variable into a form that
    is continuous/normal
  • The natural log of the odds ratio or logit
  • Ln(P/(1-P)) a b1x1 b2x2

The Odds Ratio
  • The probability of being in one group relative to
    the probability of being in the other group
  • If P(group 1) .5, the odds ratio is 1 (.5/.5)
  • If the retention rate is 80, the odds ratio is
    0.8/0.2 4 (odds of 4 to 1 of being retained)
  • If the yield rate is 67, the odds ratio is
    0.67/0.33 2 or 2 to 1.
  • If 12.5 of of alums donate, the odds ratio is
    0.125/0.875 .143 or 1 to 7

Predictors and Coefficients
  • Predictors can be continuous or categorical
    (dummy) variables
  • Coefficient shows the change in the Ln(P(1-P))
    for unit change in predictor
  • Can be converted into marginal effects effect on
    probability that Y1 for unit change in X
  • Not easy to explain but
  • Can talk in general terms (positive, negative,
  • Have classification statistics that are more

Logistic Regression in SPSS
  • Retention dataset

Retention Example Output
  • Ominbus Tests
  • Model Summary
  • Classification table

Retention Example Output
  • Predictor statistics

Interpreting Logistic Reg Output
  • Ominbus Tests
  • Overall significance of model
  • Relative performance of one model v. another
  • Model summary
  • Goodness of fit R2 Statistics
  • Several versions, none of which are real R2
  • Classification table
  • Ability to successfully classify from prediction
  • Remember prediction is probability that then has
    to be categorized

Interpreting Coefficients
  • B value is change in ln(odds ratio) for unit
    change in predictor
  • S.E. is error in predictor
  • Relates to significance and estimation
  • Wald Statistic like the t-value in OLS Linear
  • Has corresponding significance level
  • Can be incorrect for large coefficients
  • Exp(B) is marginal effect
  • effect on probability that Y1 for unit change in

Interpreting Coefficients
A unit change in SATACT increases the odds that
the student will be retained by a factor of 1,
that is, not at all
A full-time student is twice as likely to be
retained then a part-time student
A unit (full grade) change in GPA increases by
more than a factor of 2 the likelihood that a
student will be retained
On Your Own
  • Try different variables or entry methods on
    retention data
  • Predict admissions yield status with application
    data set
  • Predict full- vs. part-time faculty status with
    faculty data set
  • Distinguish between two Carnegie categories on
    institutional data set
  • Dont forget to select only two groups

Questions and Answers
Discriminant Analysis
  • Closer to linear regression in form and
  • Predictors must by continuous or dichotomous
  • Logistic regression can handle polychotomous
    categorical variables
  • Can be used for multi-group outcome
  • Generates k-1 orthogonal solutions for k groups

Requirements for Discriminant
  • Two or more groups
  • At least two cases per group
  • Any number of discriminating variables, provided
    that it is less than the total number of cases
    minus 2
  • Discriminating variables are on an interval or
    ratio scale

Requirements for Discriminant
  • No discriminating variable can be a linear
    combination of any other discriminating variable
  • The covariance matrices for each group must by
    approximately equal, unless special formulas are
  • Each group had been drawn from a population with
    a multivariate normal distribution on the
    discriminating variables

Interpreting Group Differences
  • Which variables best predict differences
  • What is the best combination of predictors The
    canonical discriminant function

The Discriminant Function
  • Derivation
  • Like regression, maximize the between-groups sum
    of squares, relative to the within-groups sum of
    squares for the value D
  • Interpretation
  • Overall function statistics
  • Predictor variable statistics

Retention Example
  • Overall model statistics
  • Eigenvalue/Canonical Correlation
  • Wilks Lambda
  • 1 Canonical2

Predictor Coefficients
  • Standardized Discriminant Coefficients
  • Variables with largest (absolute) coefficient
    contribute most to prediction of group membership
  • Sign is direction of effect

Retention Coefficients
  • Structure coefficients
  • Correlation between each predictor and overall
    discriminant function

Classification in Discrminant
  • Prior probabilities
  • Can be .5 or set by size of group

Classification in Discriminant
  • Accuracy within each group as well as overall

The Classification Matrix
  • Comparing actual group membership against
    predicted group membership, using the
    classification function
  • Can have an "unknown" region
  • Split samples can (should?) be used to further
    test the accuracy of classification

The Classification Matrix
  • Measures of interest include
  • Overall prediction accuracy
  • (ad)/N
  • Sensitivity-accuracy among positive cases
  • a/(ac)
  • Specificity-accuracy among negative cases
  • d/(bd)
  • False positive rate
  • b/(bd)
  • False negative rate
  • c/(ac)

The Confusion Matrix
  • But thats not all
  • Prevalence (ac)/N
  • Overall diagnostic power (bd)/N
  • Positive predictive power a/(ab)
  • Negative predictive power d/(cd)
  • Misclassification rate (bc)/N
  • Odds ratio (ad)/(bc)
  • Kappa (a d) - (((a c)(a b) (b d)(c
    d))/N) N - (((a c)(a b) (b d)(c
  • NMI 1 - -a.ln(a)-b.ln(b)-c.ln(c)-d.ln(d)(ab).ln(
    ab)(cd).ln(cd) N.lnN -
    ((ac).ln(ac) (bd).ln(bd))

Adjusting for Prior Probabilities or the "Costs"
of Misclassification
  • Methods so far have considered each group equally
  • One can take into account known differences in
    group composition
  • This usually takes one of two forms
  • Prior information regarding the likely
    distribution of group size
  • There are known higher "costs" of misclassifying
    objects into one group compared to in the other

.5 v. Group Size Cutoffs-Discrim
.5 v. Group Size Cutoffs-Logistic
Logistic v. Discrim Classification
  • Classification measure calculator for 2x2
  • http//

On Your Own
  • Rerun your logistic regressions as discriminant
  • Play with different cutoff conditions
  • .5 vs. predicted from group size for discriminant
  • Set your own value for logistic regression

Questions and Answers
Logistic vs. Discriminant
  • Logistic
  • Accommodates categorical predictors with gt 2
  • Fewer assumptions
  • More robust
  • Easier to use?
  • Discriminant
  • Easier to interpret?
  • More classification features
  • Can accommodate costs of misclassification

Reporting Results
  • Logistic regression coefficients (and their
    anti-logs) are difficult to convey graphically
  • Positive impact values above 1 range
  • Negative impact values below 1 have a limited
  • Delta P is an alternative
  • Change in probability of outcome given unit
    change in predictor

Reporting Results
  • Classification table and some of the related
    measures are usually most effective way to convey
    usefulness of results
  • As with all higher level analyses, the most
    important point is to interpret in context of
    real decisions
  • E.g., impact of changing selection index cutoff
    in terms of entering class size and predicted
    change in retention rate

Some Reasonable Examples
  • Smith and Nielsen, Longwood College
  • http//
  • DePauw University
  • http//

Good Night!
  • Read Chapter 5 of RIR Stats Volume
  • To reinforce lessons for today and tomorrow
  • If you are having trouble falling asleep

Cluster Analysis
  • Any of a wide variety of numerical procedures
    that can be used to create a classification
  • Conceptually easy to understand and well suited
    to segmentation studies
  • It is a heuristic algorithm, not supported by
    extensive statistical reasoning
  • It is entirely data driven
  • Sometimes yields inconsistent results

Cluster Analysis
  • Creating groups out of whole cloth
  • Drawing circles around points scattered in
    n-dimensional space

What Is a Cluster?
  • A set of objects or points, that are relatively
    close to each other and relatively far from
    points in other clusters.
  • This view that tends to favor spherical clusters
    over ones of other shapes

Steps to Cluster Analysis
  • Selecting variables
  • Selecting a similarity or distance measure
  • Choosing a clustering algorithm

Selecting Variables
  • Most popular forms based on measures of
    "similarity" according to some combination of
  • The choice of variables is one of the most
    critical steps
  • Should be guided by an explicit theory or at
    least solid reasoning
  • Higher education researchers typically have ready
    access to certain types of student

Choosing a Similarity Measure
  • Distance measures Spatial relationship
  • Association measures Similarities or
    dissimilarities, using measures of association
    (e.g., Correlation, contingency tables)
  • The type of variable constrains the choice
  • Nominal variables require either association
    coefficients or a decision-tree technique
  • Continuous variables lend themselves to
    distance-type measures

Distance-Type Measures
  • Several are cases of what is called the Minkowski
  • Euclidean distance (r 2 distance between two
    points in n-dimensional space)
  • City-block metric (r 1 the sum of differences
    along each measure

Distance-Type Measures
  • Another common distance measure is Mahalanobis
    D2, which takes into account correlations among
    the predictors

Standardized vs. Unstandardized Measures
  • One must be careful about the implications of
    using standardized vs. unstandardized measures in
    computing these distances.

Matching-Type Measures
  • Association coefficients
  • The only game in town when the predictors are
    nominally scaled.
  • The predictor variables are usually converted to
    binary indicators.
  • Similarity Coefficients are a form of matching
    type measure based on a series of binary
    variables that represent the presence or absence
    of a trait.

Contingency Table-based Similarity Coefficients
  • Possible coefficients that differ according to
  • How negative matches (0,0) are incorporated
  • Whether matched pairs are equally weighted or
  • Whether unmatched pairs carry twice the weight of
    matched pairs
  • Whether negative matched are excluded altogether

Contingency-Table Measures
  • (ad)/(abcd) matching coefficient
  • a/(abcd) Russel/Rao index
  • a/(abc) Jaccard coefficient
  • 2a/(2abc) Dices coefficient

Correlation Coefficients
  • Pearson r, Spearman r, etc.
  • Correlation is across variables and between each
    pair of objects

Across Variable Correlation
Standard Across Person Correlation
The Distance Matrix
  • Regardless of method, the first step in cluster
    analysis is to produce a distance matrix.
  • A row and column for each object
  • Cells represent the distance or similarity
    measure between each pair.
  • Symmetric with diagonal of 0's for distance
    matrices, or 1's for similarity measures.
  • This is what makes cluster analyses like these so
    computationally intensive.

Choosing a Clustering Algorithm
  • Hierarchical algorithms
  • Agglomerative methods start with each object in
    its own cluster and then merges points and
    clusters until some criteria is reached
  • Single linkage (nearest neighbor)
  • Complete linkage (furthest neighbor)
  • Average linkage
  • Wards error sum of squares

Choosing a Clustering Algorithm
  • Hierarchical algorithms (continued)
  • Divisive methods start with one group of the
    whole and partitioning objects into smaller
    clusters until some criteria is reached.
  • Splinter-average distance
  • Decision tree methods
  • Partitioning algorithms
  • K-means clustering
  • Trace-based methods

Peer Institution Example
  • Variables Derived from IPEDS

Create Proximity Matrix
  • Screen institutions to a manageable number (lt
  • Select ClassifyHierarchical Cluster
  • Place predictors in Variable box
  • Under Method, choose Z scores for standardize
    box (by variable).
  • Paste syntax
  • Erase Cluster procedure
  • Change proximity matrix file name so you can find
  • Run it

Using Proximity Matrix
  • Find target institution (sort by name)
  • Identify Varname and find target column
  • Get rid of excess columns
  • Sort (ascending) by varname column
  • VOILA! Institutions now sorted by similarity to

Graphical Clustering Methods
  • Glyphs, Metroglyphs, Fourier series, and Chernoff

Decision Trees
  • Hybrid between clustering and discriminant
  • The criterion variable does not define the groups
  • But the groups are defined so as to maximize
    differences according to the criterion.
  • The purpose is to identify key variables for
    distinguishing among groups and formulating group
    membership prediction rules.

Functions of Decision Trees
  • Derive decision rules from data
  • Develop classification system to predict future
  • Illustrate these through a decision tree
  • Discretize continuous variables

SPSS AnswerTree
  • Three decision tree algorithms
  • All use "brute force" methods

Common Features
  • Merging categories of the predictor variables so
    that non-significantly different values are
    pooled together
  • Splitting the variables at points that maximizes
  • Stopping the branching when further splits do not
    contribute significantly
  • Pruning branches from an existing tree
  • Validation and error estimation

  • Not just binary splits
  • Handles nominal, ordinal, and continuous
  • Useful for discretizing continuous variables
  • Demo and sample output included within session
    support materials

(No Transcript)
Playing with CHAID, etc.
  • Work with retention dataset
  • Use retention status as criteria
  • Use Semester GPA as criteria
  • Try newer institutional dataset
  • Use graduation rate as criteria
  • Try nearest neighbor analysis with newer data

Final Thoughts
  • Flexibility of logistic regression models make it
    the coin of the realm
  • E.g., Multinomial logistic HLM regression
  • Cluster analysis is so data driven as to make its
    use fairly limited
  • Threshold approach to peer identification much
    more popular but its always good to run things
    multiple ways (See IR Primer Peer Chapter)
  • CHAID is fun to play with and informative

(No Transcript)
Write a Comment
User Comments (0)