Multivariate Coarse Classing of Nominal Variables - PowerPoint PPT Presentation

About This Presentation
Title:

Multivariate Coarse Classing of Nominal Variables

Description:

Transformed data for distance calculation. Nominal-to-numeric. mapping. Classing tree ... For each record i, calculate scorei=average(Sj qij ) ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 53
Provided by: davi119
Learn more at: https://davis.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Multivariate Coarse Classing of Nominal Variables


1
Multivariate Coarse Classing of Nominal Variables
  • Geraldine E. Rosario
  • Talk given at Fair Isaac on July 14, 2003
  • Based on paper Mapping Nominal Values to Numbers
  • for Effective Visualization, InfoVis 2003.

2
Outline
  • Motivation
  • Overview of Distance-Quantification-Classing
    approach
  • Algorithmic Details
  • Experimental Evaluation
  • Wrap-Up

3
Those pesky nominal variables
  • Nominal variable variables whose values do not
    have a natural ordering or distance
  • High cardinality nominal variable has large
    number of distinct values
  • Examples?
  • Examples of business applications using nominal
    variables?
  • Why do you usually pre-process/transform them
    before doing data analysis?

4
Visualizing Nominal Variables
  • Most data visualization
  • tools are designed for
  • numeric variables.
  • What if variable is
  • nominal?
  • Most tools which are
  • designed for nominal
  • variables cannot handle
  • large of values.

5
Quantified Nominal Variables
Are the order and spacing of values within each
variable believable?
6
Coarse Classing Nominal Variables
  • Possible ways of classing nominal variables with
    high cardinality
  • Domain expertise
  • Univariate using information about the variable
    itself. e.g. based on frequency of occurrence of
    the attributes
  • Bivariate using information from one other
    variable. e.g. relationship with predictor
    variable
  • Multivariate based on the profile across several
    other variables. e.g. using cluster analysis
  • Is multivariate coarse classing better?

7
The approach
8
Proposed Approach
  • Pre-process nominal variables using a
    Distance-Quantification-Classing (DQC) approach
  • Steps
  • Distance transform the data so that the
    distance between 2 nominal values can be
    calculated (based on the variables relationship
    with other variables)
  • Quantification assign order and spacing to the
    nominal values
  • Classing or intra-dimension clustering
    determine which values are similar to each other
    and can be grouped together
  • Each step can be done by more than one technique.

9
Distance-Quantification-Classing Approach
10
Example Input to Output
Task Pre-process color based on its patterns
across quality and size.
Data Quality (3) good,ok,bad Color (6)
blue,green,orange,
purple,red,white Size (10) a to j
11
Other Potential Uses of DQC as Pre-Processor
  • For techniques that require numeric inputs
    linear regression, some clustering algorithms
    (can speed up calculations but with some loss of
    accuracy)
  • For techniques that require low cardinality
    nominal variables scorecards, neural networks,
    association rules
  • FICO-specific
  • Multivariate coarse classing
  • ClusterBots nominal variables could be
    quantified and distance calculations would be
    simpler. Could be applied to mixed variables?
  • Product groups, merchant groups
  • Can you think of other uses?

12
Details Details
13
Distance Step Correspondence Analysis
  • Used for analyzing n-way tables containing some
    measure of association between rows and columns
  • Simple Correspondence Analysis (SCA) for 2
    variables
  • Multiple Correspondence Analysis (MCA) for gt 2
    variables. Uses SCA.
  • Focused Correspondence Analysis (FCA) proposed
    alternative to MCA when memory is limited. Uses
    SCA.
  • Reinvented as Dual Scaling, Reciprocal Averaging,
    Homogeneity Analysis, etc.
  • Similar to PCA but for nominal variables

14
Simple Correspondence Analysis The Basic Idea
Calculate c2 statistic (measures the strength of
association between COLOR and QUALITY based
on assumption of independence). Any deviation
from independence will increase the c2 value.
Can we find similar COLORs based on its
association with QUALITY?
Similar profiles
15
Simple Correspondence Analysis Steps
Row percentage matrix
Column percentage matrix
Normalize counts table
Similar row profiles (blue,purple),
Similar column profiles (ok,bad),
Eigenvalues
Identify a few independent dimensions which can
reconstruct the c2 value. (SVD, EigenAnalysis).
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Scale the new dimensions such that c2 distances
between row points is maximized.
16
Simple Correspondence Analysis The Output
  • Coordinates Matrix
  • Set of independent dimensions
  • Dimensions ordered by diminishing importance
  • Total of independent dimensions min(r,c)-1
  • Similar to principal components from PCA
  • Eigenvalues
  • Indicates the importance of each independent
    dimension

17
Distance Step Alternative Multiple
Correspondence Analysis
  • Steps
  • BurtTable(rawdataMatrix) ? burtMatrix
  • SCA(burtMatrix) ? coordMatrix, evaluesVector
  • ReduceNDim(coordMatrix, evaluesVector) ?
    coordMatrixSubset
  • Input to SCA - Burt Table crosses all variables
    by all variables


X3
X2
X1
X1 by X1 counts table
X1 by X2 counts table
X1
X2
X3

18
Multiple Correspondence Analysis
  • Features
  • For a given variable, determines which values are
    similar to each other by comparing value profiles
    across all other variables
  • multivariate
  • maximizes usage of information
  • memory-intensive
  • Simultaneously analyzes of all variables
  • efficient calculations

19
Reduce Number of Dimensions to Keep
  • Reduce the number of independent dimensions to
    keep for subsequent analysis (due to large of
    analysis variables and high cardinality)

eigenvalue
1 2 3 4 5
dimension
20
Distance Step AlternativeFocused Correspondence
Analysis
  • Proposed alternative to MCA when memory space is
    limited
  • Core idea instead of comparing value profiles
    across all other nominal variables, just compare
    value profiles across the nominal variables which
    are most correlated with the target variable
  • Input to Simple CA


X9
X1
X3
target variable Xi
Xi by X3 counts table
Xi by X1 counts table
21
Focused Correspondence Analysis
  • Steps
  • PairwiseAssociate(rawdataMatrix) ? assocMatrix
  • Set k ( analysis variables to use)
  • FCATable(rawdataMatrix, k, assocMatrix) ?
    fcaInputMatrix
  • SCA(fcaInputMatrix) ? coordMatrix, evaluesVector
  • ReduceNDim(coordMatrix, evaluesVector) ?
    coordMatrixSubset

22
FCA Calculate Pairwise Association
  • Used Uncertainty Coefficient U(RC) to measure
    strength of nominal association
  • Bounded 0,1
  • U(RC)1 ? value of row variable R can be known
    precisely given value of column variable C
  • Example U(RC) association matrix

23
FCA Determine top k associated variables for
each nominal variable
  • Set k gt 2 to ensure use of at least one analysis
    variable per target variable
  • Cannot use a threshold on the association measure

24
Focused Correspondence Analysis
  • Features
  • One-at-a-time analysis
  • Less/controllable memory usage
  • Sub-optimal quantification compared to MCA
  • Requires pre-processing step to determine top
    correlated variables per target variable
  • longer run time

25
Quantification Step Modified Optimal Scaling
Nominal-to-numeric mapping
Coordinates for Independent
Dimensions Dim1 Dim2 Blue
- 0.02 - 0.28 Green - 0.54
0.14 Orange 0.55 0.10 Purple
0 - 0.25 Red - 0.50
0.20 White 0.57 0.19
Optimal Scaling
Optimal Scaling goal maximize the variance of
the scores of the records, where score
average(qij)
26
Quantification Step Modified Optimal Scaling
  • Problem with Optimal Scaling perfect
    associations between variables are not recreated
    in the quantified versions
  • Modified Optimal Scaling
  • Let p of eigenvalues 1.0
  • If p gt 1 then set
  • Else set

27
Classing Step Hierarchical Cluster Analysis
Cluster Analysis weighted by counts
from FCA
28
Loss of Information due to Classing
  • Determine variable V with highest association
    with target X.
  • Create X by V counts table.
  • Calculate total table measure of association (eg,
    U(XV)).
  • Starting from bottom of tree, for every pair of
    nodes merged,
  • calculate cumulative information loss

29
Distance-Quantification-Classing Approach
30
Does this approach work?
31
Experimental Evaluation
  • Wrong quantification and classing will introduce
    artificial patterns and cause errors in
    interpretation
  • Evaluation measures
  • Believability
  • Quality of Visual Display
  • Quality of classing
  • Quality of quantification
  • Space FCA less space
  • Run time MCA faster

perception
statistical
computational
32
Test Data Sets
33
Believability and Quality of Visual Display
  • Given two displays resulting from different
    nominal-to-numeric mappings
  • Which mapping gives a more believable ordering
    and spacing?
  • Based on your domain knowledge, are the values
    that are positioned close together similar to
    each other?
  • Are the values that are positioned far from the
    rest of the values really outliers?
  • Which display has less clutter?

34
Automobile Data Alphabetical
35
Automobile Data MCA
Are these patterns believable?
36
Automobile Data FCA
Are these patterns believable?
37
PERF Data Alphabetical
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
38
PERF Data FCA
Region-Country 1-many Country-Product many-many
Are these associations preserved and revealed?
39
Quality of Classing
  • Classing A is better than classing B if, given a
    classing tree, the rate of information loss with
    each merging is slower

Information loss due to classing for one variable
? The lower the line, the slower the info
loss, the better the classing.
Calculate difference between the lines. ?
40
Which classing is better depends on dataset
Distribution of difference between the lines.
41
Quality of Quantification
  • A quantification is good if
  • If data points that are close together in nominal
    space are also close together in numeric space
  • If two variables are highly associated with each
    other, then their quantified versions should also
    have high correlation.

42
MCA gives better quantification
? Average Squared Correlation higher value
better quantification
? Correlation between MCA and FCA scales how
close are FCA scales to MCA scales
43
Had enough yet?
44
Going back to Multivariate Coarse Classing
  • Other issues
  • Missing values
  • Mixed or numeric variables as analysis variables
  • Nominal values with small counts
  • Robustness of quantification and classing

45
Can you think of other uses of DQC at FICO?
  • For techniques that require numeric inputs
    linear regression, some clustering algorithms
    (can speed up calculations but with some loss of
    accuracy)
  • For techniques that require low cardinality
    nominal variables scorecards, neural networks,
    association rules
  • FICO-specific
  • Multivariate coarse classing
  • ClusterBots nominal variables could be
    quantified and distance calculations would be
    simpler. Could be applied to mixed variables?
  • Product groups, merchant groups
  • ???????

46
Implementation
  • SAS version exists
  • PROC CORRESP, PROC CLUSTER, PROC FREQ
  • C version in development

47
Summary
  • DQC is a general-purpose approach for
    pre-processing nominal variables for data
    analysis techniques requiring numeric variables
    or low cardinality nominal variables
  • DQC multivariate, data-driven, scalable,
    distance-preserving, association-preserving
  • FCA is a viable alternative to MCA when memory
    space is limited
  • Quality of classing and quantification
  • depends on strength of associations within the
    data set.
  • is in the eye of the user

48
Yippee, its over!
  • Original InfoVis2003 paper Mapping Nominal
    Values to Numbers for Effective Visualization.
  • http//davis.wpi.edu/xmdv/documents.html
  • XmdvTool Homepage
  • http//davis.wpi.edu/xmdv
  • xmdv_at_cs.wpi.edu
  • Code is free for research and education.

49
References
  • Gre93 GREENACRE, M.J., 1993, Correspondence
    Analysis in Practice, London Academic Press
  • Gre84 Greenacre, M. (1984), Theory and
    Applications of Correspondence Analysis, London
    Academic Press
  • Sta StatSoft Inc. Correspondence Analysis.
    http//www.statsoftinc.com/textbook/stcoran.html
  • Fri99 Friendly, Michael. 1999. "visualizing
    Categorical Cata." In Sirken, Monroe G. et. al.
    (eds). Cognition and Survey Research. New York
    John Wiley Sons.
  • Kei97 Keim D. A. Visual Techniques for
    Exploring Databases, Invited Tutorial, Int.
    Conference on Knowledge Discovery in Databases
    (KDD'97), Newport Beach, CA, 1997.
  • Hua97b Zhexue Huang. A Fast Clustering
    Algorithm to Cluster Very Large Categorical Data
    Sets in Data Mining (1997)
  • SAS Manuals (PROC CORRESP, PROC CLUSTER, PROC
    FREQ)

50
What input tables can SCA accept?
  • In general, SCA can use as input any table that
    has the properties
  • The table must use the same physical units or
    measurements, and
  • The values in the table must be non-negative.
  • The FCA input table satisfies these properties.

51
Uncertainty Coefficient U(RC)
Source SAS Proc Freq
52
Average Squared Correlation
  • Given the raw data matrix Rrij, where the
    columns represent the variables. Create new
    matrix Qqij where qij.quantified version of
    rij.. Let Qjjth column of Q.
  • For each record i, calculate scoreiaverage(Sj
    qij )
  • For each variable j, calculate corrjcorrelation(Q
    i,score)
  • Calculate average of the squared correlation.
  • Source Gre93
Write a Comment
User Comments (0)
About PowerShow.com