Title: The Marriage of Market Basket Analysis to Predictive Modeling
1The Marriage of Market Basket Analysis to
Predictive Modeling
2How Would You Mine This Transactional Data?
3Is Data Mining Simply Market Basket Analysis?
4Market Basket Analysis identifies the rule
/our_company/bboard/?hr/café/ but
- How do you use this information?
- Can the information be used to develop a
predictive model? - More generally, how do you develop predictive
models using transactional tables?
5Data Mining Software Objectives
- Predictive Modeling
- Clustering
- Market Basket Analysis
- Feature Discovery that is, improve the
predictive accuracy of existing models
6Agenda
- Converting a transactional to a
modeling table - The curse of dimensionality possible fixes
- A feature discovery process using market basket
analysis output as an input to predictive
modeling - A dimensional reduction scheme using confidence
7DM Table Structures
- Transactional tables (Market Basket Analysis)
- Trans-id page spend count
- id-1 page1 0 1
- id-1 page2 0 1
- id-1 page3 0 1
- id-1 page4 19.99 1
- id-1 page5 0 1
- id-2 page1 0 1
- Modeling tables (modeling clustering tools)
- Trans-id page spend count
- id-1 . 19.95 5
- id-2 . 0 1
8Converting Transactional Into Modeling Data
- Continuous variable case - easy
- Collapse the spend or count columns via the sum,
mean, or frequency statistic for each
transaction-id value - Proc sql
create table new as
select id,sum(amount) as total from old
group by id - Categorical variable case - challenging
- It seems the detail page information is lost when
the rows are rolled-up or collapsed - However, with transposition you collapse the rows
onto a single row for each id, with each distinct
page now being a column in the modeling table and
taking the count or sum statistic as its value -
9The Input Discovery Process
- Existing modeling table contains
- id-1, age, income, job-category, married,
recency, frequency, zip-code - New potential predictors per transpose contains
- id-1, spend on page1, spend on page2, spend on
page3, spend on page4, spend on page5 - Augment existing modeling table with the new
inputs and, hopefully, discover new, significant
predictors to improve predictive accuracy
10Problem with Transpose Method
- Suppose the server has 1,000 distinct pages the
transpose method now produces 1,000 new columns
instead of 5 - Sparsity new columns have a preponderance of
missing values e.g., id-2 will have 5 missing
values and the 1 non-missing - Regression, Neural, and Cluster tools struggle
with this many variables, especially when there
is such a preponderance of the same values (e.g.,
zeros or missing)
11The Curse of Dimensionality
- Suppose interest lies in a second classification
column too e.g., both time (hour) and page
visited - Transpose method now produces 1,00024 new
variables, assuming no interest in interactions - If interactions are of interest, then there will
be 24,000 (1,000x24) new variable generated
12General Fix
- Reduce the number of levels of the categorical
variable (e.g., using confidence) - Use the transpose method to convert the
transactional to a modeling table - Add the new inputs to the traditional modeling
table in an effort to improve predictive accuracy
13Creating Rules-Based Dummy Variables
- Obtain rules using market basket analysis
- Choose the rule of interest
- Identify folks having the rule of interest in
their market basket - Create a dummy variable flagging them
- Augment the traditional modeling table with the
dummy variable - Use the dummy variable as an input or target in a
predictive modeling tool
14Using SQL to Identify Folks Having a Rule of
Interest in Their Market Basket
15Creating a Rule-Based Dummy Variable
16The All-Info Table
17Feature Discovery A new potential predictor or
input
18Possible Sub-setting Criteria
- Any rule of interest
- The confidence - e.g., all rules having
confidence gt 100 (optimal level of confidence?) - The support - e.g., all rules having support gt
10 (optimal level of support?) - The lift - e.g., all rules having lift gt 5
(optimal level of lift)
19Using Confidence as the Basis for a
Reclassification Scheme
- Suppose diapers?beer has a confidence of 100
- Then the two levels diapers beer can be
mapped into the value diapers?beer, it seems - Actually, both the rule and its reverse must have
a confidence of 100
20The Confidence Reclassification Scheme
- If confidence for the rule and its opposite is
gt80, then combine the two levels into the
rule-based level - e.g., page1 page2 both mapped into
page1?page2 - Using 80 instead of 100 will introduce
inaccuracy, but the analyst overwhelmed with too
many levels will likely be willing to substitute
a little accuracy for dimensional reduction
21The Confidence Reclassification Scheme
- Use the transpose method to generate candidate
predictors - Augment the traditional modeling table with the
new candidate predictors table - Develop an enhanced model using some of the
candidate predictors in the hope of fostering
predictive accuracy
22Contact Information