KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real

About This Presentation

Title:

KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real

Description:

Quantile est., Graphical. model. Modeling methodology design. Programming, Simulation, ... More generally, quantile loss can be used (cf. MAP case study) ... – PowerPoint PPT presentation

Number of Views:381

Avg rating:3.0/5.0

Slides: 94

Provided by: IBMU301

Category:

more less

Transcript and Presenter's Notes

Title: KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real

1
KDD-09 Tutorial Predictive Modeling in the
Wild Success Factors in Data Mining Competitions
and Real-Life Projects
Saharon Rosset, Tel Aviv University Claudia
Perlich, IBM Research
2
Predictive modeling

Most general definition build a model from
observed data, with the goal of predicting some
unobserved outcomes
Primary example supervised learning
get training data (x1,y1), (x2,y2),,
(xn,yn)drawn i.i.d from joint distribution on
(X,y)
Build model f(x) to describe the relationship
between x and y
Use it to predict y when only x is observed in
future
Other cases may relax some of the supervised
learning assumptions
For example in KDD Cup 2007 did not see any
yis, had to extrapolate them based on training
xis see later in tutorial

3
Predictive Modeling Competitions

Competitions like KDD-Cup extract core
predictive modeling challenges from their
application environment
Usually supposed to represent real-life
predictive modeling challenges
Extracting a real-life problem from its context
and making a credible competition out of it is
often more difficult than it seems
We will see it in examples

4
The Goals of this Tutorial

Understand the two modes of predictive modeling,
their similarities and differences
Real life projects
Data mining competitions
Describe the main factors for success in the two
modes of predictive modeling
Discuss some of the recurring challenges that
come up in determining success
These goals will be addressed and demonstrated
through a series of case studies

5
Credentials in Data Mining Competitions
Claudia Perlich - Runner up KDD CUP 03 - Winner
ILP challenge 05 - Winner KDD CUP 09_at_

Saharon Rosset
- Winner KDD CUP 99
- Winner KDD CUP 00

Jointly
Winners in KDD CUP 2007_at_
Winners of KDD CUP 2008_at_
Winners of INFORMS data mining challenge 08_at_

Collaborators _at_Prem Melville, _at_Yan Liu,
_at_Grzegorz Swirszcz, Foster Provost, Sofus
Macscassy, Aron Inger, Nurit Vatnik, Einat
Neuman, _at_Alexandru Niculescu-Mizil
6
Experience with Real Life Projects

2004-2009 Collaboration on Business Intelligence
projects at IBM Research
Total of gt10 publications on real life projects
Total 4 IBM Outstanding Technical Achievement
awards
IBM accomplishment and major accomplishment
Finalists in this years INFORMS Edelman Prize
for real-life applications of Operations Research
and Statistics
One of the successful projects will be discussed
here as a case study

7
Outline

Introduction and overview (SR)
Differences between competitions and real life
Success Factors
Recurrent challenges in competitions and real
projects
Case Studies
KDD CUP 2007 (SR)
KDD CUP 2008 (CP)
Business Intelligence Example Market Alignment
Program (MAP) (CP)
Conclusions and summary (CP)

8
Introduction What do you think is important?

Domain knowledge
Statistics background
Data mining algorithms
Computing power
General Experience with data

9
Differences between competitions and projects
In this tutorial we deal with the predictive
modeling aspect, so our discussion of projects
will also start with a well defined predictive
task and ignore most of the difficulties with
getting to that point
10
Real life project evolution and our focus
Business/ modeling problem definition
Statistical problem definition
Modeling methodology design
Data preparation integration
Sales force mgmt. Wallet est.
Quantile est.,Latent variable est.
Quantile est.,Graphical model
IBM relationshipsFirmographics
Model generation validation
Implementation application development
Not our focus
Loosely related
Programming,Simulation,IBM Wallets
OnTarget,MAP
Our focus
11
Two types of competitions

Real
Raw data
Set up the model yourself
Task-specific evaluation
Simulate real life mode
Example
KDD Cup 2007
KDD Cup 2008
Approach
Understand the domain
Analyze the data
Build model
Challenges
Too numerous

Sterile
Clean data matrix
Standard error measure
Often anonymized features
Pure machine learning
Example
KDD Cup 2009
PKDD Challenge 2007
Approach
Emphasize algorithms, computation
Attack with heavy (kernel?) machines
Challenges
Size, missing values, features

12
Factors of Success in Competitions and Real Life

1. Data and domain understanding
Generation of data and task
Cleaning and representation/transformation
2. Statistical insights
Statistical properties
Test validity of assumptions
Performance measure
3. Modeling and learning approach
Most publishable part
Choice or development of most suitable algorithm

Real
Sterile
13
Recurring challenges

We emphasize three recurring challenges in
predictive modeling that often get overlooked
Data leakage impact, avoidance and detection
Leakage use of illegitimate data for modeling
Legitimate data data that will be available
when model is applied
In competitions, the definition of leakage is
unclear
Adapting learning to real-life performance
measures
Could move well beyond standard measures like
MSE, error rate, or AUC
We will see this in two of our case studies
Relational learning/Feature construction
Real data is rarely flat, and good, practical
solutions for this problem remain a challenge

14
1 Leakage in Predictive Modeling

Introduction of predictive information about the
target by the data generation, collection, and
preparation process
Trivial example Binary target was created using
a cutoff on a continuous variable and by
accident, the continuous variable was not removed
Reversal of cause and effect when information
from the future becomes available
It produces models that do not generalize/true
model performances is much lower than out-of
sample (but including leakage) estimate
Commonly occurs when combining data from multiple
sources or multiple time points and often
manifests in the order in data files
Leakage is surprisingly pervasive in competitions
and real life
KDD CUP 2007, KDD CUP 2008 had leakages, as we
will see in case studies
INFORMS competition had leakage due to partial
removal of information for only positive cases

15
Real life leakage example

P. Melville, S. Rosset, R. Lawrence (2008)
Customer Targeting Models Using Actively-Selected
Web Content. KDD-08
Built models for identifying new customers for
IBM products, based on
IBM Internal databases
Companies websites
Example pattern Companies with the word
Websphere on their website are likely to be
good customers for IBM Websphere products
Ahem, a slight cause and effect problem
Source of problem we only have current view of
company website, not its view when it was an IBM
prospect (prior to buying)
Ad-hoc solution remove all obvious leakage
words.
Does not solve the fundamental problem

16
General leakage solution predict the future

Niels Bohr is quoted as saying Prediction is
difficult, especially about the future
Flipping this around, if
The true prediction task is about the future
(usually is)
We can make sure that our model only has access
to information at the present
We can apply the time-based cutoff in the
competition / evaluation / proof of concept stage
? We are guaranteed (intuitively and
mathematically) that we can prevent leakage
For the websites example, this would require
getting internet snapshot from (say) two years
ago, and using only what we knew then to learn
who bought since

17
2 Real-life performance measures

Real life prediction models should be constructed
and judged for performance on real-life measures
Address the real problem at hand optimize ,
life span etc.
At the same time, need to maintain statistical
soundness
Can we optimize these measures directly?
Are we better off just building good models in
general?
Example breast cancer detection (KDD Cup 2008)
At first sight, a standard classification problem
(malignant or benign?)
Obvious extension cost sensitive objectiveMuch
better to do a biopsy on a healthy subject than
send a malignant patient home!
Competition objective optimize effective use of
radiologists time
Complex measure called FROC
See case study in Claudias part

18
Optimizing real-life measures

It is a common approach to use the prediction
objective to motivate an empirical loss function
for modeling
If the prediction objective is the expected value
of Y given x, then squared error loss (e.g,
linear regression or CART) is appropriate
If we want to predict the median of Y instead,
then absolute loss is appropriate
More generally, quantile loss can be used (cf.
MAP case study)
We will see successful examples of this approach
in two case studies (KDD CUP 07 and MAP)
What do we do with complex measures like FROC?
There is really no way to build a good model
directly
Less ambitious approach
Build a model using standard approaches (e.g.
logistic regression)
post-process your model to do well on the
specific measure
We will see a successful example of this approach
in KDD CUP 08

19
3 Relational and Multi-Level Data

Real-life databases are rarely flat!
Example INFORMS Challenge 08, medical records

mn
mn
mn
mn
20
Approaches for dealing with relational data

Modeling approaches that use relational data
directly
There has been a lot of research, but there is a
scarcity of practically useful methods that take
this approach
Flattening the relational structure into a
standard X,y setup
The key to this approach is generation of useful
features from the relational tables
This is the approach we took in the INFORMS08
challenge
Ad hoc approaches
Based on specific properties of the data and
modeling problem, it may be possible to divide
and conquer the relational setup
See example in the KDD CUP 08 case study

21
Modelers best friend Exploratory data analysis

Exploratory data analysis (EDA) is a general name
for a class of techniques aimed at
Examining data
Validating data
Forming hypotheses about data
The techniques are often graphical or intuitive,
but can also be statistical
Testing very simple hypotheses as a way of
getting at more complex ones
E.g. test each variable separately against
response, and look for strong correlations
The most important proponent of EDA was the
great, late statistician John Tukey

22
The beauty and value of exploratory data analysis

EDA is a critical step in creating successful
predictive modeling solutions
Expose leakage
AVOID PRECONCEPTIONS about
What matters
What would work
Etc.
Example Identifying KDD CUP 08 leakage through
EDA
Graphical display of identifier vs
malingnant/benign (see case study slide)
Could also be discovered via a statistical
variable-by-variable examination of significant
correlations with response to detect it
Key to finding this AOIVDING PRECONCEPTIONS
about the irrelevance of identifier

23
Elements of EDA for predictive modeling

Examine data variable by variable
Outliers?
Missing data patterns?
Examine relationships with response
Strong correlations?
Unexpected correlations?
Compare to other similar datasets/problems
Are variable distributions consistent?
Are correlations consistent?
Stare at raw data, at graphs, at
correlations/results
Unexpected answers to any of these questions may
change the course of the predictive modeling
process

24
Case study 1 Netflix/KDD-Cup 2007
25
October 2006 Announcement of the NETFLIX
Competition

USAToday headline
Netflix offers 1 million prize for better
movie recommendations
Details
Beat NETFLIX current recommender Cinematch RMSE
by 10 prior to 2011
50,000 for the annual progress price
First two awarded to ATT team 9.4 improvement
as of 10/08 (almost there!)
Data contains a subset of 100 million movie
ratings from NETFLIX including 480,189 users and
17,770 movies
Performance is evaluated on holdout movie-user
pairs
NETFLIX competition has attracted 50K
contestants on 40K teams from gt150 different
countries
40K valid submissions from 5K different teams

26
NETFLIX Data Internet Movie
Data Base
All movies (80K)
17K Selection unclear
All users (6.8 M)
480 K At least 20 Ratings by end 2005
NETFLIX Competition Data
100 M ratings
27
NETFLIX data generation process
KDD CUP NO User or Movie Arrival
User Arrival
Movie Arrival
Task 1
17K movies
Task 2
Training Data
1998
Time
2005 2006
Qualifier Dataset 3M
28
KDD-CUP 2007 based on the NETFLIX

Training NETFLIX competition data from 1998-2005
Test 2006 ratings randomly split by movie in to
two tasks
Task 1 Who rated what in 2006
Given a list of 100,000 pairs of users and
movies, predict for each pair the probability
that the user rated the movie in 2006
Result IBM Research team was second runner-up,
No 3 out of 39 teams
Task 2 Number of ratings per movie in 2006
Given a list of 8863 movies, predict the number
of additional reviews that all existing users
will give in 2006
Result IBM Research team was the winner, No 1
out of 34 teams

29
Test sets from 2006 for Task 1 and Task 2
Marginal 2006 Distribution of rating
Users
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Movies
Task 2
log(n1)
Rating Totals
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
30
Task 1 Did User A review Movie B in 2006?

A standard classification task to answer question
whether existing users will review existing
movies
In line more with synthetic mode of
competitions than real mode
Challenges
Huge amount of data
how to sample the data so that any learning
algorithms can be applied is critical
Complex affecting factors
decrease of interest in old movies, growing
tendency of watching (reviewing) more movies by
Netflix users
Key solutions
Effective sampling strategies to keep as much
information as possible
Careful feature extraction from multiple sources

31
Task 2 How many reviews in 2006?

Task formulation
Regression task to predict the total count of
reviewers from existing users for 8863
existing movies
Evaluation is by RMSE on log scale
Challenges
Movie dynamics and life-cycle
Interest in movies changes over time
User dynamics and life-cycle
No new users are added to the database
Key solutions
Use counts from test set of Task 1 to learn a
model for 2006 adjusting for pair removal
Build set of quarterly lagged models to determine
the overall scalar
Use Poisson regression

32
Some data observations
Leakage Alert!

Task 1 test set is a potential response for
training a model for Task 2
Was sampled according to marginal ( reviews
for movie in 06 / total reviews in 06)which is
proportional to the Task 2 response ( reviews
for movie in 06)
BIG advantage we get a view of 2006 behavior for
half the movies? Build model on this half, apply
to the other half (Task 2 test set)
Caveats
Proportional sampling implies there is a scaling
parameter left, which we dont know
Recall that after sampling, (movie, person) pairs
that appeared before 2006 were dropped from Task
1 test set? Correcting it is an inverse
rejection sampling problem

33
Test sets from 2006 for Task 1 and Task 2
Users
Marginal 2006 Distribution of rating
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Estimate Marginal Distribution
Surrogate learning problem
Movies
Task 2
log(n1)
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
Rating Totals
34
Some data observations (ctd.)

No new movies and reviewers in 2006
Need to emphasize modeling the life-cycle of
movies (and reviewers)
How are older movies reviewed relative to newer
movies?
Does this depend on other features (like movies
genre)?
This is especially critical when we consider the
scaling caveat above

35
Some statistical perspectives

Poisson distribution is very appropriate for
counts
Clearly true of overall counts for 2006
Assuming any kind of reasonable reviewers arrival
process
Right modeling approach for true counts is
Poisson regressionni Pois (?i?t)log(?i) ?j
?j xij? arg max? l(n X,?) (maximum
likelihood)
What does this imply for model evaluation
approach?
Variance stabilizing transformation for Poisson
is square root? ?ni has roughly constant
variance? RMSE on log scale emphasizes
performance on unpopular movies (small Poisson
parameter ? larger log scale variance)
We still assumed that if we do well in a
likelihood formulation, we will do well with any
evaluation approach

Adapting to evaluation measures!
36
Some statistical perspectives (ctd.)

Can we invert the rejection sampling mechanism?
This can be viewed as a missing data problem
ni, mj are the counts for movie i and reviewer j
respectively
pi, qj are the true marginals for movie i and
reviewer j respectively
N is the total number of pairs rejected due to
review prior to 2006
Ui, Pj are the users who reviewed movie i prior
to 2006 and movies reviewed by user j prior to
2006, respectively
Can we design a practical EM algorithm with our
huge data size? Interesting research problem
We implemented ad-hoc inversion algorithm
Iterate until convergence between- assuming
movie marginals are fixed, adjusting reviewer
marginals- assuming reviewer marginals are
fixed, adjusting movie marginals
We verified that it indeed improved our data
since it increased correlation with 4Q2005 counts

37
Modeling Approach Schema
Estimate Poison Regression M1 Predict on Task
1 movies
Who Reviewed Test (100K)
Inverse Rejection Sampling
Count ratings by Movie from
Scale Predictions To Total
Use M1 to Predict Task 2 movies
IMDB
Movie Features
Estimate 4 Poison Regression G1G4 Predict for
2006
Construct Movie Features
Find optimal Scalar
NETFLIX challenge
Estimate 2006 total Ratings for Task 2Test set
Construct Lagged Features Q1-Q4 2005
?
38
Some observations on modeling approach

Lagged datasets are meant to simulate forward
prediction to 2006
Select quarter (e.g., Q105), remove all movies
reviewers that started later
Build model on this data with e.g., Q305 as
response
Apply model to our full dataset, which is
naturally cropped at Q405 ? Gives a prediction
for Q206
With several models like this, predict all of
2006
Two potential uses
Use as our prediction for 2006 but only if
better than the model built on Task 1 movies!
Consider only sum of their predictions to use for
scaling the Task 1 model
We evaluated models on Task 1 test set
Used holdout when also building them on this set
How can we evaluate the models built on lagged
datasets?
Missing a scaling parameter between the 2006
prediction and sampled set
Solution select optimal scaling based on Task 1
test set performance? Since other model was
still better, we knew we should use it!

39
Some details on our models and submission

All models at movie level. Features we used
Historical reviews in previous months/quarters/yea
rs (on log scale)
Movies age since premier, movies age in Netflix
(since first review)
Also consider log, square etc ? have flexibility
in form of functional dependence
Movies genre
Include interactions between genre and age ?
life cycle seems to differ by genre!
Models we considered (MSE on log-scale on Task 1
holdout)
Poisson regression on Task 1 test set (0.24)
Log-scale linear regression model on Task 1 test
set (0.25)
Sum of lagged models built on 2005 quarters
best scaling (0.31)
Scaling based on lagged models
Our estimated of number of reviews for all models
in Task 1 test set about 9.5M
Implied scaling parameter for predictions about
90
Total of our submitted predictions for Task 2
test set was 9.3M

40
Competition evaluation

First we were informed that we won with RMSE of
770
They mistakenly evaluated on non-log scale
Strong emphasis on most popular movies
We won by large margin? Our model did well on
most popular movies!
Then they re-evaluated on log scale, we still won
On log scale the least popular movies are
emphasized
Recall that variance stabilizing transformation
is in-between (square root)
So our predictions did well on unpopular movies
too!
Interesting question would we win on square root
scale (or similarly, Poisson likelihood-based
evaluation)? Sure hope so!

41
Competition evaluation (ctd.)

Results of competition (log-scale evaluation)
Components of our models MSE
The error of the model for the scaled-down Task 1
test set (which we estimated at about 0.24)
Additional error from incorrect scaling factor
Scaling numbers
True total reviews 8.7M
Sum of our predictions 9.3M
Interesting question what would be best scaling
For log-scale evaluation? Conjecture need to
under-estimate true total
For square-root evaluation? Conjecture need to
estimate about right

42
Effect of scaling on the two evaluation approaches
43
Effect of scaling on the two evaluation approaches
44
KDD CUP 2007 Summary

Keys to our success
Identify subtle leakage
Is it formally leakage? Depends on intentions of
organizers
Appropriate statistical approach
Poisson regression
Inverting rejection sampling in leakage
Careful handling of time-series aspects
Not keys to our success
Fancy machine learning algorithms

45
Case Study 2 KDD CUP 2008 - Siemens Medical
Breast Cancer Identification
MLO CC
MLO CC
6816 Images
1712 Patients
Malignant
105,000 Candidates
?
x1 , x2 , , x117, class
candidate feature vector
46
KDD-CUP 2008 based on Mammography

Training labeled candidates from 1300 patient
and association of candidate to location, image
and patient
Test candidates from separate set of 1300
patients
Task 1
Rank all candidates by the likelihood of being
cancerous
Results IBM Research team was the winner out of
246
Task 2
Identify a list of healthy patients
Results IBM Research team was the winner out of
205

47
Task 1 Candidate Likelihood of Cancer

Almost standard probability estimation/ranking
task on the candidate level
Somewhat synthetic as the meaning of the features
is unknown
Challenges
Low positive rate 7 patients and 0.6 of
candidates
Beware of overfitting
Sampling
Unfamiliar evaluation measure
FROC, related to AUC
Non-robust
Hint at locality
Key Solution
Simple linear model
Post-processing of scores
Leakge in identifiers

Adapting to evaluation measures!
48
Task 2 Classify patients

Derivate of the previous task 1
A patient is healthy if all her candidates are
benign
Probability that a patient is healthy is the
product of the probabilities of her candidates
Challenges
Extremely non robust performance measure
Including any patient with cancer in the list
disqualified the entry
Risk tradeoff need to anticipate the solutions
of the other participants
Key solution
Pick a model with high sensitivity to false
negatives
Leakage in identifiers EDA at work

49
EDA on the Breast Cancer Domain
Console output of sorted patient_ID
patient_label

144484 1
148717 0
168975 0
169638 1
171985 0
177389 1
182498 0
185266 0
193561 1
194771 0
198716 1
199814 1
1030694 0
1123030 0
1171864 0
1175742 0
1177150 0
1194527 0
1232036 0

Base rate of 7????
What about 200K to 999K?
50
Mystery of the Data GenerationIdentifier
Leakage in the Breast cancer data
Leakage

Distribution of identifiers has a strong natural
grouping of patient identifiers
3 natural buckets
The three group have VERY different base rated of
cancer prevalence
Last group seems to be sorted (cancer first)
Total of 4 groups with very patient different
probability of cancer
Organizers admitted to have combined data from
different years in order to increase the positive
rate

51
Building the classification model

For evaluation we created a stratified 50
training and test split by patient
Given few positives (300), results may exhibit
high variance
We explored the use of various learning
algorithms including Neural Networks, Logistic
regression and various SVMs
Linear models (logistic regression or linear
SVMs) yielded the most promising results
FROC 0.0834
Down-sampling the negative class?
Keep on 25 of all healthy patients
Helped in some cases, not really reliable
improvement
Add the identifier category (1,2,3,4) as
additional feature

52
Modeling Neighborhood Dependence
Relational Data

Candidates are not really iid but actually
relational
Stacking
Build initial model and score all candidates
Use labels of neighbors in second round
Formulate as EM problem
Treat the labels of the neighbors are unobserved
in EM
Pair-wise constraints based on location adjacency
Calculate the Euclidean distance from the
candidates within the same picture and distance
to the nipple in both views for each breast
Select the pairs of candidates with distance
difference less than a threshold
Constraints selected pairs of examples (xi,MLO,
xi,CC) should have the same predicted labels,
i.e. f(xi,MLO) f(xi,CC).
Results
Seems to improve the probability estimate in
terms of AUC
Did not improve FROC

53
Outlier Treatment
Statistics

Many of the 117 numeric features have large
outliers
Incur a huge penalty in terms of likelihood
Large bias
Badly calibrated probabilities
Extreme (wrong) values in the prediction

Histogram of Feature 10
142 observations gt 5
54
ROC vs. FROC optimization Post-processing of
model scores?
Adapting to evaluation

In ROC all rows are independent
and both true positives and false
positives are counted by row
FROC has true patients and false
positive candidates
Higher TP rate for candidates does not improve
FROC unless from new patient, e.g.,
Its better to have 2 correctly identified
candidates from different patients, instead of 5
from the same
Its best to re-order candidates based on model
scores so as to ensure many different patients up
front

55
Theory of Post-processing
Adapting to evaluation

Probabilistic Approach
At any point we want to maximize the expected
gradient of the FROC at this point
Define for each candidate c of patient i
pc probability that candidate c is malignant
npj probability that a patient i has not yet been
identified
3 cases
Candidate is positive but you already have
identified patient with probability pc
(1-npi)
Candidate is positive and new patient with
probability pc npi
Candidate is negative with probability 1- pc
Pick candidate with largest expected gain pc
npi/(1- pc)
Theorem
The expected value of FROC for the is higher
that for any other order
Problem
Our probability estimates are not good enough
for this to work well

56
Bad Calibration!
Statistics
Calibration Plot

We consistently over-predict the probability of
cancer for the most likely candidates
Linear Bias of the method
High class-skew
Outlier in the 117 numeric features leads to
extreme predictions on holdout

Re-calibration?
We tried a number of methods
No improvement
Some resulted in better calibration but hurt the
ranking

57
Post-Processing Heuristic
Adapting to evaluation

Ad-HOC Approach
Take the top n ranked candidates where n is
approximately the number of positive candidates
Select one candidate with the highest score for
each patient from this list and put them on the
top of the list
Iterate until all top n candidates are re-ordered

True Positive Patient Rate
False Positive Rate Per Image
Re-ordering model scores significantly improves
the FROC with no additional modeling
58
Submissions and Results

Task 1
Bagged linear SVMs with hinge loss and heuristic
post processing
This approach scored the winning result of 0.0933
on FROC out of 246 submission of 110 unique
participants
Second place scored 0.0895
Some rumor that other participants also found the
ID leakage
Task 2
Logistic model performs better than the SVM
models probably because likelihood is more
sensitive to extreme errors (the first false
negative)
The first false negative occur typically around
1100 patients in the training set
We submitted the first 1020 patients ranked by a
logistic model that included ID feature
original 117 features
Scored a specificity of 0.682 on the test set
with no false negatives
Only 24 out of 203 submissions had no false
negatives
Second place scored 0.17 specificity

59
Summary in terms of success factors

Leakage in the identifier provides information
about the likelihood of a patient to have cancer
Caused by the organizers effort to increase the
positive rate by adding old patients that
developed cancer
Post-processing for FROC optimization
Awareness of impact of feature outliers
Interacts with the statistical properties of the
data and the model
Log-likelihood more sensitive than hinge loss
Otherwise simple model to avoid overfitting
Linear models
Relational is not helpful for the given evaluation

60
KDD CUP 2009

Data customer database of Orange with 100K
observations and 15K variables
Three different tasks and 2.5 versions
Prediction Churn, Appetency, Upselling
Versions Fast (5 days) Slow (1 month)
Large and Small version
Interesting characteristics
Highly sterile, nothing known about anything
Leaderboard
is was possible to match the large and small and
receive feedback on 20 of test

61
KDD CUP 2000

Data online store history for Gazelle.com
Five different tasks including
Prediction Who will continue in session? Who
will buy?
Insights Characterize heavy spenders
Interesting characteristics
Leakage internal testing sessions were left in
data
Deterministic behavior
If identified, give 100 accuracy in prediction
for part of data
Evaluation in terms of real business
objectives?
Sort of handled by defining a set of standard
questions, each covering different aspect of
business objective
Relational data?
Yes, customers had different of sessions, of
different length, with different stages

62
KDD CUP 2003

Data Citation rates in Physics papers
Two tasks
Predict Change in number of citation during next
3 month
Write an interesting paper about it
Interesting Characteristics
Highly relational, links between papers and
authors
Feature construction up to participants
Leakage impossible since the truth was really in
the future
Evaluation on SSE against integer values (Poisson)

63
ILP Challenge 2003

Data Yeast genome including protein sequence,
alignment similarity scores with other proteins,
additional protein information from relational DB
Task Identify (potentially multiple) functional
classes for each gene
Interesting characteristics
420 possible classes, very subjective asignment
Purely relational, no features available
Distances (supposedly p-values) of gene alignment
Secondary structure (protein of amino acids)
Protein DB with keywords, etc
Leakage in the identifier contains letter of
the labeling research group
Highly unsatisfcatory evaluation precision of
the prediction

64
INFORMS data mining contest 2008

Data 2 years of hospital records with accounting
information (cost, reimbursement, ) , patient
demographics, medication history
Tasks
Identify pneumonia patients
Design optimization setting for preventive
treatment
Interesting characteristics
Relational setting (4 tables linked though
patient identifier)
Leakage removal of the pneumonia code left
hidden traces
Dirty data with plenty missing, contradicting
demographics and changing patient IDs

65
Data Mining in the Wild Project Work

Similarities with competitions (compared to DM
research)
Single dataset
Algorithms can be existing and simple
No real need for baselines (although useful)
The absolute performance matters
Differences to competitions
You need to decide what the analytical problem is
You need to define the evaluation rather than
optimize it
You need to avoid leakage rather than use it
You need to FIND all relevant data rather than
use what is there (often leads to relational
settings)
You need to deliver it somehow to have impact

66
Case Study 3 Market Alignment Program

Wallet
Total amount of money that the customer can
spend in a certain product category in a given
period
Why Are We Interested in Wallet?
Customer targeting
Focus on acquiring customers with high wallet
For existing customers, focus on high wallet, low
share-of-wallet customers
Sales force management
Wallet of as sales force allocation target and
make resource assignment decisions based on
wallet
Evaluate success of sales personnel and by
attained share-of-wallet

67
Wallet Modeling Challenge

The customer wallet is never observed
Nothing to fit a model
Even if you have a model, how do you evaluate it?
We would like a predictive approach from
available data
Firmographics (Sales, Industry, Employees)
IBM Sales and transaction history

68
Define Wallet/Opportunity?

TOTAL Total customer available budget in total
IT
Can we really hope to attain all of it?
SERVED Total customer spending on IT products
offered by IBM
Better definition for our marketing purposes
REALISTIC IBM spending of the best similar
customers
IBM Sales
?
REALISTIC ? SERVED ? TOTAL
?
Company Revenue

69
REALISTIC Wallets as quantiles

Motivation
Imagine 100 identical firms with identical IT
needs
Consider the distribution of the IBM sales to
these firms
Bottom 95 of firms should spend as much as the
top 5
Define REALISTIC wallet as high percentile of
spending conditional on the customer attributes
Implies that a few customers are spending full
wallet with us
however, we do not know which ones

70
Formally Percentile of Conditional

Distribution of IBM sales s to the customer given
customer attributes x sx f?,x
Two obvious ways to get at the pth percentile
Estimate the conditional by integrating over a
neighborhood of similar customers? Take pth
percentile of spending in neighborhood
Create a global model for pth percentile? Build
global regression models, e.g.,
sx Exp(axß)

REALISTIC
71
Estimation the Quantile Loss Function

The mean minimizes a sum of squared residuals
The median minimizes a sum of absolute residuals.
The p-th quantile minimizes an asymmetrically
weighted sum of absolute residuals

72
Overview of analytical approaches
Ad HOC
Optimization
kNN -Industry - Size
Quantile Regression
Decomposition
General kNN - K - Distance - Features
Model Form - Linear - Decision
Tree - Quanting
- Linear Model - Adjustment
73
Data Generation Process

Need to combine data on revenue with customers
properties
Complicated matching process between in IBM
internal customer view (accounts) and the
external sources (Dun Breadstreet)
Probabilistic process with plenty of heuristics
Huge danger of introducing data bias
Tradeoff in data quality and coverage
Leakage potential
We can only get current customer information
This information might be tainted by the
customers interaction with IBM
Problem gets amplified when we try to augment the
data with home-page information

74
Evaluating Measures for Wallet

We still dont know the truth ?
Combined approach
Quantile loss to assess only the relevant
predictive ability and feature selection
Expert Feedback to select suitable model class
Business Impact to identify overall effectiveness

75
Empirical Evaluation I Quantile Loss

Setup
Four domains with relevant quantile modeling
problemsdirect mailing, housing prices, income
data, IBM sales
Performance on test set in terms of 0.9th
quantile loss
Approaches
Linear quantile regression
Q-kNN (kNN with quantile prediction from the
neighbors)
Quantile trees (quantile prediction in the leaf)
Bagged quantile trees
Quanting (Langrofd et al. 2006 -- reduces
quantile estimation to averaged classification
using trees)
Baselines
Best constant model
Traditional regression models for expected
values, adjusted under Gaussian assumption
(1.28?)

76
Performance on Quantile Loss (smaller is better)

Conclusions
Standard regression is not competitive (because
the residuals are not normal)
If there is a time-lagged variable, linear
quantile model is best
Splitting criterion is irrelevant in the tree
models
Quanting (using decision trees) and quantile tree
perform comparably
Generalized kNN is not competitive

77
Evaluation II MAP Workshops Overview

Calculated 2005 opportunity using naive Q-kNN
approach
2005 MAP workshops
Displayed opportunity by brand
Expert can accept or alter the opportunity
Select 3 brands for evaluation DB2, Rational,
Tivoli
Build 100 models for each brand using different
approaches
Compare expert opportunity to model predictions
Error measures absolute, squared
Scale original, log, root
Total of 6 measures

78
Expert Feedback to Original Model
Experts accept opportunity (45)
Increase (17)
Experts change opportunity (40)
Decrease (23)
Experts reduced opportunity to 0 (15)
79
Observations

Many accounts are set for external reasons to
zero
Exclude from evaluation since no model can
predict the competitive environment
Exponential distribution of opportunities
Evaluation on the original (non-log) scale is
subject to large outliers
Experts seem to make percentage adjustments
Consider log scale evaluation in addition to
original scale and root as intermediate
Suspect strong anchoring bias, 45 of
opportunities were not touched

80
Model Comparison Results
We count how often a model scores within the top
10 and 20 for each of the 6 measures
(Anchoring)
(Best)
81
MAP Experiments Conclusions

Q-kNN performs very well after flooring but is
typically inferior prior to flooring
80th percentile Linear quantile regression
performs consistently well (flooring has a minor
effect)
Experts are strongly influenced by displayed
opportunity (and displayed revenue of previous
years)
Models without last years revenue dont perform
well
Use Linear Quantile Regression with q0.8 in MAP
06

82
MAP Business Impact

MAP launched in 2005
In 2006 420 workshops held worldwide, with teams
responsible for most of IBMs revenue
MAP recognized as 2006 IBM Research
Accomplishment
Awarded based on proven business impact
Runner up in Case Study Award in KDD 2007
Edelman finalist 2009
Most important use is segmentation of customer
base
Shift resources into invest segments with low
wallet share

83
Business Impact

For 2006, 270 resource shifts were made to 268
Invest Accounts
We examine the performance of these accounts
relative to background

REVENUE
9 growth in INVEST accounts
4 growth in all other accounts

PIPELINE (relative to 2005)
17 growth in INVEST accounts
3 growth in all other accounts

Validated Revenue Opportunity (M)

QUOTA ATTAINMENT
45 for MAP-shifted resources
36 for non-MAP shifts

270 Shifts
2005 Actual Revenue (M)
84
Summary in terms of success factors

1 Data and Domain understanding
Match of business objective to modeling approach
made a previously unsolvable business problem
solvable with predictive modeling
2 Statistical insight
Minimizing quantile-loss estimates the correct
quantity
One single evaluation metrics is in real life not
enough
Autocorrelation helps linear model
3 Modeling
Extension to tree induction
Comparative study
In the end linear it is

85
Identify Potential Causes for Chip Failure

Data 5K machines of which 18 failed in the last
year
Task Can you identify a (short) list of other
machines that are likely to fail to have them
preemptively fixed
Characteristics
Relational Tool ID, Multiple chips per machine
(only the first failure is detected)
Leakage database is clearly augmented past
failure all failure have a customer associated,
but customer is missing in most non-failure
Statistical observation This is really a
survival analysis problem, the failure does not
occur prior to a runtime of 180 days
Accuracy and even AUC is NOT relevant
Insight cause of failure
Lift and false positive rate in the top k is more
important

86
Threats in Competitions and Projects

Competitions
Mistakes under time pressure
Accidental use of the target (kernel SVM)
Complexity
Overfitting

Projects
Unavailability of data
Data generation problems
The model is not good enough to be useful
Model results are not accessible to the user
If the user has to understand the model you need
to keep it simple
Web delivery of predictions

87
Overfitting

Even if you think, that you know this one -
You probably still overdo it!
KDD CUP results have shown that a large number of
entries overfit
2003, 90 of entries did worse than the best
constant prediction
Corollary Dont overdo it on the search
Having a holdout, does NOT make you immune to
overfitting-
you just overfit on the holdout
10 fold cross validation does NO make you immune
either
Leaderboards on 10 of test are VERY deceptive
KDD CUP 2009 The winner of the fast challenge
after only 5 days was indeed the leader of the
board
The winner of the slow challenge after 1 more
month was NOT the leader of the board

88
Overfitting Example KDD CUP 2008

Data
105,000 candidates
117 numeric features
Sounds good right?
Overfitting is NOT just about the training size
and model complexity!
Linear models overfit too!
How robust is the evaluation measure?
AUC ?
FROC ?
Number of healthy patients ???
What is the base rate?
600 positives ?

89
Factors of Success in Competitions and Real Life

1. Data and domain understanding
Generation of data and task
Cleaning and representation/transformation
2. Statistical insights
Statistical properties
Test validity of assumptions
Performance measure
3. Modeling and learning approach
Most publishable part
Choice or development of most suitable algorithm

Real
Sterile
90
Success Factor 1 Data and Domain Understanding

Task and data generation
Formulate analytical problem (MAP)
EDA
Check for Leakage

91
Success Factors 2 Statistical insights

Properties of Evaluation Measures
Does it measure what you care about?
Robustness
Invariance to transformation/
Linkage between model optimization, statistic and
performance

92
Success Factors 3 Models and approach

How much complexity do you need?
Often linear does just fine with the correctly
constructed features (Actually of my wins have
been with linear models)
Feature selection
Can you optimize what you want to optimize?
How does the model relate to your evaluation
metrics
Regression approaches predict conditional mean
Accuracy vs AUC vs Log likelihood
Does it scale to your problem?
Some cool methods just do not run on 100K