Title: KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real
1 KDD-09 Tutorial Predictive Modeling in the
Wild Success Factors in Data Mining Competitions
and Real-Life Projects
Saharon Rosset, Tel Aviv University Claudia
Perlich, IBM Research
2Predictive modeling
- Most general definition build a model from
observed data, with the goal of predicting some
unobserved outcomes - Primary example supervised learning
- get training data (x1,y1), (x2,y2),,
(xn,yn)drawn i.i.d from joint distribution on
(X,y) - Build model f(x) to describe the relationship
between x and y - Use it to predict y when only x is observed in
future - Other cases may relax some of the supervised
learning assumptions - For example in KDD Cup 2007 did not see any
yis, had to extrapolate them based on training
xis see later in tutorial
3Predictive Modeling Competitions
- Competitions like KDD-Cup extract core
predictive modeling challenges from their
application environment - Usually supposed to represent real-life
predictive modeling challenges - Extracting a real-life problem from its context
and making a credible competition out of it is
often more difficult than it seems - We will see it in examples
4The Goals of this Tutorial
- Understand the two modes of predictive modeling,
their similarities and differences - Real life projects
- Data mining competitions
- Describe the main factors for success in the two
modes of predictive modeling - Discuss some of the recurring challenges that
come up in determining success - These goals will be addressed and demonstrated
through a series of case studies
5Credentials in Data Mining Competitions
Claudia Perlich - Runner up KDD CUP 03 - Winner
ILP challenge 05 - Winner KDD CUP 09_at_
- Saharon Rosset
- - Winner KDD CUP 99
- - Winner KDD CUP 00
- Jointly
- Winners in KDD CUP 2007_at_
- Winners of KDD CUP 2008_at_
- Winners of INFORMS data mining challenge 08_at_
Collaborators _at_Prem Melville, _at_Yan Liu,
_at_Grzegorz Swirszcz, Foster Provost, Sofus
Macscassy, Aron Inger, Nurit Vatnik, Einat
Neuman, _at_Alexandru Niculescu-Mizil
6Experience with Real Life Projects
- 2004-2009 Collaboration on Business Intelligence
projects at IBM Research - Total of gt10 publications on real life projects
- Total 4 IBM Outstanding Technical Achievement
awards - IBM accomplishment and major accomplishment
- Finalists in this years INFORMS Edelman Prize
for real-life applications of Operations Research
and Statistics - One of the successful projects will be discussed
here as a case study
7Outline
- Introduction and overview (SR)
- Differences between competitions and real life
- Success Factors
- Recurrent challenges in competitions and real
projects - Case Studies
- KDD CUP 2007 (SR)
- KDD CUP 2008 (CP)
- Business Intelligence Example Market Alignment
Program (MAP) (CP) - Conclusions and summary (CP)
8Introduction What do you think is important?
- Domain knowledge
- Statistics background
- Data mining algorithms
- Computing power
- General Experience with data
9Differences between competitions and projects
In this tutorial we deal with the predictive
modeling aspect, so our discussion of projects
will also start with a well defined predictive
task and ignore most of the difficulties with
getting to that point
10Real life project evolution and our focus
Business/ modeling problem definition
Statistical problem definition
Modeling methodology design
Data preparation integration
Sales force mgmt. Wallet est.
Quantile est.,Latent variable est.
Quantile est.,Graphical model
IBM relationshipsFirmographics
Model generation validation
Implementation application development
Not our focus
Loosely related
Programming,Simulation,IBM Wallets
OnTarget,MAP
Our focus
11Two types of competitions
- Real
- Raw data
- Set up the model yourself
- Task-specific evaluation
- Simulate real life mode
- Example
- KDD Cup 2007
- KDD Cup 2008
- Approach
- Understand the domain
- Analyze the data
- Build model
- Challenges
- Too numerous
- Sterile
- Clean data matrix
- Standard error measure
- Often anonymized features
- Pure machine learning
- Example
- KDD Cup 2009
- PKDD Challenge 2007
- Approach
- Emphasize algorithms, computation
- Attack with heavy (kernel?) machines
- Challenges
- Size, missing values, features
12Factors of Success in Competitions and Real Life
- 1. Data and domain understanding
- Generation of data and task
- Cleaning and representation/transformation
- 2. Statistical insights
- Statistical properties
- Test validity of assumptions
- Performance measure
- 3. Modeling and learning approach
- Most publishable part
- Choice or development of most suitable algorithm
Real
Sterile
13Recurring challenges
- We emphasize three recurring challenges in
predictive modeling that often get overlooked - Data leakage impact, avoidance and detection
- Leakage use of illegitimate data for modeling
- Legitimate data data that will be available
when model is applied - In competitions, the definition of leakage is
unclear - Adapting learning to real-life performance
measures - Could move well beyond standard measures like
MSE, error rate, or AUC - We will see this in two of our case studies
- Relational learning/Feature construction
- Real data is rarely flat, and good, practical
solutions for this problem remain a challenge
141 Leakage in Predictive Modeling
- Introduction of predictive information about the
target by the data generation, collection, and
preparation process - Trivial example Binary target was created using
a cutoff on a continuous variable and by
accident, the continuous variable was not removed - Reversal of cause and effect when information
from the future becomes available - It produces models that do not generalize/true
model performances is much lower than out-of
sample (but including leakage) estimate - Commonly occurs when combining data from multiple
sources or multiple time points and often
manifests in the order in data files - Leakage is surprisingly pervasive in competitions
and real life - KDD CUP 2007, KDD CUP 2008 had leakages, as we
will see in case studies - INFORMS competition had leakage due to partial
removal of information for only positive cases
15Real life leakage example
- P. Melville, S. Rosset, R. Lawrence (2008)
Customer Targeting Models Using Actively-Selected
Web Content. KDD-08 - Built models for identifying new customers for
IBM products, based on - IBM Internal databases
- Companies websites
- Example pattern Companies with the word
Websphere on their website are likely to be
good customers for IBM Websphere products - Ahem, a slight cause and effect problem
- Source of problem we only have current view of
company website, not its view when it was an IBM
prospect (prior to buying) - Ad-hoc solution remove all obvious leakage
words. - Does not solve the fundamental problem
16General leakage solution predict the future
- Niels Bohr is quoted as saying Prediction is
difficult, especially about the future - Flipping this around, if
- The true prediction task is about the future
(usually is) - We can make sure that our model only has access
to information at the present - We can apply the time-based cutoff in the
competition / evaluation / proof of concept stage - ? We are guaranteed (intuitively and
mathematically) that we can prevent leakage - For the websites example, this would require
getting internet snapshot from (say) two years
ago, and using only what we knew then to learn
who bought since
172 Real-life performance measures
- Real life prediction models should be constructed
and judged for performance on real-life measures - Address the real problem at hand optimize ,
life span etc. - At the same time, need to maintain statistical
soundness - Can we optimize these measures directly?
- Are we better off just building good models in
general? - Example breast cancer detection (KDD Cup 2008)
- At first sight, a standard classification problem
(malignant or benign?) - Obvious extension cost sensitive objectiveMuch
better to do a biopsy on a healthy subject than
send a malignant patient home! - Competition objective optimize effective use of
radiologists time - Complex measure called FROC
- See case study in Claudias part
18Optimizing real-life measures
- It is a common approach to use the prediction
objective to motivate an empirical loss function
for modeling - If the prediction objective is the expected value
of Y given x, then squared error loss (e.g,
linear regression or CART) is appropriate - If we want to predict the median of Y instead,
then absolute loss is appropriate - More generally, quantile loss can be used (cf.
MAP case study) - We will see successful examples of this approach
in two case studies (KDD CUP 07 and MAP) - What do we do with complex measures like FROC?
- There is really no way to build a good model
directly - Less ambitious approach
- Build a model using standard approaches (e.g.
logistic regression) - post-process your model to do well on the
specific measure - We will see a successful example of this approach
in KDD CUP 08
193 Relational and Multi-Level Data
- Real-life databases are rarely flat!
- Example INFORMS Challenge 08, medical records
mn
mn
mn
mn
20Approaches for dealing with relational data
- Modeling approaches that use relational data
directly - There has been a lot of research, but there is a
scarcity of practically useful methods that take
this approach - Flattening the relational structure into a
standard X,y setup - The key to this approach is generation of useful
features from the relational tables - This is the approach we took in the INFORMS08
challenge - Ad hoc approaches
- Based on specific properties of the data and
modeling problem, it may be possible to divide
and conquer the relational setup - See example in the KDD CUP 08 case study
21Modelers best friend Exploratory data analysis
- Exploratory data analysis (EDA) is a general name
for a class of techniques aimed at - Examining data
- Validating data
- Forming hypotheses about data
- The techniques are often graphical or intuitive,
but can also be statistical - Testing very simple hypotheses as a way of
getting at more complex ones - E.g. test each variable separately against
response, and look for strong correlations - The most important proponent of EDA was the
great, late statistician John Tukey
22The beauty and value of exploratory data analysis
- EDA is a critical step in creating successful
predictive modeling solutions - Expose leakage
- AVOID PRECONCEPTIONS about
- What matters
- What would work
- Etc.
- Example Identifying KDD CUP 08 leakage through
EDA - Graphical display of identifier vs
malingnant/benign (see case study slide) - Could also be discovered via a statistical
variable-by-variable examination of significant
correlations with response to detect it - Key to finding this AOIVDING PRECONCEPTIONS
about the irrelevance of identifier
23Elements of EDA for predictive modeling
- Examine data variable by variable
- Outliers?
- Missing data patterns?
- Examine relationships with response
- Strong correlations?
- Unexpected correlations?
- Compare to other similar datasets/problems
- Are variable distributions consistent?
- Are correlations consistent?
- Stare at raw data, at graphs, at
correlations/results - Unexpected answers to any of these questions may
change the course of the predictive modeling
process
24Case study 1 Netflix/KDD-Cup 2007
25October 2006 Announcement of the NETFLIX
Competition
- USAToday headline
- Netflix offers 1 million prize for better
movie recommendations - Details
- Beat NETFLIX current recommender Cinematch RMSE
by 10 prior to 2011 - 50,000 for the annual progress price
- First two awarded to ATT team 9.4 improvement
as of 10/08 (almost there!) - Data contains a subset of 100 million movie
ratings from NETFLIX including 480,189 users and
17,770 movies - Performance is evaluated on holdout movie-user
pairs - NETFLIX competition has attracted 50K
contestants on 40K teams from gt150 different
countries - 40K valid submissions from 5K different teams
26 NETFLIX Data Internet Movie
Data Base
All movies (80K)
17K Selection unclear
All users (6.8 M)
480 K At least 20 Ratings by end 2005
NETFLIX Competition Data
100 M ratings
27NETFLIX data generation process
KDD CUP NO User or Movie Arrival
User Arrival
Movie Arrival
Task 1
17K movies
Task 2
Training Data
1998
Time
2005 2006
Qualifier Dataset 3M
28KDD-CUP 2007 based on the NETFLIX
- Training NETFLIX competition data from 1998-2005
- Test 2006 ratings randomly split by movie in to
two tasks - Task 1 Who rated what in 2006
- Given a list of 100,000 pairs of users and
movies, predict for each pair the probability
that the user rated the movie in 2006 - Result IBM Research team was second runner-up,
No 3 out of 39 teams - Task 2 Number of ratings per movie in 2006
- Given a list of 8863 movies, predict the number
of additional reviews that all existing users
will give in 2006 - Result IBM Research team was the winner, No 1
out of 34 teams
29Test sets from 2006 for Task 1 and Task 2
Marginal 2006 Distribution of rating
Users
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Movies
Task 2
log(n1)
Rating Totals
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
30Task 1 Did User A review Movie B in 2006?
- A standard classification task to answer question
whether existing users will review existing
movies - In line more with synthetic mode of
competitions than real mode - Challenges
- Huge amount of data
- how to sample the data so that any learning
algorithms can be applied is critical - Complex affecting factors
- decrease of interest in old movies, growing
tendency of watching (reviewing) more movies by
Netflix users - Key solutions
- Effective sampling strategies to keep as much
information as possible - Careful feature extraction from multiple sources
31Task 2 How many reviews in 2006?
- Task formulation
- Regression task to predict the total count of
reviewers from existing users for 8863
existing movies - Evaluation is by RMSE on log scale
- Challenges
- Movie dynamics and life-cycle
- Interest in movies changes over time
- User dynamics and life-cycle
- No new users are added to the database
- Key solutions
- Use counts from test set of Task 1 to learn a
model for 2006 adjusting for pair removal - Build set of quarterly lagged models to determine
the overall scalar - Use Poisson regression
32Some data observations
Leakage Alert!
- Task 1 test set is a potential response for
training a model for Task 2 - Was sampled according to marginal ( reviews
for movie in 06 / total reviews in 06)which is
proportional to the Task 2 response ( reviews
for movie in 06) - BIG advantage we get a view of 2006 behavior for
half the movies? Build model on this half, apply
to the other half (Task 2 test set) - Caveats
- Proportional sampling implies there is a scaling
parameter left, which we dont know - Recall that after sampling, (movie, person) pairs
that appeared before 2006 were dropped from Task
1 test set? Correcting it is an inverse
rejection sampling problem
33Test sets from 2006 for Task 1 and Task 2
Users
Marginal 2006 Distribution of rating
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Estimate Marginal Distribution
Surrogate learning problem
Movies
Task 2
log(n1)
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
Rating Totals
34Some data observations (ctd.)
- No new movies and reviewers in 2006
- Need to emphasize modeling the life-cycle of
movies (and reviewers) - How are older movies reviewed relative to newer
movies? - Does this depend on other features (like movies
genre)? - This is especially critical when we consider the
scaling caveat above
35Some statistical perspectives
- Poisson distribution is very appropriate for
counts - Clearly true of overall counts for 2006
- Assuming any kind of reasonable reviewers arrival
process - Right modeling approach for true counts is
Poisson regressionni Pois (?i?t)log(?i) ?j
?j xij? arg max? l(n X,?) (maximum
likelihood) - What does this imply for model evaluation
approach? - Variance stabilizing transformation for Poisson
is square root? ?ni has roughly constant
variance? RMSE on log scale emphasizes
performance on unpopular movies (small Poisson
parameter ? larger log scale variance) - We still assumed that if we do well in a
likelihood formulation, we will do well with any
evaluation approach
Adapting to evaluation measures!
36Some statistical perspectives (ctd.)
- Can we invert the rejection sampling mechanism?
- This can be viewed as a missing data problem
- ni, mj are the counts for movie i and reviewer j
respectively - pi, qj are the true marginals for movie i and
reviewer j respectively - N is the total number of pairs rejected due to
review prior to 2006 - Ui, Pj are the users who reviewed movie i prior
to 2006 and movies reviewed by user j prior to
2006, respectively - Can we design a practical EM algorithm with our
huge data size? Interesting research problem - We implemented ad-hoc inversion algorithm
- Iterate until convergence between- assuming
movie marginals are fixed, adjusting reviewer
marginals- assuming reviewer marginals are
fixed, adjusting movie marginals - We verified that it indeed improved our data
since it increased correlation with 4Q2005 counts
37Modeling Approach Schema
Estimate Poison Regression M1 Predict on Task
1 movies
Who Reviewed Test (100K)
Inverse Rejection Sampling
Count ratings by Movie from
Scale Predictions To Total
Use M1 to Predict Task 2 movies
IMDB
Movie Features
Estimate 4 Poison Regression G1G4 Predict for
2006
Construct Movie Features
Find optimal Scalar
NETFLIX challenge
Estimate 2006 total Ratings for Task 2Test set
Construct Lagged Features Q1-Q4 2005
?
38Some observations on modeling approach
- Lagged datasets are meant to simulate forward
prediction to 2006 - Select quarter (e.g., Q105), remove all movies
reviewers that started later - Build model on this data with e.g., Q305 as
response - Apply model to our full dataset, which is
naturally cropped at Q405 ? Gives a prediction
for Q206 - With several models like this, predict all of
2006 - Two potential uses
- Use as our prediction for 2006 but only if
better than the model built on Task 1 movies! - Consider only sum of their predictions to use for
scaling the Task 1 model - We evaluated models on Task 1 test set
- Used holdout when also building them on this set
- How can we evaluate the models built on lagged
datasets? - Missing a scaling parameter between the 2006
prediction and sampled set - Solution select optimal scaling based on Task 1
test set performance? Since other model was
still better, we knew we should use it!
39Some details on our models and submission
- All models at movie level. Features we used
- Historical reviews in previous months/quarters/yea
rs (on log scale) - Movies age since premier, movies age in Netflix
(since first review) - Also consider log, square etc ? have flexibility
in form of functional dependence - Movies genre
- Include interactions between genre and age ?
life cycle seems to differ by genre! - Models we considered (MSE on log-scale on Task 1
holdout) - Poisson regression on Task 1 test set (0.24)
- Log-scale linear regression model on Task 1 test
set (0.25) - Sum of lagged models built on 2005 quarters
best scaling (0.31) - Scaling based on lagged models
- Our estimated of number of reviews for all models
in Task 1 test set about 9.5M - Implied scaling parameter for predictions about
90 - Total of our submitted predictions for Task 2
test set was 9.3M
40Competition evaluation
- First we were informed that we won with RMSE of
770 - They mistakenly evaluated on non-log scale
- Strong emphasis on most popular movies
- We won by large margin? Our model did well on
most popular movies! - Then they re-evaluated on log scale, we still won
- On log scale the least popular movies are
emphasized - Recall that variance stabilizing transformation
is in-between (square root) - So our predictions did well on unpopular movies
too! - Interesting question would we win on square root
scale (or similarly, Poisson likelihood-based
evaluation)? Sure hope so!
41Competition evaluation (ctd.)
- Results of competition (log-scale evaluation)
- Components of our models MSE
- The error of the model for the scaled-down Task 1
test set (which we estimated at about 0.24) - Additional error from incorrect scaling factor
- Scaling numbers
- True total reviews 8.7M
- Sum of our predictions 9.3M
- Interesting question what would be best scaling
- For log-scale evaluation? Conjecture need to
under-estimate true total - For square-root evaluation? Conjecture need to
estimate about right
42Effect of scaling on the two evaluation approaches
43Effect of scaling on the two evaluation approaches
44KDD CUP 2007 Summary
- Keys to our success
- Identify subtle leakage
- Is it formally leakage? Depends on intentions of
organizers - Appropriate statistical approach
- Poisson regression
- Inverting rejection sampling in leakage
- Careful handling of time-series aspects
- Not keys to our success
- Fancy machine learning algorithms
45Case Study 2 KDD CUP 2008 - Siemens Medical
Breast Cancer Identification
MLO CC
MLO CC
6816 Images
1712 Patients
Malignant
105,000 Candidates
?
x1 , x2 , , x117, class
candidate feature vector
46KDD-CUP 2008 based on Mammography
- Training labeled candidates from 1300 patient
and association of candidate to location, image
and patient - Test candidates from separate set of 1300
patients - Task 1
- Rank all candidates by the likelihood of being
cancerous - Results IBM Research team was the winner out of
246 - Task 2
- Identify a list of healthy patients
- Results IBM Research team was the winner out of
205
47Task 1 Candidate Likelihood of Cancer
- Almost standard probability estimation/ranking
task on the candidate level - Somewhat synthetic as the meaning of the features
is unknown - Challenges
- Low positive rate 7 patients and 0.6 of
candidates - Beware of overfitting
- Sampling
- Unfamiliar evaluation measure
- FROC, related to AUC
- Non-robust
- Hint at locality
- Key Solution
- Simple linear model
- Post-processing of scores
- Leakge in identifiers
Adapting to evaluation measures!
48Task 2 Classify patients
- Derivate of the previous task 1
- A patient is healthy if all her candidates are
benign - Probability that a patient is healthy is the
product of the probabilities of her candidates - Challenges
- Extremely non robust performance measure
- Including any patient with cancer in the list
disqualified the entry - Risk tradeoff need to anticipate the solutions
of the other participants - Key solution
- Pick a model with high sensitivity to false
negatives - Leakage in identifiers EDA at work
49EDA on the Breast Cancer Domain
Console output of sorted patient_ID
patient_label
- 144484 1
- 148717 0
- 168975 0
- 169638 1
- 171985 0
- 177389 1
- 182498 0
- 185266 0
- 193561 1
- 194771 0
- 198716 1
- 199814 1
- 1030694 0
- 1123030 0
- 1171864 0
- 1175742 0
- 1177150 0
- 1194527 0
- 1232036 0
Base rate of 7????
What about 200K to 999K?
50Mystery of the Data GenerationIdentifier
Leakage in the Breast cancer data
Leakage
- Distribution of identifiers has a strong natural
grouping of patient identifiers - 3 natural buckets
- The three group have VERY different base rated of
cancer prevalence - Last group seems to be sorted (cancer first)
- Total of 4 groups with very patient different
probability of cancer - Organizers admitted to have combined data from
different years in order to increase the positive
rate
51Building the classification model
- For evaluation we created a stratified 50
training and test split by patient - Given few positives (300), results may exhibit
high variance - We explored the use of various learning
algorithms including Neural Networks, Logistic
regression and various SVMs - Linear models (logistic regression or linear
SVMs) yielded the most promising results - FROC 0.0834
- Down-sampling the negative class?
- Keep on 25 of all healthy patients
- Helped in some cases, not really reliable
improvement - Add the identifier category (1,2,3,4) as
additional feature
52Modeling Neighborhood Dependence
Relational Data
- Candidates are not really iid but actually
relational - Stacking
- Build initial model and score all candidates
- Use labels of neighbors in second round
- Formulate as EM problem
- Treat the labels of the neighbors are unobserved
in EM - Pair-wise constraints based on location adjacency
- Calculate the Euclidean distance from the
candidates within the same picture and distance
to the nipple in both views for each breast - Select the pairs of candidates with distance
difference less than a threshold - Constraints selected pairs of examples (xi,MLO,
xi,CC) should have the same predicted labels,
i.e. f(xi,MLO) f(xi,CC). - Results
- Seems to improve the probability estimate in
terms of AUC - Did not improve FROC
53Outlier Treatment
Statistics
- Many of the 117 numeric features have large
outliers - Incur a huge penalty in terms of likelihood
- Large bias
- Badly calibrated probabilities
- Extreme (wrong) values in the prediction
Histogram of Feature 10
142 observations gt 5
54ROC vs. FROC optimization Post-processing of
model scores?
Adapting to evaluation
- In ROC all rows are independent
- and both true positives and false
- positives are counted by row
- FROC has true patients and false
- positive candidates
- Higher TP rate for candidates does not improve
FROC unless from new patient, e.g., - Its better to have 2 correctly identified
candidates from different patients, instead of 5
from the same - Its best to re-order candidates based on model
scores so as to ensure many different patients up
front
55Theory of Post-processing
Adapting to evaluation
- Probabilistic Approach
- At any point we want to maximize the expected
- gradient of the FROC at this point
- Define for each candidate c of patient i
- pc probability that candidate c is malignant
- npj probability that a patient i has not yet been
identified - 3 cases
- Candidate is positive but you already have
- identified patient with probability pc
(1-npi) - Candidate is positive and new patient with
probability pc npi - Candidate is negative with probability 1- pc
- Pick candidate with largest expected gain pc
npi/(1- pc) - Theorem
- The expected value of FROC for the is higher
that for any other order - Problem
- Our probability estimates are not good enough
for this to work well
56Bad Calibration!
Statistics
Calibration Plot
- We consistently over-predict the probability of
cancer for the most likely candidates - Linear Bias of the method
- High class-skew
- Outlier in the 117 numeric features leads to
extreme predictions on holdout
- Re-calibration?
- We tried a number of methods
- No improvement
- Some resulted in better calibration but hurt the
ranking
57Post-Processing Heuristic
Adapting to evaluation
- Ad-HOC Approach
- Take the top n ranked candidates where n is
approximately the number of positive candidates - Select one candidate with the highest score for
each patient from this list and put them on the
top of the list - Iterate until all top n candidates are re-ordered
True Positive Patient Rate
False Positive Rate Per Image
Re-ordering model scores significantly improves
the FROC with no additional modeling
58Submissions and Results
- Task 1
- Bagged linear SVMs with hinge loss and heuristic
post processing - This approach scored the winning result of 0.0933
on FROC out of 246 submission of 110 unique
participants - Second place scored 0.0895
- Some rumor that other participants also found the
ID leakage - Task 2
- Logistic model performs better than the SVM
models probably because likelihood is more
sensitive to extreme errors (the first false
negative) - The first false negative occur typically around
1100 patients in the training set - We submitted the first 1020 patients ranked by a
logistic model that included ID feature
original 117 features - Scored a specificity of 0.682 on the test set
with no false negatives - Only 24 out of 203 submissions had no false
negatives - Second place scored 0.17 specificity
59Summary in terms of success factors
- Leakage in the identifier provides information
about the likelihood of a patient to have cancer - Caused by the organizers effort to increase the
positive rate by adding old patients that
developed cancer - Post-processing for FROC optimization
- Awareness of impact of feature outliers
- Interacts with the statistical properties of the
data and the model - Log-likelihood more sensitive than hinge loss
- Otherwise simple model to avoid overfitting
- Linear models
- Relational is not helpful for the given evaluation
60KDD CUP 2009
- Data customer database of Orange with 100K
observations and 15K variables - Three different tasks and 2.5 versions
- Prediction Churn, Appetency, Upselling
- Versions Fast (5 days) Slow (1 month)
- Large and Small version
- Interesting characteristics
- Highly sterile, nothing known about anything
- Leaderboard
- is was possible to match the large and small and
receive feedback on 20 of test
61KDD CUP 2000
- Data online store history for Gazelle.com
- Five different tasks including
- Prediction Who will continue in session? Who
will buy? - Insights Characterize heavy spenders
- Interesting characteristics
- Leakage internal testing sessions were left in
data - Deterministic behavior
- If identified, give 100 accuracy in prediction
for part of data - Evaluation in terms of real business
objectives? - Sort of handled by defining a set of standard
questions, each covering different aspect of
business objective - Relational data?
- Yes, customers had different of sessions, of
different length, with different stages
62KDD CUP 2003
- Data Citation rates in Physics papers
- Two tasks
- Predict Change in number of citation during next
3 month - Write an interesting paper about it
- Interesting Characteristics
- Highly relational, links between papers and
authors - Feature construction up to participants
- Leakage impossible since the truth was really in
the future - Evaluation on SSE against integer values (Poisson)
63ILP Challenge 2003
- Data Yeast genome including protein sequence,
alignment similarity scores with other proteins,
additional protein information from relational DB - Task Identify (potentially multiple) functional
classes for each gene - Interesting characteristics
- 420 possible classes, very subjective asignment
- Purely relational, no features available
- Distances (supposedly p-values) of gene alignment
- Secondary structure (protein of amino acids)
- Protein DB with keywords, etc
- Leakage in the identifier contains letter of
the labeling research group - Highly unsatisfcatory evaluation precision of
the prediction
64INFORMS data mining contest 2008
- Data 2 years of hospital records with accounting
information (cost, reimbursement, ) , patient
demographics, medication history - Tasks
- Identify pneumonia patients
- Design optimization setting for preventive
treatment - Interesting characteristics
- Relational setting (4 tables linked though
patient identifier) - Leakage removal of the pneumonia code left
hidden traces - Dirty data with plenty missing, contradicting
demographics and changing patient IDs
65Data Mining in the Wild Project Work
- Similarities with competitions (compared to DM
research) - Single dataset
- Algorithms can be existing and simple
- No real need for baselines (although useful)
- The absolute performance matters
- Differences to competitions
- You need to decide what the analytical problem is
- You need to define the evaluation rather than
optimize it - You need to avoid leakage rather than use it
- You need to FIND all relevant data rather than
use what is there (often leads to relational
settings) - You need to deliver it somehow to have impact
66Case Study 3 Market Alignment Program
- Wallet
- Total amount of money that the customer can
spend in a certain product category in a given
period - Why Are We Interested in Wallet?
- Customer targeting
- Focus on acquiring customers with high wallet
- For existing customers, focus on high wallet, low
share-of-wallet customers - Sales force management
- Wallet of as sales force allocation target and
make resource assignment decisions based on
wallet - Evaluate success of sales personnel and by
attained share-of-wallet
67Wallet Modeling Challenge
- The customer wallet is never observed
- Nothing to fit a model
- Even if you have a model, how do you evaluate it?
- We would like a predictive approach from
available data - Firmographics (Sales, Industry, Employees)
- IBM Sales and transaction history
68Define Wallet/Opportunity?
- TOTAL Total customer available budget in total
IT - Can we really hope to attain all of it?
- SERVED Total customer spending on IT products
offered by IBM - Better definition for our marketing purposes
- REALISTIC IBM spending of the best similar
customers - IBM Sales
- ?
- REALISTIC ? SERVED ? TOTAL
- ?
- Company Revenue
69REALISTIC Wallets as quantiles
- Motivation
- Imagine 100 identical firms with identical IT
needs - Consider the distribution of the IBM sales to
these firms - Bottom 95 of firms should spend as much as the
top 5 - Define REALISTIC wallet as high percentile of
spending conditional on the customer attributes - Implies that a few customers are spending full
wallet with us - however, we do not know which ones
70Formally Percentile of Conditional
- Distribution of IBM sales s to the customer given
customer attributes x sx f?,x -
- Two obvious ways to get at the pth percentile
- Estimate the conditional by integrating over a
neighborhood of similar customers? Take pth
percentile of spending in neighborhood - Create a global model for pth percentile? Build
global regression models, e.g., - sx Exp(axß)
REALISTIC
71Estimation the Quantile Loss Function
- The mean minimizes a sum of squared residuals
- The median minimizes a sum of absolute residuals.
- The p-th quantile minimizes an asymmetrically
weighted sum of absolute residuals -
72Overview of analytical approaches
Ad HOC
Optimization
kNN -Industry - Size
Quantile Regression
Decomposition
General kNN - K - Distance - Features
Model Form - Linear - Decision
Tree - Quanting
- Linear Model - Adjustment
73Data Generation Process
- Need to combine data on revenue with customers
properties - Complicated matching process between in IBM
internal customer view (accounts) and the
external sources (Dun Breadstreet) - Probabilistic process with plenty of heuristics
- Huge danger of introducing data bias
- Tradeoff in data quality and coverage
- Leakage potential
- We can only get current customer information
- This information might be tainted by the
customers interaction with IBM - Problem gets amplified when we try to augment the
data with home-page information
74Evaluating Measures for Wallet
- We still dont know the truth ?
- Combined approach
- Quantile loss to assess only the relevant
predictive ability and feature selection - Expert Feedback to select suitable model class
- Business Impact to identify overall effectiveness
75Empirical Evaluation I Quantile Loss
- Setup
- Four domains with relevant quantile modeling
problemsdirect mailing, housing prices, income
data, IBM sales - Performance on test set in terms of 0.9th
quantile loss - Approaches
- Linear quantile regression
- Q-kNN (kNN with quantile prediction from the
neighbors) - Quantile trees (quantile prediction in the leaf)
- Bagged quantile trees
- Quanting (Langrofd et al. 2006 -- reduces
quantile estimation to averaged classification
using trees) - Baselines
- Best constant model
- Traditional regression models for expected
values, adjusted under Gaussian assumption
(1.28?)
76Performance on Quantile Loss (smaller is better)
- Conclusions
- Standard regression is not competitive (because
the residuals are not normal) - If there is a time-lagged variable, linear
quantile model is best - Splitting criterion is irrelevant in the tree
models - Quanting (using decision trees) and quantile tree
perform comparably - Generalized kNN is not competitive
77Evaluation II MAP Workshops Overview
- Calculated 2005 opportunity using naive Q-kNN
approach - 2005 MAP workshops
- Displayed opportunity by brand
- Expert can accept or alter the opportunity
- Select 3 brands for evaluation DB2, Rational,
Tivoli - Build 100 models for each brand using different
approaches - Compare expert opportunity to model predictions
- Error measures absolute, squared
- Scale original, log, root
- Total of 6 measures
78Expert Feedback to Original Model
Experts accept opportunity (45)
Increase (17)
Experts change opportunity (40)
Decrease (23)
Experts reduced opportunity to 0 (15)
79Observations
- Many accounts are set for external reasons to
zero - Exclude from evaluation since no model can
predict the competitive environment - Exponential distribution of opportunities
- Evaluation on the original (non-log) scale is
subject to large outliers - Experts seem to make percentage adjustments
- Consider log scale evaluation in addition to
original scale and root as intermediate - Suspect strong anchoring bias, 45 of
opportunities were not touched
80Model Comparison Results
We count how often a model scores within the top
10 and 20 for each of the 6 measures
(Anchoring)
(Best)
81MAP Experiments Conclusions
- Q-kNN performs very well after flooring but is
typically inferior prior to flooring - 80th percentile Linear quantile regression
performs consistently well (flooring has a minor
effect) - Experts are strongly influenced by displayed
opportunity (and displayed revenue of previous
years) - Models without last years revenue dont perform
well - Use Linear Quantile Regression with q0.8 in MAP
06
82MAP Business Impact
- MAP launched in 2005
- In 2006 420 workshops held worldwide, with teams
responsible for most of IBMs revenue - MAP recognized as 2006 IBM Research
Accomplishment - Awarded based on proven business impact
- Runner up in Case Study Award in KDD 2007
- Edelman finalist 2009
- Most important use is segmentation of customer
base - Shift resources into invest segments with low
wallet share
83Business Impact
- For 2006, 270 resource shifts were made to 268
Invest Accounts - We examine the performance of these accounts
relative to background
- REVENUE
- 9 growth in INVEST accounts
- 4 growth in all other accounts
- PIPELINE (relative to 2005)
- 17 growth in INVEST accounts
- 3 growth in all other accounts
Validated Revenue Opportunity (M)
- QUOTA ATTAINMENT
- 45 for MAP-shifted resources
- 36 for non-MAP shifts
270 Shifts
2005 Actual Revenue (M)
84Summary in terms of success factors
- 1 Data and Domain understanding
- Match of business objective to modeling approach
made a previously unsolvable business problem
solvable with predictive modeling - 2 Statistical insight
- Minimizing quantile-loss estimates the correct
quantity - One single evaluation metrics is in real life not
enough - Autocorrelation helps linear model
- 3 Modeling
- Extension to tree induction
- Comparative study
- In the end linear it is
85Identify Potential Causes for Chip Failure
- Data 5K machines of which 18 failed in the last
year - Task Can you identify a (short) list of other
machines that are likely to fail to have them
preemptively fixed - Characteristics
- Relational Tool ID, Multiple chips per machine
(only the first failure is detected) - Leakage database is clearly augmented past
failure all failure have a customer associated,
but customer is missing in most non-failure - Statistical observation This is really a
survival analysis problem, the failure does not
occur prior to a runtime of 180 days - Accuracy and even AUC is NOT relevant
- Insight cause of failure
- Lift and false positive rate in the top k is more
important
86Threats in Competitions and Projects
- Competitions
- Mistakes under time pressure
- Accidental use of the target (kernel SVM)
- Complexity
- Overfitting
- Projects
- Unavailability of data
- Data generation problems
- The model is not good enough to be useful
- Model results are not accessible to the user
- If the user has to understand the model you need
to keep it simple - Web delivery of predictions
87Overfitting
- Even if you think, that you know this one -
- You probably still overdo it!
- KDD CUP results have shown that a large number of
entries overfit - 2003, 90 of entries did worse than the best
constant prediction - Corollary Dont overdo it on the search
- Having a holdout, does NOT make you immune to
overfitting- - you just overfit on the holdout
- 10 fold cross validation does NO make you immune
either - Leaderboards on 10 of test are VERY deceptive
- KDD CUP 2009 The winner of the fast challenge
after only 5 days was indeed the leader of the
board - The winner of the slow challenge after 1 more
month was NOT the leader of the board
88Overfitting Example KDD CUP 2008
- Data
- 105,000 candidates
- 117 numeric features
- Sounds good right?
- Overfitting is NOT just about the training size
and model complexity! - Linear models overfit too!
- How robust is the evaluation measure?
- AUC ?
- FROC ?
- Number of healthy patients ???
- What is the base rate?
- 600 positives ?
89Factors of Success in Competitions and Real Life
- 1. Data and domain understanding
- Generation of data and task
- Cleaning and representation/transformation
- 2. Statistical insights
- Statistical properties
- Test validity of assumptions
- Performance measure
- 3. Modeling and learning approach
- Most publishable part
- Choice or development of most suitable algorithm
Real
Sterile
90Success Factor 1 Data and Domain Understanding
- Task and data generation
- Formulate analytical problem (MAP)
- EDA
- Check for Leakage
91Success Factors 2 Statistical insights
- Properties of Evaluation Measures
- Does it measure what you care about?
- Robustness
- Invariance to transformation/
- Linkage between model optimization, statistic and
performance
92Success Factors 3 Models and approach
- How much complexity do you need?
- Often linear does just fine with the correctly
constructed features (Actually of my wins have
been with linear models) - Feature selection
- Can you optimize what you want to optimize?
- How does the model relate to your evaluation
metrics - Regression approaches predict conditional mean
- Accuracy vs AUC vs Log likelihood
- Does it scale to your problem?
- Some cool methods just do not run on 100K
93Summary comparison of case studies
94Invitation Please join us on another data mining
competition!
- INFORMS Data Mining contest on health care data
- Register at www.informsdmcontest2009.org
- Real data of hospital visits for patients with
severe heart disease - Real tasks for ongoing project
- Transfer to specialized hospitals
- Severity / death
- Relational (multiple hospital stays per patient)
- Evaluation
- AUC
- Publication and workshop at INFORMS 2009