KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real

Description:

Quantile est., Graphical. model. Modeling methodology design. Programming, Simulation, ... More generally, quantile loss can be used (cf. MAP case study) ... – PowerPoint PPT presentation

Number of Views:381
Avg rating:3.0/5.0
Slides: 94
Provided by: IBMU301
Category:

less

Transcript and Presenter's Notes

Title: KDD09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real


1
KDD-09 Tutorial Predictive Modeling in the
Wild Success Factors in Data Mining Competitions
and Real-Life Projects
Saharon Rosset, Tel Aviv University Claudia
Perlich, IBM Research
2
Predictive modeling
  • Most general definition build a model from
    observed data, with the goal of predicting some
    unobserved outcomes
  • Primary example supervised learning
  • get training data (x1,y1), (x2,y2),,
    (xn,yn)drawn i.i.d from joint distribution on
    (X,y)
  • Build model f(x) to describe the relationship
    between x and y
  • Use it to predict y when only x is observed in
    future
  • Other cases may relax some of the supervised
    learning assumptions
  • For example in KDD Cup 2007 did not see any
    yis, had to extrapolate them based on training
    xis see later in tutorial

3
Predictive Modeling Competitions
  • Competitions like KDD-Cup extract core
    predictive modeling challenges from their
    application environment
  • Usually supposed to represent real-life
    predictive modeling challenges
  • Extracting a real-life problem from its context
    and making a credible competition out of it is
    often more difficult than it seems
  • We will see it in examples

4
The Goals of this Tutorial
  • Understand the two modes of predictive modeling,
    their similarities and differences
  • Real life projects
  • Data mining competitions
  • Describe the main factors for success in the two
    modes of predictive modeling
  • Discuss some of the recurring challenges that
    come up in determining success
  • These goals will be addressed and demonstrated
    through a series of case studies

5
Credentials in Data Mining Competitions
Claudia Perlich - Runner up KDD CUP 03 - Winner
ILP challenge 05 - Winner KDD CUP 09_at_
  • Saharon Rosset
  • - Winner KDD CUP 99
  • - Winner KDD CUP 00
  • Jointly
  • Winners in KDD CUP 2007_at_
  • Winners of KDD CUP 2008_at_
  • Winners of INFORMS data mining challenge 08_at_

Collaborators _at_Prem Melville, _at_Yan Liu,
_at_Grzegorz Swirszcz, Foster Provost, Sofus
Macscassy, Aron Inger, Nurit Vatnik, Einat
Neuman, _at_Alexandru Niculescu-Mizil
6
Experience with Real Life Projects
  • 2004-2009 Collaboration on Business Intelligence
    projects at IBM Research
  • Total of gt10 publications on real life projects
  • Total 4 IBM Outstanding Technical Achievement
    awards
  • IBM accomplishment and major accomplishment
  • Finalists in this years INFORMS Edelman Prize
    for real-life applications of Operations Research
    and Statistics
  • One of the successful projects will be discussed
    here as a case study

7
Outline
  • Introduction and overview (SR)
  • Differences between competitions and real life
  • Success Factors
  • Recurrent challenges in competitions and real
    projects
  • Case Studies
  • KDD CUP 2007 (SR)
  • KDD CUP 2008 (CP)
  • Business Intelligence Example Market Alignment
    Program (MAP) (CP)
  • Conclusions and summary (CP)

8
Introduction What do you think is important?
  • Domain knowledge
  • Statistics background
  • Data mining algorithms
  • Computing power
  • General Experience with data

9
Differences between competitions and projects
In this tutorial we deal with the predictive
modeling aspect, so our discussion of projects
will also start with a well defined predictive
task and ignore most of the difficulties with
getting to that point
10
Real life project evolution and our focus
Business/ modeling problem definition
Statistical problem definition
Modeling methodology design
Data preparation integration
Sales force mgmt. Wallet est.
Quantile est.,Latent variable est.
Quantile est.,Graphical model
IBM relationshipsFirmographics
Model generation validation
Implementation application development
Not our focus
Loosely related
Programming,Simulation,IBM Wallets
OnTarget,MAP
Our focus
11
Two types of competitions
  • Real
  • Raw data
  • Set up the model yourself
  • Task-specific evaluation
  • Simulate real life mode
  • Example
  • KDD Cup 2007
  • KDD Cup 2008
  • Approach
  • Understand the domain
  • Analyze the data
  • Build model
  • Challenges
  • Too numerous
  • Sterile
  • Clean data matrix
  • Standard error measure
  • Often anonymized features
  • Pure machine learning
  • Example
  • KDD Cup 2009
  • PKDD Challenge 2007
  • Approach
  • Emphasize algorithms, computation
  • Attack with heavy (kernel?) machines
  • Challenges
  • Size, missing values, features

12
Factors of Success in Competitions and Real Life
  • 1. Data and domain understanding
  • Generation of data and task
  • Cleaning and representation/transformation
  • 2. Statistical insights
  • Statistical properties
  • Test validity of assumptions
  • Performance measure
  • 3. Modeling and learning approach
  • Most publishable part
  • Choice or development of most suitable algorithm

Real
Sterile
13
Recurring challenges
  • We emphasize three recurring challenges in
    predictive modeling that often get overlooked
  • Data leakage impact, avoidance and detection
  • Leakage use of illegitimate data for modeling
  • Legitimate data data that will be available
    when model is applied
  • In competitions, the definition of leakage is
    unclear
  • Adapting learning to real-life performance
    measures
  • Could move well beyond standard measures like
    MSE, error rate, or AUC
  • We will see this in two of our case studies
  • Relational learning/Feature construction
  • Real data is rarely flat, and good, practical
    solutions for this problem remain a challenge

14
1 Leakage in Predictive Modeling
  • Introduction of predictive information about the
    target by the data generation, collection, and
    preparation process
  • Trivial example Binary target was created using
    a cutoff on a continuous variable and by
    accident, the continuous variable was not removed
  • Reversal of cause and effect when information
    from the future becomes available
  • It produces models that do not generalize/true
    model performances is much lower than out-of
    sample (but including leakage) estimate
  • Commonly occurs when combining data from multiple
    sources or multiple time points and often
    manifests in the order in data files
  • Leakage is surprisingly pervasive in competitions
    and real life
  • KDD CUP 2007, KDD CUP 2008 had leakages, as we
    will see in case studies
  • INFORMS competition had leakage due to partial
    removal of information for only positive cases

15
Real life leakage example
  • P. Melville, S. Rosset, R. Lawrence (2008)
    Customer Targeting Models Using Actively-Selected
    Web Content. KDD-08
  • Built models for identifying new customers for
    IBM products, based on
  • IBM Internal databases
  • Companies websites
  • Example pattern Companies with the word
    Websphere on their website are likely to be
    good customers for IBM Websphere products
  • Ahem, a slight cause and effect problem
  • Source of problem we only have current view of
    company website, not its view when it was an IBM
    prospect (prior to buying)
  • Ad-hoc solution remove all obvious leakage
    words.
  • Does not solve the fundamental problem

16
General leakage solution predict the future
  • Niels Bohr is quoted as saying Prediction is
    difficult, especially about the future
  • Flipping this around, if
  • The true prediction task is about the future
    (usually is)
  • We can make sure that our model only has access
    to information at the present
  • We can apply the time-based cutoff in the
    competition / evaluation / proof of concept stage
  • ? We are guaranteed (intuitively and
    mathematically) that we can prevent leakage
  • For the websites example, this would require
    getting internet snapshot from (say) two years
    ago, and using only what we knew then to learn
    who bought since

17
2 Real-life performance measures
  • Real life prediction models should be constructed
    and judged for performance on real-life measures
  • Address the real problem at hand optimize ,
    life span etc.
  • At the same time, need to maintain statistical
    soundness
  • Can we optimize these measures directly?
  • Are we better off just building good models in
    general?
  • Example breast cancer detection (KDD Cup 2008)
  • At first sight, a standard classification problem
    (malignant or benign?)
  • Obvious extension cost sensitive objectiveMuch
    better to do a biopsy on a healthy subject than
    send a malignant patient home!
  • Competition objective optimize effective use of
    radiologists time
  • Complex measure called FROC
  • See case study in Claudias part

18
Optimizing real-life measures
  • It is a common approach to use the prediction
    objective to motivate an empirical loss function
    for modeling
  • If the prediction objective is the expected value
    of Y given x, then squared error loss (e.g,
    linear regression or CART) is appropriate
  • If we want to predict the median of Y instead,
    then absolute loss is appropriate
  • More generally, quantile loss can be used (cf.
    MAP case study)
  • We will see successful examples of this approach
    in two case studies (KDD CUP 07 and MAP)
  • What do we do with complex measures like FROC?
  • There is really no way to build a good model
    directly
  • Less ambitious approach
  • Build a model using standard approaches (e.g.
    logistic regression)
  • post-process your model to do well on the
    specific measure
  • We will see a successful example of this approach
    in KDD CUP 08

19
3 Relational and Multi-Level Data
  • Real-life databases are rarely flat!
  • Example INFORMS Challenge 08, medical records

mn
mn
mn
mn
20
Approaches for dealing with relational data
  • Modeling approaches that use relational data
    directly
  • There has been a lot of research, but there is a
    scarcity of practically useful methods that take
    this approach
  • Flattening the relational structure into a
    standard X,y setup
  • The key to this approach is generation of useful
    features from the relational tables
  • This is the approach we took in the INFORMS08
    challenge
  • Ad hoc approaches
  • Based on specific properties of the data and
    modeling problem, it may be possible to divide
    and conquer the relational setup
  • See example in the KDD CUP 08 case study

21
Modelers best friend Exploratory data analysis
  • Exploratory data analysis (EDA) is a general name
    for a class of techniques aimed at
  • Examining data
  • Validating data
  • Forming hypotheses about data
  • The techniques are often graphical or intuitive,
    but can also be statistical
  • Testing very simple hypotheses as a way of
    getting at more complex ones
  • E.g. test each variable separately against
    response, and look for strong correlations
  • The most important proponent of EDA was the
    great, late statistician John Tukey

22
The beauty and value of exploratory data analysis
  • EDA is a critical step in creating successful
    predictive modeling solutions
  • Expose leakage
  • AVOID PRECONCEPTIONS about
  • What matters
  • What would work
  • Etc.
  • Example Identifying KDD CUP 08 leakage through
    EDA
  • Graphical display of identifier vs
    malingnant/benign (see case study slide)
  • Could also be discovered via a statistical
    variable-by-variable examination of significant
    correlations with response to detect it
  • Key to finding this AOIVDING PRECONCEPTIONS
    about the irrelevance of identifier

23
Elements of EDA for predictive modeling
  • Examine data variable by variable
  • Outliers?
  • Missing data patterns?
  • Examine relationships with response
  • Strong correlations?
  • Unexpected correlations?
  • Compare to other similar datasets/problems
  • Are variable distributions consistent?
  • Are correlations consistent?
  • Stare at raw data, at graphs, at
    correlations/results
  • Unexpected answers to any of these questions may
    change the course of the predictive modeling
    process

24
Case study 1 Netflix/KDD-Cup 2007
25
October 2006 Announcement of the NETFLIX
Competition
  • USAToday headline
  • Netflix offers 1 million prize for better
    movie recommendations
  • Details
  • Beat NETFLIX current recommender Cinematch RMSE
    by 10 prior to 2011
  • 50,000 for the annual progress price
  • First two awarded to ATT team 9.4 improvement
    as of 10/08 (almost there!)
  • Data contains a subset of 100 million movie
    ratings from NETFLIX including 480,189 users and
    17,770 movies
  • Performance is evaluated on holdout movie-user
    pairs
  • NETFLIX competition has attracted 50K
    contestants on 40K teams from gt150 different
    countries
  • 40K valid submissions from 5K different teams

26
NETFLIX Data Internet Movie
Data Base
All movies (80K)
17K Selection unclear
All users (6.8 M)
480 K At least 20 Ratings by end 2005
NETFLIX Competition Data
100 M ratings
27
NETFLIX data generation process
KDD CUP NO User or Movie Arrival
User Arrival
Movie Arrival
Task 1
17K movies
Task 2
Training Data
1998
Time
2005 2006
Qualifier Dataset 3M
28
KDD-CUP 2007 based on the NETFLIX
  • Training NETFLIX competition data from 1998-2005
  • Test 2006 ratings randomly split by movie in to
    two tasks
  • Task 1 Who rated what in 2006
  • Given a list of 100,000 pairs of users and
    movies, predict for each pair the probability
    that the user rated the movie in 2006
  • Result IBM Research team was second runner-up,
    No 3 out of 39 teams
  • Task 2 Number of ratings per movie in 2006
  • Given a list of 8863 movies, predict the number
    of additional reviews that all existing users
    will give in 2006
  • Result IBM Research team was the winner, No 1
    out of 34 teams

29
Test sets from 2006 for Task 1 and Task 2
Marginal 2006 Distribution of rating
Users
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Movies
Task 2
log(n1)
Rating Totals
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
30
Task 1 Did User A review Movie B in 2006?
  • A standard classification task to answer question
    whether existing users will review existing
    movies
  • In line more with synthetic mode of
    competitions than real mode
  • Challenges
  • Huge amount of data
  • how to sample the data so that any learning
    algorithms can be applied is critical
  • Complex affecting factors
  • decrease of interest in old movies, growing
    tendency of watching (reviewing) more movies by
    Netflix users
  • Key solutions
  • Effective sampling strategies to keep as much
    information as possible
  • Careful feature extraction from multiple sources

31
Task 2 How many reviews in 2006?
  • Task formulation
  • Regression task to predict the total count of
    reviewers from existing users for 8863
    existing movies
  • Evaluation is by RMSE on log scale
  • Challenges
  • Movie dynamics and life-cycle
  • Interest in movies changes over time
  • User dynamics and life-cycle
  • No new users are added to the database
  • Key solutions
  • Use counts from test set of Task 1 to learn a
    model for 2006 adjusting for pair removal
  • Build set of quarterly lagged models to determine
    the overall scalar
  • Use Poisson regression

32
Some data observations
Leakage Alert!
  • Task 1 test set is a potential response for
    training a model for Task 2
  • Was sampled according to marginal ( reviews
    for movie in 06 / total reviews in 06)which is
    proportional to the Task 2 response ( reviews
    for movie in 06)
  • BIG advantage we get a view of 2006 behavior for
    half the movies? Build model on this half, apply
    to the other half (Task 2 test set)
  • Caveats
  • Proportional sampling implies there is a scaling
    parameter left, which we dont know
  • Recall that after sampling, (movie, person) pairs
    that appeared before 2006 were dropped from Task
    1 test set? Correcting it is an inverse
    rejection sampling problem

33
Test sets from 2006 for Task 1 and Task 2
Users
Marginal 2006 Distribution of rating
Sample (movie, user) pairs according to product
of marginals
Task 1
Remove Pairs that were rated prior to 2006
Estimate Marginal Distribution
Surrogate learning problem
Movies
Task 2
log(n1)
Task 2 Test Set (8.8K)
Task 1 Test Set (100K)
Rating Totals
34
Some data observations (ctd.)
  • No new movies and reviewers in 2006
  • Need to emphasize modeling the life-cycle of
    movies (and reviewers)
  • How are older movies reviewed relative to newer
    movies?
  • Does this depend on other features (like movies
    genre)?
  • This is especially critical when we consider the
    scaling caveat above

35
Some statistical perspectives
  • Poisson distribution is very appropriate for
    counts
  • Clearly true of overall counts for 2006
  • Assuming any kind of reasonable reviewers arrival
    process
  • Right modeling approach for true counts is
    Poisson regressionni Pois (?i?t)log(?i) ?j
    ?j xij? arg max? l(n X,?) (maximum
    likelihood)
  • What does this imply for model evaluation
    approach?
  • Variance stabilizing transformation for Poisson
    is square root? ?ni has roughly constant
    variance? RMSE on log scale emphasizes
    performance on unpopular movies (small Poisson
    parameter ? larger log scale variance)
  • We still assumed that if we do well in a
    likelihood formulation, we will do well with any
    evaluation approach

Adapting to evaluation measures!
36
Some statistical perspectives (ctd.)
  • Can we invert the rejection sampling mechanism?
  • This can be viewed as a missing data problem
  • ni, mj are the counts for movie i and reviewer j
    respectively
  • pi, qj are the true marginals for movie i and
    reviewer j respectively
  • N is the total number of pairs rejected due to
    review prior to 2006
  • Ui, Pj are the users who reviewed movie i prior
    to 2006 and movies reviewed by user j prior to
    2006, respectively
  • Can we design a practical EM algorithm with our
    huge data size? Interesting research problem
  • We implemented ad-hoc inversion algorithm
  • Iterate until convergence between- assuming
    movie marginals are fixed, adjusting reviewer
    marginals- assuming reviewer marginals are
    fixed, adjusting movie marginals
  • We verified that it indeed improved our data
    since it increased correlation with 4Q2005 counts

37
Modeling Approach Schema
Estimate Poison Regression M1 Predict on Task
1 movies
Who Reviewed Test (100K)
Inverse Rejection Sampling
Count ratings by Movie from
Scale Predictions To Total
Use M1 to Predict Task 2 movies
IMDB
Movie Features
Estimate 4 Poison Regression G1G4 Predict for
2006
Construct Movie Features
Find optimal Scalar
NETFLIX challenge
Estimate 2006 total Ratings for Task 2Test set
Construct Lagged Features Q1-Q4 2005
?
38
Some observations on modeling approach
  • Lagged datasets are meant to simulate forward
    prediction to 2006
  • Select quarter (e.g., Q105), remove all movies
    reviewers that started later
  • Build model on this data with e.g., Q305 as
    response
  • Apply model to our full dataset, which is
    naturally cropped at Q405 ? Gives a prediction
    for Q206
  • With several models like this, predict all of
    2006
  • Two potential uses
  • Use as our prediction for 2006 but only if
    better than the model built on Task 1 movies!
  • Consider only sum of their predictions to use for
    scaling the Task 1 model
  • We evaluated models on Task 1 test set
  • Used holdout when also building them on this set
  • How can we evaluate the models built on lagged
    datasets?
  • Missing a scaling parameter between the 2006
    prediction and sampled set
  • Solution select optimal scaling based on Task 1
    test set performance? Since other model was
    still better, we knew we should use it!

39
Some details on our models and submission
  • All models at movie level. Features we used
  • Historical reviews in previous months/quarters/yea
    rs (on log scale)
  • Movies age since premier, movies age in Netflix
    (since first review)
  • Also consider log, square etc ? have flexibility
    in form of functional dependence
  • Movies genre
  • Include interactions between genre and age ?
    life cycle seems to differ by genre!
  • Models we considered (MSE on log-scale on Task 1
    holdout)
  • Poisson regression on Task 1 test set (0.24)
  • Log-scale linear regression model on Task 1 test
    set (0.25)
  • Sum of lagged models built on 2005 quarters
    best scaling (0.31)
  • Scaling based on lagged models
  • Our estimated of number of reviews for all models
    in Task 1 test set about 9.5M
  • Implied scaling parameter for predictions about
    90
  • Total of our submitted predictions for Task 2
    test set was 9.3M

40
Competition evaluation
  • First we were informed that we won with RMSE of
    770
  • They mistakenly evaluated on non-log scale
  • Strong emphasis on most popular movies
  • We won by large margin? Our model did well on
    most popular movies!
  • Then they re-evaluated on log scale, we still won
  • On log scale the least popular movies are
    emphasized
  • Recall that variance stabilizing transformation
    is in-between (square root)
  • So our predictions did well on unpopular movies
    too!
  • Interesting question would we win on square root
    scale (or similarly, Poisson likelihood-based
    evaluation)? Sure hope so!

41
Competition evaluation (ctd.)
  • Results of competition (log-scale evaluation)
  • Components of our models MSE
  • The error of the model for the scaled-down Task 1
    test set (which we estimated at about 0.24)
  • Additional error from incorrect scaling factor
  • Scaling numbers
  • True total reviews 8.7M
  • Sum of our predictions 9.3M
  • Interesting question what would be best scaling
  • For log-scale evaluation? Conjecture need to
    under-estimate true total
  • For square-root evaluation? Conjecture need to
    estimate about right

42
Effect of scaling on the two evaluation approaches
43
Effect of scaling on the two evaluation approaches
44
KDD CUP 2007 Summary
  • Keys to our success
  • Identify subtle leakage
  • Is it formally leakage? Depends on intentions of
    organizers
  • Appropriate statistical approach
  • Poisson regression
  • Inverting rejection sampling in leakage
  • Careful handling of time-series aspects
  • Not keys to our success
  • Fancy machine learning algorithms

45
Case Study 2 KDD CUP 2008 - Siemens Medical
Breast Cancer Identification
MLO CC
MLO CC
6816 Images
1712 Patients
Malignant
105,000 Candidates
?
x1 , x2 , , x117, class
candidate feature vector
46
KDD-CUP 2008 based on Mammography
  • Training labeled candidates from 1300 patient
    and association of candidate to location, image
    and patient
  • Test candidates from separate set of 1300
    patients
  • Task 1
  • Rank all candidates by the likelihood of being
    cancerous
  • Results IBM Research team was the winner out of
    246
  • Task 2
  • Identify a list of healthy patients
  • Results IBM Research team was the winner out of
    205

47
Task 1 Candidate Likelihood of Cancer
  • Almost standard probability estimation/ranking
    task on the candidate level
  • Somewhat synthetic as the meaning of the features
    is unknown
  • Challenges
  • Low positive rate 7 patients and 0.6 of
    candidates
  • Beware of overfitting
  • Sampling
  • Unfamiliar evaluation measure
  • FROC, related to AUC
  • Non-robust
  • Hint at locality
  • Key Solution
  • Simple linear model
  • Post-processing of scores
  • Leakge in identifiers

Adapting to evaluation measures!
48
Task 2 Classify patients
  • Derivate of the previous task 1
  • A patient is healthy if all her candidates are
    benign
  • Probability that a patient is healthy is the
    product of the probabilities of her candidates
  • Challenges
  • Extremely non robust performance measure
  • Including any patient with cancer in the list
    disqualified the entry
  • Risk tradeoff need to anticipate the solutions
    of the other participants
  • Key solution
  • Pick a model with high sensitivity to false
    negatives
  • Leakage in identifiers EDA at work

49
EDA on the Breast Cancer Domain
Console output of sorted patient_ID
patient_label
  • 144484 1
  • 148717 0
  • 168975 0
  • 169638 1
  • 171985 0
  • 177389 1
  • 182498 0
  • 185266 0
  • 193561 1
  • 194771 0
  • 198716 1
  • 199814 1
  • 1030694 0
  • 1123030 0
  • 1171864 0
  • 1175742 0
  • 1177150 0
  • 1194527 0
  • 1232036 0

Base rate of 7????
What about 200K to 999K?
50
Mystery of the Data GenerationIdentifier
Leakage in the Breast cancer data
Leakage
  • Distribution of identifiers has a strong natural
    grouping of patient identifiers
  • 3 natural buckets
  • The three group have VERY different base rated of
    cancer prevalence
  • Last group seems to be sorted (cancer first)
  • Total of 4 groups with very patient different
    probability of cancer
  • Organizers admitted to have combined data from
    different years in order to increase the positive
    rate

51
Building the classification model
  • For evaluation we created a stratified 50
    training and test split by patient
  • Given few positives (300), results may exhibit
    high variance
  • We explored the use of various learning
    algorithms including Neural Networks, Logistic
    regression and various SVMs
  • Linear models (logistic regression or linear
    SVMs) yielded the most promising results
  • FROC 0.0834
  • Down-sampling the negative class?
  • Keep on 25 of all healthy patients
  • Helped in some cases, not really reliable
    improvement
  • Add the identifier category (1,2,3,4) as
    additional feature

52
Modeling Neighborhood Dependence
Relational Data
  • Candidates are not really iid but actually
    relational
  • Stacking
  • Build initial model and score all candidates
  • Use labels of neighbors in second round
  • Formulate as EM problem
  • Treat the labels of the neighbors are unobserved
    in EM
  • Pair-wise constraints based on location adjacency
  • Calculate the Euclidean distance from the
    candidates within the same picture and distance
    to the nipple in both views for each breast
  • Select the pairs of candidates with distance
    difference less than a threshold
  • Constraints selected pairs of examples (xi,MLO,
    xi,CC) should have the same predicted labels,
    i.e. f(xi,MLO) f(xi,CC).
  • Results
  • Seems to improve the probability estimate in
    terms of AUC
  • Did not improve FROC

53
Outlier Treatment
Statistics
  • Many of the 117 numeric features have large
    outliers
  • Incur a huge penalty in terms of likelihood
  • Large bias
  • Badly calibrated probabilities
  • Extreme (wrong) values in the prediction

Histogram of Feature 10
142 observations gt 5
54
ROC vs. FROC optimization Post-processing of
model scores?
Adapting to evaluation
  • In ROC all rows are independent
  • and both true positives and false
  • positives are counted by row
  • FROC has true patients and false
  • positive candidates
  • Higher TP rate for candidates does not improve
    FROC unless from new patient, e.g.,
  • Its better to have 2 correctly identified
    candidates from different patients, instead of 5
    from the same
  • Its best to re-order candidates based on model
    scores so as to ensure many different patients up
    front

55
Theory of Post-processing
Adapting to evaluation
  • Probabilistic Approach
  • At any point we want to maximize the expected
  • gradient of the FROC at this point
  • Define for each candidate c of patient i
  • pc probability that candidate c is malignant
  • npj probability that a patient i has not yet been
    identified
  • 3 cases
  • Candidate is positive but you already have
  • identified patient with probability pc
    (1-npi)
  • Candidate is positive and new patient with
    probability pc npi
  • Candidate is negative with probability 1- pc
  • Pick candidate with largest expected gain pc
    npi/(1- pc)
  • Theorem
  • The expected value of FROC for the is higher
    that for any other order
  • Problem
  • Our probability estimates are not good enough
    for this to work well

56
Bad Calibration!
Statistics
Calibration Plot
  • We consistently over-predict the probability of
    cancer for the most likely candidates
  • Linear Bias of the method
  • High class-skew
  • Outlier in the 117 numeric features leads to
    extreme predictions on holdout
  • Re-calibration?
  • We tried a number of methods
  • No improvement
  • Some resulted in better calibration but hurt the
    ranking

57
Post-Processing Heuristic
Adapting to evaluation
  • Ad-HOC Approach
  • Take the top n ranked candidates where n is
    approximately the number of positive candidates
  • Select one candidate with the highest score for
    each patient from this list and put them on the
    top of the list
  • Iterate until all top n candidates are re-ordered


True Positive Patient Rate
False Positive Rate Per Image
Re-ordering model scores significantly improves
the FROC with no additional modeling
58
Submissions and Results
  • Task 1
  • Bagged linear SVMs with hinge loss and heuristic
    post processing
  • This approach scored the winning result of 0.0933
    on FROC out of 246 submission of 110 unique
    participants
  • Second place scored 0.0895
  • Some rumor that other participants also found the
    ID leakage
  • Task 2
  • Logistic model performs better than the SVM
    models probably because likelihood is more
    sensitive to extreme errors (the first false
    negative)
  • The first false negative occur typically around
    1100 patients in the training set
  • We submitted the first 1020 patients ranked by a
    logistic model that included ID feature
    original 117 features
  • Scored a specificity of 0.682 on the test set
    with no false negatives
  • Only 24 out of 203 submissions had no false
    negatives
  • Second place scored 0.17 specificity

59
Summary in terms of success factors
  • Leakage in the identifier provides information
    about the likelihood of a patient to have cancer
  • Caused by the organizers effort to increase the
    positive rate by adding old patients that
    developed cancer
  • Post-processing for FROC optimization
  • Awareness of impact of feature outliers
  • Interacts with the statistical properties of the
    data and the model
  • Log-likelihood more sensitive than hinge loss
  • Otherwise simple model to avoid overfitting
  • Linear models
  • Relational is not helpful for the given evaluation

60
KDD CUP 2009
  • Data customer database of Orange with 100K
    observations and 15K variables
  • Three different tasks and 2.5 versions
  • Prediction Churn, Appetency, Upselling
  • Versions Fast (5 days) Slow (1 month)
  • Large and Small version
  • Interesting characteristics
  • Highly sterile, nothing known about anything
  • Leaderboard
  • is was possible to match the large and small and
    receive feedback on 20 of test

61
KDD CUP 2000
  • Data online store history for Gazelle.com
  • Five different tasks including
  • Prediction Who will continue in session? Who
    will buy?
  • Insights Characterize heavy spenders
  • Interesting characteristics
  • Leakage internal testing sessions were left in
    data
  • Deterministic behavior
  • If identified, give 100 accuracy in prediction
    for part of data
  • Evaluation in terms of real business
    objectives?
  • Sort of handled by defining a set of standard
    questions, each covering different aspect of
    business objective
  • Relational data?
  • Yes, customers had different of sessions, of
    different length, with different stages

62
KDD CUP 2003
  • Data Citation rates in Physics papers
  • Two tasks
  • Predict Change in number of citation during next
    3 month
  • Write an interesting paper about it
  • Interesting Characteristics
  • Highly relational, links between papers and
    authors
  • Feature construction up to participants
  • Leakage impossible since the truth was really in
    the future
  • Evaluation on SSE against integer values (Poisson)

63
ILP Challenge 2003
  • Data Yeast genome including protein sequence,
    alignment similarity scores with other proteins,
    additional protein information from relational DB
  • Task Identify (potentially multiple) functional
    classes for each gene
  • Interesting characteristics
  • 420 possible classes, very subjective asignment
  • Purely relational, no features available
  • Distances (supposedly p-values) of gene alignment
  • Secondary structure (protein of amino acids)
  • Protein DB with keywords, etc
  • Leakage in the identifier contains letter of
    the labeling research group
  • Highly unsatisfcatory evaluation precision of
    the prediction

64
INFORMS data mining contest 2008
  • Data 2 years of hospital records with accounting
    information (cost, reimbursement, ) , patient
    demographics, medication history
  • Tasks
  • Identify pneumonia patients
  • Design optimization setting for preventive
    treatment
  • Interesting characteristics
  • Relational setting (4 tables linked though
    patient identifier)
  • Leakage removal of the pneumonia code left
    hidden traces
  • Dirty data with plenty missing, contradicting
    demographics and changing patient IDs

65
Data Mining in the Wild Project Work
  • Similarities with competitions (compared to DM
    research)
  • Single dataset
  • Algorithms can be existing and simple
  • No real need for baselines (although useful)
  • The absolute performance matters
  • Differences to competitions
  • You need to decide what the analytical problem is
  • You need to define the evaluation rather than
    optimize it
  • You need to avoid leakage rather than use it
  • You need to FIND all relevant data rather than
    use what is there (often leads to relational
    settings)
  • You need to deliver it somehow to have impact

66
Case Study 3 Market Alignment Program
  • Wallet
  • Total amount of money that the customer can
    spend in a certain product category in a given
    period
  • Why Are We Interested in Wallet?
  • Customer targeting
  • Focus on acquiring customers with high wallet
  • For existing customers, focus on high wallet, low
    share-of-wallet customers
  • Sales force management
  • Wallet of as sales force allocation target and
    make resource assignment decisions based on
    wallet
  • Evaluate success of sales personnel and by
    attained share-of-wallet

67
Wallet Modeling Challenge
  • The customer wallet is never observed
  • Nothing to fit a model
  • Even if you have a model, how do you evaluate it?
  • We would like a predictive approach from
    available data
  • Firmographics (Sales, Industry, Employees)
  • IBM Sales and transaction history

68
Define Wallet/Opportunity?
  • TOTAL Total customer available budget in total
    IT
  • Can we really hope to attain all of it?
  • SERVED Total customer spending on IT products
    offered by IBM
  • Better definition for our marketing purposes
  • REALISTIC IBM spending of the best similar
    customers
  • IBM Sales
  • ?
  • REALISTIC ? SERVED ? TOTAL
  • ?
  • Company Revenue

69
REALISTIC Wallets as quantiles
  • Motivation
  • Imagine 100 identical firms with identical IT
    needs
  • Consider the distribution of the IBM sales to
    these firms
  • Bottom 95 of firms should spend as much as the
    top 5
  • Define REALISTIC wallet as high percentile of
    spending conditional on the customer attributes
  • Implies that a few customers are spending full
    wallet with us
  • however, we do not know which ones

70
Formally Percentile of Conditional
  • Distribution of IBM sales s to the customer given
    customer attributes x sx f?,x
  • Two obvious ways to get at the pth percentile
  • Estimate the conditional by integrating over a
    neighborhood of similar customers? Take pth
    percentile of spending in neighborhood
  • Create a global model for pth percentile? Build
    global regression models, e.g.,
  • sx Exp(axß)

REALISTIC
71
Estimation the Quantile Loss Function
  • The mean minimizes a sum of squared residuals
  • The median minimizes a sum of absolute residuals.
  • The p-th quantile minimizes an asymmetrically
    weighted sum of absolute residuals

72
Overview of analytical approaches
Ad HOC
Optimization
kNN -Industry - Size
Quantile Regression
Decomposition
General kNN - K - Distance - Features
Model Form - Linear - Decision
Tree - Quanting
- Linear Model - Adjustment
73
Data Generation Process
  • Need to combine data on revenue with customers
    properties
  • Complicated matching process between in IBM
    internal customer view (accounts) and the
    external sources (Dun Breadstreet)
  • Probabilistic process with plenty of heuristics
  • Huge danger of introducing data bias
  • Tradeoff in data quality and coverage
  • Leakage potential
  • We can only get current customer information
  • This information might be tainted by the
    customers interaction with IBM
  • Problem gets amplified when we try to augment the
    data with home-page information

74
Evaluating Measures for Wallet
  • We still dont know the truth ?
  • Combined approach
  • Quantile loss to assess only the relevant
    predictive ability and feature selection
  • Expert Feedback to select suitable model class
  • Business Impact to identify overall effectiveness

75
Empirical Evaluation I Quantile Loss
  • Setup
  • Four domains with relevant quantile modeling
    problemsdirect mailing, housing prices, income
    data, IBM sales
  • Performance on test set in terms of 0.9th
    quantile loss
  • Approaches
  • Linear quantile regression
  • Q-kNN (kNN with quantile prediction from the
    neighbors)
  • Quantile trees (quantile prediction in the leaf)
  • Bagged quantile trees
  • Quanting (Langrofd et al. 2006 -- reduces
    quantile estimation to averaged classification
    using trees)
  • Baselines
  • Best constant model
  • Traditional regression models for expected
    values, adjusted under Gaussian assumption
    (1.28?)

76
Performance on Quantile Loss (smaller is better)
  • Conclusions
  • Standard regression is not competitive (because
    the residuals are not normal)
  • If there is a time-lagged variable, linear
    quantile model is best
  • Splitting criterion is irrelevant in the tree
    models
  • Quanting (using decision trees) and quantile tree
    perform comparably
  • Generalized kNN is not competitive

77
Evaluation II MAP Workshops Overview
  • Calculated 2005 opportunity using naive Q-kNN
    approach
  • 2005 MAP workshops
  • Displayed opportunity by brand
  • Expert can accept or alter the opportunity
  • Select 3 brands for evaluation DB2, Rational,
    Tivoli
  • Build 100 models for each brand using different
    approaches
  • Compare expert opportunity to model predictions
  • Error measures absolute, squared
  • Scale original, log, root
  • Total of 6 measures

78
Expert Feedback to Original Model
Experts accept opportunity (45)
Increase (17)
Experts change opportunity (40)
Decrease (23)
Experts reduced opportunity to 0 (15)
79
Observations
  • Many accounts are set for external reasons to
    zero
  • Exclude from evaluation since no model can
    predict the competitive environment
  • Exponential distribution of opportunities
  • Evaluation on the original (non-log) scale is
    subject to large outliers
  • Experts seem to make percentage adjustments
  • Consider log scale evaluation in addition to
    original scale and root as intermediate
  • Suspect strong anchoring bias, 45 of
    opportunities were not touched

80
Model Comparison Results
We count how often a model scores within the top
10 and 20 for each of the 6 measures
(Anchoring)
(Best)
81
MAP Experiments Conclusions
  • Q-kNN performs very well after flooring but is
    typically inferior prior to flooring
  • 80th percentile Linear quantile regression
    performs consistently well (flooring has a minor
    effect)
  • Experts are strongly influenced by displayed
    opportunity (and displayed revenue of previous
    years)
  • Models without last years revenue dont perform
    well
  • Use Linear Quantile Regression with q0.8 in MAP
    06

82
MAP Business Impact
  • MAP launched in 2005
  • In 2006 420 workshops held worldwide, with teams
    responsible for most of IBMs revenue
  • MAP recognized as 2006 IBM Research
    Accomplishment
  • Awarded based on proven business impact
  • Runner up in Case Study Award in KDD 2007
  • Edelman finalist 2009
  • Most important use is segmentation of customer
    base
  • Shift resources into invest segments with low
    wallet share

83
Business Impact
  • For 2006, 270 resource shifts were made to 268
    Invest Accounts
  • We examine the performance of these accounts
    relative to background
  • REVENUE
  • 9 growth in INVEST accounts
  • 4 growth in all other accounts
  • PIPELINE (relative to 2005)
  • 17 growth in INVEST accounts
  • 3 growth in all other accounts

Validated Revenue Opportunity (M)
  • QUOTA ATTAINMENT
  • 45 for MAP-shifted resources
  • 36 for non-MAP shifts

270 Shifts
2005 Actual Revenue (M)
84
Summary in terms of success factors
  • 1 Data and Domain understanding
  • Match of business objective to modeling approach
    made a previously unsolvable business problem
    solvable with predictive modeling
  • 2 Statistical insight
  • Minimizing quantile-loss estimates the correct
    quantity
  • One single evaluation metrics is in real life not
    enough
  • Autocorrelation helps linear model
  • 3 Modeling
  • Extension to tree induction
  • Comparative study
  • In the end linear it is

85
Identify Potential Causes for Chip Failure
  • Data 5K machines of which 18 failed in the last
    year
  • Task Can you identify a (short) list of other
    machines that are likely to fail to have them
    preemptively fixed
  • Characteristics
  • Relational Tool ID, Multiple chips per machine
    (only the first failure is detected)
  • Leakage database is clearly augmented past
    failure all failure have a customer associated,
    but customer is missing in most non-failure
  • Statistical observation This is really a
    survival analysis problem, the failure does not
    occur prior to a runtime of 180 days
  • Accuracy and even AUC is NOT relevant
  • Insight cause of failure
  • Lift and false positive rate in the top k is more
    important

86
Threats in Competitions and Projects
  • Competitions
  • Mistakes under time pressure
  • Accidental use of the target (kernel SVM)
  • Complexity
  • Overfitting
  • Projects
  • Unavailability of data
  • Data generation problems
  • The model is not good enough to be useful
  • Model results are not accessible to the user
  • If the user has to understand the model you need
    to keep it simple
  • Web delivery of predictions

87
Overfitting
  • Even if you think, that you know this one -
  • You probably still overdo it!
  • KDD CUP results have shown that a large number of
    entries overfit
  • 2003, 90 of entries did worse than the best
    constant prediction
  • Corollary Dont overdo it on the search
  • Having a holdout, does NOT make you immune to
    overfitting-
  • you just overfit on the holdout
  • 10 fold cross validation does NO make you immune
    either
  • Leaderboards on 10 of test are VERY deceptive
  • KDD CUP 2009 The winner of the fast challenge
    after only 5 days was indeed the leader of the
    board
  • The winner of the slow challenge after 1 more
    month was NOT the leader of the board

88
Overfitting Example KDD CUP 2008
  • Data
  • 105,000 candidates
  • 117 numeric features
  • Sounds good right?
  • Overfitting is NOT just about the training size
    and model complexity!
  • Linear models overfit too!
  • How robust is the evaluation measure?
  • AUC ?
  • FROC ?
  • Number of healthy patients ???
  • What is the base rate?
  • 600 positives ?

89
Factors of Success in Competitions and Real Life
  • 1. Data and domain understanding
  • Generation of data and task
  • Cleaning and representation/transformation
  • 2. Statistical insights
  • Statistical properties
  • Test validity of assumptions
  • Performance measure
  • 3. Modeling and learning approach
  • Most publishable part
  • Choice or development of most suitable algorithm

Real
Sterile
90
Success Factor 1 Data and Domain Understanding
  • Task and data generation
  • Formulate analytical problem (MAP)
  • EDA
  • Check for Leakage

91
Success Factors 2 Statistical insights
  • Properties of Evaluation Measures
  • Does it measure what you care about?
  • Robustness
  • Invariance to transformation/
  • Linkage between model optimization, statistic and
    performance

92
Success Factors 3 Models and approach
  • How much complexity do you need?
  • Often linear does just fine with the correctly
    constructed features (Actually of my wins have
    been with linear models)
  • Feature selection
  • Can you optimize what you want to optimize?
  • How does the model relate to your evaluation
    metrics
  • Regression approaches predict conditional mean
  • Accuracy vs AUC vs Log likelihood
  • Does it scale to your problem?
  • Some cool methods just do not run on 100K

93
Summary comparison of case studies
94
Invitation Please join us on another data mining
competition!
  • INFORMS Data Mining contest on health care data
  • Register at www.informsdmcontest2009.org
  • Real data of hospital visits for patients with
    severe heart disease
  • Real tasks for ongoing project
  • Transfer to specialized hospitals
  • Severity / death
  • Relational (multiple hospital stays per patient)
  • Evaluation
  • AUC
  • Publication and workshop at INFORMS 2009
Write a Comment
User Comments (0)
About PowerShow.com