KDD%20Cup%202009 - PowerPoint PPT Presentation

About This Presentation
Title:

KDD%20Cup%202009

Description:

About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). About 10% used unscrambling. ... boosting decision tree technology, bagging also used. ... – PowerPoint PPT presentation

Number of Views:324
Avg rating:3.0/5.0
Slides: 36
Provided by: lemaire
Category:
Tags: 20cup | kdd | bagging

less

Transcript and Presenter's Notes

Title: KDD%20Cup%202009


1
KDD Cup2009
  • Fast Scoring on a Large Database
  • Presentation of the Results at the KDD Cup
    Workshop
  • June 28, 2008
  • The Organizing Team

2
KDD Cup 2009 Organizing Team
  • Project team at Orange Labs RD
  • Vincent Lemaire
  • Marc Boullé
  • Fabrice Clérot
  • Raphaël Féraud
  • Aurélie Le Cam
  • Pascal Gouzien
  • Beta testing and proceedings editor
  • Gideon Dror
  • Web site design
  • Olivier Guyon (MisterP.net, France)
  • Coordination (KDD cup co-chairs)
  • Isabelle Guyon
  • David Vogel

3
Thanks to our sponsors
  • Orange
  • ACM SIGKDD
  • Pascal
  • Unipen
  • Google
  • Health Discovery Corp
  • Clopinet
  • Data Mining Solutions
  • MPS

4
Record KDD Cup Participation
Year Teams
1997 45
1998 57
1999 24
2000 31
2001 136
2002 18
2003 57
2004 102
2005 37
2006 68
2007 95
2008 128
2009 453
5
Participation Statistics
  • 1299 registered teams
  • 7865 entries
  • 46 countries

Argentina Germany Malaysia South Korea
Australia Greece Mexico Spain
Austria Hong Kong Netherlands Sweden
Belgium Hungary New Zealand Switzerland
Brazil India Pakistan Taiwan
Bulgaria Iran Portugal Turkey
Canada Ireland Romania Uganda
Chile Israel Russian Federation United Kingdom
China Italy Singapore Uruguay
Fiji Japan Slovak Republic United States
Finland Jordan Slovenia
France Latvia South Africa
6
A worlwide operator
  • One of the main telecommunication operators in
    the world
  • Providing services to more than 170 millions
    customers over five continents
  • Including 120 millions under the Orange Brand

7
KDD Cup 2009 organized by OrangeCustomer
Relationship Management (CRM)
  • Three marketing tasks predict the propensity of
    customers
  • to switch provider Churn
  • to buy new products or services Appentency
  • to buy upgrades or new options proposed to them
    Up-selling
  • Objective improve the return of investments
    (ROI) of marketing campaigns
  • Increase the efficiency of the campaign given a
    campaign cost
  • Decrease the campaign cost for a given marketing
    objective
  • Better prediction leads to better ROI

8
Data, constraints and requirements
  • Train and deploy requirements
  • About one hundred models per month
  • Fast data preparation and modeling
  • Fast deployment
  • Model requirements
  • Robust
  • Accurate
  • Understandable
  • Business requirement
  • Return of investment for the whole process
  • Input data
  • Relational databases
  • Numerical or categorical
  • Noisy
  • Missing values
  • Heavily unbalanced distribution
  • Train data
  • Hundreds of thousands of instances
  • Tens of thousand of variables
  • Deployment
  • Tens of millions of instances

9
In-house systemFrom raw data to scoring models
  • Data warehouse
  • Relational data base
  • Data mart
  • Star schema
  • Feature construction
  • PAC technology
  • Generates tens of thousands of variables
  • Data preparation and modeling
  • Khiops technology

Data feeding
PAC
Khiops
10
Design of the challenge
  • Orange business objective
  • Benchmark the in-house system against state of
    the art techniques
  • Data
  • Data store
  • Not an option
  • Data warehouse
  • Confidentiality and scalability issues
  • Relational data requires domain knowledge and
    specialized skills
  • Tabular format
  • Standard format for the data mining community
  • Domain knowledge incorporated using feature
    construction (PAC)
  • Easy anonymization
  • Tasks
  • Three representative marketing tasks
  • Requirements
  • Fast data preparation and modeling (fully
    automatic)

11
Data sets extraction and preparation
  • Input data
  • 10 relational table
  • A few hundreds of fields
  • One million customers
  • Instance selection
  • Resampling given the three marketing tasks
  • Keep 100 000 instances, with less unbalanced
    target distributions
  • Variable construction
  • Using PAC technology
  • 20000 constructed variables to get a tabular
    representation
  • Keep 15 000 variables (discard constant
    variables)
  • Small track subset of 230 variables related to
    classical domain knowledge
  • Anonymization
  • Discard variable names, discard identifiers
  • Randomize order of variables
  • Rescale each numerical variable by a random
    factor

12
Scientific and technical challenge
  • Scientific objective
  • Fast data preparation and modeling within five
    days
  • Large scale 50 000 train and test data, 15 000
    variables
  • Hetegeneous data
  • Numerical with missing values
  • Categorical with hundreds of values
  • Heavily unbalanced distribution
  • KDD social meeting objective
  • Attract as many participants as possible
  • Additional small track and slow track
  • Online feedback on validation dataset
  • Toy problem (only one informative input variable)
  • Leverage challenge protocol overhead
  • One month to explore descriptive data and test
    submission protocol
  • Attractive conditions
  • No intellectual property conditions
  • Money prizes

13
Business impact of the challenge
  • Bring Orange datasets to the data mining
    community
  • Benefit for community
  • Access to challenging data
  • Benefit for Orange
  • Benchmark of numerous competing techniques
  • Drive the research efforts towards Orange needs
  • Evaluate the Orange in-house system
  • High number of participants and high quality of
    the results
  • Orange in-house results
  • Improved by a significant margin when leveraging
    all business requirements
  • Almost Parretto optimal when other criterions are
    considered
  • (automation, very fast train and deploy,
    robustness and understandability)
  • Need to study the best challenge methods to get
    more insights

14
KDD Cup 2009 Result Analysis
Best Result (period considered in the figure) In
House System (downloadable www.khiops.com) Base
line (Naïve Bayes)
15
Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly
16
Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly
  • In House (Orange?) System
  • No parameters
  • On 1 standard laptop (mono proc)
  • If deal as 3 different problems

17
Overall Test AUC Fast
Very Fast Good Result
Small improvement after the first day (83.85 ?
84.93)
18
Overall Test AUC Slow
Very Small improvement after the 5th day (84.93
? 85.2)
Improvement due to unscrambling?
19
Overall Test AUC Submissions
23.24 of the submissions (gt0.5)lt Baseline
? 15.25 of the submissions (gt0.5)gt In House
? 84.75 of the submissions (gt0.5)lt In House
20
Overall Test AUC 'Correlation' Test / Valid
21
Overall Test AUC'Correlation' Test / Train
?
Random Values Submitted
Boosting Method orTrain Target Submitted
Over fitting
22
Overall Test AUC
Test AUC - 24 hours
Test AUC - 12 hours
Test AUC 5 days
Test AUC 36 days
23
Overall Test AUC
  • ? Difference between
  • best result at the end of the first day and
  • best result at the end of the 36 days

?1.35
Test AUC - 12 hours
  • time to adjust model parameters ?
  • time to train ensemble method ?
  • time to find more processors ?
  • time to test more methods
  • time to unscramble ?

Test AUC 36 days
24
Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Easier ?
Harder ?
25
Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
? 1.84
? 1.38
? 0.11
Easier ?
Harder ?
  • ? Difference between
  • best result at the end of the first day and
  • best result at the end of the 36 days

26
CorrelationTest AUC / Valid AUC (5 days)
Churn Test/Valid day ? 05
Appetency Test/Valid day ? 05
Up-selling Test/Valid day ? 05
Easier ?
Harder ?
27
CorrelationTrain AUC / Valid AUC (36 days)
Churn Test/Train day ? 036
Appetency Test/Train day ? 036
Up-selling Test/Train day ? 036
Difficulty to conclude something
28
HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
29
HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
YES !
Churn Test AUC day ? 536
Appetency Test AUC day ? 536
Up-selling Test AUC day ? 536
30
Fact SheetsPreprocessing Feature Selection
PREPROCESSING (overall usage95)
Replacement of the missing values
Discretization
Normalizations
Grouping modalities
Other prepro
Principal Component Analysis
0
20
40
60
80
Percent of participants
FEATURE SELECTION (overall usage85)
Feature ranking
Filter method
Other FS
Forward / backward wrapper
Embedded method
Wrapper with search
0
10
20
30
40
50
60
Percent of participants
31
Fact SheetsClassifier
CLASSIFIER (overall usage93)
Decision tree...
Linear classifier
Non-linear kernel
  • About 30 logistic loss, gt15 exp loss, gt15 sq
    loss, 10 hinge loss.
  • Less than 50 regularization (20 2-norm, 10
    1-norm).
  • Only 13 unlabeled data.

Other Classif
Neural Network
Naïve Bayes
Nearest neighbors
Bayesian Network
Bayesian Neural Network
0
10
20
30
40
50
60
Percent of participants
32
Fact Sheets Model Selection
MODEL SELECTION (overall usage90)
10 test
K-fold or leave-one-out
Out-of-bag est
Bootstrap est
Other-MS
  • About 75 ensemble methods
    (1/3 boosting, 1/3 bagging, 1/3 other).
  • About 10 used unscrambling.

Other cross-valid
Virtual leave-one-out
Penalty-based
Bi-level
Bayesian
0
10
20
30
40
50
60
Percent of participants
33
Fact Sheets Implementation
Run in parallel
None
Multi-processor
Memory
Parallelism
Operating System
Software Platform
34
Winning methods
  • Fast track
  • IBM research, USA Ensemble of a wide variety
    of classifiers. Effort put into coding (most
    frequent values coded with binary features,
    missing values replaced by mean, extra features
    constructed, etc.)
  • ID Analytics, Inc., USA Filterwrapper FS.
    TreeNet by Salford Systems an additive boosting
    decision tree technology, bagging also used.
  • David Slate Peter Frey, USA Grouping of
    modalities/discretization, filter FS, ensemble of
    decision trees.
  • Slow track
  • University of Melbourne CV-based FS targeting
    AUC. Boosting with classification trees and
    shrinkage, using Bernoulli loss.
  • Financial Engineering Group, Inc., Japan
    Grouping of modalities, filter FS using AIC,
    gradient tree-classifier boosting.
  • National Taiwan University Average 3
    classifiers (1) Solve joint multiclass problem
    with l1-regularized maximum entropy model. (2)
    AdaBoost with tree-based weak leaner. (3)
    Selective Naïve Bayes.
  • ( small dataset unscrambling)

35
Conclusion
  • Participation exceeded our expectations. We thank
    the participants for their hard work, our
    sponsors, and Orange who offered
  • A problem of real industrial interest with
    challenging scientific and technical aspects
  • Prizes.
  • Lessons learned
  • Do not under-estimate the participants five days
    were given for the fast challenge, only a few
    hours sufficed to some participants.
  • Ensemble methods are effective.
  • Ensemble of decision trees offer off-the-shelf
    solutions to problems with large numbers of
    samples and attributes, mixed types of variables,
    and lots of missing values.
Write a Comment
User Comments (0)
About PowerShow.com