Scaling Decision Tree Induction - PowerPoint PPT Presentation

Loading...

PPT – Scaling Decision Tree Induction PowerPoint presentation | free to download - id: 771d97-OGE1N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Scaling Decision Tree Induction

Description:

Scaling Decision Tree Induction Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the art methods ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 41
Provided by: Unkno311
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Scaling Decision Tree Induction


1
Scaling Decision Tree Induction
2
Outline
  • Why do we need scaling?
  • Cover state of the art methods
  • Details on my research (which is one of the state
    of the art methods)

3
Problems Scaling Decision Trees
  • Data doesnt fit in RAM
  • Numeric attributes require repeated sorting
  • Noisy datasets lead to very large trees
  • Large datasets fundamentally different from
    smaller ones
  • Cant store the entire dataset
  • Underlying phenomenon changes over time

4
Current State-Of-The-Art
  • Disk based methods
  • Sprint
  • SLIQ
  • Sampling methods
  • BOAT
  • VFDT CVFDT
  • Data Stream Methods
  • VFDT CVFDT

5
SPRINT/SLIQ
  • Shafer, Agrawal, Mehta
  • In the IBM Intelligent Miner for Data
  • Learns the same tree as traditional method but
    works with data on disk
  • One scan over the data per level of the induced
    tree

6
SPRINT/SLIQ Details
  • Split the dataset into one file per attribute
  • (value, record ID)
  • Pre-sort each numeric attributes file
  • Do one scan over each file, find best split point
  • Use hash-tables to split the files maintaining
    sort order
  • Recur

7
SPRINT/SLIQ Splitting Example
Test Attrib
To Split
gt
val rec
val rec
10 1
3 3
10 1
14 6
5 2
14 6
20 4
6 5
20 2
9 1
25 4
10 4
30 3
12 6
40 5
lt
hashtable
20 2
1 gt
30 3
2 lt
40 5
3 lt
4 gt
5 lt
6 gt
8
BOAT
  • Gehrke, Ganti, Ramakrishnan, Loh
  • Learns the same tree as traditional methods but
    can be as much as 3x faster than SPRINT/SLIQ
  • When things work out learns more than one level
    of tree in one scan over the database

9
BOAT Details
  • Read a sample of data into memory
  • Learn N trees via traditional methods on
    bootstrap samples from this sample
  • Keep any subset of the N trees that is exactly
    the same
  • Verify the subtree with a scan over all data
  • When this fails revert to SPRINT/SLIQ

10
BOAT Example
x1?
x1?
x1?
male
female
male
female
male
female
x2?
x2?
x2?
gt 65
lt 65
gt 67
lt 67
gt 61
lt 61
x3?
no
yes
x1?
male
female
x2?
gt 61
lt 67
?
11
VFDT/CVFDT
  • Hulten, Spencer, Domingos
  • With high probability learns what traditional
    methods would learn, but much faster
  • Learns from data stream instead of data base
  • CVFDT is extension to time changing concepts

12
Motivation
  • Why use a data stream model?
  • High data rate
  • Essentially infinite data
  • Data collected in varied circumstances
  • Need a algorithms that are
  • Constant time per example use each example once
  • Incremental
  • Anytime
  • Produce results equivalent to traditional
    methods

13
Hoeffding Trees
  • In order to pick split attribute for a node
    looking at a few example may be sufficient
  • Given a stream of examples
  • Use the first to pick the split at the root
  • Sort succeeding ones to the leaves
  • Pick best attribute there
  • Continue
  • Leaves predict most common class

14
How Much Data?
  • Make sure best attribute is better than second
  • That is
  • Using a statistical result Hoeffding bound
  • Collect data till

15
Hoeffding Tree Algorithm
  • Proceedure HoeffdingTree(Stream, d)
  • Let HT Tree with single leaf (root)
  • Initialize sufficient statistics at root
  • For each example (X, y) in Stream
  • Sort (X, y) to leaf using HT
  • Update sufficient statistics at leaf
  • Compute G for each attribute
  • If G(best) G(2nd best) gt e, then
  • Split leaf on best attribute
  • For each branch
  • Start new leaf, init sufficient statistics
  • Return HT

x1?
male
female
y0
x2?
gt 65
lt 65
y0
y1
16
Properties of Hoeffding Trees
  • Model may contain incorrect splits, useful?
  • Bound the difference with infinite data tree
  • Chance an arbitrary example takes different path
  • Intuition example on level i of tree has i
    chances to go through a mistaken node

17
VFDT (Very Fast Decision Tree)
  • Memory management
  • Memory dominated by sufficient statistics
  • Deactivate less promising leaves when needed
  • Ties
  • Wasteful to decide between identical attributes
  • Check for splits periodically
  • Pre-pruning (optional)
  • Only make splits that improve the value of G(.)
  • Early stop on bad attributes
  • Bootstrap with traditional learner
  • Rescan old data when time available

18
Experiments
  • Compared VFDT and C4.5 (Quinlan, 1993)
  • Same memory limit for both (40 MB)
  • 100k examples for C4.5
  • VFDT settings d 10-7, t 5
  • Domains 2 classes, 100 binary attributes
  • Fifteen synthetic trees 2.2k 500k leaves
  • Noise from 0 to 30

19
(No Transcript)
20
(No Transcript)
21
Running Times
  • Pentium III at 500 MHz running Linux
  • C4.5 takes 35 seconds to read and process 100k
    examples VFDT takes 47 seconds
  • VFDT takes 6377 seconds for 20 million examples
    5752s to read 625s to process
  • VFDT processes 32k examples per second (excluding
    I/O)

22
(No Transcript)
23
Time-Changing Data Streams
  • Underlying concept often changes over time
  • Seasonal effects
  • Economic cycles
  • Etc.
  • Many KDD systems assume data is sample from
    stationary distribution
  • CVFDT -- Extends VFDT for time changing data
    streams

24
Dealing with Time Changing Concepts
  • Out-of-date data misleads learner and results in
    larger or less accurate models
  • Maintain a window of the most recent examples
  • When new data arrives update the window and
    reapply the learner
  • Effective when window size similar to concept
    drift rate
  • Extremely inefficient!

25
Concept adapting VFDT
  • Keep up to date with a window of size w
  • Incrementally incorporate and forget examples
  • Smoothly change the induced tree
  • Grow speculative structure
  • Change structure when more accurate
  • Incorporates new examples in constant time
    instead of relearning on window O(w) time

26
Window (Forgetting Examples)
  • Keep sufficient statistics at every node
  • Update with new old examples
  • Keep an ID and only forget where needed
  • Quickly update leaf predictions
  • Periodically check for any invalid splits
  • Some portion due to incorrect initial splits
  • The rest due to changes in the data stream

27
Alternate Sub-Trees
  • When new test looks better grow alternate
    sub-tree
  • Replace the old when new is more accurate
  • This smoothly adjusts to changing concepts

Gender?
Pets?
College?
false
Hair?
true
false
true
false
true
28
CVFDT Details
  • Memory Requirements
  • When drift present, CVFDT uses fewer nodes than
    VFDT
  • Observed good results with relatively few
    alternate-trees
  • Update time
  • O( attribs values classes path length)
  • Independent of training set and window size!

29
Other things
  • Dynamic window size
  • Drastic changes in the data stream
  • Drastic changes in the induced model
  • No apparent changes (learn more detail)

30
Synthetic Experiments
  • Concept based on parallel hyper-planes
  • Aligned axis better split attribute, rotate the
    hyper-planes to change structure of true tree



-
-
Concept Drift


-
-
31
Synthetic Experiments (cont.)
  • Compare CVFDT with VFDT
  • 5 million training examples
  • Drift inserted by periodically rotating
    hyper-planes
  • About 8 of test points change label each drift
  • 100,000 examples in window
  • 5 noise
  • Results sampled every 10k examples throughout the
    run and averaged

32
Error Rate vs. Attributes
33
Tree Size vs. Attributes
34
Detailed View of Single Run
35
Varying Levels of Drift
36
Details of Adaptation
37
Comparison With VFDT-window
  • CVFDT most of the accuracy gain
  • VFDT 10 min
  • CVFDT 46 min
  • VFDT-window
  • Est. 548 days!

VFDT-Window
VFDT
CVFDT
38
Application Web Data
  • Trace of all web requests from UW campus
  • 82.8 million requests over one-week period
  • Goal to predict which pages to cache
  • CVFDT does better for first 70 of run
  • VFDTs performance improved near end
  • Data seems to contain drift, but more study is
    needed

39
Open Issues
  • Continuous Attributes
  • Batch version of VFDT
  • Very Fast Post Pruning
  • Extending general method to other algorithms

40
Summary
  • Decision trees important, need some more work to
    scale to today's problems
  • Disk based methods
  • About one scan per level of tree
  • Sampling can produce equivalent trees much faster
About PowerShow.com