Scaling Decision Tree Induction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Scaling Decision Tree Induction

1
Scaling Decision Tree Induction
2
Outline

Why do we need scaling?
Cover state of the art methods
Details on my research (which is one of the state
of the art methods)

3
Problems Scaling Decision Trees

Data doesnt fit in RAM
Numeric attributes require repeated sorting
Noisy datasets lead to very large trees
Large datasets fundamentally different from
smaller ones
Cant store the entire dataset
Underlying phenomenon changes over time

4
Current State-Of-The-Art

Disk based methods
Sprint
SLIQ
Sampling methods
BOAT
VFDT CVFDT
Data Stream Methods
VFDT CVFDT

5
SPRINT/SLIQ

Shafer, Agrawal, Mehta
In the IBM Intelligent Miner for Data
Learns the same tree as traditional method but
works with data on disk
One scan over the data per level of the induced
tree

6
SPRINT/SLIQ Details

Split the dataset into one file per attribute
(value, record ID)
Pre-sort each numeric attributes file
Do one scan over each file, find best split point
Use hash-tables to split the files maintaining
sort order
Recur

7
SPRINT/SLIQ Splitting Example
Test Attrib
To Split
gt
val rec
val rec
10 1
3 3
10 1
14 6
5 2
14 6
20 4
6 5
20 2
9 1
25 4
10 4
30 3
12 6
40 5
lt
hashtable
20 2
1 gt
30 3
2 lt
40 5
3 lt
4 gt
5 lt
6 gt
8
BOAT

Gehrke, Ganti, Ramakrishnan, Loh
Learns the same tree as traditional methods but
can be as much as 3x faster than SPRINT/SLIQ
When things work out learns more than one level
of tree in one scan over the database

9
BOAT Details

Read a sample of data into memory
Learn N trees via traditional methods on
bootstrap samples from this sample
Keep any subset of the N trees that is exactly
the same
Verify the subtree with a scan over all data
When this fails revert to SPRINT/SLIQ

10
BOAT Example
x1?
x1?
x1?
male
female
male
female
male
female
x2?
x2?
x2?
gt 65
lt 65
gt 67
lt 67
gt 61
lt 61
x3?
no
yes
x1?
male
female
x2?
gt 61
lt 67
?
11
VFDT/CVFDT

Hulten, Spencer, Domingos
With high probability learns what traditional
methods would learn, but much faster
Learns from data stream instead of data base
CVFDT is extension to time changing concepts

12
Motivation

Why use a data stream model?
High data rate
Essentially infinite data
Data collected in varied circumstances
Need a algorithms that are
Constant time per example use each example once
Incremental
Anytime
Produce results equivalent to traditional
methods

13
Hoeffding Trees

In order to pick split attribute for a node
looking at a few example may be sufficient
Given a stream of examples
Use the first to pick the split at the root
Sort succeeding ones to the leaves
Pick best attribute there
Continue
Leaves predict most common class

14
How Much Data?

Make sure best attribute is better than second
That is
Using a statistical result Hoeffding bound
Collect data till

15
Hoeffding Tree Algorithm

Proceedure HoeffdingTree(Stream, d)
Let HT Tree with single leaf (root)
Initialize sufficient statistics at root
For each example (X, y) in Stream
Sort (X, y) to leaf using HT
Update sufficient statistics at leaf
Compute G for each attribute
If G(best) G(2nd best) gt e, then
Split leaf on best attribute
For each branch
Start new leaf, init sufficient statistics
Return HT

x1?
male
female
y0
x2?
gt 65
lt 65
y0
y1
16
Properties of Hoeffding Trees

Model may contain incorrect splits, useful?
Bound the difference with infinite data tree
Chance an arbitrary example takes different path
Intuition example on level i of tree has i
chances to go through a mistaken node

17
VFDT (Very Fast Decision Tree)

Memory management
Memory dominated by sufficient statistics
Deactivate less promising leaves when needed
Ties
Wasteful to decide between identical attributes
Check for splits periodically
Pre-pruning (optional)
Only make splits that improve the value of G(.)
Early stop on bad attributes
Bootstrap with traditional learner
Rescan old data when time available

18
Experiments

Compared VFDT and C4.5 (Quinlan, 1993)
Same memory limit for both (40 MB)
100k examples for C4.5
VFDT settings d 10-7, t 5
Domains 2 classes, 100 binary attributes
Fifteen synthetic trees 2.2k 500k leaves
Noise from 0 to 30

19
(No Transcript)
20
(No Transcript)
21
Running Times

Pentium III at 500 MHz running Linux
C4.5 takes 35 seconds to read and process 100k
examples VFDT takes 47 seconds
VFDT takes 6377 seconds for 20 million examples
5752s to read 625s to process
VFDT processes 32k examples per second (excluding
I/O)

22
(No Transcript)
23
Time-Changing Data Streams

Underlying concept often changes over time
Seasonal effects
Economic cycles
Etc.
Many KDD systems assume data is sample from
stationary distribution
CVFDT -- Extends VFDT for time changing data
streams

24
Dealing with Time Changing Concepts

Out-of-date data misleads learner and results in
larger or less accurate models
Maintain a window of the most recent examples
When new data arrives update the window and
reapply the learner
Effective when window size similar to concept
drift rate
Extremely inefficient!

25
Concept adapting VFDT

Keep up to date with a window of size w
Incrementally incorporate and forget examples
Smoothly change the induced tree
Grow speculative structure
Change structure when more accurate
Incorporates new examples in constant time
instead of relearning on window O(w) time

26
Window (Forgetting Examples)

Keep sufficient statistics at every node
Update with new old examples
Keep an ID and only forget where needed
Quickly update leaf predictions
Periodically check for any invalid splits
Some portion due to incorrect initial splits
The rest due to changes in the data stream

27
Alternate Sub-Trees

When new test looks better grow alternate
sub-tree
Replace the old when new is more accurate
This smoothly adjusts to changing concepts

Gender?
Pets?
College?
false
Hair?
true
false
true
false
true
28
CVFDT Details

Memory Requirements
When drift present, CVFDT uses fewer nodes than
VFDT
Observed good results with relatively few
alternate-trees
Update time
O( attribs values classes path length)
Independent of training set and window size!

29
Other things

Dynamic window size
Drastic changes in the data stream
Drastic changes in the induced model
No apparent changes (learn more detail)

30
Synthetic Experiments

Concept based on parallel hyper-planes
Aligned axis better split attribute, rotate the
hyper-planes to change structure of true tree

-
-
Concept Drift

-
-
31
Synthetic Experiments (cont.)

Compare CVFDT with VFDT
5 million training examples
Drift inserted by periodically rotating
hyper-planes
About 8 of test points change label each drift
100,000 examples in window
5 noise
Results sampled every 10k examples throughout the
run and averaged

32
Error Rate vs. Attributes
33
Tree Size vs. Attributes
34
Detailed View of Single Run
35
Varying Levels of Drift
36
Details of Adaptation
37
Comparison With VFDT-window

CVFDT most of the accuracy gain
VFDT 10 min
CVFDT 46 min
VFDT-window
Est. 548 days!

VFDT-Window
VFDT
CVFDT
38
Application Web Data

Trace of all web requests from UW campus
82.8 million requests over one-week period
Goal to predict which pages to cache
CVFDT does better for first 70 of run
VFDTs performance improved near end
Data seems to contain drift, but more study is
needed

39
Open Issues

Continuous Attributes
Batch version of VFDT
Very Fast Post Pruning
Extending general method to other algorithms

40
Summary

Decision trees important, need some more work to
scale to today's problems
Disk based methods
About one scan per level of tree
Sampling can produce equivalent trees much faster

Write a Comment

User Comments (0)

About PowerShow.com

Scaling Decision Tree Induction PowerPoint PPT Presentation