Data Mining Session 4 Scaling up decision tree learners - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Data Mining Session 4 Scaling up decision tree learners

Description:

sunny,cool,normal,FALSE,yes. rainy,mild,normal,FALSE,yes. sunny,mild,normal,TRUE,yes ... cool. false. true. maximal. information. gain. DATA MINING - 2 MARCH ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 35
Provided by: lucde8
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Session 4 Scaling up decision tree learners


1
Data MiningSession 4Scaling up decision tree
learners
  • Luc Dehaspe
  • K.U.L. Computer Science Department

2
Data Mining process model CRISP-DM
3
Course overview
Session 2-3 Data preparation
Data Mining
4
Scaling up Decision trees
  • Decision trees
  • Scaling up
  • Scaling up Decision trees

Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
5
Classification
  • process of assigning new objects to predefined
    categories or classes
  • given a set of labeled records
  • build a model (decision tree)
  • predict labels for future unlabeled records

6
Decision trees
7
Decision treesThe Weka tool
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play yes, no _at_data sunny,hot,high,FALSE,no sunn
y,hot,high,TRUE,no overcast,hot,high,FALSE,yes rai
ny,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no overcast,cool,normal,TR
UE,yes sunny,mild,high,FALSE,no sunny,cool,normal,
FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,n
ormal,TRUE,yes overcast,mild,high,TRUE,yes overcas
t,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
http//www.cs.waikato.ac.nz/ml/weka/
8
Decision treesAttribute selection
http//www-lmmb.ncifcrf.gov/toms/paper/primer/lat
ex/index.html http//directory.google.com/Top/Scie
nce/Math/Applications/Information_Theory/Papers/
9
Decision treesAttribute selection
0.94 bits
maximal information gain
amount of information required to specify class
of an example given that it reaches node
0.97 bits 5/14
gain 0.25 bits
10
Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
maximal information gain
11
Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
humidity

high
normal
12
Decision treesFinal tree
outlook
sunny
overcast
rainy
windy
humidity
false
true
high
normal
13
Decision treeBasic Algorithm
  • Initialize top node to all examples
  • While impure leaves available
  • select next impure leave L
  • find splitting attribute A with maximal
    information gain
  • for each value of A add child to L

14
Decision treeFind good split
  • Sufficient statistics to compute info gain count
    matrix

15
Decision treesID3 - C4.5 - C5 (Quinlan)
  • Simple depth-first construction
  • Needs entire data to fit in memory
  • Unsuitable for large data sets
  • Need to scale up

16
Scaling up Decision trees
  • Decision trees
  • Scaling up
  • Scaling up Decision trees

Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
17
Scaling upWhy?
  • Terrabyte databases exist (e.g., 100 M records
    2000 fields 5 bytes)
  • increasing size of training set often increases
    the accuracy of learned classification models
  • Overfitting with small data sets due to
  • small disjuncts
  • large numbers of features (sparsely populated
    model spaces)
  • In discovery setting special cases should occur
    frequently enough
  • Other motivations for fast algorithm
    interaction, crossvalidation, multiple models

18
Scaling upvery large
  • database practitioner 100 GB
  • data mining 100 MB - 1 GB
  • Somewhere around data sizes of 100 MB or so,
    qualitatively new, very serious scaling problems
    begin to arise, both on the human and on the
    algorithmic side
  • (Huber 1997)

19
Scaling upWhat?
  • pragmatic turning impractical algorithm into
    practical one - how large a problem can you deal
    with?
  • time complexity what is the growth rate of the
    algorithms run time with increasing
  • examples
  • attributes per example
  • values per attribute
  • space complexity
  • main memory limitation
  • no substantial loss of accuracy

20
Scaling upMethods
  • Fast algorithm
  • algorithm/programming optimizations
  • parallelization
  • Data partitioning (horizontal, vertical)
  • Relational representations
  • e.g., 3 table database (5 bytes/value)
  • customer table 1 million customers, 20
    attributes
  • state table 50 states, 80 attributes
  • product table 10 products, 400 attributes
  • 100 MB database ? 2.5 GB single table

21
Scaling up Decision trees
  • Decision trees
  • Scaling up
  • Scaling up Decision trees

Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
22
Decision trees algorithm optimizationsSPRINT
(Shafer, Agrawal, Mehta, 1996)
attribute lists
horizontal data format
23
Decision trees SPRINT
attribute list L (sunny)
24
Decision trees SPRINT
  • Advantages
  • attribute lists might fit in memory when total
    data set doesnt
  • Disadvantages
  • Size of hash table is O(N) for top levels of tree
  • if hash table does not fit in memory (mostly true
    for large data sets), then build in parts so that
    each part fits

25
ParallelismTask vs. Data Parallelism
  • Data Parallelism
  • data partitioned among P processors
  • each processor performs the same work on local
    partition
  • Task Parallelism
  • each processor performs different computation
  • data may be (selectively) replicated or
    partitioned

26
ParallelismStatic vs. Dynamic Load Balancing
  • Static Load Balancing
  • Work is initially divided
  • no subsequent computation or data movement
  • Dynamic Load Balancing
  • steal work from heavily loaded processors
  • reassign to lightly loaded processors

27
Decision trees Data Parallelism
28
Decision treesData Parallelism
  • Synchronous Tree Construction
  • initially partition data across processors
  • Pro
  • no data movement is required
  • Con
  • load imbalance
  • High communication cost too high in lower parts
    of tree

proc 0
29
Decision treesTask parallelism
proc 3
proc 2
30
Decision treesTask parallelism
  • Partitioned Tree Construction
  • Pro
  • Highly concurrent
  • no data movement after 1 proc responsible for
    entire subtree
  • Con
  • excessive data movements until 1 proc responsible
  • load imbalance
  • portions of tree die out (assignment of nodes to
    procs based on nr cases, not work)
  • solution dynamic load balancing
  • exchange data between processor partitions
  • load balance within processor partitions

31
Decision treeHybrid parallel formulation
  • Start with synchronous tree construction
  • Detect when cost of communication becomes too
    high and redistribution of data is cheaper
  • ? (Communication cost) ? Moving Cost Load
    Balancing
  • Proceed with partitioned tree construction
    approach

32
Decision treeHybrid parallel formulation (detail)
  • Database split among P processors, all processors
    assigned to 1 partition, tree root node allocated
    to partition
  • Nodes at frontier of tree within 1 partition
    processed with synchronous tree construction
  • Communication cost prohibitive ? processors in
    partition divided into 2 partitions, current set
    of frontier nodes split and allocated
    (partitioned tree construction)
  • Above steps repeated in each one of the
    partitions for subtrees
  • Idle processors recycled

33
Speedup comparisons
  • Data set 1.6 million examples, 11 attributes

34
Scaling up Decision treesSummary
  • Decision trees
  • Scaling up
  • why? what?
  • methods
  • fast algorithm
  • data partitioning
  • relational representation
  • Scaling up Decision trees
  • Sprint attribute lists
  • Parallelism (data, task, hybrid)
Write a Comment
User Comments (0)
About PowerShow.com