BOAT - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

BOAT

Description:

BOAT Bootstrapped Optimistic Algorithm for Tree Construction CIS 595 FALL 2000 Presentation Prashanth Saka BOAT is a new algorithm for decision tree construction that ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 17
Provided by: cisTempl4
Learn more at: https://cis.temple.edu
Category:
Tags: boat

less

Transcript and Presenter's Notes

Title: BOAT


1
BOAT
  • Bootstrapped Optimistic Algorithm for Tree
    Construction

2
CIS 595FALL 2000
  • Presentation
  • Prashanth Saka

3
  • BOAT is a new algorithm for decision tree
    construction that improves both in functionality
    and performance, resulting in a gain of around
    300.
  • The reason, only two scans over the entire
    training data set.
  • The first scalable algorithm with the ability to
    incrementally update the tree w.r.t., to both
    insertions and deletions over the dataset.

4
  • Take a sample D ? D from the training database
    and construct a sample tree with coarse splitting
    criteria at each node using bootstrapping.
  • Make one scan over the database D and process
    each tuple by streaming it down the tree.
  • At the root node, n, update the counts of buckets
    for each numerical predictor attribute.
  • If t falls in the confidence interval, t is
    written into a temporary file Sn at node n, else
    it is sent down the tree.

5
  • Then the tree is processed top-down.
  • At each node a lower bounding technique is used
    to check whether the global minimum value of the
    impurity function could be lower than i, the
    minimum impurity value.
  • If the check is successful, then we are done with
    the node n. Else, we discard n and its sub tree
    during the current construction.

6
  • Each decision tree has exactly one incoming edge
    and zero or two outgoing edges.
  • Each leaf is labeled with one class label.
  • Each internal node is labeled with one predictor
    attribute Xn called the splitting attribute.
  • Each internal node has the splitting predicate qn
    associated with it.
  • If Xn is numerical, then qn is in the form Xn?
    xn, where xn belongs to dom(Xn) xn is the split
    point at node n.

7
  • The combined information of splitting attribute
    and splitting predicates at a node n is called
    the splitting criterion at n.

8
  • We associate at each node n?T, a predicate
  • fn dom(X1) x x dom(Xm) ? true, false ,
    called its node predicate as follows
  • for the root node n, fn true
  • Let n be a non-root node with the parent p,
    whose splitting predicate is qp.
  • If n is the left child of p, then fn fp? qp
  • If n is the right child of p, then fn fp? qp

9
  • Since each leaf node n ? T is labeled with a
    class label, it encodes a classification rule fn
    ? c, where c is the label of n.
  • T dom(X1) x x dom(Xm) ? dom(C) and is
    therefore a classifier called a decision
    classifier.
  • For a node n?T, with parent p, Fn is the set of
    records in D that follows the path from the root
    to node n, when being processed by the tree.
  • Formally, Fn t ? D f(n) is true

10
  • Here, the impurity based split selection methods
    are considered, which produce binary splits.
  • The impurity based split selection methods
    calculate the splitting criterion by trying to
    minimize a concave splitting function imp?.
  • At each node, all the predictor attributes X are
    examined and the impurity of the best split on X
    is calculated, and the final split is chosen such
    that the value of imp? is minimized.

11
  • Let T be the final tree constructed using the
    split selection method CL on the training
    database, D.
  • As D does not fit into the memory, consider D?D
    such that D fits into the memory.
  • Compute a sample T from D.
  • Each node n ? T has a sample splitting criteria
    consisting of a sample splitting attribute and a
    split point.
  • We can use this knowledge of T to guide us in
    the construction of T, our final goal.

12
  • Consider a node n in the sample tree T with
    numerical sample attribute Xn, and sample
    splitting predicate Xn ? x.
  • By T being close to T we mean that the final
    splitting attribute is at node n is X and that
    the final split point is inside a confidence
    interval around x.
  • For categorical attributes, both the splitting
    attribute as well as the splitting subset have to
    match.

13
  • Bootstrapping The bootstrapping method can be
    applied to the in-memory sample D to obtain a
    tree T that is close to T with high probability.
  • In addition to T, we also obtain the confidence
    intervals that contain the final split points
    with for nodes with numerical splitting
    attributes.
  • We call the information at node n obtained
    through bootstrapping the coarse splitting
    criterion at node n.

14
  • After finding the final splitting attribute at
    each node n and also the confidence interval of
    the attribute values that contain the final split
    point.
  • To decide on the final split point we need to
    examine the value of the impurity function only
    at the attribute values inside the confidence
    interval.
  • If we had all the tuples that fall inside the
    confidence interval of n in-memory, then we could
    calculate the final split point exactly by
    calculating the value of the impurity function at
    these points only.

15
  • To bring these tuples into memory we make one
    scan over D and keep all tuples that fall inside
    the confidence interval at any node in-memory.
  • Then we post process each node with a numerical
    splitting attribute to find the exact value of
    the split point using the tuples collected during
    the database scan.
  • This phase is called the clean-up phase.
  • The coarse splitting criterion at node n obtained
    from the sample D through bootstrapping is only
    correct with a high probability.

16
  • Whenever the course splitting criterion at n is
    not correct, we detect it during the clean-up
    phase, and can take necessary action.
  • And hence, the method guarantees to find exactly
    the same tree as if a traditional main memory
    algorithm were run on the complete training set.
Write a Comment
User Comments (0)
About PowerShow.com