PPT – BOAT PowerPoint presentation | free to download

About This Presentation

Title:

BOAT

Description:

Number of Views:34

Avg rating:3.0/5.0

Slides: 17

Provided by: cisTempl4

Learn more at: https://cis.temple.edu

Category:

Tags: boat

Transcript and Presenter's Notes

Title: BOAT

1
BOAT

2
CIS 595FALL 2000

BOAT is a new algorithm for decision tree
construction that improves both in functionality
and performance, resulting in a gain of around
300.
The reason, only two scans over the entire
training data set.
The first scalable algorithm with the ability to
incrementally update the tree w.r.t., to both
insertions and deletions over the dataset.

Take a sample D ? D from the training database
and construct a sample tree with coarse splitting
criteria at each node using bootstrapping.
Make one scan over the database D and process
each tuple by streaming it down the tree.
At the root node, n, update the counts of buckets
for each numerical predictor attribute.
If t falls in the confidence interval, t is
written into a temporary file Sn at node n, else
it is sent down the tree.

Then the tree is processed top-down.
At each node a lower bounding technique is used
to check whether the global minimum value of the
impurity function could be lower than i, the
minimum impurity value.
If the check is successful, then we are done with
the node n. Else, we discard n and its sub tree
during the current construction.

Each decision tree has exactly one incoming edge
and zero or two outgoing edges.
Each leaf is labeled with one class label.
Each internal node is labeled with one predictor
attribute Xn called the splitting attribute.
Each internal node has the splitting predicate qn
associated with it.
If Xn is numerical, then qn is in the form Xn?
xn, where xn belongs to dom(Xn) xn is the split
point at node n.

The combined information of splitting attribute
and splitting predicates at a node n is called
the splitting criterion at n.

Since each leaf node n ? T is labeled with a
class label, it encodes a classification rule fn
? c, where c is the label of n.
T dom(X1) x x dom(Xm) ? dom(C) and is
therefore a classifier called a decision
classifier.
For a node n?T, with parent p, Fn is the set of
records in D that follows the path from the root
to node n, when being processed by the tree.
Formally, Fn t ? D f(n) is true

Here, the impurity based split selection methods
are considered, which produce binary splits.
The impurity based split selection methods
calculate the splitting criterion by trying to
minimize a concave splitting function imp?.
At each node, all the predictor attributes X are
examined and the impurity of the best split on X
is calculated, and the final split is chosen such
that the value of imp? is minimized.

Let T be the final tree constructed using the
split selection method CL on the training
database, D.
As D does not fit into the memory, consider D?D
such that D fits into the memory.
Compute a sample T from D.
Each node n ? T has a sample splitting criteria
consisting of a sample splitting attribute and a
split point.
We can use this knowledge of T to guide us in
the construction of T, our final goal.

Consider a node n in the sample tree T with
numerical sample attribute Xn, and sample
splitting predicate Xn ? x.
By T being close to T we mean that the final
splitting attribute is at node n is X and that
the final split point is inside a confidence
interval around x.
For categorical attributes, both the splitting
attribute as well as the splitting subset have to
match.

Bootstrapping The bootstrapping method can be
applied to the in-memory sample D to obtain a
tree T that is close to T with high probability.
In addition to T, we also obtain the confidence
intervals that contain the final split points
with for nodes with numerical splitting
attributes.
We call the information at node n obtained
through bootstrapping the coarse splitting
criterion at node n.

After finding the final splitting attribute at
each node n and also the confidence interval of
the attribute values that contain the final split
point.
To decide on the final split point we need to
examine the value of the impurity function only
at the attribute values inside the confidence
interval.
If we had all the tuples that fall inside the
confidence interval of n in-memory, then we could
calculate the final split point exactly by
calculating the value of the impurity function at
these points only.

To bring these tuples into memory we make one
scan over D and keep all tuples that fall inside
the confidence interval at any node in-memory.
Then we post process each node with a numerical
splitting attribute to find the exact value of
the split point using the tuples collected during
the database scan.
This phase is called the clean-up phase.
The coarse splitting criterion at node n obtained
from the sample D through bootstrapping is only
correct with a high probability.

Whenever the course splitting criterion at n is
not correct, we detect it during the clean-up
phase, and can take necessary action.
And hence, the method guarantees to find exactly
the same tree as if a traditional main memory
algorithm were run on the complete training set.

Write a Comment

User Comments (0)