Decision Trees and more!

About This Presentation

Title:

Decision Trees and more!

Description:

DT: Construction Algorithm. Procedure DT(S) : S- sample ... Drop at least O(g2 [val(qt)]3/ t ) Theoretical Analysis. Need to solve when val(Tk) e. ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 39

Provided by: yishaym4

Category:

more less

Transcript and Presenter's Notes

Title: Decision Trees and more!

1
Decision Trees and more!
2
Learning OR with few attributes

Target function OR of k literals
Goal learn in time
polynomial in k and log n
e and d constants
ELIM makes slow progress
might disqualifies only one literal per round
Might remain with O(n) candidate literals

3
ELIM Algorithm for learning OR

Keep a list of all candidate literals
For every example whose classification is 0
Erase all the literals that are 1.
Correctness
Our hypothesis h An OR of our set of literals.
Our set of literals includes the target OR
literals.
Every time h predicts zero we are correct.
Sample size
m gt (1/e) ln (3n/d) O (n/e 1/e ln (1/d))

4
Set Cover - Definition

Input S1 , , St and Si ? U
Output Si1, , Sik and ?j SjkU
Question Are there k sets that cover U?
NP-complete

5
Set Cover Greedy algorithm

j0 UjU C?
While Uj ? ?
Let Si be arg max Si ? Uj
Add Si to C
Let Uj1 Uj Si
j j1

6
Set Cover Greedy Analysis

At termination, C is a cover.
Assume there is a cover C of size k.
C is a cover for every Uj
Some S in C covers Uj/k elements of Uj
Analysis of Uj Uj1 ? Uj - Uj/k
Solving the recursion.
Number of sets j ? k ln ( U1)

7
Building an Occam algorithm

Given a sample T of size m
Run ELIM on T
Let LIT be the set of remaining literals
Assume there exists k literals in LIT that
classify correctly all the sample T
Negative examples T-
any subset of LIT classifies T- correctly

8
Building an Occam algorithm

Positive examples T
Search for a small subset of LIT which classifies
T correctly
For a literal z build Szx z satisfies x
Our assumption there are k sets that cover T
Greedy finds k ln m sets that cover T
Output h OR of the k ln m literals
Size (h) lt k ln m log 2n
Sample size m O( k log n log (k log n))

9
k-DNF

Definition
A disjunction of terms at most k literals
Term Tx3? x1 ? x5
DNF T1? T2 ? T3 ? T4
Example

10
Learning k-DNF

Extended input
For each AND of k literals define a new input T
Example Tx3? x1 ? x5
Number of new inputs at most (2n)k
Can compute the new input easily in time k(2n)k
The k-DNF is an OR over the new inputs.
Run the ELIM algorithm over the new inputs.
Sample size O ((2n)k/e 1/e ln (1/d))
Running time same.

11
Learning Decision Lists

Definition

0
x4
x7
x1
1
1
1
1
0
0
-1
1
-1
12
Learning Decision Lists

Similar to ELIM.
Input a sample S of size m.
While S not empty
For a literal z build Tzx z satisfies x
Find a Tz which all have the same classification
Add z to the decision list
Update S S-Tz

13
DL algorithm correctness

The output decision list is consistent.
Number of decision lists
Length lt n1
Node 2n lirals
Leaf 2 values
Total bound (22n)n1
Sample size
m O (n log n/e 1/e ln (1/d))

14
k-DL

Each node is a conjunction of k literals
Includes k-DNF (and k-CNF)

15
Learning k-DL

Extended input
For each AND of k literals define a new input
Example Tx3? x1 ? x5
Number of new inputs at most (2n)k
Can compute the new input easily in time k(2n)k
The k-DL is a DL over the new inputs.
Run the DL algorithm over the new inputs.
Sample size
Running time

16
Open Problems

Attribute Efficient
Decision list very limited results
Parity functions negative?
k-DNF and k-DL

17
Decision Trees
x1
1
0
x6
1
0
18
Learning Decision Trees Using DL

Consider a decision tree T of size r.
Theorem
There exists a log (r1)-DL L that computes T.
Claim There exists a leaf in T of depth log
(r1).
Learn a Decision Tree using a Decision List
Running time nlog s
n number of attributes
S Tree Size.

19
Decision Trees
x1 gt 5
x6 gt 2
20
Decision Trees Basic Setup.

Basic class of hypotheses H.
Input Sample of examples
Output Decision tree
Each internal node from H
Each leaf a classification value
Goal (Occam Razor)
Small decision tree
Classifies all (most) examples correctly.

21
Decision Tree Why?

Efficient algorithms
Construction.
Classification
Performance Comparable to other methods
Software packages
CART
C4.5 and C5

22
Decision Trees This Lecture

Algorithms for constructing DT
A theoretical justification
Using boosting
Future lecture
DT pruning.

23
Decision Trees Algorithm Outline

A natural recursive procedure.
Decide a predicate h at the root.
Split the data using h
Build right subtree (for h(x)1)
Build left subtree (for h(x)0)
Running time
T(s) O(s) T(s) T(s-) O(s log s)
s Tree size

24
DT Selecting a Predicate

Basic setting
Clearly qup (1-u)r

Prf1q
v
h
Prh0u
Prh11-u
0
1
v1
v2
Prf1 h0p
Prf1 h1r
25
Potential function setting

Compare predicates using potential function.
Inputs q, u, p, r
Output value
Node dependent
For each node and predicate assign a value.
Given a split u val(v1) (1-u) val(v2)
For a tree weighted sum over the leaves.

26
PF classification error

Let val(v)minq,1-q
Classification error.
The average potential only drops
Termination
When the average is zero
Perfect Classification

27
PF classification error

Is this a good split?
Initial error 0.2
After Split 0.4 (1/2) 0.6(1/2) 0.2

28
Potential Function requirements

When zero perfect classification.
Strictly convex.

29
Potential Function requirements

Every Change in an improvement

val(r)
val(q)
val(p)
1-u
u
p
q
r
30
Potential Functions Candidates

Potential Functions
val(q) Ginni(q)2q(1-q) CART
val(q)etropy(q) -q log q (1-q) log (1-q)
C4.5
val(q) sqrt2 q (1-q)
Assumption
Symmetric val(q) val(1-q)
Convex
val(0)val(1) 0 and val(1/2) 1

31
DT Construction Algorithm

Procedure DT(S) S- sample
If all the examples in S have the classification
b
Create a leaf of value b and return
For each h compute val(h,S)
val(h,S) uhval(ph) (1-uh) val(rh)
Let h arg minh val(h,S)
Split S using h to S0 and S1
Recursively invoke DT(S0) and DT(S1)

32
DT Analysis

Potential function
val(T) Sv leaf of T Prv val(qv)
For simplicity use true probability
Bounding the classification error
error(T) ? val(T)
study how fast val(T) drops
Given a tree T define T(l,h) where
h predicate
l leaf.

T
h
33
Top-Down algorithm

Input s size H predicates val()
T0 single leaf tree
For t from 1 to s do
Let (l,h) arg max(l,h)val(Tt) val(Tt(l,h))
Tt1 Tt(l,h)

34
Theoretical Analysis

Assume H satisfies the weak learning hypo.
For each D there is an h s.t. error(h)lt1/2-g
Show, that in every step
a significant drop in val(T)
Results weaker than AdaBoost
But algorithm never intended to do it!
Use Weak Learning
show a large drop in val(T) at each step
Modify initial distribution to be unbiased.

35
Theoretical Analysis

Let val(q) 2q(1-q)
Local drop at a node at least 16g2 q(1-q)2
Claim At every step t there is a leaf l s.t.
Prl ? et/2t
error(l) minql,1-ql ? et/2
where et is the error at stage t
Proof!

36
Theoretical Analysis

Drop at time t at least
Prl g2 ql (1-ql)2 ? g2 et3 / t
For Ginni index
val(q)2q(1-q)
q ? q(1-q) ? val(q)/2
Drop at least O(g2 val(qt)3/ t )

37
Theoretical Analysis

Need to solve when val(Tk) lt e.
Bound k.
Time expO(1/g2 1/ e2)

38
Something to think about

AdaBoost very good bounds
DT Ginni Index exponential
Comparable results in practice
How can it be?

Write a Comment

User Comments (0)

About PowerShow.com

Decision Trees and more! - PowerPoint PPT Presentation

Decision Trees and more!

DT: Construction Algorithm. Procedure DT(S) : S- sample ... Drop at least O(g2 [val(qt)]3/ t ) Theoretical Analysis. Need to solve when val(Tk) e. ... – PowerPoint PPT presentation