Title: Decision Trees
1Decision Trees
2Contingency Tables
- A better name for a histogram
- A One-dimensional Contingency Table
- Recipe for making a k-dimensional contingency
table - Pick k attributes from your dataset. Call them
a1,a2, ak. - For every possible combination of values, a1x1,
a2x2, akxk ,record how frequently that
combination occurs
3A 2-d Contingency Table
For each pair of values for attributes (age,
wealth) we can see how many records match.
4 A 3-d Contingency Tables
5Contingency Tables
- With 16 attributes, how many 1-d contingency
tables are there? - How many 2-d contingency tables?
- How many 3-d tables?
- With 100 attributes how many 3-d tables are
there?
6Contingency Tables
- With 16 attributes, how many 1-d contingency
tables are there? 16 - How many 2-d contingency tables? 16- choose-2
16 15 / 2 120 - How many 3-d tables? 560
- With 100 attributes how many 3-d tables are
there? 161,700
7Manually looking at contingencytables
- Looking at one contingency table can be as much
fun as reading an interesting book - Looking at ten tables as much fun as watching
CNN - Looking at 100 tables as much fun as watching an
infomercial - Looking at 100,000 tables as much fun as a three
week November vacation in Duluth with a dying
weasel
8Searching for High Info Gains
Given something (e.g. wealth) you are trying to
predict, it is easy to ask the computer to find
which attribute has highest information gain for
it.
9Decision Trees
- A decision tree is a graph of decisions and their
possible consequences, (including resource costs
and risks) used to create a plan to reach a goal.
- Decision trees are constructed in order to help
with making decisions. A decision tree is a
special form of tree structure.
10Sample Tree
Interior nodes representattributes Arcs between
nodes represent possible valuesof the attributes
Leaf nodes represent value of outcomevariable
given values of attributes on the path to the
leaf node.
11Types of Decision Trees
- Classification tree
- Outcome variable is a categorical variable
- Regression tree
- Outcome variable is a continuous variable
12Play Golf Dataset
13Decision Tree of Golf Data
Sunny
14Conclusion
- The best way to explain the attribute play is
with the attribute Outlook - First conclusion, people always play when its
overcast - On days it rains, the attribute Windy explains
whether people play or not - On days when its sunny, the attribute humidity
explains when people play
15Decision Tree as Rules
If Outlook Overcast Then Play If Outlook
Sunny and Humidity lt 70 Then Play ElseIf Outlook
Sunny and Humidity gt 70 Then Dont Play If
Outlook Rain and Windy True Then Dont
Play ElseIf Outlook Rain and Windy False Then
Play
16Learning Decision Trees
- To decide which attribute should be tested first,
simply find the one with the highest information
gain. - Then recurse
17Example Data
18Decision Tree
MPG is outcome variable
Cylinders3
Cylinders5
Cylinders8
Cylinders4
Cylinders6
19Recursion Step
20Decision Tree
21Second Level of Tree
Next level of tree
22Final Tree
23Final Tree
Dont split a case if all matching instances have
the same outcome
24Final Tree
Dont split a node if none of the attributes can
create non-empty children
25Final Tree
No attribute can distinguish because
none provides any information
26Confidence and Support
- Confidence refers to the relative frequency that
an event occurs - If golfers play 8 out of the 10 days its
overcast then we have 8/10 confidence that
golfers will play on overcast days - Support refers to number of times an event occurs
out of all instances - If its only overcast 1 day in 100 then there is
only 1/100 support for the rule given above
27Pruning the Tree
- We can set limits on how deep we want to build
the tree - If there is insufficient support for a new branch
- Not enough instances to make it worthwhile
- Have to set a cutoff value for the algorithm
- Want to avoid data which is actually irrelevant
but appears to be relevant in the data used to
build the tree - There are some statistical techniques to identify
noisy data