A systematic overview of Data Mining Algorithms - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

A systematic overview of Data Mining Algorithms

Description:

Search/optimization method(hill climbing, simulated annealing, convergence specification ... Tree can deal with mixed data types (combination of ... – PowerPoint PPT presentation

Number of Views:304

Avg rating:3.0/5.0

Slides: 25

Provided by: sve2

Category:

more less

Transcript and Presenter's Notes

Title: A systematic overview of Data Mining Algorithms

1
A systematic overview of Data Mining Algorithms

Data Mining, David Hand Ch 5

Input data
Output models or patterns

3
Data Mining algorithm specification

Task (visualization, classification, clustering,
regression)
Structure (functional form) of the model that we
are fitting to the data (linear regression,
hierarchical clustering )
Score function
Goodness of fit
Generalization
Search/optimization method(hill climbing,
simulated annealing, convergence specification )

4
CART classification

Mapping an input vector x to a categorical class
y
Task prediction (classification)
Model structure tree
Score function cross-validation loss function
Search method greedy local search

5
CART Tree

Internal node binary test
Leaf node class
tree picture p.146
Data vector x descends a unique path from root to
leaf.
Structure of the tree is derived from the data
Choose the best variable to split the data
Score function is misclassification loss
S C ( y(i), y(i) )
C is m x m matrix (m is number of classes)

6
Tree classification vs. Linear classification

Tree can deal with mixed data types (combination
of categorical and real-valued data)
Easier to deal with large number of variables
(because process one at a time)

7
CART tree classification

Goal of CART is to find a tree closest to optimal
tree
Complex enough to capture the structure
Not too complex to avoid overfit
Uses cross-validation technique

8
CART

Search space is all possible trees (combinatorial
large).
Approximation algorithm Uses greedy local
search to identify good candidate tree
structures
2 phase
Recursively expands the tree from root
Prunes branches

9
Reductionist view on data mining algorithms

Algorithms are tuples of
model structure, score function, search method,
data management
When deciding on an algorithm for an application,
think of which components are suitable

10
Multilayer Preceptrons (MLP) for Regression and
Classification

Non-linear mapping from a real-valued input
vector x to a real-valued output y
Can be used as
Nonlinear model for regression
classification

11
MLP Basic idea
12
MLP algorithm tuple

Task prediction
Structure layers of nonlinear transformations
of weighted sums of the inputs
Score function sum of squared errors
Search method steepest-descent from randomly
chosen initial parameter values

13
MLP

No simple summary (as with a tree)
Nonlinear nature of its model structure
Y is a nonlinear function of the inputs
Parameter ? ( the weights) appear non-linearly in
the score function (?)

14
Score function

Commonly used Score function
S S( y(i) y(i))2
y(i) is a true target value and y(i) is a output
of the network for the ith data point
y(i) is a function of the input vector x(i) and
the weights ?
Training network minimizing S

15
Training methods

Back-propagation
Steepest descent on the score function descends
to local minimum given a randomly chosen point in
the parameter space.
Non-linear optimization techniques

16
Tree vs. MLP

Tree algorithm search through models of different
complexities in a relatively automated manner.
No accepted procedure for network structure of
MLP (number of hidden nodes, layers)
In practice, trial-and-error

17
A Priori Alg. For Association Rule Learning

Association Rule
IF A 1 AND B1 THEN C1 with prob p
A,B,C are binary variables.
accuracy pa is conditional probability
p(C1A1,B1)
support ps p(A1,B1,C1)
Specify pa ps as thresholds
Goal is to find all rules that satisfy threshold
values

Summaries of co-occurrence patterns in the
observed data
Correlation relation, not causation.

19
Association Rules

Task description associations between
variables
Structure probabilistic association rules
Score function threshold on accuracy and
support
Search method systematic (BFS) with pruning

20
Example
21
Association Rules Summary

Systematic search explicitely tries to minimize
number of linear scans through database
Designed to operate on very large data sets.
Focus on computational efficiency

22
Vector Space Algorithms for Text Retrieval

Given Query object Q and a large database
Find k objects that are most similar to Q
How is similarity defined?
Reduce documents too a vector representation
Angle measures similarity

Task retrieval of k most similar documents in a
database relative to a given query
Representation vector of term occurrences
Score function angle between two vectors
Search method
Model representation is the key idea (which terms
to use in a vector)

Statistical approach emphasize theoretical
aspect of inference procedures (parameter
estimation, model selection)
Computer science approach focus on more
efficient search and data mangement.

Write a Comment

User Comments (0)