Title: A systematic overview of Data Mining Algorithms
1A systematic overview of Data Mining Algorithms
- Data Mining David Hand Ch 5
2- Input data
- Output models or patterns
3Data Mining algorithm specification
- Task visualization classification clustering
regression
- Structure functional form of the model that we
are fitting to the data linear regression
hierarchical clustering
- Score function
- Goodness of fit
- Generalization
- Search/optimization methodhill climbing
simulated annealing convergence specification
4CART classification
- Mapping an input vector x to a categorical class
y
- Task prediction classification
- Model structure tree
- Score function crossvalidation loss function
- Search method greedy local search
5CART Tree
- Internal node binary test
- Leaf node class
- [ tree picture p.146]
- Data vector x descends a unique path from root to
leaf.
- Structure of the tree is derived from the data
- Choose the best variable to split the data
- Score function is misclassification loss
- S C yi yi
- C is m x m matrix m is number of classes
6Tree classification vs. Linear classification
- Tree can deal with mixed data types combination
of categorical and realvalued data
- Easier to deal with large number of variables
because process one at a time
7CART tree classification
- Goal of CART is to find a tree closest to optimal
tree
- Complex enough to capture the structure
- Not too complex to avoid overfit
- Uses crossvalidation technique
8CART
- Search space is all possible trees combinatorial
large.
- Approximation algorithm Uses greedy local
search to identify good candidate tree
structures
- 2 phase
- Recursively expands the tree from root
- Prunes branches
9Reductionist view on data mining algorithms
- Algorithms are tuples of
- model structure score function search method
data management
- When deciding on an algorithm for an application
think of which components are suitable
10Multilayer Preceptrons MLP for Regression and
Classification
- Nonlinear mapping from a realvalued input
vector x to a realvalued output y
- Can be used as
- Nonlinear model for regression
- classification
11MLP Basic idea
12MLP algorithm tuple
- Task prediction
- Structure layers of nonlinear transformations
of weighted sums of the inputs
- Score function sum of squared errors
- Search method steepestdescent from randomly
chosen initial parameter values
13MLP
- No simple summary as with a tree
- Nonlinear nature of its model structure
- Y is a nonlinear function of the inputs
- Parameter ? the weights appear nonlinearly in
the score function ?
14Score function
- Commonly used Score function
- S S yi yi2
- yi is a true target value and yi is a output
of the network for the ith data point
- yi is a function of the input vector xi and
the weights ?
- Training network minimizing S
15Training methods
- Backpropagation
- Steepest descent on the score function descends
to local minimum given a randomly chosen point in
the parameter space.
- Nonlinear optimization techniques
16Tree vs. MLP
- Tree algorithm search through models of different
complexities in a relatively automated manner.
- No accepted procedure for network structure of
MLP number of hidden nodes layers
- In practice trialanderror
17A Priori Alg. For Association Rule Learning
- Association Rule
- IF A 1 AND B1 THEN C1 with prob p
- ABC are binary variables.
- accuracy pa is conditional probability
pC1A1B1
- support ps pA1B1C1
- Specify pa ps as thresholds
- Goal is to find all rules that satisfy threshold
values
18- Summaries of cooccurrence patterns in the
observed data
- Correlation relation not causation.
19Association Rules
- Task description associations between
variables
- Structure probabilistic association rules
- Score function threshold on accuracy and
support
- Search method systematic BFS with pruning
20Example
21Association Rules Summary
- Systematic search explicitely tries to minimize
number of linear scans through database
- Designed to operate on very large data sets.
- Focus on computational efficiency
22Vector Space Algorithms for Text Retrieval
- Given Query object Q and a large database
- Find k objects that are most similar to Q
- How is similarity defined?
- Reduce documents too a vector representation
- Angle measures similarity
23- Task retrieval of k most similar documents in a
database relative to a given query
- Representation vector of term occurrences
- Score function angle between two vectors
- Search method
- Model representation is the key idea which terms
to use in a vector
24- Statistical approach emphasize theoretical
aspect of inference procedures parameter
estimation model selection
- Computer science approach focus on more
efficient search and data mangement.