Title: A systematic overview of Data Mining Algorithms
1A systematic overview of Data Mining Algorithms
- Data Mining David Hand Ch 5
2- Input data
- Output models or patterns
3Data Mining algorithm specification
- Task (visualization classification clustering
regression)
- Structure (functional form) of the model that we
are fitting to the data (linear regression
hierarchical clustering )
- Score function
- Goodness of fit
- Generalization
- Search/optimization method(hill climbing
simulated annealing convergence specification )
4CART classification
- Mapping an input vector x to a categorical class
y
- Task prediction (classification)
- Model structure tree
- Score function cross-validation loss function
- Search method greedy local search
5CART Tree
- Internal node binary test
- Leaf node class
- tree picture p.146
- Data vector x descends a unique path from root to
leaf.
- Structure of the tree is derived from the data
- Choose the best variable to split the data
- Score function is misclassification loss
- S C ( y(i) y(i) )
- C is m x m matrix (m is number of classes)
6Tree classification vs. Linear classification
- Tree can deal with mixed data types (combination
of categorical and real-valued data)
- Easier to deal with large number of variables
(because process one at a time)
7CART tree classification
- Goal of CART is to find a tree closest to optimal
tree
- Complex enough to capture the structure
- Not too complex to avoid overfit
- Uses cross-validation technique
8CART
- Search space is all possible trees (combinatorial
large).
- Approximation algorithm Uses greedy local
search to identify good candidate tree
structures
- 2 phase
- Recursively expands the tree from root
- Prunes branches
9Reductionist view on data mining algorithms
- Algorithms are tuples of
- model structure score function search method
data management
- When deciding on an algorithm for an application
think of which components are suitable
10Multilayer Preceptrons (MLP) for Regression and
Classification
- Non-linear mapping from a real-valued input
vector x to a real-valued output y
- Can be used as
- Nonlinear model for regression
- classification
11MLP Basic idea
12MLP algorithm tuple
- Task prediction
- Structure layers of nonlinear transformations
of weighted sums of the inputs
- Score function sum of squared errors
- Search method steepest-descent from randomly
chosen initial parameter values
13MLP
- No simple summary (as with a tree)
- Nonlinear nature of its model structure
- Y is a nonlinear function of the inputs
- Parameter ( the weights) appear non-linearly in
the score function ()
14Score function
- Commonly used Score function
- S S( y(i) y(i))2
- y(i) is a true target value and y(i) is a output
of the network for the ith data point
- y(i) is a function of the input vector x(i) and
the weights
- Training network minimizing S
15Training methods
- Back-propagation
- Steepest descent on the score function descends
to local minimum given a randomly chosen point in
the parameter space.
- Non-linear optimization techniques
16Tree vs. MLP
- Tree algorithm search through models of different
complexities in a relatively automated manner.
- No accepted procedure for network structure of
MLP (number of hidden nodes layers)
- In practice trial-and-error
17A Priori Alg. For Association Rule Learning
- Association Rule
- IF A 1 AND B1 THEN C1 with prob p
- ABC are binary variables.
- accuracy pa is conditional probability
p(C1A1B1)
- support ps p(A1B1C1)
- Specify pa ps as thresholds
- Goal is to find all rules that satisfy threshold
values
18- Summaries of co-occurrence patterns in the
observed data
- Correlation relation not causation.
19Association Rules
- Task description associations between
variables
- Structure probabilistic association rules
- Score function threshold on accuracy and
support
- Search method systematic (BFS) with pruning
20Example
21Association Rules Summary
- Systematic search explicitely tries to minimize
number of linear scans through database
- Designed to operate on very large data sets.
- Focus on computational efficiency
22Vector Space Algorithms for Text Retrieval
- Given Query object Q and a large database
- Find k objects that are most similar to Q
- How is similarity defined
- Reduce documents too a vector representation
- Angle measures similarity
23- Task retrieval of k most similar documents in a
database relative to a given query
- Representation vector of term occurrences
- Score function angle between two vectors
- Search method
- Model representation is the key idea (which terms
to use in a vector)
24- Statistical approach emphasize theoretical
aspect of inference procedures (parameter
estimation model selection)
- Computer science approach focus on more
efficient search and data mangement.