A systematic overview of Data Mining Algorithms - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

A systematic overview of Data Mining Algorithms

Description:

Search/optimization method(hill climbing, simulated annealing, convergence specification ... Tree can deal with mixed data types (combination of ... – PowerPoint PPT presentation

Number of Views:304
Avg rating:3.0/5.0
Slides: 25
Provided by: sve2
Category:

less

Transcript and Presenter's Notes

Title: A systematic overview of Data Mining Algorithms


1
A systematic overview of Data Mining Algorithms
  • Data Mining, David Hand Ch 5

2
  • Input data
  • Output models or patterns

3
Data Mining algorithm specification
  • Task (visualization, classification, clustering,
    regression)
  • Structure (functional form) of the model that we
    are fitting to the data (linear regression,
    hierarchical clustering )
  • Score function
  • Goodness of fit
  • Generalization
  • Search/optimization method(hill climbing,
    simulated annealing, convergence specification )

4
CART classification
  • Mapping an input vector x to a categorical class
    y
  • Task prediction (classification)
  • Model structure tree
  • Score function cross-validation loss function
  • Search method greedy local search

5
CART Tree
  • Internal node binary test
  • Leaf node class
  • tree picture p.146
  • Data vector x descends a unique path from root to
    leaf.
  • Structure of the tree is derived from the data
  • Choose the best variable to split the data
  • Score function is misclassification loss
  • S C ( y(i), y(i) )
  • C is m x m matrix (m is number of classes)

6
Tree classification vs. Linear classification
  • Tree can deal with mixed data types (combination
    of categorical and real-valued data)
  • Easier to deal with large number of variables
    (because process one at a time)

7
CART tree classification
  • Goal of CART is to find a tree closest to optimal
    tree
  • Complex enough to capture the structure
  • Not too complex to avoid overfit
  • Uses cross-validation technique

8
CART
  • Search space is all possible trees (combinatorial
    large).
  • Approximation algorithm Uses greedy local
    search to identify good candidate tree
    structures
  • 2 phase
  • Recursively expands the tree from root
  • Prunes branches

9
Reductionist view on data mining algorithms
  • Algorithms are tuples of
  • model structure, score function, search method,
    data management
  • When deciding on an algorithm for an application,
    think of which components are suitable

10
Multilayer Preceptrons (MLP) for Regression and
Classification
  • Non-linear mapping from a real-valued input
    vector x to a real-valued output y
  • Can be used as
  • Nonlinear model for regression
  • classification

11
MLP Basic idea
12
MLP algorithm tuple
  • Task prediction
  • Structure layers of nonlinear transformations
    of weighted sums of the inputs
  • Score function sum of squared errors
  • Search method steepest-descent from randomly
    chosen initial parameter values

13
MLP
  • No simple summary (as with a tree)
  • Nonlinear nature of its model structure
  • Y is a nonlinear function of the inputs
  • Parameter ? ( the weights) appear non-linearly in
    the score function (?)

14
Score function
  • Commonly used Score function
  • S S( y(i) y(i))2
  • y(i) is a true target value and y(i) is a output
    of the network for the ith data point
  • y(i) is a function of the input vector x(i) and
    the weights ?
  • Training network minimizing S

15
Training methods
  • Back-propagation
  • Steepest descent on the score function descends
    to local minimum given a randomly chosen point in
    the parameter space.
  • Non-linear optimization techniques

16
Tree vs. MLP
  • Tree algorithm search through models of different
    complexities in a relatively automated manner.
  • No accepted procedure for network structure of
    MLP (number of hidden nodes, layers)
  • In practice, trial-and-error

17
A Priori Alg. For Association Rule Learning
  • Association Rule
  • IF A 1 AND B1 THEN C1 with prob p
  • A,B,C are binary variables.
  • accuracy pa is conditional probability
    p(C1A1,B1)
  • support ps p(A1,B1,C1)
  • Specify pa ps as thresholds
  • Goal is to find all rules that satisfy threshold
    values

18
  • Summaries of co-occurrence patterns in the
    observed data
  • Correlation relation, not causation.

19
Association Rules
  • Task description associations between
    variables
  • Structure probabilistic association rules
  • Score function threshold on accuracy and
    support
  • Search method systematic (BFS) with pruning

20
Example
21
Association Rules Summary
  • Systematic search explicitely tries to minimize
    number of linear scans through database
  • Designed to operate on very large data sets.
  • Focus on computational efficiency

22
Vector Space Algorithms for Text Retrieval
  • Given Query object Q and a large database
  • Find k objects that are most similar to Q
  • How is similarity defined?
  • Reduce documents too a vector representation
  • Angle measures similarity

23
  • Task retrieval of k most similar documents in a
    database relative to a given query
  • Representation vector of term occurrences
  • Score function angle between two vectors
  • Search method
  • Model representation is the key idea (which terms
    to use in a vector)

24
  • Statistical approach emphasize theoretical
    aspect of inference procedures (parameter
    estimation, model selection)
  • Computer science approach focus on more
    efficient search and data mangement.
Write a Comment
User Comments (0)
About PowerShow.com