Database%20Implementation%20of%20a%20Model-Free%20Classifier - PowerPoint PPT Presentation

About This Presentation
Title:

Database%20Implementation%20of%20a%20Model-Free%20Classifier

Description:

Based on simple SQL queries. Converges to optimal Bayes Classifier ... Efficient (based on simple SQL queries) Reliable (converging to optimal) Parallelizable ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 63
Provided by: adb3
Learn more at: http://www.adbis.org
Category:

less

Transcript and Presenter's Notes

Title: Database%20Implementation%20of%20a%20Model-Free%20Classifier


1
Database Implementation of a Model-Free
Classifier
ADBIS 2007
  • Konstantinos Morfonios

2
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
3
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
4
Introduction
Classification
x ltx1, x2, , xDgt
? f(x)
5
Introduction
ltx1,1, x1,2, , x1,D, ?1gt ltx2,1, x2,2, , x2,D,
?2gt ltx3,1, x3,2, , x3,D, ?1gt ltx4,1, x4,2, ,
x4,D, ?1gt . . .
x1 ltx1, x2, , xDgt
x2 ltx1, x2, , xDgt
Lazy
Eager
(Nearest Neighbors)
(Decision Trees)
() Faster decisions
( - ) Large/complex datasets
( - ) Dynamic datasets
( - ) Dynamic models
6
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
7
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
8
Motivation
  • Large/complex datasets

9
Motivation
10
Motivation
  • Large/complex datasets
  • Dynamic datasets

11
Motivation
12
Motivation
  • Large/complex datasets
  • Dynamic datasets
  • Dynamic models

13
Motivation
14
Motivation
  • Large/complex datasets
  • Dynamic datasets
  • Dynamic models

Lazy (model-free)
15
Motivation
  • Large/complex datasets
  • Dynamic datasets
  • Dynamic models

Lazy (model-free)
Nearest Neighbors
16
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
Suffers from curse of dimensionality
  • Not reliable Beyer et al., ICDT 1999
  • Not indexable Shaft et al., ICDT 2005

Nearest Neighbors
17
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Category?

18
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy

19
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Scaling?

20
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Based on simple SQL queries

21
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Based on simple SQL queries
  • Accuracy?

22
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Based on simple SQL queries
  • Converges to optimal Bayes Classifier

23
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Based on simple SQL queries
  • Converges to optimal Bayes Classifier
  • Other features?

24
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
  • Lazy
  • Based on simple SQL queries
  • Converges to optimal Bayes Classifier
  • Parallelizable

25
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
26
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
27
LOCUS
Example
28
LOCUS
Ideally Dense space
29
LOCUS
f2
?(lt7, 4gt) ?
Ideally Dense space
f1
30
LOCUS
f2
?(lt7, 4gt)
f1
31
LOCUS
f2
  • Many features
  • Large domains

?? Sparse space
Reality
f1
32
LOCUS
f2
?(lt7, 4gt) ?
?
f1
33
LOCUS
?1 2
f2
?
?(lt7, 4gt) ?
?2 1
f1
3-NN
34
LOCUS
?1 2
f2
?
?(lt7, 4gt)
?2 1
f1
3-NN
35
LOCUS
f2
?(lt7, 4gt) ?
f1
LOCUS
36
LOCUS
?1 7
f2
?
?(lt7, 4gt) ?
?2 3
f1
LOCUS
37
LOCUS
?1 7
f2
?
?(lt7, 4gt)
?2 3
f1
LOCUS
38
LOCUS
f2
Disk-based implementation
f1
LOCUS
39
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
?1 7
?
?(lt7, 4gt)
?2 3
ltx1, x2gt
40
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if R is large?
Classical optimization techniques for a
well-known type of aggregate queries
  • Indexing
  • Materialized views
  • Presorting

41
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
Method reliability?
LOCUS converges to the optimal Bayes classifier
as the size of the dataset increases (proof in
the paper)
42
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
43
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
Not a problem, since generally in practice
  • Combinations of categorical and numeric
    features
  • Categorical features have small domains

Hence, they do not contribute to sparsity
44
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
45
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
46
Parallel Execution
R R1 ? R2 ? R3 ? R4
47
Parallel Execution
Count distributive function
?1 23
?1 7
?1 5
?2 4
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
48
Parallel Execution
  • Small network traffic
  • Load balancing
  • Lightweight operations on the main server

?1 7
?1 5
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
49
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
50
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
51
Experimental Evaluation
  • LOCUS vs DTs and NNs (weka)
  • Synthetic datasets
  • Ten functions Agrawal et al., IEEE TKDE 1993
  • D 9
  • N ? 5?103, 5?106
  • Real-world datasets
  • UCI Repository

52
Experimental Evaluation
Classification error rate (synthetic datasets, N
5?104)
53
Experimental Evaluation
Effect of dataset size on classification error
rate of LOCUS (synthetic datasets, N ? 5?103,
5?106)
54
Experimental Evaluation
Effect of dataset size on time scalability of
LOCUS (synthetic datasets, N ? 5?103, 5?106)
55
Experimental Evaluation
Classification error rate (real-world datasets)
56
Experimental Evaluation
Effect of dataset size on classification error
rate (dataset CovType, N ? 5?103, 5?105)
57
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
58
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
59
Conclusions Future Work
  • LOCUS
  • Lazy (complex/dynamic datasets and models)
  • Efficient (based on simple SQL queries)
  • Reliable (converging to optimal)
  • Parallelizable

60
Conclusions Future Work
  • Similar techniques for
  • feature selection
  • regression
  • Implementation of a parallel version

61
Questions?
62
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com