1 / 39

Efficient Density-Based Clustering of Complex

Objects

- Stefan Brecheisen, Hans-Peter Kriegel, Martin

Pfeifle - University of Munich
- Institute for Computer Science
- Brighton,UK
- November 01-04, 2004

Outline

- Density-Based Clustering
- Clustering of Complex Objects
- Experimental Evaluation

Outline

- Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Experimental Evaluation

Data Mining

- Larger and larger amounts of data collected

automatically - Too large for humans to analyze manually
- Tools to assist analysis necessary ? KDD / Data

Mining

Hubble Space Telescope

Telecommunication Data

Market-Basket Data

Clustering

- Clustering
- Efficiently grouping the database into sub-groups

(clusters) such that - similarity within clusters maximized
- similarity between clusters minimized

Flat Clustering one level of clusters

- Hierarchical Clustering
- nested clusters

e.g. density-based clustering algorithm DBSCAN

KDD 96

e.g. density-based clustering algorithm OPTICS

SIGMOD 99

Density-Based Clustering I

- Parameters
- range e and minimal weight MinPts
- Definition core object
- q is core object if rangeQuery (q,e) ³

MinPts - Definition directly density-reachable
- p directly density-reachable from q if
- q is a core object and p Î rangeQuery (q,e)
- Definition density-reachable
- density-reachable transitive closure of

directly density-reachable

Density-Based Clustering II

- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that

objects of a cluster are adjacent in the

ordering.

Density-Based Clustering II

- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that

objects of a cluster are adjacent in the

ordering. - Definition core-distance

MinPts 5

e

o

core-distance(o)

Density-Based Clustering II

- Core Idea of Hierarchical Cluster Ordering
- Order the objects linearly such that

objects of a cluster are adjacent in the

ordering. - Definition core-distance
- Definition reachability-distance

MinPts 5

e

o

p

core-distance(o)

OPTICS Algorithm

reach

?

44

seedlist

OPTICS Algorithm

reach

?

?

44

44

B

core- distance

I

A

e

(B,40) (I, 40)

seedlist

OPTICS Algorithm

reach

?

?

C

44

44

B

I

A

seedlist (I, 40) (C, 40)

OPTICS Algorithm

reach

?

?

E

G

D

H

F

C

44

44

B

K

M

N

I

A

L

P

J

I

R

seedlist (J, 20) (K, 20) (L, 31) (C, 40) (M, 40)

(R, 43)

OPTICS Algorithm

reach

?

?

E

G

D

H

F

C

44

44

B

K

M

N

I

A

L

P

J

I

J

R

seedlist (L, 19) (K, 20) (R, 21) (M, 30) (P,

31) (C, 40)

OPTICS Algorithm

reach

?

?

E

G

D

H

F

C

44

44

B

K

M

N

I

A

L

P

J

I

J

L

R

seedlist (M, 18) (K, 18) (R, 20) (P, 21) (N,

35) (C, 40)

OPTICS Algorithm

reach

?

E

G

D

H

F

C

44

B

K

M

N

I

A

L

P

J

A

B

I

J

L

M

K

N

R

P

C

D

F

G

E

H

R

seedlist -

OPTICS Algorithm

reach

?

E

G

D

H

F

C

44

B

K

M

N

I

A

L

P

J

A

B

I

J

L

M

K

N

R

P

C

D

F

G

E

H

R

seedlist -

Outline

- Foundations of Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Direct Integration of the Multi-Step Query

Processing Paradigm - Experimental Evaluation

Complex Objects

complex objects

complex models

complex distance measure

Single-Step Clustering Approach

Density-based Clustering algorithms, like DBSCAN

and OPTICS

- Performance Problems
- For each database object q, we perform one range

query. - Expensive exact distance computation do(o,q) for

each object o of the database independent of the

e- range

2

1

Query Q(q,e)

Result R(q,e)

Multi-Step Query Processing

- Multi-Step Similarity Search
- Range Queries (Faloutsos et al. 94)
- k-Nearest Neighbor Queries (Korn et al. 96)
- Optimal k- Nearest Neighbor Queries (Seidl,

Kriegel 98) - No False Drops?

Lower-Bounding Property

filter distance

object distance

Traditional Multi-Step Clustering Approach

Density-based Clustering algorithms, like DBSCAN

and OPTICS

- Performance Problems
- For each database object q, we perform one range

query (1). - The range query is first performed on the filter

information (2,3). - One expensive exact distance computation do(o,q)

for each object o of the candidate set C(q,e) is

performed (4). This refinement step is very

expensive for non-selective filters or high e

values.

Integrated Multi-Step Clustering Approach

Extended Density-based Clustering algorithms,

like DBSCAN and OPTICS

- Proposed Solution
- For each database object q, we perform one range

query on the filter information (1,2). - Only those exact distances do(o,q) are computed

which are necessary to determine the

core-properties of q (3). - A beneficial heuristic for determining the

reachability-properties is applied which saves on

exact distance computations (4).

- Direct integration of the multi-step query

processing paradigm into the clustering

algorithm - postponing expensive exact distance

computations as long as possible

1

2

3

4

postponed computations of do(o,q) for

Reach.-properties of o

computation of do(o,q) for Core - properties

of q

Query Q(q,e) using df

Candidates C (q,e)

Integrated Multi-Step Clustering Approach

Determination of Core-Properties

MinPts3 e75

Sorted Distance List

Filter Information

df(K,Q)10

do(K,Q)53

df(Z,Q)12

do(Z,Q)69

e

Q

df(R,Q)18

do(R,Q)49

do(R,Q)53

df(M,Q)55

df(A,Q)58

- First, we carry out a range query on the filter
- for each query object Q.
- Second, we order the resulting candidate set
- in ascending order according to the filter

distance. - Third, we walk through the candidate set and

perform exact - distance calculations until we can be sure

that we have found - the MinPts nearest neighbors.

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

first elements are ascendingly ordered

df(R,B)18

df(K,B)20

d0(M,C)65

df(R,D)34

df(K,L)30

df(K,G)43

df(K,C)55

each list of predecessor objects is ascendingly

ordered

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(R,B)18

df(K,B)20

d0(M,C)65

do(K,Q)53

df(R,D)34

df(K,L)30

do(Z,Q)69

df(K,G)43

do(R,Q)53

do(K,Q)53

df(K,C)55

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(R,B)18

df(K,B)20

d0(M,C)65

d0(Z,Q)69

do(K,Q)53

df(R,D)34

df(K,L)30

do(Z,Q)69

df(K,G)43

do(R,Q)53

do(K,Q)53

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(R,B)18

df(K,B)20

do(K,Q)53

df(R,D)34

df(K,L)30

do(Z,Q)69

d0(R,Q)53

df(K,G)43

do(R,Q)53

do(K,Q)53

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(R,B)18

df(K,B)20

d0(M,C)65

d0(Z,Q)69

df(M,Q)55

do(K,Q)53

df(R,D)34

df(K,L)30

do(Z,Q)69

d0(R,Q)53

df(K,G)43

do(R,Q)53

do(K,Q)53

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(R,B)18

df(K,B)20

d0(Z,Q)69

df(M,Q)55

df(A,Q)58

do(K,Q)53

df(R,D)34

d0(M,C)65

df(K,L)30

do(Z,Q)69

d0(R,Q)53

df(K,G)43

do(R,Q)53

do(K,Q)53

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Extended Seedlist

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

result list of the current query object Q

which has to be inserted into the extended

seedlist

df(A,Q)58

df(R,B)18

df(K,B)20

d0(Z,Q)69

df(M,Q)55

df(I,Q)65

do(K,Q)53

df(R,D)34

d0(M,C)65

df(K,L)30

do(Z,Q)69

d0(R,Q)53

df(K,G)43

do(R,Q)53

do(K,Q)53

df(M,Q)55

df(A,Q)58

df(I,Q)65

Integrated Multi-Step Clustering Approach

Determination of Next Query Object

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

df(A,Q)58

df(K,B)20

d0(Z,Q)69

df(M,Q)55

df(I,Q)65

do(R,B)44

df(R,B)18

d0(M,C)65

df(R,D)34

df(K,L)30

d0(R,Q)53

df(K,G)43

do(K,Q)53

Integrated Multi-Step Clustering Approach

Determination of Next Query Object

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

df(A,Q)58

d0(Z,Q)69

df(M,Q)55

df(I,Q)65

d0(M,C)65

Integrated Multi-Step Clustering Approach

Determination of Next Query Object

- Data Structure List of Lists
- Additional information about possible predecessor

objects are stored in order to postpone exact

distance calculations as long as possible.

df(A,Q)58

df(K,B)20

d0(Z,Q)69

df(M,Q)55

df(I,Q)65

d0(K,B)25

d0(M,C)65

df(K,L)30

df(K,G)43

do(K,Q)53

Outline

- Foundations of Density-Based Clustering
- Core Object Density-Reachability
- DBSCAN OPTICS
- Clustering of Complex Objects
- Direct Integration of the Multi-Step Query

Processing Paradigm - Experimental Evaluation

Experimental Evaluation

Test Data Sets

- Graphs representing images DAWAK 03
- Expensive exact distance function
- Selective filter used

- High dimensional feature vectors
- representing CAD objects DASFAA 03
- not very selective filter used
- (Euclidean norm)

Experimental Evaluation

DBSCAN

Feature vectors

Graphs

runtime sec.

runtime sec.

no. of objects

no. of objects

- Already non-selective filters (feature vectors)

are helpful for accelerating DBSCAN by up to an

order of magnitude when using the new integrated

multi-step query processing approach. - The traditional multi-step query processing

approach does not benefit from non-selective

filters (feature vectors), as the cardinality of

the candidate set is still high even when small

e-values are used. - When filters of high selectivity (graphs) are

used, our new integrated multi-step query

processing approach leads to a speed-up of two

orders of magnitude compared to a full table

scan.

Experimental Evaluation

OPTICS

Feature vectors

Graphs

runtime sec.

runtime sec.

no. of objects

no. of objects

no. of objects

- When using filters of high selectivity (graphs),

our new integrated multi-step query processing

approach outperforms the traditional multi-step

query processing approach and the full table scan

by a factor of up to 30. - For high e-values, as used with OPTICS, the full

table scan performs even better than the

traditional multi-step query processing approach.

Conclusions

- Summary Efficient Density-Based Clustering of

Complex Objects - direct integration of the multi-step query

processing - paradigm into the clustering algorithm
- MinPts-nearest neighbor queries on the exact

information - postponing expensive exact distance computations

as - long as possible
- Future Work
- integration of the multi-step query processing

paradigm into - other data mining algorithms