PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems

Description:

Huge amount of data are distributed in large-scale dynamic networks (e.g., P2P systems) ... [Samatova02] N. F. Samatova, G. Ostrouchov, A. Geist, and A. V. Melechko. ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 28

Provided by: mxl7

Category:

more less

Transcript and Presenter's Notes

Title: PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems

1
PENS An Algorithm for Density-Based Clustering
in Peer-to-Peer Systems

Mei Li
Wang-Chien Lee
Anand Sivasubramaniam
Pennsylvania State University
June 2006

Guanling Lee National Dong Hwa University _at_
InfoScale06
2
Roadmap

Introduction
PENS
Analysis
Summary

3
Motivation

Huge amount of data are distributed in
large-scale dynamic networks (e.g., P2P systems)
Knowledge hidden among the data is valuable for
various applications
Market analysis, smart query processing,
scientific exploration
clustering
Group a set of data objects into clusters of
similar data objects
applications
pattern recognition, spatial data analysis,
market analysis, document classification and
access pattern discovery in WWW

4
IntroductionPeer-to-peer (P2P) systems

Different from client-server computing model
No central control
Each peer has equal functionality a peer is a
server and a client
Dynamic
peer join/leave
Large scale
Existing systems
Unstructured overlays
Gnutella
Structured overlays
CAN (Ratnasamy01)
CHORD (Stoica01)
SSW (Li04)

5
P2P systemsUnstructured overlays
Search flooding with TTL
Simple, maintenance free Excessive communication
cost
6
P2P systemsStructured overlays
(1,1)

Each data is assigned a key, each peer is
assigned an ID
A Data object is mapped to the peer whose ID is
close to its key

(0,0)
Search efficient
7
Clustering in centralized systems

Well known algorithms
Partition-based
k-mean (MacQueen67)
Hierarchical
BIRCH (Zhang96), CURE (Guha98)
Grid-based
WaveCluster (Sheikholeslami98), STING (Wang97)
Density-based
DBSCAN (Ester96)
Model-based
autoclass (Cheeseman96)
Applicable to P2P systems?
Require data to be transmitted to a central site
Not feasible (no central server, costly
communication, privacy concern)
Minimize disk access
Communication cost is the concern in P2P systems

8
Clustering in distributed systems

Two steps
Clustering on local sites
Combine local results
forward to a central site
Johnson99, Samatova02, Januzaj04, Xu99
flood to the whole system
Bandyopadhyay05, Dhillon99, Forman00
Applicable to P2P systems?
No central site
System wide flooding not feasible

9
Our approach

Clustering on local site
Hierarchical cluster assembling

? peer density-based clustering (PENS)
follows the design principle of DBSCAN
Illustrate on top of CAN

10
Background DBSCAN

Density-based clustering treats regions densely
populated by data as clusters
Efficient in very large dataset
Discover clusters with arbitrary shapes
Insensitive to noises
Basic idea
If the neighborhood of a given radius (e) for a
data object has a cardinality of at least a
preset threshold (T), this data object belongs to
a cluster

11
Background DBSCAN (cont.)

Clustering algorithm
Starting from any arbitrary data object, expand
the clusters.
If a cluster can not be expanded any more,
iterate over another data object that is not
clustered.
Terminate when no data object can be expanded,
and all data objects are either clustered or
labeled as noise.

q
12
Issues in PENS

Hierarchy formation
How to form a hierarchy for cluster assembly?
Cluster Expansion Checking
How to determine and represent information
necessary for cluster assembly?
Cluster Merging
How to merge clusters along the hierarchy?

13
Hierarchy formation
14
Hierarchy Formation (VPtree)
1
0
buddy regions
00
01
10
11
001
111
011
000
110
101
100
010
C
A
H
0100
0010
1111
0101
0011
1110
1100
1101
1011
1010
E
G
K
B
J
D
L
F
I
11110 11111
N
M
15
Hierarchy formation (cont.)

Each leaf node is associated with one zone--taken
charge of by a peer.
Question
Which peer should act as the internal tree node
(arbiter) to merge the clusters in two buddy
regions?

M
K
N
I
L
I
G
J
H
I
J
E
F
I
M
J
L
A
B
C
D
N
M
16
Cluster Expansion Check

Observation
If all data objects of a local cluster within the
zone are at least e away from the boundary of the
zone, this cluster is also a global cluster
(non-expandable).
If a local cluster within a zone can be expanded
to include data objects outside of the zone
(expandable), there is at least one data object
of this local cluster in the e-inner boundary
whose e-neighborhood contains some data objects
in the e-outer boundary.

e-outer boundary
B
e
e
A
e-inner boundary
Algorithm 1. Region query obtain data in the
e-outer boundary 2. Local computation examine
data in the e-inner boundary Optimization stores
the index of the data mapped to its e-outer
boundary eliminate the need of range query
17
Cluster expansion check
T 5
p
p
p
p
noise
expansion
new cluster
expansion merging

local cluster or noise localClusterID ZoneID
Cluster expansion set (CES)
Present coverage (Pcoverage)
Expandable coverage (ECoverage)

18
Cluster merging

Merge region A and B, store the result back in B
Merger one ore more clusters in A with a cluster
in B
For each local cluster i in A
S i.Ecoverage j.Pcoverage (j is a cluster in
B)
If S is non-empty
merge i and j
Above procedure does not handle following cases
A cluster in A can be merged with one or more
clusters in B
For each local cluster i in B
S i.Ecoverage j.Pcoverage (j is another
cluster in B)
If S is non-empty
merge i and j

19
Summary of PENS

Although hierarchy is used, no overloading of
single peer
Each node receives 2 messages, send 1 message
The message size is not monotonically increasing

20
Analysis of PENS

Message complexity O(2k N)
Optimization O(N)

21
Conclusion

We propose a fully distributed clustering
algorithm, PENS, that adopts hierarchical cluster
assembling.
No flooding, no central site
Analysis demonstrates the efficiency of PENS.

22
Future works

Conduct extensive performance evaluation
Explore other clustering algorithms and data
mining tasks in P2P systems

23
References

Bandyopadhyay05 S. Bandyopadhyay, C. Gianella,
U. Maulik, H. Kargupta, K. Liu, and S. Datta.
Clustering Distributed Data Streams in
Peer-to-Peer Environments. Information Science
Journal (In Press).
Cheeseman 96 P. Cheeseman and J. Stutz.
Bayesian classification (autoclass) Theory and
results. In Advances in Knowledge Discovery and
Data Mining, pages 153180. AAAI/MIT Press,1996.
Dhillon99 I. S. Dhillon and D. S. Modha. A
data-clustering algorithm on distributed memory
multiprocessors. In Proceedings of Workshop on
Large-Scale Parallel KDD Systems (in conjunction
with SIGKDD), pages 245260, August 1999.
Ester96 M. Ester, H.-P. Kriegel, J. Sander, and
X. Xu. A density based algorithm for discovering
clusters in large spatial databases with noise.
In Proceedings of Knowledge Discovery in Database
(KDD), pages 226231, 1996.
Forman00 G. Forman and B. Zhang. Distributed
data clustering can be efficient and exact.
SIGKDD Explorations, 2(2)3438, 2000.
Guha98 S. Guha, R. Rastogi, and K. Shim. CURE
An efficient clustering algorithm for large
databases. In Proceedings of SIGMOD, pages 7384,
June 1998.
Januzaj04 E. Januzaj, H.-P. Kriegel, and M.
Pfeifle. DBDC Density based distributed
clustering. In Proceedings of International
Conference on Extending Database Technology
(EDBT), pages 88105, March 2004.
Johnson99 E. L. Johnson and H. Kargupta.
Collective, hierarchical clustering from
distributed, heterogeneous data. In Proceedings
of Workshop on Large-Scale Parallel KDD Systems
(in conjunction with SIGKDD), pages 221244,
August 1999.
Li04 M. Li, W.-C. Lee, and A. Sivasubramaniam.
Semantic small world An overlay network for
peer-to-peer search. In Proceedings of
International Conference on Network Protocols
(ICNP), pages 228238, October 2004.

24
References (cont.)

MacQueen67 J. MacQueen. Some methods for
classification and analysis of multivariate
observations. In Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and
Probability, pages 281297, 1967.
Ratnasamy01 S. Ratnasamy, P. Francis, M.
Handley, R. M. Karp, and S. Schenker. A scalable
content-addressable network. In Proceedings of
ACM SIGCOMM, pages 161172, August 2001.
Samatova02 N. F. Samatova, G. Ostrouchov, A.
Geist, and A. V. Melechko. RACHET An efficient
cover-based merging of clustering hierarchies
from distributed datasets. Distributed and
Parallel Databases, 11(2)157180, 2002.
Sheikholeslami98 G. Sheikholeslami, S.
Chatterjee, and A. Zhang. WaveCluster A
multi-resolution clustering approach for very
large spatial databases. In Proceedings of VLDB,
pages 428439, August 1998.
Stoica01 I. Stoica, R. Morris, D. Karger, M. F.
Kaashoek, and H. Balakrishnan. Chord A scalable
peer-to-peer lookup service for Internet
applications. In Proceedings of ACM SIGCOMM,
pages 149160, August 2001.
Xu99 X. Xu, J. Jager, and H.-P. Kriegel. A
fast parallel clustering algorithm for large
spatial databases. Data Mining and Knowledge
Discovery, 3(3)263290, 1999.
Wang97 W. Wang, J. Yang, and R. R. Muntz.
STING A statistical information grid approach to
spatial data mining. In Proceedings of VLDB,
pages 186195, August 1997.
Zhang96 T. Zhang, R. Ramakrishnan, and M.
Livny. BIRCH An efficient data clustering method
for very large databases. In Proceedings of
SIGMOD, pages 103114, June 1996.

25
(No Transcript)
26
Backup slides DBSCAN Definition

p is directly density reachable from q wrt. e and
T in D (pgtDq) if 1) p Neq 2) Neq T
p is density reachable from q wrt. e and T in D
if there is a chain of objects p1, p2, , pn such
that p1 q, pn p, pi1gtDpi
p and q are density connected wrt. e and T in D
if there is an object o in D such that p and q
are density reachable from o in D.

T 5
27
Backup slidesDBSCAN definitions (cont.)