Title: PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems
1PENS An Algorithm for Density-Based Clustering
in Peer-to-Peer Systems
- Mei Li
- Wang-Chien Lee
- Anand Sivasubramaniam
- Pennsylvania State University
-
- June 2006
-
Guanling Lee National Dong Hwa University _at_
InfoScale06
2Roadmap
- Introduction
- PENS
- Analysis
- Summary
3Motivation
- Huge amount of data are distributed in
large-scale dynamic networks (e.g., P2P systems) - Knowledge hidden among the data is valuable for
various applications - Market analysis, smart query processing,
scientific exploration - clustering
- Group a set of data objects into clusters of
similar data objects - applications
- pattern recognition, spatial data analysis,
market analysis, document classification and
access pattern discovery in WWW
4IntroductionPeer-to-peer (P2P) systems
- Different from client-server computing model
- No central control
- Each peer has equal functionality a peer is a
server and a client - Dynamic
- peer join/leave
- Large scale
- Existing systems
- Unstructured overlays
- Gnutella
- Structured overlays
- CAN (Ratnasamy01)
- CHORD (Stoica01)
- SSW (Li04)
5P2P systemsUnstructured overlays
Search flooding with TTL
Simple, maintenance free Excessive communication
cost
6P2P systemsStructured overlays
(1,1)
- Each data is assigned a key, each peer is
assigned an ID - A Data object is mapped to the peer whose ID is
close to its key
(0,0)
Search efficient
7Clustering in centralized systems
- Well known algorithms
- Partition-based
- k-mean (MacQueen67)
- Hierarchical
- BIRCH (Zhang96), CURE (Guha98)
- Grid-based
- WaveCluster (Sheikholeslami98), STING (Wang97)
- Density-based
- DBSCAN (Ester96)
- Model-based
- autoclass (Cheeseman96)
- Applicable to P2P systems?
- Require data to be transmitted to a central site
- Not feasible (no central server, costly
communication, privacy concern) - Minimize disk access
- Communication cost is the concern in P2P systems
8Clustering in distributed systems
- Two steps
- Clustering on local sites
- Combine local results
- forward to a central site
- Johnson99, Samatova02, Januzaj04, Xu99
- flood to the whole system
- Bandyopadhyay05, Dhillon99, Forman00
- Applicable to P2P systems?
- No central site
- System wide flooding not feasible
9Our approach
- Clustering on local site
- Hierarchical cluster assembling
- ? peer density-based clustering (PENS)
- follows the design principle of DBSCAN
- Illustrate on top of CAN
10Background DBSCAN
- Density-based clustering treats regions densely
populated by data as clusters - Efficient in very large dataset
- Discover clusters with arbitrary shapes
- Insensitive to noises
- Basic idea
- If the neighborhood of a given radius (e) for a
data object has a cardinality of at least a
preset threshold (T), this data object belongs to
a cluster
11Background DBSCAN (cont.)
- Clustering algorithm
- Starting from any arbitrary data object, expand
the clusters. - If a cluster can not be expanded any more,
iterate over another data object that is not
clustered. - Terminate when no data object can be expanded,
and all data objects are either clustered or
labeled as noise.
q
12Issues in PENS
- Hierarchy formation
- How to form a hierarchy for cluster assembly?
- Cluster Expansion Checking
- How to determine and represent information
necessary for cluster assembly? - Cluster Merging
- How to merge clusters along the hierarchy?
13Hierarchy formation
14Hierarchy Formation (VPtree)
1
0
buddy regions
00
01
10
11
001
111
011
000
110
101
100
010
C
A
H
0100
0010
1111
0101
0011
1110
1100
1101
1011
1010
E
G
K
B
J
D
L
F
I
11110 11111
N
M
15Hierarchy formation (cont.)
- Each leaf node is associated with one zone--taken
charge of by a peer. - Question
- Which peer should act as the internal tree node
(arbiter) to merge the clusters in two buddy
regions?
M
K
N
I
L
I
G
J
H
I
J
E
F
I
M
J
L
A
B
C
D
N
M
16Cluster Expansion Check
- Observation
- If all data objects of a local cluster within the
zone are at least e away from the boundary of the
zone, this cluster is also a global cluster
(non-expandable). - If a local cluster within a zone can be expanded
to include data objects outside of the zone
(expandable), there is at least one data object
of this local cluster in the e-inner boundary
whose e-neighborhood contains some data objects
in the e-outer boundary.
e-outer boundary
B
e
e
A
e-inner boundary
Algorithm 1. Region query obtain data in the
e-outer boundary 2. Local computation examine
data in the e-inner boundary Optimization stores
the index of the data mapped to its e-outer
boundary eliminate the need of range query
17Cluster expansion check
T 5
p
p
p
p
noise
expansion
new cluster
expansion merging
- local cluster or noise localClusterID ZoneID
- Cluster expansion set (CES)
- Present coverage (Pcoverage)
- Expandable coverage (ECoverage)
18Cluster merging
- Merge region A and B, store the result back in B
- Merger one ore more clusters in A with a cluster
in B - For each local cluster i in A
- S i.Ecoverage j.Pcoverage (j is a cluster in
B) - If S is non-empty
- merge i and j
- Above procedure does not handle following cases
- A cluster in A can be merged with one or more
clusters in B - For each local cluster i in B
- S i.Ecoverage j.Pcoverage (j is another
cluster in B) - If S is non-empty
- merge i and j
19Summary of PENS
- Although hierarchy is used, no overloading of
single peer - Each node receives 2 messages, send 1 message
- The message size is not monotonically increasing
20Analysis of PENS
- Message complexity O(2k N)
- Optimization O(N)
21Conclusion
- We propose a fully distributed clustering
algorithm, PENS, that adopts hierarchical cluster
assembling. - No flooding, no central site
- Analysis demonstrates the efficiency of PENS.
22Future works
- Conduct extensive performance evaluation
- Explore other clustering algorithms and data
mining tasks in P2P systems
23References
- Bandyopadhyay05 S. Bandyopadhyay, C. Gianella,
U. Maulik, H. Kargupta, K. Liu, and S. Datta.
Clustering Distributed Data Streams in
Peer-to-Peer Environments. Information Science
Journal (In Press). - Cheeseman 96 P. Cheeseman and J. Stutz.
Bayesian classification (autoclass) Theory and
results. In Advances in Knowledge Discovery and
Data Mining, pages 153180. AAAI/MIT Press,1996. - Dhillon99 I. S. Dhillon and D. S. Modha. A
data-clustering algorithm on distributed memory
multiprocessors. In Proceedings of Workshop on
Large-Scale Parallel KDD Systems (in conjunction
with SIGKDD), pages 245260, August 1999. - Ester96 M. Ester, H.-P. Kriegel, J. Sander, and
X. Xu. A density based algorithm for discovering
clusters in large spatial databases with noise.
In Proceedings of Knowledge Discovery in Database
(KDD), pages 226231, 1996. - Forman00 G. Forman and B. Zhang. Distributed
data clustering can be efficient and exact.
SIGKDD Explorations, 2(2)3438, 2000. - Guha98 S. Guha, R. Rastogi, and K. Shim. CURE
An efficient clustering algorithm for large
databases. In Proceedings of SIGMOD, pages 7384,
June 1998. - Januzaj04 E. Januzaj, H.-P. Kriegel, and M.
Pfeifle. DBDC Density based distributed
clustering. In Proceedings of International
Conference on Extending Database Technology
(EDBT), pages 88105, March 2004. - Johnson99 E. L. Johnson and H. Kargupta.
Collective, hierarchical clustering from
distributed, heterogeneous data. In Proceedings
of Workshop on Large-Scale Parallel KDD Systems
(in conjunction with SIGKDD), pages 221244,
August 1999. - Li04 M. Li, W.-C. Lee, and A. Sivasubramaniam.
Semantic small world An overlay network for
peer-to-peer search. In Proceedings of
International Conference on Network Protocols
(ICNP), pages 228238, October 2004.
24References (cont.)
- MacQueen67 J. MacQueen. Some methods for
classification and analysis of multivariate
observations. In Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and
Probability, pages 281297, 1967. - Ratnasamy01 S. Ratnasamy, P. Francis, M.
Handley, R. M. Karp, and S. Schenker. A scalable
content-addressable network. In Proceedings of
ACM SIGCOMM, pages 161172, August 2001. - Samatova02 N. F. Samatova, G. Ostrouchov, A.
Geist, and A. V. Melechko. RACHET An efficient
cover-based merging of clustering hierarchies
from distributed datasets. Distributed and
Parallel Databases, 11(2)157180, 2002. - Sheikholeslami98 G. Sheikholeslami, S.
Chatterjee, and A. Zhang. WaveCluster A
multi-resolution clustering approach for very
large spatial databases. In Proceedings of VLDB,
pages 428439, August 1998. - Stoica01 I. Stoica, R. Morris, D. Karger, M. F.
Kaashoek, and H. Balakrishnan. Chord A scalable
peer-to-peer lookup service for Internet
applications. In Proceedings of ACM SIGCOMM,
pages 149160, August 2001. - Xu99 X. Xu, J. Jager, and H.-P. Kriegel. A
fast parallel clustering algorithm for large
spatial databases. Data Mining and Knowledge
Discovery, 3(3)263290, 1999. - Wang97 W. Wang, J. Yang, and R. R. Muntz.
STING A statistical information grid approach to
spatial data mining. In Proceedings of VLDB,
pages 186195, August 1997. - Zhang96 T. Zhang, R. Ramakrishnan, and M.
Livny. BIRCH An efficient data clustering method
for very large databases. In Proceedings of
SIGMOD, pages 103114, June 1996.
25(No Transcript)
26Backup slides DBSCAN Definition
- p is directly density reachable from q wrt. e and
T in D (pgtDq) if 1) p Neq 2) Neq T - p is density reachable from q wrt. e and T in D
if there is a chain of objects p1, p2, , pn such
that p1 q, pn p, pi1gtDpi - p and q are density connected wrt. e and T in D
if there is an object o in D such that p and q
are density reachable from o in D.
T 5
27Backup slidesDBSCAN definitions (cont.)
- A cluster C in D if a non-empty set
- maximality p, q D if q c, p gtDq, then p
C - Connectivity p, q C p and q are
density-connected in D - Noise is the data objects in D that do not belong
to any clusters.