PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems

Description:

Huge amount of data are distributed in large-scale dynamic networks (e.g., P2P systems) ... [Samatova02] N. F. Samatova, G. Ostrouchov, A. Geist, and A. V. Melechko. ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 28
Provided by: mxl7
Category:

less

Transcript and Presenter's Notes

Title: PENS: An Algorithm for DensityBased Clustering in PeertoPeer Systems


1
PENS An Algorithm for Density-Based Clustering
in Peer-to-Peer Systems
  • Mei Li
  • Wang-Chien Lee
  • Anand Sivasubramaniam
  • Pennsylvania State University
  • June 2006

Guanling Lee National Dong Hwa University _at_
InfoScale06
2
Roadmap
  • Introduction
  • PENS
  • Analysis
  • Summary

3
Motivation
  • Huge amount of data are distributed in
    large-scale dynamic networks (e.g., P2P systems)
  • Knowledge hidden among the data is valuable for
    various applications
  • Market analysis, smart query processing,
    scientific exploration
  • clustering
  • Group a set of data objects into clusters of
    similar data objects
  • applications
  • pattern recognition, spatial data analysis,
    market analysis, document classification and
    access pattern discovery in WWW

4
IntroductionPeer-to-peer (P2P) systems
  • Different from client-server computing model
  • No central control
  • Each peer has equal functionality a peer is a
    server and a client
  • Dynamic
  • peer join/leave
  • Large scale
  • Existing systems
  • Unstructured overlays
  • Gnutella
  • Structured overlays
  • CAN (Ratnasamy01)
  • CHORD (Stoica01)
  • SSW (Li04)

5
P2P systemsUnstructured overlays
Search flooding with TTL
Simple, maintenance free Excessive communication
cost
6
P2P systemsStructured overlays
(1,1)
  • Each data is assigned a key, each peer is
    assigned an ID
  • A Data object is mapped to the peer whose ID is
    close to its key

(0,0)
Search efficient
7
Clustering in centralized systems
  • Well known algorithms
  • Partition-based
  • k-mean (MacQueen67)
  • Hierarchical
  • BIRCH (Zhang96), CURE (Guha98)
  • Grid-based
  • WaveCluster (Sheikholeslami98), STING (Wang97)
  • Density-based
  • DBSCAN (Ester96)
  • Model-based
  • autoclass (Cheeseman96)
  • Applicable to P2P systems?
  • Require data to be transmitted to a central site
  • Not feasible (no central server, costly
    communication, privacy concern)
  • Minimize disk access
  • Communication cost is the concern in P2P systems

8
Clustering in distributed systems
  • Two steps
  • Clustering on local sites
  • Combine local results
  • forward to a central site
  • Johnson99, Samatova02, Januzaj04, Xu99
  • flood to the whole system
  • Bandyopadhyay05, Dhillon99, Forman00
  • Applicable to P2P systems?
  • No central site
  • System wide flooding not feasible

9
Our approach
  • Clustering on local site
  • Hierarchical cluster assembling
  • ? peer density-based clustering (PENS)
  • follows the design principle of DBSCAN
  • Illustrate on top of CAN

10
Background DBSCAN
  • Density-based clustering treats regions densely
    populated by data as clusters
  • Efficient in very large dataset
  • Discover clusters with arbitrary shapes
  • Insensitive to noises
  • Basic idea
  • If the neighborhood of a given radius (e) for a
    data object has a cardinality of at least a
    preset threshold (T), this data object belongs to
    a cluster

11
Background DBSCAN (cont.)
  • Clustering algorithm
  • Starting from any arbitrary data object, expand
    the clusters.
  • If a cluster can not be expanded any more,
    iterate over another data object that is not
    clustered.
  • Terminate when no data object can be expanded,
    and all data objects are either clustered or
    labeled as noise.

q
12
Issues in PENS
  • Hierarchy formation
  • How to form a hierarchy for cluster assembly?
  • Cluster Expansion Checking
  • How to determine and represent information
    necessary for cluster assembly?
  • Cluster Merging
  • How to merge clusters along the hierarchy?

13
Hierarchy formation
14
Hierarchy Formation (VPtree)
1
0
buddy regions
00
01
10
11
001
111
011
000
110
101
100
010
C
A
H
0100
0010
1111
0101
0011
1110
1100
1101
1011
1010
E
G
K
B
J
D
L
F
I
11110 11111
N
M
15
Hierarchy formation (cont.)
  • Each leaf node is associated with one zone--taken
    charge of by a peer.
  • Question
  • Which peer should act as the internal tree node
    (arbiter) to merge the clusters in two buddy
    regions?

M
K
N
I
L
I
G
J
H
I
J
E
F
I
M
J
L
A
B
C
D
N
M
16
Cluster Expansion Check
  • Observation
  • If all data objects of a local cluster within the
    zone are at least e away from the boundary of the
    zone, this cluster is also a global cluster
    (non-expandable).
  • If a local cluster within a zone can be expanded
    to include data objects outside of the zone
    (expandable), there is at least one data object
    of this local cluster in the e-inner boundary
    whose e-neighborhood contains some data objects
    in the e-outer boundary.

e-outer boundary
B
e
e
A
e-inner boundary
Algorithm 1. Region query obtain data in the
e-outer boundary 2. Local computation examine
data in the e-inner boundary Optimization stores
the index of the data mapped to its e-outer
boundary eliminate the need of range query
17
Cluster expansion check
T 5
p
p
p
p
noise
expansion
new cluster
expansion merging
  • local cluster or noise localClusterID ZoneID
  • Cluster expansion set (CES)
  • Present coverage (Pcoverage)
  • Expandable coverage (ECoverage)

18
Cluster merging
  • Merge region A and B, store the result back in B
  • Merger one ore more clusters in A with a cluster
    in B
  • For each local cluster i in A
  • S i.Ecoverage j.Pcoverage (j is a cluster in
    B)
  • If S is non-empty
  • merge i and j
  • Above procedure does not handle following cases
  • A cluster in A can be merged with one or more
    clusters in B
  • For each local cluster i in B
  • S i.Ecoverage j.Pcoverage (j is another
    cluster in B)
  • If S is non-empty
  • merge i and j

19
Summary of PENS
  • Although hierarchy is used, no overloading of
    single peer
  • Each node receives 2 messages, send 1 message
  • The message size is not monotonically increasing

20
Analysis of PENS
  • Message complexity O(2k N)
  • Optimization O(N)

21
Conclusion
  • We propose a fully distributed clustering
    algorithm, PENS, that adopts hierarchical cluster
    assembling.
  • No flooding, no central site
  • Analysis demonstrates the efficiency of PENS.

22
Future works
  • Conduct extensive performance evaluation
  • Explore other clustering algorithms and data
    mining tasks in P2P systems

23
References
  • Bandyopadhyay05 S. Bandyopadhyay, C. Gianella,
    U. Maulik, H. Kargupta, K. Liu, and S. Datta.
    Clustering Distributed Data Streams in
    Peer-to-Peer Environments. Information Science
    Journal (In Press).
  • Cheeseman 96 P. Cheeseman and J. Stutz.
    Bayesian classification (autoclass) Theory and
    results. In Advances in Knowledge Discovery and
    Data Mining, pages 153180. AAAI/MIT Press,1996.
  • Dhillon99 I. S. Dhillon and D. S. Modha. A
    data-clustering algorithm on distributed memory
    multiprocessors. In Proceedings of Workshop on
    Large-Scale Parallel KDD Systems (in conjunction
    with SIGKDD), pages 245260, August 1999.
  • Ester96 M. Ester, H.-P. Kriegel, J. Sander, and
    X. Xu. A density based algorithm for discovering
    clusters in large spatial databases with noise.
    In Proceedings of Knowledge Discovery in Database
    (KDD), pages 226231, 1996.
  • Forman00 G. Forman and B. Zhang. Distributed
    data clustering can be efficient and exact.
    SIGKDD Explorations, 2(2)3438, 2000.
  • Guha98 S. Guha, R. Rastogi, and K. Shim. CURE
    An efficient clustering algorithm for large
    databases. In Proceedings of SIGMOD, pages 7384,
    June 1998.
  • Januzaj04 E. Januzaj, H.-P. Kriegel, and M.
    Pfeifle. DBDC Density based distributed
    clustering. In Proceedings of International
    Conference on Extending Database Technology
    (EDBT), pages 88105, March 2004.
  • Johnson99 E. L. Johnson and H. Kargupta.
    Collective, hierarchical clustering from
    distributed, heterogeneous data. In Proceedings
    of Workshop on Large-Scale Parallel KDD Systems
    (in conjunction with SIGKDD), pages 221244,
    August 1999.
  • Li04 M. Li, W.-C. Lee, and A. Sivasubramaniam.
    Semantic small world An overlay network for
    peer-to-peer search. In Proceedings of
    International Conference on Network Protocols
    (ICNP), pages 228238, October 2004.

24
References (cont.)
  • MacQueen67 J. MacQueen. Some methods for
    classification and analysis of multivariate
    observations. In Proceedings of the Fifth
    Berkeley Symposium on Mathematical Statistics and
    Probability, pages 281297, 1967.
  • Ratnasamy01 S. Ratnasamy, P. Francis, M.
    Handley, R. M. Karp, and S. Schenker. A scalable
    content-addressable network. In Proceedings of
    ACM SIGCOMM, pages 161172, August 2001.
  • Samatova02 N. F. Samatova, G. Ostrouchov, A.
    Geist, and A. V. Melechko. RACHET An efficient
    cover-based merging of clustering hierarchies
    from distributed datasets. Distributed and
    Parallel Databases, 11(2)157180, 2002.
  • Sheikholeslami98 G. Sheikholeslami, S.
    Chatterjee, and A. Zhang. WaveCluster A
    multi-resolution clustering approach for very
    large spatial databases. In Proceedings of VLDB,
    pages 428439, August 1998.
  • Stoica01 I. Stoica, R. Morris, D. Karger, M. F.
    Kaashoek, and H. Balakrishnan. Chord A scalable
    peer-to-peer lookup service for Internet
    applications. In Proceedings of ACM SIGCOMM,
    pages 149160, August 2001.
  • Xu99 X. Xu, J. Jager, and H.-P. Kriegel. A
    fast parallel clustering algorithm for large
    spatial databases. Data Mining and Knowledge
    Discovery, 3(3)263290, 1999.
  • Wang97 W. Wang, J. Yang, and R. R. Muntz.
    STING A statistical information grid approach to
    spatial data mining. In Proceedings of VLDB,
    pages 186195, August 1997.
  • Zhang96 T. Zhang, R. Ramakrishnan, and M.
    Livny. BIRCH An efficient data clustering method
    for very large databases. In Proceedings of
    SIGMOD, pages 103114, June 1996.

25
(No Transcript)
26
Backup slides DBSCAN Definition
  • p is directly density reachable from q wrt. e and
    T in D (pgtDq) if 1) p Neq 2) Neq T
  • p is density reachable from q wrt. e and T in D
    if there is a chain of objects p1, p2, , pn such
    that p1 q, pn p, pi1gtDpi
  • p and q are density connected wrt. e and T in D
    if there is an object o in D such that p and q
    are density reachable from o in D.

T 5
27
Backup slidesDBSCAN definitions (cont.)
  • A cluster C in D if a non-empty set
  • maximality p, q D if q c, p gtDq, then p
    C
  • Connectivity p, q C p and q are
    density-connected in D
  • Noise is the data objects in D that do not belong
    to any clusters.
Write a Comment
User Comments (0)
About PowerShow.com