Network clustering presentation

About This Presentation

Transcript and Presenter's Notes

Title: Network clustering

1
Network clustering

Presented by Wooyoung Kim
2/6/2009
CSc 8910 Analysis of Biological Network, Spring
2009
Dr. Yi Pan

2
Outline
Introduction Definitions and Basic Concepts
Network Clustering Problem Hierarchical
clustering Clique-based clustering Centre-based
clustering Conclusion
3
Introduction

Clustering is
Loosely defined as the process of grouping
objects into sets called clusters so that each
cluster consists of elements that are similar in
some way.
Example
Distance-based clustering close given distance
metric
Conceptual clustering based on descriptive
concepts

4
Introduction

Clustering is
used for multiple purposes, including
Finding natural clusters (modules) and
describing their properties
Classifying the data
Detecting unusual data (outliers)
Data reduction by treating a cluster or one of
its element as a single representative unit

5
Introduction

Network clustering
deals with clustering the data represented as a
network or a graph
Link analysis
Data points are represented by vertices
An edge exists if two data points are similar or
related in a certain way
Similarity criterion
Pairwise relations - for network model
cohesiveness for cluster similarity

6
Introduction

Network clustering approaches are used to perform
Distance-based clustering
Vertices are data points, and edges are for close
points
Distances for weight the edges of a complete
graph

7
Introduction

Conceptual clustering
Generating a concept description for each
generated cluster
Design a matching field in database networks,
then vertices are connected by an edge if the
two matching fields are close
Example
Protein interaction networks, proteins are
vertices and a pair is connected by an edge if
they are known to interact
Gene co-expression networks, genes are vertices
and an edge indicates that the pair of genes (end
points) are co-expressed over some cut-off value,
based on microarray experiments.

8
Introduction

Application of network clustering
Understand the structure and function of
proteins based on protein interaction maps of
organisms
Clustering protein interaction networks (PINs)
using cliques to decompose the Protein
interaction network into functional modules and
protein complexes
Use of cliques and other high density subgraphs
to identify protein complexes (splicing
machinery, transcription factors, etc.) and
functional modules (signalling cascades, cell
cycle regulation)

9
Introduction

Application of network clustering
Protein complexes groups of proteins that
interact with each other at the same time and
place.
Functional modules groups of proteins that
are known to have pairwise interactions by
binding with each other to participate in
different cellular processes at different times

10
Definition and basic concepts

G(V,E) is simple, undirected graph
nV, eE
is complement graph of G
Complement set of edges
GS is induced subgraph of G (induced by a
subset S of V)
N(v) is set of neighbours of a vertex v in G
(excluding v)
Degree deg(v)N(v)
NvN(v) U v
duv is distance between u and v
Length of the shortest path from u to v
dmmax(dij) for all vertex pairs i and j is
diameter of a graph

11
Definition and basic concepts

Edge connectivity k(G) of a graph the
minimum number of edges that must be removed to
disconnect the graph
Vertex connectivity (or connectivity) k(G) of a
graph the minimum number of vertices that must
be removed to disconnect the graph (or results in
a trivial graph)
trivial graph one vertex, no edges
connected graph every pair of vertices are
connected.

12
Definition and basic concepts
Example (Vertex) connectivity k(G)2 removal of
vertices 3 and 5 would disconnect the graph
2
8
9
11
5
1
12
3
7
10
4
6
13
Definition and basic concepts
Example Edge connectivity k(G)2 removal of
edges (9,11) and (10,12) would disconnect the
graph
2
8
9
11
5
1
12
3
7
10
4
6
14
Definition and basic concepts

A clique C is a subset of vertices such that an
edge exists between every pair of vertices in C
The induced subgraph GC is a complete graph
A clique is maximal if it is not a subset of
any larger clique
A clique is maximum if there are no larger
cliques in the graph
A subset of vertices I is called an independent
set (also called a stable set) if for every pair
of vertices in I, (i, j) is not an edge
Induced subgraph GI is edgeless
An independent set is maximal if it is not a
subset of any larger independent set
An independent set is maximum if there are no
larger independent sets in the graph

15
Definition and basic concepts
Example maximal clique 1,2,3 is a maximal
clique
2
8
9
11
5
1
12
3
7
10
4
6
16
Definition and basic concepts
Example maximum clique 7,8,9,10 is the maximum
clique
2
8
9
11
5
1
12
3
7
10
4
6
17
Definition and basic concepts
Example maximal independent set I3,7,11 is a
maximal independent set as there is no larger
independent set containing it
18
Definition and basic concepts
Example maximum independent set The set
1,4,5,10,11 is a maximum independent set, one
of largest cardinality in the graph
19
Definition and basic concepts
20
Definition and basic concepts
Algorithm for maximal independent set
21
Definition and basic concepts
Algorithm for maximal independent set
2
5
1
3
7
4
6
22
Definition and basic concepts

A dominating set is a set of
vertices such that every vertex in the graph is
either in this set or has a neighbour in this set
Dominating set is minimal if it contains no
proper subset which is dominating
Dominating set is a minimum dominating set if
it is of the smallest cardinality
Cardinality of a minimum dominating set is
called the domination number ?(G) of a graph

23
Definition and basic concepts
D 7, 11, 3 is a minimal and minimum
dominating set
2
8
9
11
5
1
12
3
10
7
4
6
24
Definition and basic concepts

A connected dominating set is one in which the
subgraph induced by the dominating set is
connected
An independent dominating set is one in which
the dominating set is also independent

25
Network clustering problem

Given a graph G(V,E), find subsets (not
necessarily disjoint) V1,...,Vr of V such that
V UVi i1,,r such that
Each subset is a cluster modelled by structures
such as cliques or other distance and
diameter-based models
The model used as a cluster represents the
cohesiveness required of the cluster

26
Network clustering problem

The clustering models can be classified
By the constraints on relations between
clusters (clusters may be disjoint or
overlapping)
The objective function used to achieve the goal
of clustering (minimizing the number of clusters
or maximizing the cohesiveness)
When clusters are required to be disjoint
? V1,...,Vr is cluster -partition ?Exclusive
clustering
When clusters are allowed to overlap
? V1,...,Vr is a cluster-over ? Overlapping
clustering

27
Network clustering problem

Assume that there is a measure of cohesiveness of
the cluster that can be varied for a graph G ?
define two types of optimization problems
Type I Minimize the number of clusters while
ensuring that every cluster formed has
cohesiveness over a prescribed threshold
Example The problem of clustering an incomplete
graph with cliques used as clusters and the
objective of minimizing the number of clusters

28
Network clustering problem

Type II Maximize the cohesiveness of each
cluster formed, while the number of clusters is K
(the last requirement may be relaxed by setting K
be infinite )
Example assume that G has non-negative edge
weights w, for a cluster Vi let Ei denote the
edges in the subgraph induced by Vi
Use w as a dissimilarity measure (distance)
For example, w(Ei)?e in Ew(e) is meaningful
measures of cohesiveness
can be used to formulate a Type II clustering
problem
We will refer to problems as Type I and Type II
based on their objective

29
Hierarchical clustering

After performing clustering, we can abstract the
graph G0 to a graph
G1 (V1, E1) as the followings
There exists a vertex vi1 in V1 for every subset
(cluster) Vi0
There exists an edge between vi1 and vj1 if and
only if there exist a vertex x in the cluster Vi
and a vertex y in cluster Vj
In other words if any two vertices from
different clusters have an edge between them in
the original graph G0 ? clusters containing them
are made adjacent in G1

30
Hierarchical clustering

We can recursively cluster the abstracted graph
G1 in a similar fashion to obtain a multilevel
hierarchy
Process is called hierarchical clustering
Example
Following subsets form clusters in this graph
C17,8,9,10,C21,2,3,C34,6,C411,12,C55
Given the clusters of the example graph G ? we
can construct an abstracted graph G

31
Clique-based clustering

Natural choice for a highly cohesive cluster
Cliques have
Minimum possible diameter
Maximum connectivity
Maximum possible degree for each vertex
Given an arbitrary graph
Type I approach tries to partition it into (or
cover it using) minimum number of cliques
Type II approaches usually work with a weighted
complete graph and hence every partition of the
vertex set is a clique partition

32
Clique-based clustering

Minimum clique partitioning
Type I clique partitioning and clique covering
problems are both NP-hard Garey79
Heuristic approaches are preferred for large
graphs
Note that clique-partitioning and
clique-covering problems are closely related
Minimum number of clusters produced in clique
covering and partitioning are the same

33
Clique-based clustering
34
Clique-based clustering
Simple heuristic for clique partitioning
35
Clique-based clustering
Example
2
8
9
11
5
1
12
3
10
7
4
6
36
Clique-based clustering
Example
37
Clique-based clustering
Example
38
Clique-based clustering

Min-Max k-Clustering
A Type II clique partitioning problem with
min-max objective
Consider a weighted complete graph G(V,E)
with weights w(e1)w(e2)w(em) , mn(n-1)/2
Partition the graph into no more than k cliques
s.t. the maximum weight of an edge between two
vertices inside a clique is minimized
In other words, if V2,...,Vk is the clique
partition, then we wish to minimize
maxi1kmaxu,v in Viw(u,v)
The weight w(i,j) can be thought of as a measure
of dissimilarity
Larger w(i,j) means more dissimilar i and j are
Problem tries to cluster the graph into at most
k cliques such that the maximum dissimilarity
between any two vertices inside a clique is
minimized

39
Clique-based clustering

Min-Max k-Clustering
Given any graph G(V,E), the required edge
weighted complete graph G can be obtained in
different ways using meaningful measures of
dissimilarity
The weight w(i,j) could be dij, the shortest
distance of i and j in G
The weight could be k(i,j) and k(i,j) (minimum
number of vertices and edges that need to be
removed from G to disconnect i and j)
Since these are measures of similarity ? we
could obtain the required weights as
w(i,j)V-k(i,j) or w(i,j)E-k(i,j)

40
Clique-based clustering
41
Clique-based clustering
Bottleneck heuristic for the min-max k-clustering
problem
42
Clique-based clustering

Procedure bottleneck() returns the bottleneck
graph G()
MIS() is an arbitrary procedure for finding a
maximal independent set (MIS) in G
This algorithm will be optimal if we manage to
find a maximum independent set (one of largest
size) in every iteration
Problem is NP-hard
We have to restrict ourselves to finding MIS
using heuristic approaches such as the greedy
approach described earlier to have a polynomial
time algorithm

43
Clique-based clustering
Example Clustering output of the bottleneck
min-max k-clustering algorithm with k2 for
following graph
44
Clique-based clustering
Result
45
Center-based clustering

In center-based clustering models, the elements
of a cluster are determined based on their
similarity with the clusters center (or
cluster-head)
Center-based clustering algorithms usually
consist of two steps
First, an optimization procedure is used to
determine the cluster-heads
Second, the cluster-heads are then used to form
clusters around them

46
Center-based clustering

Clustering with dominating sets
Minimum dominating set and related problems
provide a modelling tool for centre-based
clustering of Type I
The minimum dominating set problem is NP-hard ?
heuristic approaches and approximation algorithms
are used to find a small dominating set
If D denotes a dominating set, then for each
vertex v in D the closed neighbourhood Nv forms
a cluster
By the definition of domination, every vertex
not in the dominating set has a neighbour in it
and hence is assigned to some cluster

47
Center-based clustering

Each v in D is called a cluster-head and the
number of clusters that result is exactly the
size of the dominating set
Minimizing the size of the dominating set ?
minimize the number of clusters produced
resulting in a Type I clustering problem
This approach results in a cluster cover since
the resulting clusters need not to be disjoint

48
Center-based clustering

Each cluster has diameter at most two as every
vertex in the cluster is adjacent to its
cluster-head and the cluster-head is similar to
all the other vertices in its cluster
However, neighbours of the cluster-head may be
poorly connected among themselves
Some post-processing may be required as a cluster
formed in this fashion from an arbitrary
dominating set could completely contain another
cluster
Clustering with dominating sets is especially
suited for clustering protein interaction
networks
To reveal groups of proteins that interact
through a central protein which could be
identified as a cluster-head in this method

49
Center-based clustering

Independent Dominating Sets
Finding a maximal independent set results also in
a minimal independent dominating set
Can be used in clustering the graph
Here, no cluster formed can contain another
cluster completely, as the cluster-heads are
independent and different

50
Center-based clustering

Greedy algorithm for minimal independent
dominating sets
Proceeds by adding a maximum degree vertex to the
current independent set and then deleting that
vertex along with its neighbours
Greedy because it adds a maximum degree vertex so
that a larger number of vertices are removed in
each iteration, yielding a small independent
dominating set
This is repeated until no more vertices exist

51
Center-based clustering
Progress of greedy minimal independent dominating
set algorithm on the example graph
52
Center-based clustering

Connected Dominating Sets
In some situations, cluster-heads are connected,
and not independent.
Connected Dominating Sets (CDS) is a dominating
set D such that GD is connected.
Finding a minimum CDS is a NP-hard.
Approximation algorithm is needed

53
Center-based clustering

Greedy vertex elimination type heuristic for
finding CDS Butenko04
Let DV, and F be empty initially.
Pick a minimum degree vertex u and delete it, if
deletion does not disconnect the graph. Otherwise
u is added to set F (u is fixed).
If N(u) is empty in F, a vertex of maximum
degree in GD that is also in N(u) is fixed
ensuring that u is dominated.
In every iteration D is connected and is a
dominating set in G.
The algorithm terminates when all the vertices
left in D are fixed (DF) and that is the output
CDS of the algorithm.

54
Center-based clustering
greedy_CDS_algorithm (connected graph G (V,E))
55
Center-based clustering
CDS
2
8
9
11
5
1
12
3
10
7
4
6
56
Center-based clustering
Cluster cove
2
8
9
11
5
1
12
3
10
7
4
6
57
Center-based clustering

D F U W
V 1 2,3 3
V\1 3 2 ---- ---
V\1,2 3 4 --- ---
V\1,2,4 3 6 7 7
V\1,2,4,6 3,7 11 9,12 9
3,5,7,8,9,10,12 3,7,9 12 10 10
3,5,7,8,9,10 3,7,9,10 5 --- --- ---
3,5,7,8,9,10 3,7,9,10,5 8 --- ---
3,5,7,9,10 3,7,9,10,5

58
Center-based clustering

k-Center clustering
Variant of k-means clustering approach
Birnhaum03
Seek to identify k-cluster-heads such that some
objective function measuring the dissimilarity of
the members of a cluster be minimized.
Different choice of dissimilarity measures and
objectives yield different clustering problems
and solutions often.

59
Center-based clustering

k-Center clustering problem is type II
center-based clustering model with a min-max
objective. Hochbaum85

bottleneck_k-center_algorithm (connected graph G
(V,E), with weights)
60
Center-based clustering

bottleneck_k-center _algorithm
Let I be MIS( ). Then the distance of every
pair of vertices in I is at least three in the
original graph G.
Any vertex in I of G cannot dominate any other
vertex in I since they are three distance apart.
Therefore, any dominating set in G is at least
as large as I.
In the algorithm, if we find an MIS of of
size k1, then no dominating set of size k exists
in G.
Thus, we proceed until we find an MIS of size at
most k and terminate.

61
Conclusion

Numerous models exist for clustering, such as
clique-based, graph partitioning, min-cut,
connectivity-based, etc.
However, more sophisticated approaches require
rigorous background on optimization and
algorithms.
Commercial software packages are available for
clustering problems of moderate sizes.
For large-scale instances, meta-heuristic
approaches such as simulated-annealing, tabu
search, or GRASP are offered.
The basic algorithms are too restrictive for
real-life data, so alternatively, distance-k
neighborhood method is used. More detail is in
Holme05

62
Conclusion

Distance-k neighborhood Nk(v)
Nk(v) the vertices that are at distance k or
less from v excluding v itself.
Using BFS to find Nk(v).
Used to identify molecular complexes starting
with a seed vertex.
The vertex weights are based on k-cores
(subgraphs of minimum degree at least k) in the
neighborhood of a vertex.
K-cores is first introduces in social network
analysis
Identify dense regions of the network.
Resemble cliques if k is large enough in
relation to the size of the k-core found.

63
Conclusion

Several models that relax cliques and dominating
sets based on distance-k neighborhood also exist
such as quasi-cliques.
Those relaxations are more robust in clustering
real-life data containing errors (noise)

64
Conclusion

Dilemma for clustering
Very rarely does real-life data present a unique
clustering solution, because
Deciding the best model is hard and it requires
experimentation with different models.
Even under one model, several optimal solutions
are possible.
These issues are in addition to the general
issues of clustering which are
The interpretation of clusters
What they represent.
The whole is more than the sum of its parts --
Aristotle

65
References
Junker08Junker and Schreiber. Analysis of
Biological Networks. WileyInterscience
publication, 2008. Butenko04Butenko, Cheng,
Oliveria, and Pardalos. A new heuristic for the
minimum connected dominating set problem on ad
hoc wireless networks. Cooperative Control and
Optimization, pp 61-73, 2004 Garey79 Garey and
Johnson. Computers and Intractability A Guide to
the Theory of NP-completeness. W.H.Freeman and
company, New York, 1979. Hochbaum85 Hochbaum
and Shmoys. A best possible heuristic for the
k-center problem. Mathematics of Operations
Research, 10 180-184, 1985. Holme05
Core-periphery organization of complex networks.
Physical Review E, 72046111-1-046111-4. 2005

Write a Comment

User Comments (0)

About PowerShow.com

Network clustering PowerPoint PPT Presentation