1 / 126

Techniques of Classification and Clustering

Problem Description

- Assume
- AA1, A2, , Ad (ordered or unordered) domain
- S A1 ? A2 ? Ad d-dimensional (numerical or

non-numerical) space - Input
- Vv1, v2, , vm d-dimensional points, where vi

?vi1, vi2, , vid?. - The jth component of vi is drawn from domain Aj.
- Output
- Gg1, g2, , gk a set groups of V with label

vL, where gi ? V.

Classification

- Supervised classification
- Discriminant analysis, or simply Classification
- A collection of labeled (pre-classified) patterns

are provided - Aims to label a newly encountered, yet unlabeled

(training) patterns - Unsupervised classification
- Clustering
- Aims to group a given collection of unlabeled

patterns into meaningful clusters - Category labels are data driven

Methods for Classification

- Neural Nets
- Classification functions are obtained by passing

multiple passes over training sets - Poor generation efficiency
- Not efficient handling of non-numerical data
- Decision trees
- If E contains only objects of one group, the

decision tree is just a leaf labeled with that

group. - Construct a DT that correctly classifies objects

in the training data set. - Test to classify the unseen objects in the test

data set.

Decision Trees (Ex Credit Analysis)

salary lt 20000

no

yes

education in graduate

accept

yes

no

accept

reject

Decision Trees

- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts

Decision Tree Algorithms

- Classifiers from machine learning community
- ID3J. R. Quinlan, Induction of decision trees,

Machine Learning, 1, 1986. - C4.5J. Ross Quinlan, C4.5 Programs for and

Neural Networks, Cambridge University Press,

Cambridge, 1996. Machine Learning, Morgan

Kaufman, 1993 - CARTL. Breiman, J. H. Friedman, R. A. Olshen,

and C. J. Stone, Classification and Regression

Trees, Wadsworth, Belmont, 1984. - Classifiers for large database
- SLIQMAR96, SPRINTJohn Shafer, Rakesh Agrawal,

and Manish Mehta, SPRINT A scalable parallel

classifier for data mining, the VLDB Conference,

Bombay, India, September 1996. - SONARTakeshi Fukuda, Yasuhiko Morimoto, and

Shinichi Morishita, Constructing efficient

decision trees by using optimized numeric

association rules, the VLDB Conference, Bombay,

India, 1996. - RainforestJ. Gehrke, R. Ramakrishnan, V. Ganti,

RainForest A Framework for Fast Decision Tree

Construction of Large Datasets, Proc. of VLDB

Conf., 1998. - Pruning phase followed by building phase

Decision Tree Algorithms

- Building phase
- Recursively split nodes using best splitting

attribute for node - Pruning phase
- Smaller imperfect decision tree generally

achieves better accuracy - Prune leaf nodes recursively to prevent

over-fitting

Preliminaries

- Theoretic Background
- Entropy
- Similarity measures
- Advanced terms

Information Theory Concepts

- Entropy of a random variable X with probability

distribution p(x) - The Kullback-Leibler(KL) Divergence or Relative

Entropy between two probability distributions p

and q - Mutual Information between random variables X and

Y

What is Entropy

- S is a sample of training data set
- Entropy measures the impurity of S

- H(X)The entropy of X
- If H(X)0, it means X is one value As H()

increases, X are heterogeneous values. - For the same number of X values,
- Low Entropy means X is from a uniform (boring)

distribution A histogram of the frequency

distribution of values of X would be flat ?and

so the values sampled from it would be all over

the place - High Entropy means X is from varied (peaks and

valleys) distribution A histogram of the

frequency distribution of values of X would have

many lows and one or two highs ? and so the

values sampled from it would be more predictable.

Entropy-Based Data Segmentation

T. Fukuda, Y. Morimoto, S. Morishita, T.

Tokuyama, Constructing Efficient Decision Trees

by Using Optimized Numeric Association Rules,

Proc. of VLDB Conf., 1996.

- Attribute has three categories, 40 C1, 30 C2, 30

C3.

C1 C2 C3

100 40 30 30

- Splitting

S1 C1 C2 C3

60 40 10 10

S2 C1 C2 C3

40 0 20 20

S3 C1 C2 C3

60 20 20 20

S4 C1 C2 C3

40 20 10 10

Information Theoretic Measure

R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A.

Swami, An Interval Classifier for Database Mining

Applications, Proc. ofVLDB, 1992.

- Information gain by branching on Ai
- gain(Ai) E - Ei
- The entropy E of an object set
- the object set containing
- object ek of group Gk.
- The expected entropy for
- the tree with Ai as the root
- where Eij is the expected entropy for the subtree

of an object set. - Information content of
- the value of Ai

Ex

C1 C2 C3

100 40 30 30

- Splitting

S1 C1 C2 C3

60 20 20 20

S2 C1 C2 C3

40 20 10 10

S3 C1 C2 C3

40 40 0 0

S4 C1 C2 C3

30 0 30 0

S5 C1 C2 C3

30 0 0 30

- Gain
- gain(Ai) E - Ei

gain1E-E10.015 gain2E-E21.09

Distributional Similarity Measures

- Cosine
- Jaccard coefficient
- Dice coefficient
- Overlap coefficient
- L1 distance (City block distance)
- Euclidean distance (L2 distance)
- Hellinger distance
- Information Radius (Jensen-Shannon divergence)
- Skew divergence
- Confusion Probability
- Lins Similarity Measure

Similarity Measures

- Minkowski distance
- Euclidean distance
- p2
- Manhattan distance
- p1
- Mahalanobis distance
- Normalization due to weight schemes
- ? is the sample covariance matrix of the patterns

or the known covariance matrix of the pattern

generation process

General form

- I (common (A,B)) information content associated

with the statement describing what A and B have

in common - I (description (A,B)) information content

associated with the statement describing A and B - ?(s) probability of the statement within the

world of the objects in question, i.e., fraction

of objects exhibiting feature s.

IT-Sim (A,B)

Similarity Measures

- The Set/Bag Model Let X and Y be two collections

of XML documents - Jaccards Coefficient
- Dices Coefficient

Similarity Measures

- Cosine-Similarity Measure (CSM)
- The Vector-Space Model Cosine-Similarity Measure

(CSM)

Query Processing a single cosine

- For every term i, with each doc j, store term

frequency tfij. - Some tradeoffs on whether to store term count,

term weight, or weighted by idfi. - At query time, accumulate component-wise sum
- If youre indexing 5 billion documents (web

search) an array of accumulators is infeasible

Ideas?

Similarity Measures (2)

- The Generalized Cosine-Similarity Measure (GCSM)

Let X and Y be vectors and - where
- Hierarchical Model
- Why only for depth?

2 Dim Similarities

- Cosine Measure
- Hellinger Measure
- Tanimoto Measure
- Clarity Measure

Advanced Terms

- Conditional Entropy
- Information Gain

Specific Conditional Entropy

- H(YXv)
- Suppose Im trying to predict output Y and I have

input X - XCollege Major, Y likes Gladiator
- Lets assume this reflects the true probabilities

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

- From this data we estimate
- P(LikeGYes)0.5
- P(MajorMath LikeGNo) 0.25
- P(MajorMath)0.5
- P(LikeGYes MajorHisgory)0
- Note
- -H(X)1.5 -H(Y)1
- ----
- H(YXMath)1 H(YXHistory)0
- H(YXCS)0

Conditional Entropy

- Definition of Conditional Entropy
- H(YX)The average specific conditional entropy

of Y - If you choose a record at random what will be the

conditional entropy of Y, conditioned on that

rows value of X - Expected number of bits to transmit Y if both

sides will know the value of X

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

vj Prob(Xvj) H(YXvj)

Math 0.5 1

History 0.25 0

CS 0.25 0

H(YX)0.510.2500.2500.5

Information Gain

- Definition of Information Gain
- IG(YX) I must transmit Y. How many bits on

average would it save me if both ends the line

knew X? - IG(YX) H(Y) H(YX)

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,

IG(YX) 1-0.5 0.5

Relative Information Gain

- Definition of Relative Information Gain
- RIG(YX) I must transmit Y, what fraction of

the bits on average would it save me if both ends

the line knew X? - RIG(YX) H(Y) H(YX)/H(Y)

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,

IG(YX) (1-0.5)/1 0.5

What is Information Gain used for?

- Suppose you are trying to predict whether someone

is going to live past 80 years. From historical

data you might find - IG(LongLife HairColor) 0.01
- IG(LongLife Smoker) 0.2
- IG(LongLife Gender) 0.25
- IG(LongLife LastDigitOfSSN) 0.00001
- IG tells you how interesting 1 2-d contingency

table is going to be.

Clustering

- Given
- Data points and number of desired clusters K
- Group the data points into K clusters
- Data points within clusters are more similar than

across clusters - Sample applications
- Customer segmentation
- Market basket customer analysis
- Attached mailing in direct marketing
- Clustering companies with similar growth

A Clustering Example

Income High Children1 CarLuxury

Income Medium Children2 CarTruck

Cluster 1

Car Sedan and Children3 Income Medium

Income Low Children0 CarCompact

Cluster 4

Cluster 3

Cluster 2

Different ways of representing clusters

(b)

e

c

b

f

i

g

Clustering Methods

- Partitioning
- Given a set of objects and a clustering

criterion, partitional clustering obtains a

partition of the objects into clusters such that

the objects in a cluster are more similar to each

other than to objects in different clusters. - K-means, and K-mediod methods determine K cluster

representatives and assign each object to the

cluster with its representative closest to the

object such that the sum of the distances squared

between the objects and their representatives is

minimized. - Hierarchical
- Nested sequence of partitions.
- Agglomerative starts by placing each object in

its own cluster and then merge these atomic

clusters into larger and larger clusters until

all objects are in a single cluster. - Divisive starts with all objects in cluster and

subdividing into smaller pieces.

Algorithms

- k-Means
- Fuzzy C-Means Clustering
- Hierarchical Clustering
- Probabilistic Clustering

Similarity Measures (2)

- Mutual Neighbor Distance (MND)
- MND(xi, xj) NN(xi, xj)NN(xj, xi), where NN(xi,

xj) is the neighbor number xj with respect to xi. - Distance under context
- s(xi, xj)f(xi, xj, e), where e is the context

K-Means Clustering Algorithm

- Choose k cluster centers to coincide with k

randomly-chosen patterns - Assign each pattern to its closest cluster

center. - Recompute the cluster centers using the current

cluster memberships. - If a convergence criterion is not met, go to step

2. - Typical convergence criteria
- No (or minimal) reassignment of patterns to new

cluster centers, or minimal decrease in squared

error.

Objective Function

- k-Means algorithm aims at minimizing the

following objective function (square error

function)

K-Means Algorithm (Ex)

G

F

E

D

H

I

C

J

B

A

Distortion

- Given a clustering ?, we denote by ?(x) the

centroid this clustering associates with an

arbitrary point x. A measure of quality for ? - Distortion? ?x d2(x, ?(x))/R
- Where R is the total number of points and x

ranges over all input points. - Improvement
- Distortion ?( parameters) log R
- Distortion ? mk log R

Remarks

- The way to initialize the means is the problem.

One popular way to start is to randomly choose k

of the samples - The results produced depend on the initial values

for the means - It can happen that the set of samples closest to

mi is empty, so the mi cannot be updated. - The results depend on the metric used to measure

Related Work Clustering

- Graph-based clustering
- For an XML document collection C, s-Graph sg (C)

(N, E), a directed graph such that N is the set

of all the elements and attributes in the

documents in C and (a, b) ? E if and only if a is

a parent element of b in document(s) in C (b can

be element or attribute). - For two sets, C1 and C2, of XML documents, the

distance between them, where sg(Ci) is the

number of edges

Fuzzy C-Means Clustering

- FCM is a method of clustering which allows one

piece of data to belong to two or more clusters. - Fuzzy partitioning is carried out through an

iterative optimization of the objective function

shown above, with the update of membership u and

the cluster center c by

Membership

- The iteration stop when

, where ? is a termination criterion

between 0 and 1, whereas k are the iteration

steps. This procedure converges to a local

minimum or a saddle point of Jm.

Fuzzy Clustering

- Properties
- uij ? 0,1 for all i,j
- for all i
- for all N

Speculations

- Correlation between m and ?
- More iteration k for less ?.

Hierarchical Clustering

- Basic Process
- Start by assigning each item to a cluster. N

clusters for N items. (Let the distances between

the clusters the same as the distances between

the items they contain.) - Find the closest (most similar) pair of clusters

and merge them into a single cluster. - Compute distances between the new cluster and

each of the old clusters. - Repeat steps 2 and 3 until all items are

clustered intoa single cluster of size N.

Hierarchical Clustering (Ex)

dendrogram

Hierarchical Clustering Algorithms

- Single-linkage clustering
- The distance between two clusters is the minimum

of the distances between all pairs of patterns

drawn from the two clusters (one pattern from the

first cluster, the other from the second). - Complete-linkage clustering
- The distance between two clusters is the maximum

of the distances between all pairs of patterns

drawn from the two clusters - Average-linkage clustering
- Minimum-variance algorithm

Single-/Complete-Link Clustering

1

2

1

1

1

2

2

2

2

1

2

2

1

2

1

2

1

2

2

2

2

1

2

1

2

1

1

2

1

2

2

2

1

1

2

1

1

2

2

1

2

1

Single Linkage Hierarchical Cluster

- Steps
- Begin with the disjoint clustering having level

L(k)0 and sequence number m0. - Find the least dissimilar pair of clusters in the

current clustering, d(r),(s) min d(i),(j),

where the minimum is over all pairs of clusters

in the current clustering. - Increment the sequence number mm1. Merge

clusters (r) and (s) into a single cluster to

form the next clustering m. Set L(m)

d(r),(s). - Update the proximity matrix, D, by deleting the

rows and columns corresponding to clusters (r)

and (s) and adding a row and column corresponding

to the newly formed cluster. The proximity

between the new cluster, denoted (r,s) and old

cluster (k) is defined d(k),(r,s) min

(d(s),(r), d(k),(s)). - If all objects are in one cluster, stop. Else go

to step 2.

Ex Single-Linkage

- Cities ? States

0

Agglomerative Hierarchical Clustering

ALGORITHM Agglomerative Hierarchical

Clustering INPUT bit-vectors B in bitmap index

BI OUTPUT a tree T METHOD (1) Place each

bit-vector Bi in its cluster (singleton),

creating the list of clusters L

(initially, the leaves of T) LB1, B2, , Bn.

(2) Compute a merging cost function,

between every

pair of elements in L to find the two closest

clusters Bi,Bj which will be the

cheapest couple to merge. (3) Remove Bi and Bj

from L. (4) Merge Bi and Bj to create a new

internal node Bjj in T which will be the

parent of Bi and Bj in the result tree. (5)

Repeat from (2) until there is only one set

remaining.

Graph-Theoretic Clustering

- Construct the minimal spanning tree (MST)
- Delete the MST edges with the largest lengths

x2

B

3.5

0.5

C

1.5

A

6.5

1.5

D

F

G

1.7

E

x1

Improving k-Means

D. Pelleg and A. Moore, Accelerating Exact

k-means Algorithms with Geometric Reasoning, ACM

Proceedings of Conf. on Knowledge and Data

Mining, 1999.

- Definitions
- Center of clusters ? (Th2) Center of rectangle

owner(h) - c1 dominates c2 w.r.t. h ? if h is in the same

side as c1 wrt c2. (pg.7,9) - Update Centroid
- If for all other centers c, c dominates c wrt h

(so cowner(h), pg 10) ? insert into owner(h) or

split h - (blacklist version) c1 dominates c2 wrt h for

any h contained in h. (pg.11)

Clustering Categorical Data ROCK

- S. Guha, R. Rastogi, K. Shim, ROCK Robust

Clustering using linKs, IEEE Conf Data

Engineering, 1999 - Use links to measure similarity/proximity
- Not distance based
- Computational complexity
- Basic ideas
- Similarity function and neighbors
- Let T1 1,2,3, T23,4,5

Using Jaccard Coefficient

- According to Jaccard coefficient, the distance

between 1,2,3 and 1,2,6 is the same as the

one between 1,2,3 and 1,2,4, although the

former is from two different clusters.

lt1,2,3,4,5gt CLUSTER 1 1,2,3 1,4,5 1,2,4

2,3,4 1,2,5 2,3,5 1,3,4 2,4,5 1,3,5

3,4,5

lt1,2,6,7gt CLUSTER 2 1,2,6 1,2,7 1,6,7 2,6,7

ROCK

- Inducing LINK the main problem is local

properties involving only the two points are

considered - Neighbor If two points are similar enough with

each other, they are neighbors - Link the link for pair of points is the number

of common neighbors.

Rock Algorithm

S. Guha, R. Rastogi, K. Shim, ROCK Robust

Clustering using linKs, IEEE Conf Data

Engineering, 1999

- Links The number of common neighbors for the

two points. - Algorithm
- Draw random sample
- Cluster with links
- Label data in disk

1,2,3, 1,2,4, 1,2,5, 1,3,4,

1,3,5 1,4,5, 2,3,4, 2,3,5, 2,4,5,

3,4,5

3

1,2,3 1,2,4

Rock Algorithm

S. Guha, R. Rastogi, K. Shim, ROCK Robust

Clustering using linKs, IEEE Conf Data

Engineering, 1999

- Criterion function to maximize link for the k

clusters - Ci denotes cluster i of size ni.

For the similarity threshold 0.5, link (1,2,6,

1,2,7) 4 link (1,2,6, 1,2,3) 3 link

(1,6,7, 1,2,3) 2 link (1,2,3, 1,4,5)

3

1,2,3 1,4,5 1,2,4 2,3,4 1,2,5

2,3,5 1,3,4 2,4,5 1,3,5 3,4,5

1,2,6 1,2,7 1,6,7 2,6,7

More on Hierarchical Clustering Methods

- Major weakness of agglomerative clustering

methods - do not scale well time complexity of at least

O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based

clustering - BIRCH (1996) uses CF-tree and incrementally

adjusts the quality of sub-clusters - CURE (1998) selects well-scattered points from

the cluster and then shrinks them towards the

center of the cluster by a specified fraction

BIRCH

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

- Pre-cluster data points using CF-tree
- For each point
- CF-tree is traversed to find the closest cluster
- If the threshold criterion is satisfied, the

point is absorbed into the cluster - Otherwise, it forms a new cluster
- Requires only single scan of data
- Cluster summaries stored in CF-tree are given to

main memory hierarchical clustering algorithm

Initialization of BIRCH

- CF of a cluster of n d-dimensional vectors,

V1,,Vn, is defined as (n,LS, SS) - n is the number of vectors
- LS is the sum of vectors
- SS is the sum of squares of vectors
- CF1CF2 (n1 n1 LS1 LS1, SS1 SS1)
- This property is used for incremental maintaining

cluster features - Distance between two clusters CF1 and CF2 is

defined to be the distance between their

centroids.

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

Clustering Feature Vector

Clustering Feature CF (N, LS, SS) N Number

of data points LS (linear sum of N data points)

?Ni1Xi SS (square sum of N data points

?Ni1Xi2

CF (5, (16,30),(54,190))

(3,4) (2,6) (4,5) (4,7) (3,8)

Notations

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

- Given N d-dimensional data points in a cluster

Xi - Centroid X0, radius R, diameter D, controid

Euclidian distance D0, centroid Manhattan

distance D1

Notations (2)

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

- Given N d-dimensional data points in a cluster

Xi - Average inter-cluster distance D2, average

intra-cluster distance D3, variance increase

distance D4

CF Tree

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

Root

B 7 L 6

Non-leaf node

CF1

CF3

CF2

CF5

child1

child3

child2

child5

Leaf node

Leaf node

CF1

CF2

CF6

prev

next

CF1

CF2

CF4

prev

next

Example

- Given (T2?), B3 for 3, 6, 8, and 1
- (2,(9, 45) ? (2,(4,10)), (2,(14,100))
- For 2 inserted ?(1,(2,4))
- (3,(6,14), (2,(14,100))
- (2,(3,5)), (1,(3,9)) (2,(14,100))
- For 5 inserted ?(1,(5,25))
- (3,(6,14),

(3,(19,125)) - (2,(3,5)), (1,(3,9)) (2,(11,61)), (1,(8,64))
- For 7 inserted ? (1,(7,49))
- (3,(6,14),

(4,(26,174)) - (2,(3,5)), (1,(3,9)) (2,(11,61)),

(2,(15,113))

Evaluation of BIRCH

- Scales linearly finds a good clustering with a

single scan and improves the quality with a few

additional scans - Weakness handles only numeric data and sensitive

to the order of the data record.

Data Summarization

- To compress the data into suitable representative

objects - OPTICS Data Bubble

Finding clusters from hierarchical clustering

depending on the resolution

OPTICS

M. Ankerst, M. Breunig, H. Kriegel, J. Sander,

OPTICS Ordering Points to Identify the

Clustering Structure, ACM SIGMOD, 1999.

- Pre N?(q) the subset of D contained in the

?neighborhood of q. (? is a radius) - Definition 1 (directly density-reachable) Object

p is directly density-reachable from object q

wrt. ? and MinPts in a set of objects D if 1) p ?

N,(q) (N?(q) is the subset of D contained in the

?-neighborhood of q.) 2) Card(N?(q)) gt MinPts

(Card(N) denotes the cardinality of the set N) - Definitions
- Directly density-reachable (p.51 Figure 2) ?

density-reachable transitivity of ddr - Density-connected (p -gt o lt- q)
- Core-distance ?, MinPts (p) MinPts_distance (p)
- Reachability-distance ?, MinPts (p,o) wrt o

max(core-distance(o), dist(o,p)) ? Figure 4 - Ex) cluster ordering ? reachability values Fig 12

Data Bubbles

M. Breunig, H. Kriegel, P. Kroger, J. Sander,

Data Bubbles Quality Preserving Performance

Boosting for Hierarchical Clustering, ACM SIGMOD,

2001.

- ?-neighborhood of P
- k-distance of P, at least for k objects O ? D it

holds d(P,O) d(P,O), and at most k-1 objects

O ? D it holds d(P,O) lt d(P,O). - k-nearest neighbors of P
- MinPts-dist(P) a distance in which there are at

least MinPts objects within the ?-neighborhood of

P.

Data Bubbles

M. Breunig, H. Kriegel, P. Kroger, J. Sander,

Data Bubbles Quality Preserving Performance

Boosting for Hierarchical Clustering, ACM SIGMOD,

2001.

- Structural distortion
- Figure 11
- Data Bubbles, B(n,rep,extend,nnDist)
- n of objects in X rep a representative

bject for X extent estimation of the radius of

X nnDist partial function, estimating k-nearest

neighbor distances in X. - Distance (B,C) page 6-83

Dist(B.rep, C.rep) - B.extent C.extend

B.nnDist(1) C.nnDist(1)

Max B.nnDist(1) C.nnDist(1)

K-Means in SQL

C. Ordonez, Integrating K-Means Clustering with a

Relational DBMS Using SQL, IEEE TKDE 2006.

- Dataset Yy1,y2,,yn d?n matrix, where yid?1

column vector - K-Means to find k clusters, by minimizing the

square error from the centers. - Square distance, Eq(1) and objective fn, Eq(2)
- Matrices
- W k weights (fractions of n) d?k matrix
- C k means (centroids) d?k matrix
- R k variances (square distances) k?1 matrix
- Matrices
- Mj contains the d sums of point dimension values

in cluster j d?k matrix - Qj contains the d sums of squared dimension

values in cluster j d?k matrix - Nj contains points in cluster j k?1 matrix
- Intermediate matrices YH, YV, YD, YNN, NMQ, WCR

Figure 193

Y

YH

C

YV

CH

Y1 Y2 Y3

1 2 3

1 2 3

9 8 7

9 8 7

9 8 7

i Y1 Y2 Y3

1 1 2 3

2 1 2 3

3 9 8 7

4 9 8 7

5 9 8 7

l k C1/C2

1 1 1

2 1 2

3 1 3

1 2 9

2 2 8

3 2 7

i l val

1 1 1

1 2 2

1 3 3

2 1 1

2 2 2

2 3 3

3 1 9

3 2 8

3 3 7

4 1 9

4 2 8

4 3 7

5 1 9

5 2 8

5 3 7

j Y1 Y2 Y3

1 1 2 3

2 9 8 7

YNN

YD

Insert into C Select 1,1,Y1 From CH Where

j1 Insert into C Select d,k,Yd From CH Where

jk

i j

1 1

2 1

3 2

4 2

5 2

i d1 d2

1 0 116

2 0 116

3 116 0

4 116 0

5 116 0

Insert into YD Select i, sum(YV.val-C.C1)2) AS

d1, sum(YV.val-C.Ck)2) AS dk FROM YV,

C Where YV.l C.l Group by i

NMQ

WCR

Insert into YNN CASE When d1 lt d2 AND d1 lt dk

Then 1 When d2 lt d3 .. Then

2 ELSE k

l j N M Q

1 1 2 2 3

2 1 2 4 3

3 1 2 6 7

1 2 3 27 7

2 2 3 24 7

3 2 3 21 7

l j W C R

1 1 0.4 1 0

2 1 0.4 2 0

3 1 0.4 3 0

1 2 0.6 9 0

2 2 0.6 8 0

3 2 0.6 7 0

Insert into MNQ Select l,j,sum(1.0) AS N,

sum(YV.val) AS M, sum(YV.va.YV.val) AS Q FROM

YV, YNN Where YV.i YNN.i GROUP by l,j

Incremental Data Summarization

S. Nassar, J. Sander, C. Cheng, Incremental and

Effective Data Summarization for Dynamic

Hierarchical Clustering, ACM SIGMOD, 2004.

- For DXi for 1?i?N, ?data bubble, the data

index ?i n/N. - For DXi with the mean ?X and standard

deviation ?X, - ? is
- good iff ???? - ?? , ?? ??,
- under-filled iff ?lt ?? - ?? , and
- over-filled iff ?gt ?? ??.

Research Issues

- Reduction Dimensions
- Approximation

(No Transcript)

Cure The Algorithm

Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998

- Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998 - Draw random sample s.
- Partition sample to p partitions with size s/p
- Partially cluster partitions into s/pq clusters
- Eliminate outliers
- By random sampling
- If a cluster grows too slow, eliminate it.
- Cluster partial clusters.
- Label data in disk

Data Partitioning and Clustering

Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998

- s 50
- p 2
- s/p 25

- s/pq 5

x

x

Cure Shrinking Representative Points

Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998

- Shrink the multiple representative points towards

the gravity center by a fraction of ?. - Multiple representatives capture the shape of the

cluster

Density-Based Clustering Methods

- Clustering based on density (local cluster

criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg D. Keim (KDD98)
- CLIQUE Agrawal, et al. (SIGMOD98)

CLIQUE (Clustering In QUEst)

- Agrawal, Gehrke, Gunopulos, Raghavan, Automatic

Subspace Clustering of High Dimensional Data for

Data Mining Applications, ACM SIGMOD 1998. - Automatically identifying subspaces of a high

dimensional data space that allow better

clustering than original space - CLIQUE can be considered as both density-based

and grid-based - It partitions each dimension into the same number

of equal length interval - It partitions a d-dimensional data space into

non-overlapping rectangular units - A unit is dense if the fraction of total data

points contained in the unit exceeds the input

model parameter - A cluster is a maximal set of connected dense

units within a subspace

Salary (10,000)

7

6

5

4

3

2

1

age

0

20

30

40

50

60

? 3

CLIQUE The Major Steps

Agrawal, Gehrke, Gunopulos, Raghavan, Automatic

Subspace Clustering of High Dimensional Data for

Data Mining Applications, ACM SIGMOD 1998.

- Partition the data space and find the number of

points that lie inside each cell of the

partition. - Identify the subspaces that contain clusters

using the Apriori principle - Identify clusters
- Determine dense units in all subspaces of

interests - Determine connected dense units in all subspaces

of interests. - Generate minimal description for the clusters
- Determine maximal regions that cover a cluster of

connected dense units for each cluster - Determination of minimal cover for each cluster

Strength and Weakness of CLIQUE

- Strength
- It automatically finds subspaces of the highest

dimensionality such that high density clusters

exist in those subspaces - It is insensitive to the order of records in

input and does not presume some canonical data

distribution - It scales linearly with the size of input and has

good scalability as the number of dimensions in

the data increases - Weakness
- The accuracy of the clustering result may be

degraded at the expense of simplicity of the

method

Model based clustering

- Assume data generated from K probability

distributions - Typically Gaussian distribution Soft or

probabilistic version of K-means clustering - Need to find distribution parameters.
- EM Algorithm

EM Algorithm

- Initialize K cluster centers
- Iterate between two steps
- Expectation step assign points to clusters
- Maximation step estimate model parameters

CURE (Clustering Using Epresentatives )

- Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998 - Stops the creation of a cluster hierarchy if a

level consists of k clusters - Uses multiple representative points to evaluate

the distance between clusters, adjusts well to

arbitrary shaped clusters and avoids single-link

effect

Drawbacks of Distance-Based Method

Guha, Rastogi Shim, CURE An Efficient

Clustering Algorithm for Large Databases, ACM

SIGMOD, 1998

- Drawbacks of square-error based clustering method

- Consider only one point as representative of a

cluster - Good only for convex shaped, similar size and

density, and if k can be reasonably estimated

BIRCH

Zhang, Ramakrishnan, Livny, Birch Balanced

Iterative Reducing and Clustering using

Hierarchies, ACM SIGMOD 1996.

- Dependent on order of insertions
- Works for convex, isotropic clusters of uniform

size - Labeling Problem
- Centroid approach
- Labeling Problem even with correct centers, we

cannot label correctly

Jensen-Shannon Divergence

- Jensen-Shannon(JS) divergence between two

probability distributions - where
- Jensen-Shannon(JS) divergence between a finite

number of probability distributions

Information-Theoretic Clustering (preserving

mutual information)

- (Lemma) The loss in mutual information equals
- Interpretation Quality of each cluster is

measured by the Jensen-Shannon Divergence between

the individual distributions in the cluster. - Can rewrite the above as
- Goal Find a clustering that minimizes the above

loss

Information Theoretic Co-clustering (preserving

mutual information)

- (Lemma) Loss in mutual information equals
- where
- Can be shown that q(x,y) is a maximum entropy

approximation to p(x,y). - q(x,y) preserves marginals q(x)p(x) q(y)p(y)

parameters that determine q are

(m-k)(kl-1)(n-l)

Preserving Mutual Information

- Lemma
- Note that may be thought of as

the prototype of row cluster (the usual

centroid of the cluster is

) - Similarly,

Example Continued

Co-Clustering Algorithm

Properties of Co-clustering Algorithm

- Theorem The co-clustering algorithm

monotonically decreases loss in mutual

information (objective function value) - Marginals p(x) and p(y) are preserved at every

step (q(x)p(x) and q(y)p(y) ) - Can be generalized to higher dimensions

(No Transcript)

Applications -- Text Classification

- Assigning class labels to text documents
- Training and Testing Phases

New Document

Class-1

Document collection

Grouped into classes

Classifier (Learns from Training data)

New Document With Assigned class

Class-m

Training Data

Dimensionality Reduction

- Feature Selection
- Feature Clustering

1

- Select the best words
- Throw away rest
- Frequency based pruning
- Information criterion based
- pruning

Document Bag-of-words

Vector Of words

Word1

Wordk

m

1

Vector Of words

Cluster1

- Do not throw away words
- Cluster words instead
- Use clusters as features

Document Bag-of-words

Clusterk

m

Experiments

- Data sets
- 20 Newsgroups data
- 20 classes, 20000 documents
- Classic3 data set
- 3 classes (cisi, med and cran), 3893 documents
- Dmoz Science HTML data
- 49 leaves in the hierarchy
- 5000 documents with 14538 words
- Available at http//www.cs.utexas.edu/users/manyam

/dmoz.txt - Implementation Details
- Bow for indexing,co-clustering, clustering and

classifying

Naïve Bayes with word clusters

- Naïve Bayes classifier
- Assign document d to the class with the highest

score - Relation to KL Divergence
- Using word clusters instead of words
- where parameters for clusters are estimated

according to joint statistics

Selecting Correlated Attributes

T. Fukuda, Y. Morimoto, S. Morishita, T.

Tokuyama, Constructing Efficient Decision Trees

by Using Optimized Numeric Association Rules,

Proc. of VLDB Conf., 1996.

- To decide A and A are strongly correlated iff
- where a threshold ? ? 1

MDL-based Decision Tree Pruning

- M. Mehta, J. Rissanen, R. Agrawal, MDL-based

Decision Tree Pruning, Proc. on KDD Conf., 1995. - Two steps for induction of decision trees
- Construct a DT using training data
- Reduce the DT by pruning to prevent overfitting
- Possible approaches
- Cost-complexity pruning using a separate set of

samples for pruning - DT pruning using the same training data sets for

testing - MDL-based pruning using Minimum Description

Length (MDL) principle.

Pruning Using MDL Principle

M. Mehta, J. Rissanen, R. Agrawal, MDL-based

Decision Tree Pruning, Proc. on KDD Conf., 1995.

- View decision tree as a means for efficiently

encoding classes of records in training set - MDL Principle best tree is the one that can

encode records using the fewest bits - Cost of encoding tree includes
- 1 bit for encoding type of each node (e.g. leaf

or internal) - Csplit cost of encoding attribute and value for

each split - nE cost of encoding the n records in each leaf

(E is entropy)

Pruning Using MDL Principle

M. Mehta, J. Rissanen, R. Agrawal, MDL-based

Decision Tree Pruning, Proc. on KDD Conf., 1995.

- Problem to compute the minimum cost subtree at

root of built tree - Suppose minCN is the cost of encoding the minimum

cost subtree rooted at N - Prune children of a node N if minCN nE1
- Compute minCN as follows
- N is leaf nE1
- N has children N1 and N2 minnE1,Csplit1minCN

1minCN2 - Prune tree in a bottom-up fashion

MDL Pruning - Example

R. Rastogi, K. Shim, PUBLIC A Decision Tree

Classifier that Integrates Building and Pruning,

Proc. of VLDB Conf., 1998.

- Cost of encoding records in N, (nE1) 3.8
- Csplit 2.6
- minCN min3.8,2.6111 3.8
- Since minCN nE1, N1 and N2 are pruned

PUBLIC

- R. Rastogi, K. Shim, PUBLIC A Decision Tree

Classifier that Integrates Building and Pruning,

Proc. of VLDB Conf., 1998. - Prune tree during (not after) building phase
- Execute pruning algorithm (periodically) on

partial tree - Problem how to compute minCN for a yet to be

expanded leaf N in a partial tree - Solution compute lower bound on the subtree cost

at N and use this as minCN when pruning - minCN is thus a lower bound on the cost of

subtree rooted at N - Prune children of a node N if minCN nE1
- Guaranteed to generate identical tree to that

generated by SPRINT

PUBLIC(1)

R. Rastogi, K. Shim, PUBLIC A Decision Tree

Classifier that Integrates Building and Pruning,

Proc. of VLDB Conf., 1998.

sal education Label

10K High-school Reject

40K Under Accept

15K Under Reject

75K grad Accept

18K grad Accept

- Simple lower bound for a subtree 1
- Cost of encoding records in N nE1 5.8
- Csplit 4
- minCN min5.8, 4111 5.8
- Since minCN nE1, N1 and N2 are pruned

PUBLIC(S)

- Theorem The cost of any subtree with s splits

and rooted at node N is at least 2s1slog a - a is the number of attributes
- k is the number of classes
- ni (gt ni1) is the number of records belonging

to class i - Lower bound on subtree cost at N is thus the

minimum of - nE1 (cost with zero split)
- 2s1slog a

k

å

ni

is2

k

å

ni

is2

Whats Clustering

- Clustering is a kind of unsupervised learning.
- Clustering is a method of grouping data that

share similar trend and patterns. - Clustering of data is a method by which large

sets of data is grouped into clusters of smaller

sets of similar data. - Example

After clustering

Thus, we see clustering means grouping of data or

dividing a large data set into smaller data sets

of some similarity.

Partitional Algorithms

- Enumerate K partitions optimizing some criterion
- Example square-error criterion
- Where x is the ith pattern belonging to the jth

cluster and c is the centroid of the jth cluster.

Squared Error Clustering Method

- Select an initial partition of the patterns with

a fixed number of clusters and cluster centers - Assign each pattern to its closest cluster center

and compute the new cluster centers as the

centroids of the clusters. Repeat this step until

convergence is achieved, i.e., until the cluster

membership is stable. - Merge and split clusters based on some heuristic

information, optionally repeating step 2.

Agglomerative Clustering Algorithm

- Place each pattern in its own cluster. Construct

a list of interpattern distances for all distinct

unordered pairs of patterns, and sort this list

in ascending order - Step through the sorted list of distances,

forming for each distinct dissimilarity value dk

a graph on the patterns where pairs of patterns

closer than dk are connected by a graph edge. If

all the patterns are members of a connected

graph, stop. Otherwise, repeat this step. - The output of the algorithm is a nested hierarchy

of graphs with can be cut at a desired

dissimilarity level forming a partition

identified by simply connected components in the

corresponding graph.

Agglomerative Hierarchical Clustering

- Mostly used hierarchical clustering algorithm
- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the

number of clusters becomes K - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)

Clustering

- Summary of Drawbacks of Traditional Methods
- Partitional algorithms split large clusters
- Centroid-based method splits large and

non-hyperspherical clusters - Centers of subclusters can be far apart
- Minimum spanning tree algorithm is sensitive to

outliers and slight change in position - Exhibits chaining effect on string of outliers
- Cannot scale up for large databases

Model-based Clustering

- Mixture of Gaussians
- Gaussian pdf P(?i)
- Data point, N(?i,?2I)
- Consider
- Data points x1, x2,, xN
- P(?1),, P(?k),?
- Likelihood function
- Maximize the likelihood function by calculating

Overview of EM Clustering

- Extensions and generalizations. The EM

(expectation maximization) algorithm extends the

k-means clustering technique in two important

ways - Instead of assigning cases or observations to

clusters to maximize the differences in means for

continuous variables, the EM clustering algorithm

computes probabilities of cluster memberships

based on one or more probability distributions.

The goal of the clustering algorithm then is to

maximize the overall probability or likelihood of

the data, given the (final) clusters. - Unlike the classic implementation of k-means

clustering, the general EM algorithm can be

applied to both continuous and categorical

variables (note that the classic k-means

algorithm can also be modified to accommodate

categorical variables).

EM Algorithm

- The EM algorithm for clustering is described in

detail in Witten and Frank (2001). - The basic approach and logic of this clustering

method is as follows. - Suppose you measure a single continuous variable

in a large sample of observations. - Further, suppose that the sample consists of two

clusters of observations with different means

(and perhaps different standard deviations)

within each sample, the distribution of values

for the continuous variable follows the normal

distribution. - The resulting distribution of values (in the

population) may look like this

EM v.s. k-Means

- Classification probabilities instead of

classifications. The results of EM clustering are

different from those computed by k-means

clustering. The latter will assign observations

to clusters to maximize the distances between

clusters. The EM algorithm does not compute

actual assignments of observations to clusters,

but classification probabilities. In other words,

each observation belongs to each cluster with a

certain probability. Of course, as a final result

you can usually review an actual assignment of

observations to clusters, based on the (largest)

classification probability.

Finding k

- V-fold cross-validation. This type of

cross-validation is useful when no test sample is

available and the learning sample is too small to

have the test sample taken from it. A specified V

value for V-fold cross-validation determines the

number of random subsamples, as equal in size as

possible, that are formed from the learning

sample. The classification tree of the specified

size is computed V times, each time leaving out

one of the subsamples from the computations, and

using that subsample as a test sample for

cross-validation, so that each subsample is used

V - 1 times in the learning sample and just once

as the test sample. The CV costs computed for

each of the V test samples are then averaged to

give the V-fold estimate of the CV costs.

Expectation Maximization

- A mixture of Gaussians
- Ex x130, P(x1)1/2 x218, P(x2)u x30,

P(x3)2u x423, P(x4)1/2-3u - Likelihood for X1 a students x2 b students

x3 c students x4 d students - To maximize L, calculate the log Likelihood L

Supposing a14, b6, c9,d10, then u1/10. If

x1x2 h students ? abh ? ah/(u1), b2uh/(u1)

Gaussian (Normal) pdf

- The Gaussian function with mean (?) and standard

deviation (?). The properties of the function - symmetric about the mean
- Gains its maximum value at the mean, the minimum

value at plus and minus infinity - The distribution is often referred to as bell

shaped - At one standard deviation from the mean the

function has dropped to about 2/3 of its maximum

value, at two standard deviations it has falled

to about a 1/7. - The area under the function one standard

deviation from the mean is about 0.682. Two

standard deviations it is 0.9545, and the three

s.d. it is 0.9973. The total area under the curve

is 1.

Gaussian

Think the cumulative distribution, F?,?2(x)

Multi-variate Density Estimation

Mixture of Gaussians

- contains all the parameters of the mixture

model. pi are known as mixing proportions or

coefficients. - A mixture of Gaussians model
- Generic mixture

P(y) y1

y2 P(xy1) P(xy2)

Mixture Density

- If we are given just x we do not know which

mixture component this example came from - We can evaluate the posterior probability that an

observed x was generated from the first mixture

component