Loading...

PPT – A Closer Look at Clustering in S-Plus PowerPoint presentation | free to download - id: 82e82b-NWViN

The Adobe Flash plugin is needed to view this content

A Closer Look at Clustering in S-Plus

Getting Your Data Into S-PLUS

- mm lt- matrix( scan("mfile"), ncol5, byrowTRUE)
- read all rows from the file where there are 5

columns in each - row

Reading In Tabular Data

- read.table(file, headerltltsee belowgtgt, sep,

row.names, - col.names,as.isF, na.strings"NA", skip0)
- header T means that the first line of the file

is used for the variable names in our data - frame
- row.names A variable that indicates the row

names. It can be a vector of the same - length as our table, or a number that points to a

particular column where the variable names - reside, or else the numbers from 1 to the length

of our table are used. - col.names A variable that provides the column

names in the absence of header T. If - nothing is provided then S-PLUS uses V

concatenated with the field name. - as.is Logical that determines whether

non-numeric variables are turned into character - strings or not.

Example 1 - Reading Tabular Data

- Price Floor Area Rooms Age Cent.heat
- 01 52.00 111.0 830 5 6.2 no
- 02 54.75 128.0 710 5 7.5 no
- floorlt-read.table("c/floor.txt")
- gt attributes(floor)
- names
- 1 "Price" "Floor" "Area" "Rooms"

- 5 "Age" "Cent.heat"
- class
- 1 "data.frame"
- row.names
- 1 "1" "2"

Example 2 - Reading Tabular Data

- Price Floor Area Rooms Age Cent.heat
- 52.00 111.0 830 5 6.2 no
- 54.75 128.0 710 5 7.5 no
- gt floorlt-read.table("c/floor.txt")
- gt attributes(floor)
- names
- 1 "V2" "V3" "V4" "V5" "V6"
- class
- 1 "data.frame"
- row.names
- 1 "Price" "52.00" "54.75"
- gt floor
- V2 V3 V4 V5 V6
- Price Floor Area Rooms Age Cent.heat
- 52.00 111.0 830 5 6.2 no

Example 3 - Reading Tabular Data

- Price Floor Area Rooms Age Cent.heat
- 52.00 111.0 830 5 6.2 no
- 54.75 128.0 710 5 7.5 no
- gt floorlt-read.table("c/floor.txt",headerT,row.na

mesNULL) - gt attributes(floor)
- names
- 1 "Price" "Floor" "Area" "Rooms"

- 5 "Age" "Cent.heat"
- class
- 1 "data.frame"
- row.names
- 1 "1" "2"

Example Data Generation

- gtx1lt-rmvnorm(100, meanc(2,2), covmatrix(c(1,0,0,

1), 2)) - gt x2lt-rmvnorm(100, meanc(-2,-2),

covmatrix(c(1,0,0,1), 2)) - gt xlt-matrix(nrow200,ncol2)
- gt x1100,lt-x1
- gt x101200,lt-x2
- gt pairs(x)

Example Data Generation

- gtx1lt-rmvnorm(100, meanc(2,2), covmatrix(c(1,0,0,

1), 2)) - gt x2lt-rmvnorm(100, meanc(-2,-2),

covmatrix(c(1,0,0,1), 2)) - gt xlt-matrix(nrow200,ncol2)
- gt x1100,lt-x1
- gt x101200,lt-x2
- gt pairs(x)

Computing the Distance Matrix

- dist(x, metric "euclidean")
- metric character string specifying the distance

metric to be used. - The currently available options are "euclidean",

"maximum", - "manhattan", and "binary". Euclidean distances

are root sum-of-squares - of differences, "maximum" is the maximum

difference, "manhattan" is the - sum of absolute differences, and "binary" is the

proportion of non- - zeros that two vectors do not have in common (the

number of occurrences - of a zero and a one, or a one and a zero divided

by the number of times - at least one vector has a one).
- Since there are many distances and since the

result of dist is typically an argument to hclust

or cmdscale, a - vector is returned, rather than a symmetric

matrix. For i less than j, the distance between

row i and row j is - element nrow(x)(i-1) - i(i-1)/2 j-i of the

result. The returned object has an attribute,

Size, giving the - number of objects, that is, nrow(x). The length

of the vector that is returned is

nrow(x)(nrow(x)-1)/2, that - is, it is of order nrow(x)

Example Distance Matrix Computation

- gt x.distlt-dist(x)
- gt length(x.dist)
- 1 19900

hclust

- hclust(dist, method "compact", sim )
- dist a distance structure or distance matrix.

Normally - this will be the result of the function dist, but

it can be - any data of the form returned by dist, or a full,

symmetric - matrix. Missing values are not allowed.
- Method a character string giving the clustering

method. The - three methods currently implemented are

"average", "connected" - (single linkage) and "compact" (complete

linkage). (The first - three characters of the method are sufficient.)

Complete Linkage Clustering with hclust

gt plclust(hclust(x.dist))

Single Linkage Clustering with hclust

gt plclust(hclust(x.dist,method"connected"))

Average Linkage Clustering with hclust

plclust(hclust(x.dist,method"average"))

Pruning Our Trees

- cutree(tree, k 0, h 0)
- k the desired number of groups. Default is 0.
- h the height at which to cut tree in order to

produce the groups. Groups will be defined by

the structure of the tree above the cut. Default

is 0.

Example Pruning

- gt x.cl2lt-cutree(hclust(x.dist),k2)
- gt x.cl2110
- 1 2 2 2 1 2 2 2 2 2 2
- gt x.cl2190200
- 1 1 1 1 1 1 1 1 1 1 1 1
- gt attributes(x.cl2)
- height
- 1 7.102939 5.142965
- recall this is the height of the last merge

making up the - group

Identifying the Number of Clusters

- As indicated previously we really have no way of

identify the true cluster structure unless we

have divine intervention - In the next several slides we present some

well-known methods

Method of Mojena

- Select the number of groups based on the first

stage of the dendogram that satisfies - The a0,a1,a2,... an-1 are the fusion levels

corresponding to stages with n, n-1, ,1

clusters. and are the mean and unbiased

standard deviation of these fusion levels and k

is a constant. - Mojena (1977) 2.75 lt k lt 3.5
- Milligan and Cooper (1985) k1.25

Method of Mojena Applied to Our Data Set - I

- gt x.clfllt-hclust(x.dist)height
- assign the fusion levels
- gt x.clmlt-mean(x.clfl)
- compute the means
- gt x.clslt-sqrt(var(x.clfl))
- compute the standard deviation
- gt print((x.clfl-x.clm)/x.cls)
- output the results for comparison with k

Method of Mojena Applied to Our Data Set - II

- gt print((x.clfl-x.clm)/x.cls)
- 1 -0.60697193 -0.58746665 -0.58678547

-0.58049331 - 5 -0.57679720 -0.57163306 -0.56496595

-0.56353931 - 185 1.21499989 1.28188441 1.48833552

1.60550442 - 189 1.64120781 1.83945221 1.91133195

2.25999297 - 193 2.51916087 2.63885648 2.99170110

3.39950673 - 197 3.98513994 4.92839223 8.13577250

Method of Mojena Applied to Our Data Set - III

- gt x.clfl186
- 1 2.428254
- gt x.clfl197
- 1 5.893725

Visualizing Our Cluster Structure

- gt x.clmojenalt-cutree(hclust(x.dist),hx.clfl186)

- gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.clmojena)

)

More Visualizing Our Cluster Structure

- gt x.clmillcooplt-cutree(hclust(x.dist),hx.clfl197

) - gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.clmillcoo

p))

One Last Time

- gt x.cllastsplitlt-cutree(hclust(x.dist),hx.clfl19

9) - gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.cllastspl

it))

To Get One Cluster

- gt plclust(hclust(x.dist))
- gt x.cljust1lt-cutree(hclust(x.dist),h11.25)
- gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.cljust1))

Hartigans k-means Clustering

- kmeans(x, centers, iter.max10)
- x matrix of multivariate data. Each row

corresponds to an observation, and each column

corresponds to a variable. Missing values are not

accepted. - Centers matrix of initial guesses for the

cluster centers, or integer giving the number of

clusters. If centers is an integer, hclust and

cutree will be used to get initial values. If

centers is a matrix, each row represents a

cluster center, and thus centers must have the

same number of columns as x. The number of rows

in centers, (there must be at least two), is the

number of clusters that will be formed. Missing

values are not accepted. - OPTIONAL ARGUMENTS
- iter.max maximum number of iterations.

Outputs of the S-PLUS kmeans function

- An object of class kmeans with the following

components - cluster vector of integers, ranging from 1 to

nrow(centers), with length the same as the number

of rows of x. The ith value indicates the cluster

in which the ith data point belongs. - Centers matrix like the input centers

containing the locations of the final cluster

centers. Each row is a cluster center location. - Withinss vector of length nrow(centers). The

ith value gives the within cluster sum of squares

for the ith cluster. - Size vector of length nrow(centers). The ith

value gives the number of data points in cluster

i.

Hartigans k-means theory

- When deciding on the number of clusters,
- Hartigan (1975, pp 90-91) suggests the
- following rough rule of thumb. If k is the
- result of kmeans with k groups and kplus1 is
- the result with k1 groups, then it is
- justifiable to add the extra group when
- (sum(kwithinss)/sum(kplus1withinss)-1)(nrow(x)-

k-1) - is greater than 10.

kmeans Applied to our Data Set

- Here we perform kmeans clustering for a

sequence of model - sizes
- gt x.km2lt-kmeans(x,2)
- gt x.km3lt-kmeans(x,3)
- gt x.km4lt-kmeans(x,4)
- gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.km2clust

er))

The 3 term kmeans solution

- gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.km3clust

er))

The 4 term kmeans Solution

- gt plot(x,1,x,2,type"n")
- gt text(x,1,x,2,labelsas.character(x.km4clust

er))

Determination of the Number of Clusters Using the

Hartigan Criteria

- gt sum(x.km2withinss)/((sum(x.km3withinss)-1)(20

0-2-1)) - 1 0.006476385
- gt sum(x.km3withinss)/((sum(x.km4withinss)-1)(20

0-3-1)) - 1 0.005889223
- gt x.km1lt-kmeans(x,1)
- Error in switch(valuesifault, nrow(centers) lt

1 or gt nrow(x) - Dumped
- So it seems that in evaluating the k1 model vs.

the k2 model we need to compute the sum of

squares deviations from the mean by hand

Model Based Clustering

- The idea behind model-based clustering is that

the data are independent samples from a series of

groups populations, but the group labels have

been lost. So if we knew that the vector g gave

the group labels and that each group had a

class-conditional pdf f(xq), then the likelihood

would be given by - Since the labels are unknown, these are treated

as parameters and the likelihood in the above

equation is maximized ove (q,g).

Model Based Clustering in S-PLUS - Inputs

- mclust(x, method "S", signif rep(0,

dim(x)2), noise - F,scale rep(1, dim(x)2), shape c(1,

rep(0.2, (dim(x)2- - 1))),workspace ltltsee belowgtgt)
- method a character string to select the

clustering criterion. Possible values are "S",

"S", "spherical" (with varying sizes), "sum of

squares" or "trace" (Ward's method),

"unconstrained", "determinant", "centroid",

"weighted average link", "group average link",

"complete link" or "farthest neighbor", "single

link" or "nearest neighbor". Only enough of the

string to determine a unique match is required. - signif vector giving the number of significant

decimal places in each column of x. Nonpositive

components are allowed. Used in initializing

clustering in some methods. - noise indicates whether or not Poisson noise

should be assumed. - scale vector for scaling the observations. The

ith column of x is multiplied by scalei before

cluster analysis begins. - shape vector determining the shape of clusters

for methods "S" and "S". - workspace size of the workspace provided to the

underlying Fortran program. The default is

(dim(x)1(dim(x)1-1)) 10dim(x)1.

Model Based Clustering in S-PLUS - Outputs

- Tree list with components merge, height, and

order - conforming to the output of the function hclust,

but here - height is just the stage of the merge. This

output can be used - with several functions such as plclust and

subtree. - Lr list of objects merged at each stage, in

which a new - cluster inherits the number of the

lowest-numbered object or - cluster from which it is formed (used for

classification by - function mclass).
- Awe a vector in which the kth element is the

approximate - weight of evidence for k clusters. This component

is present - only for the model-based methods "S", "S",

"spherical" - (with varying sizes), "sum of squares" or "trace"

(Ward's - method), "unconstrained", and "determinant".
- Call a copy of the call to mclust.

mclust applied to our data set

- gt x.mcllt-mclust(x)
- gt plclust(x.mcltree)

Assessing the Number of Components in mclust

- gt plot(seq(from1,to200),x.mclawe,type'b')

A Closer Look at the Beginning of the Curve

Examining the mclust Cluster Structure

- We can use the same techniques that we used with

kmeans to examine the cluster structure imposed

by mclust - These pictures are helpful in assessing the

appropriateness of the cluster structure.