A Closer Look at Clustering in S-Plus - PowerPoint PPT Presentation

Loading...

PPT – A Closer Look at Clustering in S-Plus PowerPoint presentation | free to download - id: 82e82b-NWViN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

A Closer Look at Clustering in S-Plus

Description:

A Closer Look at Clustering in S-Plus – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Slides: 40
Provided by: Jeffr487
Learn more at: http://guilanstat.ir
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Closer Look at Clustering in S-Plus


1
A Closer Look at Clustering in S-Plus
2
Getting Your Data Into S-PLUS
  • mm lt- matrix( scan("mfile"), ncol5, byrowTRUE)
  • read all rows from the file where there are 5
    columns in each
  • row

3
Reading In Tabular Data
  • read.table(file, headerltltsee belowgtgt, sep,
    row.names,
  • col.names,as.isF, na.strings"NA", skip0)
  • header T means that the first line of the file
    is used for the variable names in our data
  • frame
  • row.names A variable that indicates the row
    names. It can be a vector of the same
  • length as our table, or a number that points to a
    particular column where the variable names
  • reside, or else the numbers from 1 to the length
    of our table are used.
  • col.names A variable that provides the column
    names in the absence of header T. If
  • nothing is provided then S-PLUS uses V
    concatenated with the field name.
  • as.is Logical that determines whether
    non-numeric variables are turned into character
  • strings or not.

4
Example 1 - Reading Tabular Data
  • Price Floor Area Rooms Age Cent.heat
  • 01 52.00 111.0 830 5 6.2 no
  • 02 54.75 128.0 710 5 7.5 no
  • floorlt-read.table("c/floor.txt")
  • gt attributes(floor)
  • names
  • 1 "Price" "Floor" "Area" "Rooms"
  • 5 "Age" "Cent.heat"
  • class
  • 1 "data.frame"
  • row.names
  • 1 "1" "2"

5
Example 2 - Reading Tabular Data
  • Price Floor Area Rooms Age Cent.heat
  • 52.00 111.0 830 5 6.2 no
  • 54.75 128.0 710 5 7.5 no
  • gt floorlt-read.table("c/floor.txt")
  • gt attributes(floor)
  • names
  • 1 "V2" "V3" "V4" "V5" "V6"
  • class
  • 1 "data.frame"
  • row.names
  • 1 "Price" "52.00" "54.75"
  • gt floor
  • V2 V3 V4 V5 V6
  • Price Floor Area Rooms Age Cent.heat
  • 52.00 111.0 830 5 6.2 no

6
Example 3 - Reading Tabular Data
  • Price Floor Area Rooms Age Cent.heat
  • 52.00 111.0 830 5 6.2 no
  • 54.75 128.0 710 5 7.5 no
  • gt floorlt-read.table("c/floor.txt",headerT,row.na
    mesNULL)
  • gt attributes(floor)
  • names
  • 1 "Price" "Floor" "Area" "Rooms"
  • 5 "Age" "Cent.heat"
  • class
  • 1 "data.frame"
  • row.names
  • 1 "1" "2"

7
Example Data Generation
  • gtx1lt-rmvnorm(100, meanc(2,2), covmatrix(c(1,0,0,
    1), 2))
  • gt x2lt-rmvnorm(100, meanc(-2,-2),
    covmatrix(c(1,0,0,1), 2))
  • gt xlt-matrix(nrow200,ncol2)
  • gt x1100,lt-x1
  • gt x101200,lt-x2
  • gt pairs(x)

8
Example Data Generation
  • gtx1lt-rmvnorm(100, meanc(2,2), covmatrix(c(1,0,0,
    1), 2))
  • gt x2lt-rmvnorm(100, meanc(-2,-2),
    covmatrix(c(1,0,0,1), 2))
  • gt xlt-matrix(nrow200,ncol2)
  • gt x1100,lt-x1
  • gt x101200,lt-x2
  • gt pairs(x)

9
Computing the Distance Matrix
  • dist(x, metric "euclidean")
  • metric character string specifying the distance
    metric to be used.
  • The currently available options are "euclidean",
    "maximum",
  • "manhattan", and "binary". Euclidean distances
    are root sum-of-squares
  • of differences, "maximum" is the maximum
    difference, "manhattan" is the
  • sum of absolute differences, and "binary" is the
    proportion of non-
  • zeros that two vectors do not have in common (the
    number of occurrences
  • of a zero and a one, or a one and a zero divided
    by the number of times
  • at least one vector has a one).
  • Since there are many distances and since the
    result of dist is typically an argument to hclust
    or cmdscale, a
  • vector is returned, rather than a symmetric
    matrix. For i less than j, the distance between
    row i and row j is
  • element nrow(x)(i-1) - i(i-1)/2 j-i of the
    result. The returned object has an attribute,
    Size, giving the
  • number of objects, that is, nrow(x). The length
    of the vector that is returned is
    nrow(x)(nrow(x)-1)/2, that
  • is, it is of order nrow(x)

10
Example Distance Matrix Computation
  • gt x.distlt-dist(x)
  • gt length(x.dist)
  • 1 19900

11
hclust
  • hclust(dist, method "compact", sim )
  • dist a distance structure or distance matrix.
    Normally
  • this will be the result of the function dist, but
    it can be
  • any data of the form returned by dist, or a full,
    symmetric
  • matrix. Missing values are not allowed.
  • Method a character string giving the clustering
    method. The
  • three methods currently implemented are
    "average", "connected"
  • (single linkage) and "compact" (complete
    linkage). (The first
  • three characters of the method are sufficient.)

12
Complete Linkage Clustering with hclust
gt plclust(hclust(x.dist))
13
Single Linkage Clustering with hclust
gt plclust(hclust(x.dist,method"connected"))
14
Average Linkage Clustering with hclust
plclust(hclust(x.dist,method"average"))
15
Pruning Our Trees
  • cutree(tree, k 0, h 0)
  • k the desired number of groups. Default is 0.
  • h the height at which to cut tree in order to
    produce the groups. Groups will be defined by
    the structure of the tree above the cut. Default
    is 0.

16
Example Pruning
  • gt x.cl2lt-cutree(hclust(x.dist),k2)
  • gt x.cl2110
  • 1 2 2 2 1 2 2 2 2 2 2
  • gt x.cl2190200
  • 1 1 1 1 1 1 1 1 1 1 1 1
  • gt attributes(x.cl2)
  • height
  • 1 7.102939 5.142965
  • recall this is the height of the last merge
    making up the
  • group

17
Identifying the Number of Clusters
  • As indicated previously we really have no way of
    identify the true cluster structure unless we
    have divine intervention
  • In the next several slides we present some
    well-known methods

18
Method of Mojena
  • Select the number of groups based on the first
    stage of the dendogram that satisfies
  • The a0,a1,a2,... an-1 are the fusion levels
    corresponding to stages with n, n-1, ,1
    clusters. and are the mean and unbiased
    standard deviation of these fusion levels and k
    is a constant.
  • Mojena (1977) 2.75 lt k lt 3.5
  • Milligan and Cooper (1985) k1.25

19
Method of Mojena Applied to Our Data Set - I
  • gt x.clfllt-hclust(x.dist)height
  • assign the fusion levels
  • gt x.clmlt-mean(x.clfl)
  • compute the means
  • gt x.clslt-sqrt(var(x.clfl))
  • compute the standard deviation
  • gt print((x.clfl-x.clm)/x.cls)
  • output the results for comparison with k

20
Method of Mojena Applied to Our Data Set - II
  • gt print((x.clfl-x.clm)/x.cls)
  • 1 -0.60697193 -0.58746665 -0.58678547
    -0.58049331
  • 5 -0.57679720 -0.57163306 -0.56496595
    -0.56353931
  • 185 1.21499989 1.28188441 1.48833552
    1.60550442
  • 189 1.64120781 1.83945221 1.91133195
    2.25999297
  • 193 2.51916087 2.63885648 2.99170110
    3.39950673
  • 197 3.98513994 4.92839223 8.13577250

21
Method of Mojena Applied to Our Data Set - III
  • gt x.clfl186
  • 1 2.428254
  • gt x.clfl197
  • 1 5.893725

22
Visualizing Our Cluster Structure
  • gt x.clmojenalt-cutree(hclust(x.dist),hx.clfl186)
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.clmojena)
    )

23
More Visualizing Our Cluster Structure
  • gt x.clmillcooplt-cutree(hclust(x.dist),hx.clfl197
    )
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.clmillcoo
    p))

24
One Last Time
  • gt x.cllastsplitlt-cutree(hclust(x.dist),hx.clfl19
    9)
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.cllastspl
    it))

25
To Get One Cluster
  • gt plclust(hclust(x.dist))
  • gt x.cljust1lt-cutree(hclust(x.dist),h11.25)
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.cljust1))

26
Hartigans k-means Clustering
  • kmeans(x, centers, iter.max10)
  • x matrix of multivariate data. Each row
    corresponds to an observation, and each column
    corresponds to a variable. Missing values are not
    accepted.
  • Centers matrix of initial guesses for the
    cluster centers, or integer giving the number of
    clusters. If centers is an integer, hclust and
    cutree will be used to get initial values. If
    centers is a matrix, each row represents a
    cluster center, and thus centers must have the
    same number of columns as x. The number of rows
    in centers, (there must be at least two), is the
    number of clusters that will be formed. Missing
    values are not accepted.
  • OPTIONAL ARGUMENTS
  • iter.max maximum number of iterations.

27
Outputs of the S-PLUS kmeans function
  • An object of class kmeans with the following
    components
  • cluster vector of integers, ranging from 1 to
    nrow(centers), with length the same as the number
    of rows of x. The ith value indicates the cluster
    in which the ith data point belongs.
  • Centers matrix like the input centers
    containing the locations of the final cluster
    centers. Each row is a cluster center location.
  • Withinss vector of length nrow(centers). The
    ith value gives the within cluster sum of squares
    for the ith cluster.
  • Size vector of length nrow(centers). The ith
    value gives the number of data points in cluster
    i.

28
Hartigans k-means theory
  • When deciding on the number of clusters,
  • Hartigan (1975, pp 90-91) suggests the
  • following rough rule of thumb. If k is the
  • result of kmeans with k groups and kplus1 is
  • the result with k1 groups, then it is
  • justifiable to add the extra group when
  • (sum(kwithinss)/sum(kplus1withinss)-1)(nrow(x)-
    k-1)
  • is greater than 10.

29
kmeans Applied to our Data Set
  • Here we perform kmeans clustering for a
    sequence of model
  • sizes
  • gt x.km2lt-kmeans(x,2)
  • gt x.km3lt-kmeans(x,3)
  • gt x.km4lt-kmeans(x,4)
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.km2clust
    er))

30
The 3 term kmeans solution
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.km3clust
    er))

31
The 4 term kmeans Solution
  • gt plot(x,1,x,2,type"n")
  • gt text(x,1,x,2,labelsas.character(x.km4clust
    er))

32
Determination of the Number of Clusters Using the
Hartigan Criteria
  • gt sum(x.km2withinss)/((sum(x.km3withinss)-1)(20
    0-2-1))
  • 1 0.006476385
  • gt sum(x.km3withinss)/((sum(x.km4withinss)-1)(20
    0-3-1))
  • 1 0.005889223
  • gt x.km1lt-kmeans(x,1)
  • Error in switch(valuesifault, nrow(centers) lt
    1 or gt nrow(x)
  • Dumped
  • So it seems that in evaluating the k1 model vs.
    the k2 model we need to compute the sum of
    squares deviations from the mean by hand

33
Model Based Clustering
  • The idea behind model-based clustering is that
    the data are independent samples from a series of
    groups populations, but the group labels have
    been lost. So if we knew that the vector g gave
    the group labels and that each group had a
    class-conditional pdf f(xq), then the likelihood
    would be given by
  • Since the labels are unknown, these are treated
    as parameters and the likelihood in the above
    equation is maximized ove (q,g).

34
Model Based Clustering in S-PLUS - Inputs
  • mclust(x, method "S", signif rep(0,
    dim(x)2), noise
  • F,scale rep(1, dim(x)2), shape c(1,
    rep(0.2, (dim(x)2-
  • 1))),workspace ltltsee belowgtgt)
  • method a character string to select the
    clustering criterion. Possible values are "S",
    "S", "spherical" (with varying sizes), "sum of
    squares" or "trace" (Ward's method),
    "unconstrained", "determinant", "centroid",
    "weighted average link", "group average link",
    "complete link" or "farthest neighbor", "single
    link" or "nearest neighbor". Only enough of the
    string to determine a unique match is required.
  • signif vector giving the number of significant
    decimal places in each column of x. Nonpositive
    components are allowed. Used in initializing
    clustering in some methods.
  • noise indicates whether or not Poisson noise
    should be assumed.
  • scale vector for scaling the observations. The
    ith column of x is multiplied by scalei before
    cluster analysis begins.
  • shape vector determining the shape of clusters
    for methods "S" and "S".
  • workspace size of the workspace provided to the
    underlying Fortran program. The default is
    (dim(x)1(dim(x)1-1)) 10dim(x)1.

35
Model Based Clustering in S-PLUS - Outputs
  • Tree list with components merge, height, and
    order
  • conforming to the output of the function hclust,
    but here
  • height is just the stage of the merge. This
    output can be used
  • with several functions such as plclust and
    subtree.
  • Lr list of objects merged at each stage, in
    which a new
  • cluster inherits the number of the
    lowest-numbered object or
  • cluster from which it is formed (used for
    classification by
  • function mclass).
  • Awe a vector in which the kth element is the
    approximate
  • weight of evidence for k clusters. This component
    is present
  • only for the model-based methods "S", "S",
    "spherical"
  • (with varying sizes), "sum of squares" or "trace"
    (Ward's
  • method), "unconstrained", and "determinant".
  • Call a copy of the call to mclust.

36
mclust applied to our data set
  • gt x.mcllt-mclust(x)
  • gt plclust(x.mcltree)

37
Assessing the Number of Components in mclust
  • gt plot(seq(from1,to200),x.mclawe,type'b')

38
A Closer Look at the Beginning of the Curve
39
Examining the mclust Cluster Structure
  • We can use the same techniques that we used with
    kmeans to examine the cluster structure imposed
    by mclust
  • These pictures are helpful in assessing the
    appropriateness of the cluster structure.
About PowerShow.com