Structuring Interactive Cluster Analysis - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Structuring Interactive Cluster Analysis

Description:

Title: Structuring Interactive Cluster Analysis Subject: Interactive data mining Author: Wayne Oldford Keywords: clustering, k-means, visual empirical regions of ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 83
Provided by: Wayne241
Learn more at: http://www.stats.uwaterloo.ca
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Structuring Interactive Cluster Analysis


1
Structuring Interactive Cluster Analysis
  • Wayne Oldford
  • University of Waterloo

2
Structuring Interactive Cluster Analysis
This talk is about interactive cluster analysis,
that is about interactive tools for finding and
identifying groups in data. But more than
that, it's about stepping back and understanding
the structure of this process so that software
tools can be organized to simplify and to aid the
analysis.
  • Wayne Oldford
  • University of Waterloo

3
Overview
The problem of cluster analysis' or of finding
groups in data' is ill defined. So there can be
no universal solution and any claimed solution
must necessarily solve some other suitably
constrained problem and not the more general
one. What we need instead are highly interactive
tools which allow us to adapt to the
peculiarities of the data and the problem at
hand. These tools are usefully organized and
integrated if we step back and consider the
problem as one of exploratory data analysis,
except that now, in addition to the data itself,
the exploration is to take place as well on the
space of partitions of the data. Existing
algorithms need to be recast, and new ones
developed, in terms of exploring the space of
partitions. The algorithms can then be easily
integrated with other interactive tools so that
jointly they provide a broadly useful and easily
adapted tool-set for finding and identifying
groups in data.
Argument
  • ill-defined problem
  • high-interaction desirable
  • explore partitions
  • recast algorithms

4
Overview
Develop by example
Argument
  • ill-defined problem
  • high-interaction desirable
  • explore partitions
  • recast algorithms
  • problems
  • resources
  • interactive clustering
  • partition moves
  • implications
  • prototype interface

5
Problem
geometric/visual structure

Visual system easily identifies groups
algorithms are often motivated and/or understood
via visual intuition and geometric structure
6
Problem
geometric/visual structure

Visual system easily identifies groups
algorithms are often motivated and/or understood
via visual intuition and geometric structure
7
Problem
Consider visually grouping here

Context matters each point is a document
located by each words frequency within the
document
8
Problem

two similar documents of different lengths
should be closer one of these has more text
than the other.
9
Problem

green closer to orange than to red?
distance measured by angle?
10
Problem
structure in context
segmentation in MRI
groups are spatially contiguous in the plane
of the image and nearby in the intensity.
shape is not defined a priori
image source
11
Problem
context specific structure
aneurysm presents as intensity in blood
vessels
groups are spatially contiguous tubes of
similar intensity
shape is restricted a priori to be 3-d tubes
image source
12
Problem
some specific some not
image source
same slice, five different measurements at
each location
spatial grouping as before, additional
grouping possible across measurements
13
Problem
some specific some not
image source
4 dimensional data from connected images
2d spatial with clear biological grouping,
connected to
2d intensity measures with abstract
structure/grouping
14
Problem
  • Find groups in data
  • Similar objects are together
  • Groups are separated
  • Problem is ill defined
  • What do you mean similar?
  • E.g. what is contiguous structure?
  • When are groups separate?
  • Can we believe it?

15
Computational resources
  • 1. Processing

2. Memory
3. Display
16
Computational resources (and response)
  • 1. Processing
  • Gflops, Tflops, multiple processors
  • computationally intensive methods
  • problem constrained and optimized

2. Memory
3. Display
17
Computational resources (and response)
  • 1. Processing

2. Memory
  • GBs, TBs, disk and RAM
  • try to analyze huge data-sets
  • data-sets larger than necessary?

3. Display
18
Computational resources (and response)
  • 1. Processing

2. Memory
3. Display
  • high resolution, large
  • graphics processors, digital video
  • more data, more visual detail

19
Computational resources
  • 1. Processing

2. Memory
3. Display
Exploit no one resource exclusively Balance and
integrate
20
High interaction (much overlooked by researchers)
  • assume multiple displays
  • integrate computational resources
  • challenge is to design software to be simple,
    understandable, integrated and extensible

21
Example image analysis find groups via
intensity (contours and two small unusual
structures revealed)
22
Example image analysis other measurements may
contain interesting structure
23
Example image analysis identify new structure
location in the original image
24
Example image analysis mark new groups by
colour (hue, preserving lightness in original
image)
25
Example image analysis explore relation
between old and new groups via contours in the
image itself
26
Example 8 dimensions from teeth
measurements on species ( sex)
27
Example apes, hominids, modern humans
  • multiple and very different views
  • 3-d point clouds (of first 3 discriminant
    co-ordinates)
  • cases identified in a list
  • each point represented as a smooth curve by
    projecting it on a direction vector smoothly
    moving around the surface of an 8-d sphere
  • all linked via colour by cases being displayed
  • context helps
  • knowing the species encourages grouping
  • grouping based on context the visual
    information
  • grouping is confirmed across different kinds of
    display

28
Example mutual support and shapes
a 3-d projection
Shape from all dimensions
How many groups?
29
Example mutual support and shapes
Groups found here
Same in all dimensions?
How many groups?
30
Example mutual support and shapes
Observe effect here
Split black group by shape
How many groups?
31
Example mutual support and shapes
Get new 3-d projection
Coloured by shape
Five groups corroborated
32
Example exploratory data analysis
How many groups?
33
Example exploratory data analysis
Choose data to cut away
Explore the rest
Distinguish groups
34
Example exploratory data analysis
Bring data back
Explore all together
Some black with red?
Focus on centre
35
Example exploratory data analysis
Explore separately
Mark group
Discard new view
Explore all together
Two groups
36
Interactive clustering
  • visual grouping
  • location, motion, shape, texture, ...
  • linking across displays
  • manual
  • selection
  • cases, variates, groups, ...
  • colouring
  • focus
  • immediate and incremental
  • context can be used to form groups
  • multiple partitions

37
Automated clustering typical software
  • resources dedicated to numerical computation
  • teletype interaction
  • runs to completion
  • graphical output
  • dont always work so well (no universal solution)
  • confirm via exploratory data analysis

Must be integrated with interactive methods
38
Example K-means clustering
K 2 groups
Starting groups as shown have centre ball in one
group
K-means moves one point at a time to improve 2
groups
39
Example K-means clustering
K 2 groups
Final groups shown maximize F-like statistic
(between/within)
Central ball is lost
K-means poor for this data configuration
40
Example VERI Visual Empirical Regions of
Influence

join points if no third point falls in this
region
Visual Empirical Regions of Influence
41
Example VERI Visual Empirical Regions of
Influence

join points if no third point falls in this
region
Visual Empirical Regions of Influence
42
Visual Empirical Regions of Influence
  • psychophysical experiments of human visual
    perception to join data points
  • very special circumstances (two lines of three
    equi-spaced points each)
  • works well on demonstration 2-d cases
  • extends to higher dimensions
  • two points are joined or not depending on their
    joint configuration with a third point
  • each third point examined forms a plane with the
    candidate pair and so VERI shape applies
  • works in high-d with published demonstration cases

43
Example VERI
Each colour is a different group found by VERI.
Central ball is lost.
VERI fails for this data configuration (also for
small perturbations of demonstration cases).
There is no universal method, nor can there be.
44
Example VERI (with parameters)
VERI algorithm, but parameterized now to shrink
region size. Becomes minimal spanning tree in the
limit (MST gets 2 groups here).
Again. no universal method possible, but methods
can be parameterized.
45
Integrating automatic methods
  • Move about the space of partitions
  • Pa --gt Pb --gt Pc --gt .

Which operators f f(Pa) --gt Pb
are of interest?
46
Refine
Need not be nested. Nesting produces hierarchy
Reduce
47
Reassign
48
Refinement sequence
  • 1

Begin with partition containing all points in one
group.
49
Refinement sequence
  • 1

-gt 2
Refine partition to move to a new partition
containing two groups.
This refinement was had by projecting all points
onto the eigen-vector of the largest eigen value
of the sample variance covariance matrix and
splitting at the largest gap between projected
points.
Blue points are on the outer sphere.
50
Refinement sequence
  • 1

-gt 2
-gt 3
Refine partition (2) to move to a new partition
containing three groups.
  • Refinement move
  • select group whose sample var-cov matrix has
    largest eigen-value
  • for that group, project and split as before.

Green points are also on the outer sphere.
51
Refinement sequence
  • 1

-gt 2
-gt 3
-gt 4
Refine partition (3) to move to a new partition
containing four groups.
Refinement move as before, again splits red group.
New group contains a single (magenta) point on
the outer sphere (middle right, up).
Exploration of the data shows this to be a very
poor partition with that single isolated point.
52
Refinement sequence
  • 1

-gt 2
-gt 3
-gt 4
-gt 5
Refine partition (4) to move to a new partition
containing five groups.
Refinement move as before, again splits red group.
New group contains a single (black) point on the
outer sphere (bottom left).
Again a poor partition no further refinement
step taken at this point.
53
Reassign, reduce sequence
  • 5

-gt 5
A reassign move from one partition of five to
another.
Reassignment move k-means maximizing an F
statistic.
Seems a better partition than before explore to
confirm.
54
Explore present partition
  • 5

Reassignment seems to have isolated central red
ball.
Remaining groups distributed around a spherical
surface.
Consider reduction moves from this partition to
nearby partitions with fewer groups.
55
Partition to be reduced
  • 5

Same partition - back in the original position to
make subsequent reduction moves visually
comparable with previous refinement and
reassignment moves.
Choice of reduction move can be based on what we
have learned from exploring this partition.
56
Reduce sequence
  • 5

-gt 4
Reduce partition (5) to move to a new partition
containing four groups.
Reduction move Single-linkage between
groups. i.e. join closest two groups as measured
by euclidean distance between nearest points in
each group.
Seems reasonable choice given structure observed
in previous exploration.
57
Reduce sequence
  • 5

-gt 4
-gt 3
Reduce partition (4) to move to a new partition
containing three groups.
Reduction move As before.
Red ball remains.
Exploration suggests one more reduction move.
58
Reduce sequence
  • 5

-gt 4
-gt 3
-gt 2
Reduce partition (3) to move to a new partition
containing two groups.
Reduction move As before.
This partition seems best.
Interactive exploration important to choose type
and details of potentially interesting moves from
one partition to another.
59
Moves (generic functions)
examples
  • refine (Pold) --gt Pnew

break minimal spanning tree
  • reduce (Pold) --gt Pnew

join near centres
  • reassign (Pold) --gt Pnew

k-means maximize F
  • partition (graphic) --gt Pnew

colours from point cloud
60
Challenges
  • varying focus
  • subsets (selected manually and at random)
  • merging new data into partition
  • exploring multiple partitions
  • interactive display and comparison
  • resolving many to one
  • interface design
  • control panels, options
  • interaction

61
A prototype interface
  • cluster analysis hub
  • an analysis hub (Oldford, 1997) created on
    demand for partition
  • having all points in one group for named
    data-set, or
  • as defined by colours of all points in topmost
    plot, or
  • as defined by colours of selected points in
    topmost plot
  • new hub can always be created for any subset
  • maintains list of saved partitions
  • offers moves from current partition via one of
  • reduce, refine, or reassign
  • manually from current colours (so as to capture
    interactive modification of existing partition)
  • Other operations on one or more partitions (e.g.
    cluster plot, dendrogram, ...)

62
Interface illustration details of moves
  • Each move - refine, reduce, reassign - is an
    entire collection of possible moves, each with
    many possible choices.
  • The next few slides illustrate the prototype
    implementation where
  • Buttons for refine, reduce, and reassign are
    given at the topmost level.
  • Once selected, each button pops up its own
    control panel where various different kinds of
    moves and parameter choices can be made. E.g.
    the analyst might choose to reduce by any of
  • Join groups with closest centres using Euclidean
    distance
  • Join groups whose farthest points are closest
    (i.e. complete linkage)
  • Choose group with greatest spread and disperse
    its points among the remaining groups.

63
Interface - reduce
64
Interface - refine
65
Interface - reassign
66
Interface illustration example of use
  • The next few slides illustrate the prototype
    implementation applied to a ball in a sphere
    data-set (a different one from before).
  • Moves are made about the partition space (refines
    and reassign)
  • Partitions are saved (can be named, deleted,
    revisited, etc.)
  • Nested partitions compared via a dendrogram
  • Non-nested partition compared with nested ones
  • N.B. at any time, the analyst could have
    interacted with any graphic
  • to create a new partition by colouring - using
    manual button
  • focus on a subset to examine via a new cluster
    analysis hub and subsequently incorporate that
    into the partition of the whole data-set.

67
Interaction
Start with partition having all points in a
single group.
Selecting refine pops up the refinement panel.
Choose refinement details.
  • Refinement move
  • Choose group with var-cov having largest eigen
    value.
  • Project these points onto corresponding
    eigen-vector.
  • Split this group where the projected gap is
    largest.

68
Interaction
New partition appears as Refine Dataset in
panel at left.
Refinement details unchanged.
Refine produces new partition having two groups
as shown by different colours in all graphics.
69
name and save partition
Saved partition list.
New partition is named and saved.
Refinement details unchanged.
New partition has three groups.
70
prototype - refine to 4
Refinement details unchanged.
New partition has four groups.
71
prototype - refine to 5
Refinement details unchanged.
No further refinement pursued beyond this one.
New partition has five groups. The fifth group
contains a single point (blue, top right).
72
Select nested partitionsand view dendrogram
1
Select nested partitions
2
Dendrogram button.
3
  • Dendrogram shows 5 nested partitions
  • Each block is a group, horizontal cuts at each
    vertical level is a partition.
  • Size and colour proportions vary with number of
    points.
  • Colouring is as displayed in point cloud (here
    showing the current partition) .

73
Reassign, dendrogram updated
New partition appears as Reassign Dataset in
panel at left.
  • Reassign move to new partition.
  • Details
  • k-means
  • max F statistic
  • Colours update in all graphics including the
    dendrogram
  • Reassignment partition can be explored as usual.
  • This partition can be visually compared with
    previous partitions via the updated colours in
    the dendrogram.

74
Cluster plot dendrograminteraction movie
Cluster plot button operates on selected partition
  • Cluster plot
  • groups as boxes
  • close groups are visually close (via
    multi-dimensional scaling)

Nested and non-nested partitions can be visually
compared simultaneously through interaction.
75
Other operators
  • dissimilarity (Pi, Pj) --gt di,j
  • display (P1, ..., Pm)
  • dendrogram if P1 lt lt Pm
  • mds plot of all clusters in P1, , Pm
  • mds plot of all partitions P1, , Pm

76
Creation
  • partition (Data ...) --gt Pnew
  • manually from colours
  • k-means, random start, mst, veri, etc
  • from existing classifier.
  • partition-path (Data ) --gt P1 , P2 , , Pn
  • partition-path (Pold ...)
  • --gt Pold , P1 , P2 ,
    , Pn
  • e.g. nested sequence from hierarchical clustering

77
Composition
  • resolve (P1, ..., Pm ) --gt Pnew
  • combine different partitions of the same data
  • merge (Data, Pold ) --gt Pnew
  • classify additional points
  • merge (Pa , Pb ) --gt Pnew
  • combine non-overlapping partitions

78
Implications
  • Algorithms (re)cast in terms of moves
  • refine, reduce
  • reassign
  • partition, partition-path
  • easily understandable (e.g. geometric structures)
  • specify required data structures
  • e.g. ms tree, triangulation, var-cov matrix,

79
New problems
  • interface design
  • multiple partitions
  • comparison and/or resolution
  • multiple display
  • inference

80
Summary
  • Cluster analysis is naturally exploratory and
    needs integration with modern interactive data
    analysis.
  • Enlarging the problem to partitions
  • simplifies and gives structure
  • encourages exploratory approach
  • integrates naturally
  • introduces new possibilities (analysis and
    research)

81
Related references
  • Interactive clustering CASI talk, Oldford (2001)
  • Quail Overview (Interface 1998), graphics
    (Hurley and Oldford, ISI 1999) and code.
  • Design principles Oldford (Interface1999)
  • Analysis hubs Oldford (Interface 1997)

82
Acknowledgements
  • Catherine Hurley, Erin McLeish, Rayan Yahfoufi,
    Natasha Wiebe
  • U(W) students in statistical computing
  • Quail Quantitative Analysis in Lisp
  • http//www.stats.uwaterloo.ca/Quail
About PowerShow.com