Structuring Interactive Cluster Analysis

- Wayne Oldford
- University of Waterloo

Structuring Interactive Cluster Analysis

This talk is about interactive cluster analysis,

that is about interactive tools for finding and

identifying groups in data. But more than

that, it's about stepping back and understanding

the structure of this process so that software

tools can be organized to simplify and to aid the

analysis.

- Wayne Oldford
- University of Waterloo

Overview

The problem of cluster analysis' or of finding

groups in data' is ill defined. So there can be

no universal solution and any claimed solution

must necessarily solve some other suitably

constrained problem and not the more general

one. What we need instead are highly interactive

tools which allow us to adapt to the

peculiarities of the data and the problem at

hand. These tools are usefully organized and

integrated if we step back and consider the

problem as one of exploratory data analysis,

except that now, in addition to the data itself,

the exploration is to take place as well on the

space of partitions of the data. Existing

algorithms need to be recast, and new ones

developed, in terms of exploring the space of

partitions. The algorithms can then be easily

integrated with other interactive tools so that

jointly they provide a broadly useful and easily

adapted tool-set for finding and identifying

groups in data.

Argument

- ill-defined problem
- high-interaction desirable
- explore partitions
- recast algorithms

Overview

Develop by example

Argument

- ill-defined problem
- high-interaction desirable
- explore partitions
- recast algorithms

- problems
- resources
- interactive clustering
- partition moves
- implications
- prototype interface

Problem

geometric/visual structure

Visual system easily identifies groups

algorithms are often motivated and/or understood

via visual intuition and geometric structure

Problem

geometric/visual structure

Visual system easily identifies groups

algorithms are often motivated and/or understood

via visual intuition and geometric structure

Problem

Consider visually grouping here

Context matters each point is a document

located by each words frequency within the

document

Problem

two similar documents of different lengths

should be closer one of these has more text

than the other.

Problem

green closer to orange than to red?

distance measured by angle?

Problem

structure in context

segmentation in MRI

groups are spatially contiguous in the plane

of the image and nearby in the intensity.

shape is not defined a priori

image source

Problem

context specific structure

aneurysm presents as intensity in blood

vessels

groups are spatially contiguous tubes of

similar intensity

shape is restricted a priori to be 3-d tubes

image source

Problem

some specific some not

image source

same slice, five different measurements at

each location

spatial grouping as before, additional

grouping possible across measurements

Problem

some specific some not

image source

4 dimensional data from connected images

2d spatial with clear biological grouping,

connected to

2d intensity measures with abstract

structure/grouping

Problem

- Find groups in data
- Similar objects are together
- Groups are separated

- Problem is ill defined

- What do you mean similar?

- E.g. what is contiguous structure?

- When are groups separate?

- Can we believe it?

Computational resources

- 1. Processing

2. Memory

3. Display

Computational resources (and response)

- 1. Processing

- Gflops, Tflops, multiple processors

- computationally intensive methods

- problem constrained and optimized

2. Memory

3. Display

Computational resources (and response)

- 1. Processing

2. Memory

- GBs, TBs, disk and RAM

- try to analyze huge data-sets

- data-sets larger than necessary?

3. Display

Computational resources (and response)

- 1. Processing

2. Memory

3. Display

- high resolution, large

- graphics processors, digital video

- more data, more visual detail

Computational resources

- 1. Processing

2. Memory

3. Display

Exploit no one resource exclusively Balance and

integrate

High interaction (much overlooked by researchers)

- assume multiple displays

- integrate computational resources

- challenge is to design software to be simple,

understandable, integrated and extensible

Example image analysis find groups via

intensity (contours and two small unusual

structures revealed)

Example image analysis other measurements may

contain interesting structure

Example image analysis identify new structure

location in the original image

Example image analysis mark new groups by

colour (hue, preserving lightness in original

image)

Example image analysis explore relation

between old and new groups via contours in the

image itself

Example 8 dimensions from teeth

measurements on species ( sex)

Example apes, hominids, modern humans

- multiple and very different views
- 3-d point clouds (of first 3 discriminant

co-ordinates) - cases identified in a list
- each point represented as a smooth curve by

projecting it on a direction vector smoothly

moving around the surface of an 8-d sphere - all linked via colour by cases being displayed

- context helps
- knowing the species encourages grouping
- grouping based on context the visual

information

- grouping is confirmed across different kinds of

display

Example mutual support and shapes

a 3-d projection

Shape from all dimensions

How many groups?

Example mutual support and shapes

Groups found here

Same in all dimensions?

How many groups?

Example mutual support and shapes

Observe effect here

Split black group by shape

How many groups?

Example mutual support and shapes

Get new 3-d projection

Coloured by shape

Five groups corroborated

Example exploratory data analysis

How many groups?

Example exploratory data analysis

Choose data to cut away

Explore the rest

Distinguish groups

Example exploratory data analysis

Bring data back

Explore all together

Some black with red?

Focus on centre

Example exploratory data analysis

Explore separately

Mark group

Discard new view

Explore all together

Two groups

Interactive clustering

- visual grouping
- location, motion, shape, texture, ...
- linking across displays
- manual
- selection
- cases, variates, groups, ...
- colouring
- focus
- immediate and incremental
- context can be used to form groups
- multiple partitions

Automated clustering typical software

- resources dedicated to numerical computation
- teletype interaction
- runs to completion
- graphical output
- dont always work so well (no universal solution)
- confirm via exploratory data analysis

Must be integrated with interactive methods

Example K-means clustering

K 2 groups

Starting groups as shown have centre ball in one

group

K-means moves one point at a time to improve 2

groups

Example K-means clustering

K 2 groups

Final groups shown maximize F-like statistic

(between/within)

Central ball is lost

K-means poor for this data configuration

Example VERI Visual Empirical Regions of

Influence

join points if no third point falls in this

region

Visual Empirical Regions of Influence

Example VERI Visual Empirical Regions of

Influence

join points if no third point falls in this

region

Visual Empirical Regions of Influence

Visual Empirical Regions of Influence

- psychophysical experiments of human visual

perception to join data points - very special circumstances (two lines of three

equi-spaced points each) - works well on demonstration 2-d cases
- extends to higher dimensions
- two points are joined or not depending on their

joint configuration with a third point - each third point examined forms a plane with the

candidate pair and so VERI shape applies - works in high-d with published demonstration cases

Example VERI

Each colour is a different group found by VERI.

Central ball is lost.

VERI fails for this data configuration (also for

small perturbations of demonstration cases).

There is no universal method, nor can there be.

Example VERI (with parameters)

VERI algorithm, but parameterized now to shrink

region size. Becomes minimal spanning tree in the

limit (MST gets 2 groups here).

Again. no universal method possible, but methods

can be parameterized.

Integrating automatic methods

- Move about the space of partitions
- Pa --gt Pb --gt Pc --gt .

Which operators f f(Pa) --gt Pb

are of interest?

Refine

Need not be nested. Nesting produces hierarchy

Reduce

Reassign

Refinement sequence

- 1

Begin with partition containing all points in one

group.

Refinement sequence

- 1

-gt 2

Refine partition to move to a new partition

containing two groups.

This refinement was had by projecting all points

onto the eigen-vector of the largest eigen value

of the sample variance covariance matrix and

splitting at the largest gap between projected

points.

Blue points are on the outer sphere.

Refinement sequence

- 1

-gt 2

-gt 3

Refine partition (2) to move to a new partition

containing three groups.

- Refinement move
- select group whose sample var-cov matrix has

largest eigen-value - for that group, project and split as before.

Green points are also on the outer sphere.

Refinement sequence

- 1

-gt 2

-gt 3

-gt 4

Refine partition (3) to move to a new partition

containing four groups.

Refinement move as before, again splits red group.

New group contains a single (magenta) point on

the outer sphere (middle right, up).

Exploration of the data shows this to be a very

poor partition with that single isolated point.

Refinement sequence

- 1

-gt 2

-gt 3

-gt 4

-gt 5

Refine partition (4) to move to a new partition

containing five groups.

Refinement move as before, again splits red group.

New group contains a single (black) point on the

outer sphere (bottom left).

Again a poor partition no further refinement

step taken at this point.

Reassign, reduce sequence

- 5

-gt 5

A reassign move from one partition of five to

another.

Reassignment move k-means maximizing an F

statistic.

Seems a better partition than before explore to

confirm.

Explore present partition

- 5

Reassignment seems to have isolated central red

ball.

Remaining groups distributed around a spherical

surface.

Consider reduction moves from this partition to

nearby partitions with fewer groups.

Partition to be reduced

- 5

Same partition - back in the original position to

make subsequent reduction moves visually

comparable with previous refinement and

reassignment moves.

Choice of reduction move can be based on what we

have learned from exploring this partition.

Reduce sequence

- 5

-gt 4

Reduce partition (5) to move to a new partition

containing four groups.

Reduction move Single-linkage between

groups. i.e. join closest two groups as measured

by euclidean distance between nearest points in

each group.

Seems reasonable choice given structure observed

in previous exploration.

Reduce sequence

- 5

-gt 4

-gt 3

Reduce partition (4) to move to a new partition

containing three groups.

Reduction move As before.

Red ball remains.

Exploration suggests one more reduction move.

Reduce sequence

- 5

-gt 4

-gt 3

-gt 2

Reduce partition (3) to move to a new partition

containing two groups.

Reduction move As before.

This partition seems best.

Interactive exploration important to choose type

and details of potentially interesting moves from

one partition to another.

Moves (generic functions)

examples

- refine (Pold) --gt Pnew

break minimal spanning tree

- reduce (Pold) --gt Pnew

join near centres

- reassign (Pold) --gt Pnew

k-means maximize F

- partition (graphic) --gt Pnew

colours from point cloud

Challenges

- varying focus
- subsets (selected manually and at random)
- merging new data into partition

- exploring multiple partitions
- interactive display and comparison
- resolving many to one

- interface design
- control panels, options
- interaction

A prototype interface

- cluster analysis hub
- an analysis hub (Oldford, 1997) created on

demand for partition - having all points in one group for named

data-set, or - as defined by colours of all points in topmost

plot, or - as defined by colours of selected points in

topmost plot - new hub can always be created for any subset
- maintains list of saved partitions
- offers moves from current partition via one of
- reduce, refine, or reassign
- manually from current colours (so as to capture

interactive modification of existing partition) - Other operations on one or more partitions (e.g.

cluster plot, dendrogram, ...)

Interface illustration details of moves

- Each move - refine, reduce, reassign - is an

entire collection of possible moves, each with

many possible choices. - The next few slides illustrate the prototype

implementation where - Buttons for refine, reduce, and reassign are

given at the topmost level. - Once selected, each button pops up its own

control panel where various different kinds of

moves and parameter choices can be made. E.g.

the analyst might choose to reduce by any of - Join groups with closest centres using Euclidean

distance - Join groups whose farthest points are closest

(i.e. complete linkage) - Choose group with greatest spread and disperse

its points among the remaining groups.

Interface - reduce

Interface - refine

Interface - reassign

Interface illustration example of use

- The next few slides illustrate the prototype

implementation applied to a ball in a sphere

data-set (a different one from before). - Moves are made about the partition space (refines

and reassign) - Partitions are saved (can be named, deleted,

revisited, etc.) - Nested partitions compared via a dendrogram
- Non-nested partition compared with nested ones
- N.B. at any time, the analyst could have

interacted with any graphic - to create a new partition by colouring - using

manual button - focus on a subset to examine via a new cluster

analysis hub and subsequently incorporate that

into the partition of the whole data-set.

Interaction

Start with partition having all points in a

single group.

Selecting refine pops up the refinement panel.

Choose refinement details.

- Refinement move
- Choose group with var-cov having largest eigen

value. - Project these points onto corresponding

eigen-vector. - Split this group where the projected gap is

largest.

Interaction

New partition appears as Refine Dataset in

panel at left.

Refinement details unchanged.

Refine produces new partition having two groups

as shown by different colours in all graphics.

name and save partition

Saved partition list.

New partition is named and saved.

Refinement details unchanged.

New partition has three groups.

prototype - refine to 4

Refinement details unchanged.

New partition has four groups.

prototype - refine to 5

Refinement details unchanged.

No further refinement pursued beyond this one.

New partition has five groups. The fifth group

contains a single point (blue, top right).

Select nested partitionsand view dendrogram

1

Select nested partitions

2

Dendrogram button.

3

- Dendrogram shows 5 nested partitions
- Each block is a group, horizontal cuts at each

vertical level is a partition. - Size and colour proportions vary with number of

points. - Colouring is as displayed in point cloud (here

showing the current partition) .

Reassign, dendrogram updated

New partition appears as Reassign Dataset in

panel at left.

- Reassign move to new partition.
- Details
- k-means
- max F statistic

- Colours update in all graphics including the

dendrogram - Reassignment partition can be explored as usual.
- This partition can be visually compared with

previous partitions via the updated colours in

the dendrogram.

Cluster plot dendrograminteraction movie

Cluster plot button operates on selected partition

- Cluster plot
- groups as boxes
- close groups are visually close (via

multi-dimensional scaling)

Nested and non-nested partitions can be visually

compared simultaneously through interaction.

Other operators

- dissimilarity (Pi, Pj) --gt di,j

- display (P1, ..., Pm)

- dendrogram if P1 lt lt Pm

- mds plot of all clusters in P1, , Pm

- mds plot of all partitions P1, , Pm

Creation

- partition (Data ...) --gt Pnew
- manually from colours
- k-means, random start, mst, veri, etc
- from existing classifier.

- partition-path (Data ) --gt P1 , P2 , , Pn

- partition-path (Pold ...)
- --gt Pold , P1 , P2 ,

, Pn

- e.g. nested sequence from hierarchical clustering

Composition

- resolve (P1, ..., Pm ) --gt Pnew
- combine different partitions of the same data

- merge (Data, Pold ) --gt Pnew
- classify additional points

- merge (Pa , Pb ) --gt Pnew
- combine non-overlapping partitions

Implications

- Algorithms (re)cast in terms of moves
- refine, reduce
- reassign
- partition, partition-path
- easily understandable (e.g. geometric structures)
- specify required data structures
- e.g. ms tree, triangulation, var-cov matrix,

New problems

- interface design
- multiple partitions
- comparison and/or resolution
- multiple display
- inference

Summary

- Cluster analysis is naturally exploratory and

needs integration with modern interactive data

analysis. - Enlarging the problem to partitions
- simplifies and gives structure
- encourages exploratory approach
- integrates naturally
- introduces new possibilities (analysis and

research)

Related references

- Interactive clustering CASI talk, Oldford (2001)
- Quail Overview (Interface 1998), graphics

(Hurley and Oldford, ISI 1999) and code. - Design principles Oldford (Interface1999)
- Analysis hubs Oldford (Interface 1997)

Acknowledgements

- Catherine Hurley, Erin McLeish, Rayan Yahfoufi,

Natasha Wiebe - U(W) students in statistical computing
- Quail Quantitative Analysis in Lisp
- http//www.stats.uwaterloo.ca/Quail