Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 7b695e-MWM1N

The Adobe Flash plugin is needed to view this content

Research of William Perrizo, C.S. Department, NDSU

I datamine big data (big data trillions of rows

and, sometimes, thousands of columns (which can

complicate data mining trillions of rows). How do

I do it? I structure the data table as

compressed vertical bit columns (called

"predicate Trees" or "pTrees"). I process those

pTrees horizontally (because processing across

thousands of column structures is orders of

magnitude faster than processing down trillions

of row structures. As a result, some tasks that

might have taken forever can be done in a humanly

acceptable amount of time. What is data mining?

Largely it is classification (assigning a class

label to a row based on a training table of

previously classified rows). Clustering and

Association Rule Mining (ARM) are important areas

of data mining also, and they are related to

classification. The purpose of clustering is

usually to create or improve a training table.

It is also used for anomaly detection, a huge

area in data mining. ARM is used to data mine

more complex data (relationship matrixes between

two entities, not just single entity training

tables). Recommenders recommend products to

customers based on their previous purchases or

rents (or based on their ratings of items)".

To make a decision, we typically search our

memory for similar situations (near neighbor

cases) and base our decision on the decisions we

(or an expert) made in those similar cases. We

do what worked before (for us or for others).

I.e., we let near neighbor cases vote. But which

neighbor vote? "The Magical Number Seven, Plus

or Minus Two..." Information"2 is one of the

most highly cited papers in psychology cognitive

psychologist George A. Miller of Princeton

University's Department of Psychology in

Psychological Review. It argues that the number

of objects an average human can hold in working

memory is 7 2 (called Miller's Law).

Classification provides a better 7.

Some current pTree Data Mining research projects

FAUST pTree PREDICTOR/CLASSIFIER (FAUST

Functional Analytic Unsupervised and Supervised

machine Teaching) FAUST pTree

CLUSTER/ANOMALASER pTrees in MapReduce MapReduce

and Hadoop are key-value approaches to organizing

and managing BigData. pTree Text Mining

capturie the reading sequence, not just the

term-frequency matrix (lossless capture) of a

text corpus. Secure pTreeBases This involves

anonymizing the identities of the individual

pTrees and randomly padding them to mask their

initial bit positions. pTree Algorithmic Tools

An expanded algorithmic tool set is being

developed to include quadratic tools and even

higher degree tools. pTree Alternative Algorithm

Implementation Implementing pTree algorithms in

hardware (e.g., FPGAs) should result in orders of

magnitude performance increases? pTree O/S

Infrastructure Computers and Operating Systems

are designed to do logical operations (AND,

OR...) rapidly. Exploit this for pTree

processing speed. pTree Recommender This

includes, Singular Value Decomposition (SVD)

recommenders, pTree Near Neighbor Recommenders

and pTree ARM Recommenders.

FAUST clustering (the unsupervised part of FAUST)

This class of partitioning or clustering methods

relies on choosing a dot product projection so

that if we find a gap in the F-values, we know

that the 2 sets of points mapping to opposite

sides of that gap are at least as far apart as

the gap width.).

The Coordinate Projection Functionals (ej) Check

gaps in ej(y) yoej yj

The Square Distance Functional (SD) Check gaps

in SDp(y) (y-p)o(y-p) (parameterized over a

p?Rn grid).

The Square Dot Product Radius (SDPR) SDPRpq(y)

SDp(y) - DPPpq(y)2 (easier pTree processing)

DPP-KM 1. Check gaps in DPPp,d(y) (over grids of

p and d?). 1.1 Check distances at any sparse

extremes. 2. After several rounds of 1, apply

k-means to the resulting clusters (when k seems

to be determined).

DPP-DA 2. Check gaps in DPPp,d(y) (grids of p

and d?) against the density of subcluster. 2.1

Check distances at sparse extremes against

subcluster density. 2.2 Apply other methods

once Dot ceases to be effective.

DPP-SD) 3. Check gaps in DPPp,d(y) (over a

p-grid and a d-grid) and SDp(y) (over a p-grid).

3.1 Check sparse ends distance with subcluster

density. (DPPpd and SDp share construction

steps!)

SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share

construction steps! SDp(y) (y-p)o(y-p)

yoy - 2 yop pop

DPPpq(y) (y-p)odyod-pod (1/p-q)yop -

(1/p-q)yoq

Calc yoy, yop, yoq concurrently? Then constant

multiplies 2yop, (1/p-q)yop concurrently.

Then add subtract.

Calculate DPPpq(y)2. Then subtract it from

SDp(y)

FAUST DPP CLUSTER on IRiS with DPP(y)(y-p)o(q-p)/

q-p, where p is the min (or n) corner and q is

the max (x) corner of the circumscribing

rectangle (mdpts or avg (a) is used also).

Checking 0,4 distances (s42 Setosa outlier) F

0 1 2 3 3 3 4 s14 s42 s45 s23 s16

s43 s3 s14 0 8 14 7 20 3 5 s42 8 0

17 13 24 9 9 s45 14 17 0 11 9 11

10 s23 7 13 11 0 15 5 5 s16 20 24 9

15 0 18 16 s43 3 9 11 5 18 0 3 s3

5 9 10 5 16 3 0

IRIS 150 irises (rows), 4 columns (Pedal Length,

Pedal Width, Sepal Length, Sepal Width). first

50 are Setosa (s), next 50 are Versicolor (e),

next 50 are Virginica (i) irises.

CL1 Flt17 (50 Set)

17ltFlt23 CL2 (e8,e11,e44,e49,i39)

gapgt4 pnnnn qxxxx F Count 0 1 1 1 2

1 3 3 4 1 5 6 6 4 7 5 8 7 9

3 10 8 11 5 12 1 13 2 14 1 15 1 19

1 20 1 21 3 26 2 28 1 29 4 30 2 31

2 32 2 33 4 34 3 36 5 37 2 38 2 39

2 40 5 41 6 42 5 43 7 44 2 45 1 46

3 47 2 48 1 49 5 50 4 51 1 52 3 53

2 54 2 55 3 56 2 57 1 58 1 59 1 61

2 64 2 66 2 68 1

Thinning6,7 CL3.1 lt6.5 44 ver 4 vir CL3.2

gt6.5 2 ver 39 vir No sparse ends

23ltF CL3 (46 vers,49 vir)

Check distances in 12,28 s16,,i39,e49, e11,

e8,e44, i6,i10,i18,i19,i23,i32 outliers F 12

13 13 14 15 19 20 21 21 21 26 26 28

s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30

e31 s34 0 5 8 5 4 21 25 28 32 28

30 28 31 s6 5 0 4 3 6 18 21 23

27 24 26 23 27 s45 8 4 0 6 9 18

18 21 25 21 24 22 25 s19 5 3 6 0

6 17 21 24 27 24 25 23 27 s16 4 6 9

6 0 20 26 29 33 29 30 28 31 i39 21

18 18 17 20 0 17 21 24 21 22 19

23 e49 25 21 18 21 26 17 0 4 7 4

8 8 9 e8 28 23 21 24 29 21 4 0 5

1 7 8 8 e11 32 27 25 27 33 24 7

5 0 4 7 9 7 e44 28 24 21 24 29 21

4 1 4 0 6 8 7 e32 30 26 24 25

30 22 8 7 7 6 0 3 1 e30 28 23

22 23 28 19 8 8 9 8 3 0 4 e31

31 27 25 27 31 23 9 8 7 7 1 4 0

Here we project onto lines through the corners

and edge midpoints of the coordinate-oriented

circumscribing rectangle. It would, of course,

get better results if we choose p and q to

maximize gaps. Next we consider maximizing the

STD of the F-values to insure strong gaps (a

heuristic method).

Checking 57.68 distances i10,i36,i19,i32,i18,

i6,i23 outliers F 57 58 59 61 61 64

64 66 66 68 i26 i31 i8 i10 i36 i6 i23

i19 i32 i18 i26 0 5 4 8 7 8 10 13

10 11 i31 5 0 3 10 5 6 7 10 12

12 i8 4 3 0 10 7 5 6 9 11

11 i10 8 10 10 0 8 10 12 14 9

9 i36 7 5 7 8 0 5 7 9 9 10 i6

8 6 5 10 5 0 3 5 9 8 i23 10

7 6 12 7 3 0 4 11 10 i19 13 10 9

14 9 5 4 0 13 12 i32 10 12 11 9

9 9 11 13 0 4 i18 11 12 11 9 10 8

10 12 4 0

"Gap Hill Climbing" mathematical analysis

1. To increase gap size, we hill climb the

standard deviation of the functional, F (hoping

that a "rotation" of d toward a higher StDev

would increase the likelihood that gaps would be

larger since more dispersion allows for more

and/or larger gaps. This is very heuristic but

it works. 2. We are more interested in growing

the largest gap(s) of interest ( or largest

thinning). To do this we could do

F-slices are hyperplanes (assuming Fdotd) so it

would makes sense to try to "re-orient" d so that

the gap grows. Instead of taking the "improved" p

and q to be the means of the entire n-dimensional

half-spaces which is cut by the gap (or

thinning), take as p and q to be the means of the

F-slice (n-1)-dimensional hyperplanes defining

the gap or thinning. This is easy since our

method produces the pTree mask of each F-slice

ordered by increasing F-value (in fact it is the

sequence of F-values and the sequence of counts

of points that give us those value that we use to

find large gaps in the first place.).

The d2-gap is much larger than the d1gap. It is

still not the optimal gap though. Would it be

better to use a weighted mean (weighted by the

distance from the gap - that is weighted by the

d-barrel radius (from the center of the gap) on

which each point lies?)

In this example it seems to make for a larger

gap, but what weightings should be used? (e.g.,

1/radius2) (zero weighting after the first gap is

identical to the previous). Also we really want

to identify the Support vector pair of the gap

(the pair, one from one side and the other from

the other side which are closest together) as p

and q (in this case, 9 and a but we were just

lucky to draw our vector through them.) We could

check the d-barrel radius of just these gap slice

pairs and select the closest pair as p and q???

Maximizing theVariance

How do we use this theory? For Dot Product gap

based Clustering, we can hill-climb akk below to

a d that gives us the global maximum variance.

Heuristically, higher variance means more

prominent gaps.

Given any table, X(X1, ..., Xn), and any unit

vector, d, in n-space, let

We can separate out the diagonal or not

These computations are O(C) (Cnumber of classes)

and are instantaneous. Once we have the matrix

A, we can hill-climb to obtain a d that maximizes

the variance of the dot product projections of

the class means.

FAUST Classifier MVDI (Maximized Variance

Definite Indefinite

? d0, one can hill-climb it to locally maximize

the variance, V, as follows

Build a Decision tree. 1. Find the d that

maximizes the variance of the dot product

projections of the class means each round. 2.

Apply DI each round (see next slide).

d1?(V(d0))

d2?(V(d1))... where

FAUST DI K-class training set, TK, and a given

d (e.g., from DMeanTK?MedTK)

Let mimeanCi s.t. dom1?dom2? ...?domK

MniMindoCi MxiMaxdoCi MngtiMinjgtiMnj

MxltiMaxjltiMxj

Definite_i ( Mxlti, Mngti )

Indefinite_i_i1 Mngti, Mxlti1

Then recurse on each Indefinite.

For IRIS 15 records were extracted from each

Class for Testing. The rest are the Training

Set, TK. DMEANs?MEANe

Definite_i_______ Indefinite_i_i1______

class Mxlti MNgti

class MNgti Mxlti1 s-Mean 50.49 34.74

14.74 2.43 s(i1) -1 25 e-Mean 63.50

30.00 44.00 13.50 e(i2) 10 37 se

25 10 empty i-Mean 61.00 31.50 55.50 21.50

i(i3) 48 128 ei 37 48

F lt 18 ?

setosa (35 seto) 1ST

ROUND DMeans?Meane 18 lt F lt 37

? versicolor (15 vers)

37 ? F ? 48 ?

IndefiniteSet2 (20 vers, 10 virg)

48 lt F ? virginica

(25 virg)

F lt 7 ? versicolor (17

vers. 0 virg) IndefSet2

ROUND DMeane?Meani 7 ? F ? 10

? IndefSet3 ( 3 vers, 5 virg)

10 lt F ? virginica ( 0 vers, 5 virg)

F lt 3 ? versicolor (

2 vers. 0 virg)

IndefSet3 ROUND DMeane?Meani 3 ? F ?

7 ? IndefSet4 ( 2 vers, 1 virg)

Here we will assign 0 ? F

? 7 versicolor 7 lt F ?

virginica ( 0 vers, 3 virg)

7 lt F

virginica

Test F lt 15

? setosa (15 seto)

1ST ROUND DMeans?Meane 15 lt F lt 15

? versicolor ( 0

vers, 0 virg) 15 ? F ? 41

? IndefiniteSet2 (15 vers, 1

virg) 41 lt F

? virginica ( 14 virg)

100 accuracy.

F lt 20 ? versicolor (15 vers. 0 virg)

IndefSet2 ROUND DMeane?Meani 20 lt F ?

virginica ( 0 vers, 1 virg)

Option-1 The sequence of D's is

Mean(Classk)?Mean(Classk1) k1... (and Mean

could be replaced by VOM or?)

Option-2 The sequence of D's is

Mean(Classk)?Mean(?hk1..nClassh) k1... (and

Mean could be replaced by VOM or?)

Option-3 D seq Mean(Classk)?Mean(?h not used

yetClassh) where k is the Class with max count in

subcluster (VoM instead?)

Option-2 D seq. Mean(Classk)?Mean(?hk1..nClass

h) (VOM?) where k is Class with max count in

subcluster.

Option-4 D seq. always pick the means pair

which are furthest separated from each other.

Option-5 D Start with Median-to-Mean of

IndefiniteSet, then means pair corresp to max

separation of F(meani), F(meanj)

Option-6 D Always use Median-to-Mean of

IndefiniteSet, IS. (initially, ISX)

FAUST MVDI

on IRIS 15 records from each Class for Testing

(Virg39 was removed as an outlier.)

Definite_____ Indefinite s-Mean 50.49

34.74 14.74 2.43 s -1 10 e-Mean

63.50 30.00 44.00 13.50 e 23 48

s_ei 23 10 empty i-Mean 61.00 31.50

55.50 21.50 i 38 70 se_i 38

48

In this case, since the indefinite interval is so

narrow, we absorb it into the two definite

intervals resulting in decision tree

FAUST MVDI

SatLog 413train 4atr 6cls 127test

Using class means FoMN Ct min max

max1 mn4 83 101 104 82 113 8 110 121

122 mn3 85 103 108 85 117 79 105 128

129 mn1 69 106 115 94 133 12 123 148

149 Using full data (much better!) mn4 83 101

104 82 59 8 56 65 66 mn3 85 103

108 85 62 79 52 74 75 mn1 69 106

115 94 81 12 73 95 96

Gradient Hill Climb of Variance(d) d1 d2

d3 d4 Vd) 0.00 0.00 1.00 0.00

282 0.13 0.38 0.64 0.65 700 0.20 0.51

0.62 0.57 742 0.26 0.62 0.57 0.47

781 0.30 0.70 0.53 0.38 810 0.34 0.76

0.48 0.30 830 0.36 0.79 0.44 0.23

841 0.37 0.81 0.40 0.18 847 0.38 0.83

0.38 0.15 850 0.39 0.84 0.36 0.12

852 0.39 0.84 0.35 0.10 853

Fomn Ct min max max1 mn2

49 40 115 119 106 108 91 155 156 mn5 58

58 76 64 108 61 92 145 146 mn7 69 77

81 64 131 154 104 160 161 mn4 78 91 96

74 152 60 127 178 179 mn1 67 103 114 94

167 27 118 189 190 mn3 89 107 112 88 178

155 157 206 207

Gradient Hill Climb of Var(d)on t25 d1 d2

d3 d4 Vd) 0.00 0.00 0.00 1.00

1137 -0.11 -0.22 0.54 0.81 1747

MNod Ct ClMn ClMx ClMx1 mn2

45 33 115 124 150 54 102 177 178 mn5 55 52

72 59 69 33 45 88 89

Gradient Hill Climb of Var(d)on t257 0.00

0.00 1.00 0.00 496 -0.15 -0.29 0.56

0.76 1595 Same using class means or training

subset.

Gradient Hill Climb of Var(d)on t75 0.00 0.00

1.00 0.00 12 0.04 -0.09 0.83 0.55

20 -0.01 -0.19 0.70 0.69 21

Gradient Hill Climb of Var(d)on t13 0.00 0.00

1.00 0.00 29 -0.83 0.17 0.42 0.34

166 0.00 0.00 1.00 0.00 25 -0.66

0.14 0.65 0.36 81 -0.81 0.17 0.45

0.33 88

On the 127 sample SatLog TestSet 4 errors or

96.8 accuracy.

speed? With horizontal data, DTI is applied one

unclassified sample at a time (per execution

thread). With this pTree Decision Tree, we take

the entire TestSet (a PTreeSet), create the

various dot product SPTS (one for each inode),

create ut SPTS Masks. These masks mask the

results for the entire TestSet.

Gradient Hill Climb of Var(d)on t143 0.00 0.00

1.00 0.00 19 -0.66 0.19 0.47 0.56

95 0.00 0.00 1.00 0.00 27 -0.17

0.35 0.75 0.53 54 -0.32 0.36 0.65

0.58 57 -0.41 0.34 0.62 0.58 58

For WINE min

max1 8.40 10.33 27.00 9.63 28.65 9.9

53.4 7.56 11.19 32.61 10.38 34.32 7.7

111.8 8.57 12.84 30.55 11.65 32.72 8.7

108.4 8.91 13.64 34.93 11.97 37.16 13.1

92.2 Awful results!

Gradient Hill Climb of Var t156161 0.00 0.00

1.00 0.00 5 -0.23 -0.28 0.89 0.28

19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02

0.02 1.00 159 0.00 0.00 1.00 0.00

1 -0.46 -0.53 0.57 0.43

2 Inconclusive both ways so predict

purality4(17) (3ct3 tct6

Gradient Hill Climb of Var t146156 0.00 0.00

1.00 0.00 0 0.03 -0.08 0.81 -0.58

1 0.00 0.00 1.00 0.00 13 0.02 0.20

0.92 0.34 16 0.02 0.25 0.86 0.45

17 Inconclusive both ways so predict

purality4(17) (7ct15 2ct2

Gradient Hill Climb of Var t127 0.00 0.00

1.00 0.00 41 -0.01 -0.01 0.70 0.71

90 -0.04 -0.04 0.65 0.75 91 0.00 0.00

1.00 0.00 35 -0.32 -0.14 0.59 0.73

105 Inconclusive predict purality7(62 4(15)

1(5) 2(8) 5(7)

FAUST MVDI

Concrete

d0 -0.34 -0.16 0.81 -0.45

7 test errors / 30 77

For Concrete min max1 train 335.3 657.1

0 l 120.5 611.6 12 m 321.1 633.5 0 h Test

0 l 1 m 0

h 0 321 3.0 57.0

0 l 3.0 361.0 11 m 28.0 92.0 0 h

0 l 2 m 0 h

92 999

Seeds

.97 .17 -.02 .15 d0 13.3 19.3 0

0 l 16.4 23.5 0 0 m 12.2 15.2

25 5 h 0 13.2 19.3 23.5

d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5

867.7 0 h 0 l

0 m 0 h . 0

617

8 test errors / 32 75

d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0

847.4 40 h 1 l 0

m 11 h 662

999

FAUST Oblique Classifier formula P(X dot D)gta

X any set of vectors. Doblique vector

(Note if Dei, PXi gt a ).

E.g.,? Let Dvector connecting class means and

d D/D

To separate r from v D (mv?mr), a

(mvmr)/2 o d midpoint of D projected onto d

FAUST-Oblique Create tbl, TBL(classi, classj,

medoid_vectori, medoid_vectorj). Notes If we

just pick the one class which when paired with r,

gives max gap, then we can use max gap or

max_std_Int_pt instead of max_gap_midpt. Then

need stdj (or variancej) in TBL.

Best cutpoint? mean, vector_of_medians, outmost,

outmost_non-outlier?

P(mb?mr)oXgt(mrm)/2od

"outermost "furthest from means (their projs of

D-line) best rankK points, best std points,

etc. "medoid-to-mediod" close to optimal provided

classes are convex.

In higher dims same (If "convex" clustered

classes, FAUSTdiv,oblique_gap finds them.

r

Separate classR, classV using midpoints of means

(mom) method calc a

FAUST Oblique PR P(X dot d)lta

D mR?mV oblique vector. dD/D

View mR, mV as vectors (mRvector from origin to

pt_mR), a (mR(mV-mR)/2)od (mRmV)/2 o d

(Very same formula works when DmV?mR, i.e.,

points to left)

Training choosing "cut-hyper-plane" (CHP),

which is always an (n-1)-dimensionl hyperplane

(which cuts space in two). Classifying is one

horizontal program (AND/OR) across pTrees to get

a mask pTree for each entire class (bulk

classification) Improve accuracy? e.g., by

considering the dispersion within classes when

placing the CHP. Use 1. the vector_of_median,

vom, to represent each class, rather than mV,

vomV ( medianv1v?V, 2. project each class

onto the d-line (e.g., the R-class below) then

calculate the std (one horizontal formula per

class using Md's method) then use the std ratio

to place CHP (No longer at the midpoint between

mr vomr and mv vomv )

medianv2v?V, ... )

dim 2

r r vv r mR r

v v v v r r v

mV

v r v v r

v

dim 1

L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16

17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17

18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0

2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9

10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0

2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12

13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5

8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15

17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13

0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9

10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10

11 13 15

12/8/12

L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1

1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1

1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2

1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1

1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2

1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4

1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2

1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2

1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1

1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1

1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2

3 1

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16

17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17

18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0

2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9

10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0

2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12

13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5

8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15

17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13

0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9

10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10

11 13 15

L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1

1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1

1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2

1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1

1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2

1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4

1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2

1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2

1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1

1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1

1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2

3 1

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6 M

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

This just confirms z6 as an anomaly or outlier,

since it was already declared so during the

linear gap analysis.

Confirms zf as an anomaly or outlier, since it

was already declared so during the linear gap

analysis.

After having subclustered with linear gap

analysis, it would make sense to run this round

gap algoritm out only 2 steps to determine if

there are any singleton, gapgt2 subclusters

(anomalies) which were not found by the previous

linear analysis.

yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10

11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1

2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5

0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7

8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2

3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12

13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2

3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9

11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2

3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10

11

Cluster by splitting at gaps gt 2

x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4

11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9

0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14

0 z1 z15 5 9 5 Mean

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6 M

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1

1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2

1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5

2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3

3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3

1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2

1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1

3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1

1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1

3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1

gap 10-6

gap 5-2

cluster PTree Masks (by ORing)

z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0

yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10

11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1

2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5

0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7

8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2

3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12

13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2

3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9

11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2

3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10

11

Cluster by splitting at gaps gt 2

yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1

1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2

1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5

2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3

3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3

1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2

1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1

3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1

1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1

3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1

x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4

11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9

0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14

0 z1 z15 5 9 5 Mean

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6 M

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0

gap 6-9

z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1

z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0

yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10

11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1

2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5

0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7

8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2

3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12

13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2

3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9

11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2

3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10

11

Cluster by splitting at gaps gt 2

yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1

1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2

1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5

2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3

3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3

1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2

1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1

3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1

1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1

3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1

x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4

11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9

0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14

0 z1 z15 5 9 5 Mean

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6 M

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0

gap 3-7

z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1

z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0

zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10

11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1

2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5

0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7

8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2

3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12

13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2

3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9

11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2

3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10

11

Cluster by splitting at gaps gt 2

yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1

1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2

1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5

2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3

3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3

1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2

1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1

3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1

1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1

3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1

x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4

11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9

0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14

0 z1 z15 5 9 5 Mean

x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2

3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15

1 7 f 14 2 8 15 3 9 6 M

d 13 4 a b 10 9 b

c e 1110 c 9 11 d a 1111 e

8 7 8 f 7 9

z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0

z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1

z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0

zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

AND each red with each blue with each green, to

get the subcluster masks (12 ANDs)

FAUST Clustering Methods MCR (Using Midlines of

circumscribing Coordinate Rectangle)

For any FAUST clustering method, we proceed in

one of 2 ways gap analysis of the projections

onto a unit vector, d, and/or gap analysis of the

distances from a point, f (and another point, g,

usually)

Given d, f?MinPt(xod) and g?MaxPt(xod).

Given f and g, dk(f-g)/f-g

So we can do any subset (d), (df), (dg), (dfg),

(f), (fg), fgd), ...

Define a sequence fk,gkdk

fk((nv1Xv1)/2,...,nvk,...,(nvnXvn)/2)

dkek and SpS(xodk)Xk

gk((nv1Xv1)/2,...,nXk,...,(nvnXvn)/2)

f, g, d, SpS(xod) require no processing

(gap-finding is the only cost). MCR(fg) adds the

cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)).

MCR(dfg) on Iris150

Do SpS(xod) linear gap analysis (since it is

processing free).

SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap.

Sequence thruf, g pairs

On what's left

(look for outliers in subclus1, subclus2

d3 0 10 set23... 1 19 set45 0 30

ver49... 0 69 vir19

SubClus2

SubClus1

d1 none

d2 none

f1 none

f1 none

g1 none

g1 none

f2 1 41 vir23 0 47 vir18 0 47 vir32

f2 none

SubClus1

g2 none

d4 1 6 set44 0 18 vir39 Leaves exactly the

50 setosa.

f3 none

g2 none

g3 none

f4 none

f3 none

g3 none

g4 none

SubClus2

f4 none

d4 none Leaves 50 ver and 49 vir

g4 none

MCR(d) on Iris150Outlier30, gapgt4

Do SpS(xodk) linear gap analysis, k1,2,3,4.

Declare subclusters of size 1 or two to be

outliers. Create the full pairwise distance

table for any subcluster of size ? 10 and declare

any point an outlier if its column (other than

the zero diagonal value) values all exceed the

threshold (which is 4).

d3 0 10 set23... 1 19 set25 0 30

ver49... 1 69 vir19 Same split (expected)

d1 0 17 t124 0 17 t14 0 17 tal 1 17

t134 0 23 t13 0 23 t12 0 23 t1 1 23

t123 0 38 set14 ... 1 79 vir32 0 84

b12 0 84 b1 0 84 b13 1 84 b123 0 98

b124 0 98 b134 0 98 b14 0 98 ball

SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly

the 50 setosa as SubCluster1.

SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1

25 vir45 0 40 b4 0 40 b24 Leaves the 49

virginica (vir39 declared an outlier) and

the 50 versicolor as SubCluster2.

MCR(d) performs well on this dataset.

Accuracy We can't expect a clustering method

to separate versicolor from virginica because

there is no gap between them. This method does

separate off setosa perfectly and finds all 30

added outliers (subcluster of size 1 or 2). It

finds virginica outlier, vir39, which is the most

prominent intra-class outlier (distance 29.6 from

the other virginica iris's, whereas no other iris

is more than 9.1 from its classmates.) Speed dk

ek so there is zero calculation cost for the

d's. SpS(xodk) SpS(xoek) SpS(Xk) so there is

zero calculation cost for it. The only cost is

the loading of the dataset PTreeSet(X) (We use

one column, SpS(Xk) at a time.) and that loading

is required for any method. So MCR(d) is optimal

with respect to speed!

d2 0 5 t2 0 5 t23 0 5 t24 1 5

t234 0 20 ver1 ... 1 44 set16 0 60 b24 0

60 b2 0 60 b234 0 60 b23

CCR(fgd) (Corners of Circumscribing Coordinate

Rectangle) f1minVecX(minXx1..minXxn) (0000)

g1MaxVecX(MaxXx1..MaxXxn) (1111),

d(g-f)/g-f

start ?

f1MnVec RnGpgt4 none

Sequence thru main diagonal pairs, f, g

lexicographically. For each, create d.

g1MxVec RnGpgt4 0 7 vir18... 1 47 ver30 0 53

ver49.. 0 74 set14

CCR(f) Do SpS((x-f)o(x-f)) round gap analysis

CCR(g) Do SpS((x-g)o(x-g)) round gap analysis.

CCR(d) Do SpS((xod)) linear gap analysis.

Notes No calculation required to find f and g

(assuming MaxVecX and minVecX have been

calculated and residualized when PTreeSetX was

captured.) If the dimension is high, since the

main diagonal corners are liekly far from X and

thus the large radii make the round gaps nearly

linear.

SubClus1 Lingt4 none

SubCluster2

f20001 RnGpgt4 none

g21110 RnGpgt4 none

This ends SubClus2 47 setosa only

g11111 RnGpgt4 none

Lingt4 none

f10000 RnGpgt4 none

Lingt4 none

f30010 RnGpgt4 none

g21110 RnGpgt4 none

f20001 RnGpgt4 none

Lingt4 none

g31101 RnGpgt4 none

f30010 RnGpgt4 none

g31101 RnGpgt4 none

Lingt4 none

Lingt4 none

f40011 RnGpgt4 none

g41100 RnGpgt4 none

f40011 RnGpgt4 none

Lingt4 none

g41100 RnGpgt4 none

f50100 RnGpgt4 none

g51011 RnGpgt4 none

Lingt4 none

Lingt4 none

f60101 RnGpgt4 1 19 set26 0 28 ver49 0 31 set42 0

31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41

ver13

f50100 RnGpgt4 none

g51011 RnGpgt4 none

Lingt4 none

f60101 RnGpgt4 none

g61010 RnGpgt4 none

g61010 RnGpgt4 none

Lingt4 none

Lingt4 none

f70110 RnGpgt4 none

f70110 RnGpgt4 1 28 ver13 0 33 vir49

g71001 RnGpgt4 none

Lingt4 none

g71001 RnGpgt4 none

Lingt4 none

Lingt4 none

g81000 RnGpgt4 none

f80111 RnGpgt4 none

f80111 RnGpgt4 none

g81000 RnGpgt4 none

Lingt4 none This ends SubClus1 95 ver and

vir samples only

(No Transcript)

FM(fgd) (Furthest-from-the-Mediod)

FMO (FM using a Gram-Schmidt Orthonormal basis) X

? Rn. Calculate MMeanVector(X) directly, using

only the residualized 1-counts of the basic

pTrees of X. And BTW, use residualized STD

calculations to guide in choosing good gap width

thresholds (which define what an outlier is going

to be and also determine when we divide into

sub-clusters.))

fM Gpgt4 1 53 b13 0 58 t123 0 59 b234 0 59

tal 0 60 b134 1 61 b123 0 67 ball

f0t123 RnGpgt4 1 0 t123 0 25 t13 1 28 t134 0

34 set42... 1 103 b23 0 108 b13

d1(M-f1)/M-f1.

f1?MxPt(SpS(M-x)o(M-x)).

SubClust-1 f0b2 RnGpgt4 1 0 b2 0 28 ver36

SubClust-2 f0t3 RnGpgt4 none

If d11?0, Gram-Schmidt d1 e1...ek-1 ek1..en

d2 (e2 - (e2od1)d1) / e2 - (e2od1)d1

d3 (e3 - (e3od1)d1 - (e3od2)d2) / e3 -

(e3od1)d1 - (e3od2)d2 ...

SubClust-1 f0b3 RnGpgt4 1 0 b3 0 23

vir8 ... 1 54 b1 0 62 vir39

SubClust-2 f0t3 LinGapgt4 1 0 t3 0 12 t34

f0b23 RnGpgt4 1 0 b23 0 30 b3... 1 84 t34 0

95 t23 0 96 t234

Thm MxPtSpS((M-x)od)MxPtSpS(xod) (shift by

Mod, MxPts are same

Repick f1?MnPtSpS(xod1). Pick

g1?MxPtSpS(xod1)

SubClust-2 f0t34 LinGapgt4 1 0 t34 0 13 set36

Pick fh?MnPtSpS(xodh). Pick

gh?MxPtSpS(xodh).

f0b124 RnGpgt4 1 0 b124 0 28 b12 0 30 b14 1

32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86

t14 1 93 t12 0 98 t124

SubClust-1 f0t24 RnGpgt4 1 0 t24 1 12 t2 0 20

ver13

SubClust-2 f0set16 LnGpgt4 none

SubClust-1 f1ver49 RdGpgt4 none

SubClust-1 f0b1 RnGpgt4 1 0 b1 0 23 ver1

SubClust-2 f1set42 RdGpgt4 none

SubClust-1 f1ver49 LnGpgt4 none

1. Choose f0 (high outlier potential? e.g.,

furthest from mean, M?) 2. Do f0-rnd-gap analysis

( subcluster anal?) 3. f1 be s.t. no x further

away from f0 (in some dir) (all d1 dot

prods?0) 4. Do f1-rnd-gap analysis ( subclust

anal?). 5. Do d1-linear-gap analysis, d1 f0-f1 /

f0-f1. 6. Let f2 s.t. no x is further away (in

some direction) from d1-line than f2 7. Do

f2-round-gap analysis. 8. Do d2-linear-gap d2

f0-f2 - (f0-f2)od1 / len...

SubClust-1 f0ver19 RnGpgt4 none

SubClust-2 f1set42 LnGpgt4 none SubClust-2 is

50 setosa! Likely f2, f3 and f4 analysis will not

find none.

f0b34 RnGpgt4 1 0 b34 0 26 vir1 ... 1 66

vir39 0 72 set24 ... 1 83 t3 0 88 t34

SubClust-1 f0ver19 LinGpgt4 none

FMO(d)

f1ball g1tall LnGpgt4 1 -137 ball 0 -126

b123 0 -124 b134 1 -122 b234 0 -112 b13 ... 1

-29 t13 1 -24 t134 1 -18 t123 1 -13 tal

f2vir11 g2set16 Lngt4 none

f3t34 g3vir18 Lngt4 none

f4t4 g4b4 Lngt4 1 24 vir1 0 39 b4 0

39 b14

f4t4 g4vir1 Lngt4 none This ends the process.

We found all (and only) added anomalies, but

missed t34, t14, t4, t1, t3, b1, b3.

f1b13 g1b2 LnGpgt4 none

f2t2 g2b2 LnGpgt4 1 21 set16 0 26 b2

f2t2 g2t234 Lngt4 0 5 t23 0 5 t234 0

6 t12 0 6 t24 0 6 t124 1 6 t2 0 21

ver11

CRC method g1MaxVector ?

x x x x x x xx x x

x x x x x x x x x x x

x x x x x x x x x x x x x

xx x x x x x x x x x x x x x x

x xxx x x x x xx x x x x x x x x

xxx x xx x x x x x x xx x x x x x x

x x x x x x x x xx x x x x xx x

x x xx x x

f2vir11 g2b23 Lngt4 1 43 b12 0 50 b34 0

51 b124 0 51 b23 0 52 t13 0 53 b13

MCR f ?

?MCR g

f2vir11 g2b12 Lngt4 1 45 set16 0 61 b24 0 61

b2 0 61 b12

? CRC f1MinVector

f1bal RnGpgt4 1 0 ball 0 28 b123... 1

73 t4 0 78 vir39... 1 98 t34 0 103

t12 0 104 t23 0 107 t124 1 108 t234 0 113

t13 1 116 t134 0 122 t123 0 125 tal

Finally we would classify within SubCluster1

using the means of another training set (with

FAUST Classify). We would also classify

SubCluster2.1 and SubCluster2.2, but would we

know we would find SubCluster2.1 to be all Setosa

and SubCluster2.2 to be all Versicolor (as we did

before). In SubCluster1 we would separate

Versicolor from Virginica perfectly (as we did

before).

FMO(fg) start ? f1?MxPt(SpS((M-x)o(M-x))),

Round gaps first, then Linear gaps.

We could FAUST Classify each outlier (if so

desired) to find out which class they are

outliers from. However, what about the rouge

outliers I added? What would we expect? They

are not represented in the training set, so what

would happen to them? My thinking they are real

iris samples so we should not do the really do

the outlier analysis and subsequent

classification on the original 150. We already

know (assuming the "other training set" has the

same means as these 150 do), that we can separate

Setosa, Versicolor and Virginica prefectly using

FAUST Classify.

SubClus2 f1t14 Rngt4 0 0 t1 1 0 t14 0

30 ver8 ... 1 47 set15 0 52 t3 0 52 t34

SubClus1 f1b123 Rngt4 1 0 b123 0 30 b13 0

30 vir32 0 30 vir18 1 32 b23 0 37 vir6

If this is typical (though concluding from one

example is definitely "over-fitting"), then we

have to conclude that Mark's round gap analysis

is more productive than linear dot product proj

gap analysis! FFG (Furthest to Furthest),

computes SpS((M-x)o(M-x)) for f1 (expensive? Grab

any pt?, corner pt?) then compute

SpS((x-f1)o(x-f1)) for f1-round-gap-analysis.

Then compute SpS(xod1) to get g1 to have

projection furthest from that of f1 ( for d1

linear gap analysis) (Too expensive? since

gk-round-gap-analysis and linear analysis

contributed very little! But we need it to get

f2, etc. Are there other cheaper ways to get a

good f2? Need SpS((x-g1)o(x-g1)) for

g1-round-gap-analysis (too expensive!)

SubClus2 f1set23 Rngt4 1 17 vir39 0 23

ver49 0 26 ver8 0 27 ver44 1 30 ver11 0

43 t24 0 43 t2

SubClus1 f1b134 Rngt4 1 0 b134 0 24 vir19

SC1 f2ver13 Rngt4 1 0 ver13 0 5 ver43

SubClus1 f1b234 Rngt4 1 0 b234 1 30 b34 0

37 vir10

SC1 g2vir10 Rngt4 1 0 vir10 0 6 vir44

SubClus1 f1b124 Rngt4 1 0 b124 0 28 b12 0

30 b14 1 32 b24 0 41 b1... 1 59 t4 0

68 b3

SbCl_2.1 g1ver39 Rngt4 1 0 vir39 0 7 set21

Notewhat remains in SubClus2.1 is exactly the 50

setosa. But we wouldn't know that, so we

continue to look for outliers and subclusters.

SC1 f4b1 Rngt4 1 0 b1 0 23 ver1

SbCl_2.1 g1set19 Rngt4 none

SbCl_2.1 f3set16 Rngt4 none

SbCl_2.1 LnGgt4 none

SbCl_2.1 g3set9 Rngt4 none

SbCl_2.1 f2set42 Rngt4 1 0 set42 0 6 set9

SC1 f1vir19 Rngt4 1 44 t4 0 52 b2

SC1 g4b4 Rngt4 1 0 b4 0 21 vir15

SbCl_2.1 LnGgt4 none

SbCl_2.1 f4set Rngt4 none

SbCl_2.1 f2set9 Rngt4 none

SbCl_2.1 g4set Rngt4 none

SC1 g1b2 Rngt4 1 0 t4 0 28 ver36

SubC1us1 has 91, only versicolor and virginica.

SbCl_2.1 g2set16 Rngt4 none

SbCl_2.1 LnGgt4 none

SbCl_2.1 LnGgt4 none

For speed of text mining (and of other high

dimension datamining), we might do additional

dimension reduction (after stemming content

word). A simple way is to use STD of the column

of numbers generated by the functional (e.g., Xk,

SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod),

etc.). The STDs of the columns, Xk, can be

precomputed up front, once and for all. STDs of

projection and square distance functionals must

be done after they are generated (could be done

upon capture too). Good functionals produce many

large gaps. In Iris150 and Iris150Out30, I find

that the precomputed STD is a good indicator of

that. A text mining scheme might be 1.

Capture the text as a PTreeSET (after stemming

the content words) and store mean, median, STD of

every column (content word stem). 2. Throw out

low STD columns. 4'. Use a weighted sum of

"importance" and STD? (If the STD is low, there

can't be many large gaps.)

A possible Attribute Selection algorithm 1.

Peel from X, outliers using CRM-lin, CRC-lin,

possibly M-rnd, fM-rnd, fg-rnd.. (Xin X -

Xout) 2. Calculate widths of each

Xin-Circumscribing Rectangle edge, crewk 4.

Look for wide gaps top down (or, very simply,

order by STD). 4'. Divide crewk into countxk

x?Xin. (but that doesn't account for dups)

4''. look for preponderance of wide thin-gaps top

down. 4'''. look for high projection interval

count dispersion (STD). Notes 1. Maybe an

inlier sub-cluster needs occur from more than one

functional projection to be declared an inlier

sub-cluster? 2. STD of a functional projection

appears to be a good indicator of the quality of

its gap analysis. For FAUST Cluster-d (pick d,

then fMnPt(xod) and gMxPt(xod) ) a full grid of

unit vectors (all directions, equally spaced) may

be needed. Such a grid could be constructed

using angles a1, ... , am, each equi-width

partitioned on 0,180), with the formulas

d e1?kn...2cos?k e2sin?2?kn...3cos?k

e3sin?3?kn...4cos?k ... ensin?n where ?i's

start at 0 and increment by ?.

So, di1..in ?j1..n ej sin((ij-1)?) ?kn.

.j1cos(k?) i00, ? divides 180 (e.g., 90, 45,

22.5...)

CRMSTD(dfg) Eliminate all columns with STD lt

threshold.

d3 0 10 set23...50setvir39 1 19 set25 0

30 ver49...50ver_49vir 0 69 vir19

(d3d4)/sqr(2) clus1 none (d3d4)/sqr(2) clus2

none

d5 (f5vir19, g5set14) none f5 1 0.0 vir19

clus2 0 4.1 vir23 g5 none

Just about all the high STD columns find the

subcluster split. In addition, they find the

four outliers as well

(d1d3d4)/sqr(3) clus1 1 44.5 set19 0 55.4

vir39 (d1d3d4)/sqr(3) clus2 none

d5 (f5vir23, g5set14) none,f5 none, g5 none

d5 (f5vir32, g5set14) none, f5 none, g5 none

d5 (f5vir18, g5set14) none f5 1 0.0 vir18

clus2 1 4.1 vir32 0 8.2 vir6 g5 none

d5 (f5vir6, g5set14) none, f5 none, g5 none

(d1d2d3d4)/sqr(4) clus1 (d1d2d3d4)/sqr(4)

clus2 none

(d1d3)/sqr(2) clus1 none (d1d3)/sqr(2) clus2 0

57.3 ver49 0 58.0 ver8 0 58.7 ver44 1 60.1

ver11 0 64.3 ver10 none

CRMSTD(dfg) using IRIS rectangle on Satlog (1805

rows of R,G,IR1,IR2 with classes 1,2,3,4,5,7.).

Here I made a mistake and left MinVec, MaxVec

and M as they were for IRIS (so probably far from

the Satlog dataset). The results were good???

Suggests random f and g?

d2 STD23.7 gpgt3 val cl num 1 121 1 297 0 126

3 361 0 127 3 84 0 128 3 100 0 128 3 315

(d2d3)/sqr2 STD23.6 1 173.2 3 244 0 183.8 3

361 0 181.7 3 84 0 184.6 3 100 0 180.3 3 315

(d1d2)/sqr2 STD25.3 1 153.4 3 200 0 157.7 3

315 1 157.7 3 84 0 161.2 3 361

(d1d4)/sqr2 STD15.5 0 59.4 5 75 1 60.1 5

24 0 64.3 5 149... 1 142.1 3 84 0 145.7 3 361

d4 STD20.3 gpgt3 val cl num 1 29 5 75 1 33

5 24 0 37 5 73... 1 150 2 85 0 154 2 191

SQRT(x-f2)o(x-f2) STD26.7 val cl num 1 41.6

5 75 0 45.9 5 24... 1 168.8 3 244 0 180.4 3

361 0 178.1 3 84 0 179.2 3 100 0 176.1 3 315

(d1d3)/sqr2 STD16.8 1 159.8 3 84 0 166.9 3 361

d3d4)/sqr2 STD25.7 1 39.5 5 75 0 44.5 5

24... 1 142.5 2 119 0 146.5 2 191 0 147.5 2 85

(d2d4)/sqr2 STD20.4 0 40.0 5 75 1 41.0 5

24... 1 109.5 3 45 0 115.0 3 361 0 115.5 3 315 0

116.0 3 84 0 117.5 3 100 same

d3 STD17.2 gpgt3 val cl num 1 139 2 191 0 145

2 85

d1d2d3d4)/sqr4 STD25.9 0 92.5 5 75 1 95.0

5 24 0 99.0 5 149 1 101.5 5 73 0 105.0 5

121... 1 222.0 3 244 0 226.5 3 315 0 227.0 3

100 1 229.0 3 84 0 233.0 3 361 same

d1d2d3)/sqr3 STD25.3 1 203.8 3 84 0 209.0 3

361

SQRT(x-g2)o(x-g2) STD26.8 val cl num 1 15.6

5 75 0 22.5 5 149 0 22.9 5 24 0 24.1 5 73 1

26.6 5 168 0 29.6 5 121... 1 162.1 2 119 0

168.7 2 191 0 169.7 2 85

d1 STD13.6 ggt3 none

SQRT(x-M)o(x-M) STD28 val cl num 1 29.6 5

75 1 34.2 5 24 0 38.7 5 149 1 39.7 5 73 0

43.7 5 168

(d1d2d4)/sqr3 STD21.9 0 67.0 5 24 1 67.5 5

75 0 72.5 5 149

sqr(x-f4)o(x-f4 STD27.8 val cl num 1 35.6 5

75 1 39.9 5 24 0 45.3 5 149 1 45.7 5 73 0

50.8 5 168... 1 176.2 2 119 0 182.9 2 191 0 182.9

2 85

d1d3d4/sq3 STD22.1 0 81.4 5 24 1 77.4 5 75

SQRT(x-f1)o(x-f1) STD27 val cl num 0 41.1 5

24 1 41.6 5 75 0 44.9 5 149... 1 172.8 3

84 0 176.6 3 361

SQRT(x-f5)o(x-f5) STD25 val cl num 1 147.1 3

100 0 151.7 2 85 0 152.3 2 191

Skip STDlt25, same outliers 2_85, 2_191, 3_361,

3_84, 3_100, 3_315, 5_24, 5_73, 5_75,

5_149, 5_168,

SQRT(x-f3)o(x-f3) STD27.5 val cl num 1 52.2 5

75 0 58.0 5 24 1 58.2 5 149 0 61.5 5 73 1

62.5 5 168 0 66.0 5 121... 1 188.2 3 361 0 192.0

2 191 0 193.6 2 85

SQRT(x-g4)o(x-g4)

STD27.7 val cl num 1 144.8 2 119 0 148.6 3

315 0 150.7 2 191 0 150.9 3 84 0 151.8 3 100 0

151.8 2 85 0 153.9 3 361

SQRTx-g5ox-g5 STD27.4 val cl num 0 27.8 5

75 1 29.4 5 24 0 35.1 5 73 1 35.6 5 149 0

39.4 5 71

SQRT(x-g1)o(x-g1) STD26.3 val cl num 1

41.6 5 75 0 45.9 5 24... 1 166.1 2 119 0 172.3

2 191 0 172.8 2 85

SQRT(x-g3)o(x-g3) STD24.9 none

CRMSTD(dfg) Satlog corners on Satlog

Class Means c1M 63.6 98.4 110.3 90.2 c2M

48.4 38.5 114.5 119.9 c3M 87.8 106.1

111.0 87.8 c4M 77.1 90.2 94.7

73.9 c5M 59.8 62.2 80.4 66.7 c7M 69.2

77.9 82.3 64.5

1red soil, 2cotton, 3grey soil, 4damp grey

soil, 5soil w stubble, 6mixture, 7very damp

grey soil Classes 2, 5 isolated from the rest

(and each other)? 2 and 5 produced the greatest

number of outliers. Take f5c2M g5 to be other

means

d5(f5c2M,g5c7M) ggt3 STD26 val cl num 0

-139.9 2 85 1 -138.8 2 191 0 -134.4 2 186 0

-132.1 2 119 0 -131.7 2 224 0 -130.9 2 23 1

-74.5 2 200 0 -70.2 2 160 0 -68.9 2 165 0

-68.2 2 86 0 -68.1 2 194 0 -67.3 2 138 0

-67.0 2 19 1 -67.0 2 223 0 -62.9 2 60 0

-62.5 2 132 0 -59.8 5 45 0 -14.1 7 602 0

-14.0 7 412 0 -14.0 7 420 0 -13.9 7 306 0

-13.9 7 244 0 -13.7 5 175 0 -13.2 5 15 0

-13.1 7 562 0 -13.1 7 359 0 -13.0 7 532 0

-13.0 7 530 0 -12.9 7 414 0 -12.8 5 71 0

-12.7 5 121 0 -12.2 7 636 0 -11.4 5 144 1

-11.0 7 470 0 -8.0 5 168 0 -7.9 5 24 0

-7.9 5 73 0 -7.5 5 149 1 -4.9 5 190 0 -0.8

5 75

d2 STD23.7 val cl num 1 121 1 297 0 126 3

361 0 127 3 84 0 128 3 100 0 128 3 315

Lots of outliers found, but did not separate

classes as subclusters (Keeping in mind that they

may butt up against each other (no gap) so that

they would never appear as subclsuter via gap

analysis methods.). Suppose we have a high

quality training set for this dataset reliably

accurate class means. Next, find any class gaps

that might exist by using those as our f and g

points.

(d1d2)/sqr2 STD25.2 none

d4 STD20.3 val cl num 1 29 5 75 1 33 5

24 0 37 5 73.. 1 150 2 85 0 154 2 191

(d1d3)/sqr2 STD16.6 none

(d1d4)/sqr2 STD15.3 none

(d2d3)/sqr2 STD23.4 none

(d2d4)/sqr2 STD23.4 none

SubCluster1 consists of 191 class2

samples. SubCluster3 contains every

subcluster. Next, on SubCluster3 we use f5c1M

and g5c7M.

(d3d4)/sqr2 STD25.3 1 68.6 5 168 0 72.1 5 121

2 160 2 165 2 86 2 194 2 138 2 19 2 223

0.0 20.6 4.6 9.9 5.8 20.9

15.4 20.6 0.0 22.4 11.6 23.3 5.0

12.2 4.6 22.4 0.0 12.6 4.1

21.7 18.6 9.9 11.6 12.6 0.0

12.8 13.0 6.9 5.8 23.3 4.1

12.8 0.0 22.9 18.2 20.9 5.0

21.7 13.0 22.9 0.0 15.4 15.4 12.2

18.6 6.9 18.2 15.4 0.0

(d1d2d3)/sqr3 STD25.2 none

d3 STD17.2 val cl num 1 139 2 191 0 145 2

85

(d1d2d4)/sqr3 STD21.6 none

(d1d3d4)/sqr3 STD21.8 none

(d2d3d4)/sqr3 STD25.4 none

(d1d2d3d4)/sqr4 STD25.4 none

d1 STD13.6 none

d2 STD23.7 val cl num val dis(1 297) 0

118 3 242 153.3 35.128 0 118 3 73 148.4

35.707 0 118 3 343 152.3 31.144 0 118 3 263

148.4 35.707 0 118 3 155 147.4 31.796 0

118 1 36 153.5 9.2736 0 118 3 221 152.3

31.144 0 118 3 244 158.3 35.707 0 120 3 50

155.6 33.090 0 120 3 344 148.1 24.617 0

120 3 200 151.8 33.136 0 120 3 310 151.9

29.189 0 120 3 202 154.0 33.136 1 121 1 297

149.8 0

dis(2_200,2_160)12.4 outlier

f1 STD11.8 none

dis(2_60,2_132) 3.9

g1 STD14.5 none

(2_132,5_45) 33.6 outliers.

f2 STD14.9 none

g2 STD23.6 none

5 168 5 24 5 73 5 149 5 190 5 75 0.0

14.0 7.3 8.1 16.5 15.7 14.0

0.0 7.1 7.7 26.2 8.1 7.3

7.1 0.0 4.6 19.7 11.0 8.1

7.7 4.6 0.0 22.7 10.1 16.5 26.2

19.7 22.7 0.0 27.9 15.7 8.1

11.0 10.1 27.9 0.0

f3 STD16.9 none

g3 STD12.7 val cl num 1 101.9 5 73 0 105.0 5

149

SubClus3 f5c1M, g5c7M.

f4 STD22.3 none

d5(f5c2M,g5c7M) ggt2 STD68 val cl num 0

4.9 3 70 1 90.2 5 33 0 92.3 5 121 1 92.5

5 179 0 187.5 1 110 1 216.5 3 244 1 223.3 3

315 0 225.6 3 84 0 226.6 3 100

g4 STD11.6 val cl num 1 42.1 2 10 0 48.0 2

143.. 1 114.9 5 168 0 119.8 5 73

g4 STD11.6 val cl num 0 52.1 2 143 52.1 0

54.6 2 145 54.6 16.278

f5 STD24.8 none

g5 STD27.1 none

Density A set is T-dense iff it has no distance

gaps greater than T.

10/20/12 (Equivalently,

every point has neighbors in its'

T-neighborhood.) We can use L1 or HOB or L?

distance, since disL1(x,y) ? disL2(x,y)

disL2(x,y) ? 2disHOB(x,y) and disL2(x,y) ?

ndisL?(x,y)

Definition Y?X is T-dense iff there does not

exist y?Y such that dis2(y, Y-y) gt T.

Theorem-1 If for every y?Y, dis2(y,Y-y) ? T

then Y is T-dense.

Using L1 distance, not L2Euclidean Theorem-2

disL1(x,y) ?disL2(x,y) (from here on we will

use disk to mean disLk ). Therefore If, for

every y?Y, dis1(y,Y-y) ? T then Y is T-dense.

( Proof dis2(y,Y-y) ? dis1(y,Y-y) ? T )

2disHOB(x,y) ? dis2(x,y) (Proof Let the bit

pattern of dis2(x,y) be 001bk-1...b0 then

disHOB(x,y)2k and the most bk-1 ...b0 can

contribute is 2k-1 (if it's all 1-bits). So

dis2(x,y) ? 2k (2k - 1) ? 22k

2disHOB(x,y).

Theorem-3 If, for every y?Y, disHOB(y,Y-y) ?

T/2 then Y is T-dense. Proof dis2(y,Y-y) ?

2disHOB(y,Y-y) ? 2T/2 T

Theorem-4 If, for every y?Y, dis?(y,Y-y) ?

T/n then Y is T-dense. Proof dis2(y,Y-y) ?

ndisHOB(y,Y-y) ? nT/n T

Pick T' based on T and the dimension, n (It can

be done!). If MaxGap(yoek)MaxGap(Yk) lt T'

?k1..n, then Y is T-dense (Recall, yoek is just

Yk as a column of values.) Note We use the

logn pTreeGapFinder to avoid sorting.

Unfortunately, it doesn't immediately find all

gaps precisely at their full width (because it

descends using power of 2 widths), but if we find

all PTreeGaps, we can be assured that

MaxPTreeGap(Y) ? MaxGap(Y) or we can keep track

of "thin gaps" and thereby actually identify all

gaps (see the slide on pTreeGapFinder).

Theorem-5 If ?k1..nMaxGap(Yk) ? T, then Y is

T-dense Proof dis1(y,x)?k1..nyk-xk.

yk-xk ? MaxGap(Yk) ?x?Y. So dis2(y,Y-y) ?

dis1(y,Y-y) ? ?k1..nMaxGap(Yk) ? T

p x y 1 6 36 2 7 39 3 8 41

4 9 34 5 9 38 6 10 42 7 12 34 8

12 38 9 13 35 10 13 40 11 19 38 12

25 38 13 22 22 14 26 16 15 26 25 16

29 11 17 31 18 18 32 26 19 34 11 20

34 23 21 35 20 22 37 10 23 37 23 24

38 13 25 38 21 26 39 24 27 40 9 28

42 9 29 38 39 30 38 42 31 39 44 32

41 41 33 41 45 34 42 39 35 42 43 36

44 43 37 45 40

No gaps (ct0_intervals) on the furthest-to-Mean

line, but 3 ct1 intevals. Declare pp12, p16,

p18 anomaly if pofM is far enough from the bddry

pts of its interval?

Round 2 is straight forward. So, 1. Given gaps,

find ctk_intervals. 2. Find good gaps (dot prod

with a constant vector for linear gaps?)

For rounded gaps, use xox?

Note in this example, vom works better than mean.

Using vector lengths

However, if the data happens to be shifted, as it

is on the right, using lengths no longer works in

this example. That is, dot product with a fixed

vector, like fM is independent of the placement

of the points with respect to the origin. Length

based gapping is dependent.

A squared pattern does not lend itself to rounded

gap boundaries. distance from the origin is in

red. Distance from (7,0) is in blue. 9

x x 8 7 x x x x x x x x x x x x x x x x 6

x x x x x x x x x x x x x x x x 5 x x x x x x x

x x x x x x x x x 4 x x x x x x x x x x x x x x

x x 3 x x x x x x x x x x x x x x x x 2 x x x

x x x x x x x x x x x x x 1 x x x x x x x x x x

x x x x x x 0 x x x x x x x x x x x x x x x x

0 1 2 3 4 5 6 7 8 9 a b c d e f

- FAUSTFast, Accurate Unsupervised and

Supervised Teaching (Teaching big data to

reveal information) 6/9/12 - FAUST CLUSTER-fmg (furthest-to-mean gaps for

finding round clusters) CX (e.g., Xp1,

..., pf 15 pix dataset.) - While an incomplete cluster, C, remains find M

Medoid(C) ( Mean or Vector_of_Medians

or? ). - Pick f?C furthest from M from SSPTreeSet(D(x,M)

.(e.g., HOBbit furthest f, take any from

highest-order S-slice.) - If ct(C)/dis2(f,M)gtDT (DensThresh), C is

complete, else split C where PPTreeSet(cofM/fM)

gap gt GT (GapThresh) - End While.
- Notes a. Euclidean and HOBbit furthest. b.

fM/fM and just fM in P. c. find gaps by

sorrting P or O(logn) pTree method?

C2p5 complete (singleton outlier).

C3p6,pf, will split (details omitted), so

p6, pf complete (outliers). That leaves

C1p1,p2,p3,p4 and C4p7,p8,p9,pa,pb,pc,pd,pe

still incomplete. C1 is dense ( density(C1)

4/22.5 gt DT.3 ?) , thus C1 is

complete. Applying the algorithm to C4

In both cases those probably are the best "round"

clusters, so the accuracy seems high. The speed

will be very high!

pa outlier. C2 splits into p9, pb,pc,pd

complete.

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e

f

M0 8.3 4.2 M1 6.3 3.5

f1p3, C1 doesn't split (complete).

M

f

M4

D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.

8 3.3 3.3 1.8 1.5

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4

3 3 p5 6 2 p6 9 3 p7 15 1 p8 14

2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd

9 11 pe 11 11 pf 7 8

C1 C2 C3 C4

M1

M0

FAUST CLUSTER-fmg O(logn) pTree method for

finding P-gaps P ScalarPTreeSet( c o fM/fM )

D(x,M) 8 7 7 6 4 2 7 6 7 4 4 6 6 7 4

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5

6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13

4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8

D3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

D2 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1

D1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0

D0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0

xoUp1M 1 3 3 4 6 9 14 13 15 13 13 14 13 15 1

0

P3 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

P2 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

P1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1

P0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 0

HOBbit Furthest pt list p1 Pick fp1.

dens(C)16/8216/64.25

If GT2k then add 0,1,...,2k-1 check all k of

these down to level2k

P38,15 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 ct 10

P3'0,7 1 1 1 1 1 0 0 0 0 0 0