Why Not Store Everything in Main Memory? Why use disks? - PowerPoint PPT Presentation

Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 7b695e-MWM1N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Why Not Store Everything in Main Memory? Why use disks?

Description:

Research of William Perrizo, C.S. Department, NDSU I datamine big data (big data trillions of rows and, sometimes, thousands of columns (which can complicate data ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 36
Provided by: William1437
Category:
Tags: disks | everything | main | memory | store | use

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory? Why use disks?


1
Research of William Perrizo, C.S. Department, NDSU
I datamine big data (big data trillions of rows
and, sometimes, thousands of columns (which can
complicate data mining trillions of rows). How do
I do it? I structure the data table as
compressed vertical bit columns (called
"predicate Trees" or "pTrees"). I process those
pTrees horizontally (because processing across
thousands of column structures is orders of
magnitude faster than processing down trillions
of row structures. As a result, some tasks that
might have taken forever can be done in a humanly
acceptable amount of time. What is data mining?
Largely it is classification (assigning a class
label to a row based on a training table of
previously classified rows). Clustering and
Association Rule Mining (ARM) are important areas
of data mining also, and they are related to
classification. The purpose of clustering is
usually to create or improve a training table.
It is also used for anomaly detection, a huge
area in data mining. ARM is used to data mine
more complex data (relationship matrixes between
two entities, not just single entity training
tables). Recommenders recommend products to
customers based on their previous purchases or
rents (or based on their ratings of items)".
To make a decision, we typically search our
memory for similar situations (near neighbor
cases) and base our decision on the decisions we
(or an expert) made in those similar cases. We
do what worked before (for us or for others).
I.e., we let near neighbor cases vote. But which
neighbor vote? "The Magical Number Seven, Plus
or Minus Two..." Information"2 is one of the
most highly cited papers in psychology cognitive
psychologist George A. Miller of Princeton
University's Department of Psychology in
Psychological Review. It argues that the number
of objects an average human can hold in working
memory is 7 2 (called Miller's Law).
Classification provides a better 7.
Some current pTree Data Mining research projects
FAUST pTree PREDICTOR/CLASSIFIER (FAUST
Functional Analytic Unsupervised and Supervised
machine Teaching) FAUST pTree
CLUSTER/ANOMALASER pTrees in MapReduce MapReduce
and Hadoop are key-value approaches to organizing
and managing BigData. pTree Text Mining
capturie the reading sequence, not just the
term-frequency matrix (lossless capture) of a
text corpus. Secure pTreeBases This involves
anonymizing the identities of the individual
pTrees and randomly padding them to mask their
initial bit positions. pTree Algorithmic Tools
An expanded algorithmic tool set is being
developed to include quadratic tools and even
higher degree tools. pTree Alternative Algorithm
Implementation Implementing pTree algorithms in
hardware (e.g., FPGAs) should result in orders of
magnitude performance increases? pTree O/S
Infrastructure Computers and Operating Systems
are designed to do logical operations (AND,
OR...) rapidly. Exploit this for pTree
processing speed. pTree Recommender This
includes, Singular Value Decomposition (SVD)
recommenders, pTree Near Neighbor Recommenders
and pTree ARM Recommenders.
2
FAUST clustering (the unsupervised part of FAUST)
This class of partitioning or clustering methods
relies on choosing a dot product projection so
that if we find a gap in the F-values, we know
that the 2 sets of points mapping to opposite
sides of that gap are at least as far apart as
the gap width.).
The Coordinate Projection Functionals (ej) Check
gaps in ej(y) yoej yj
The Square Distance Functional (SD) Check gaps
in SDp(y) (y-p)o(y-p) (parameterized over a
p?Rn grid).
The Square Dot Product Radius (SDPR) SDPRpq(y)
SDp(y) - DPPpq(y)2 (easier pTree processing)
DPP-KM 1. Check gaps in DPPp,d(y) (over grids of
p and d?). 1.1 Check distances at any sparse
extremes. 2. After several rounds of 1, apply
k-means to the resulting clusters (when k seems
to be determined).
DPP-DA 2. Check gaps in DPPp,d(y) (grids of p
and d?) against the density of subcluster. 2.1
Check distances at sparse extremes against
subcluster density. 2.2 Apply other methods
once Dot ceases to be effective.
DPP-SD) 3. Check gaps in DPPp,d(y) (over a
p-grid and a d-grid) and SDp(y) (over a p-grid).
3.1 Check sparse ends distance with subcluster
density. (DPPpd and SDp share construction
steps!)
SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share
construction steps! SDp(y) (y-p)o(y-p)
yoy - 2 yop pop
DPPpq(y) (y-p)odyod-pod (1/p-q)yop -
(1/p-q)yoq
Calc yoy, yop, yoq concurrently? Then constant
multiplies 2yop, (1/p-q)yop concurrently.
Then add subtract.
Calculate DPPpq(y)2. Then subtract it from
SDp(y)
3
FAUST DPP CLUSTER on IRiS with DPP(y)(y-p)o(q-p)/
q-p, where p is the min (or n) corner and q is
the max (x) corner of the circumscribing
rectangle (mdpts or avg (a) is used also).
Checking 0,4 distances (s42 Setosa outlier) F
0 1 2 3 3 3 4 s14 s42 s45 s23 s16
s43 s3 s14 0 8 14 7 20 3 5 s42 8 0
17 13 24 9 9 s45 14 17 0 11 9 11
10 s23 7 13 11 0 15 5 5 s16 20 24 9
15 0 18 16 s43 3 9 11 5 18 0 3 s3
5 9 10 5 16 3 0
IRIS 150 irises (rows), 4 columns (Pedal Length,
Pedal Width, Sepal Length, Sepal Width). first
50 are Setosa (s), next 50 are Versicolor (e),
next 50 are Virginica (i) irises.
CL1 Flt17 (50 Set)
17ltFlt23 CL2 (e8,e11,e44,e49,i39)
gapgt4 pnnnn qxxxx F Count 0 1 1 1 2
1 3 3 4 1 5 6 6 4 7 5 8 7 9
3 10 8 11 5 12 1 13 2 14 1 15 1 19
1 20 1 21 3 26 2 28 1 29 4 30 2 31
2 32 2 33 4 34 3 36 5 37 2 38 2 39
2 40 5 41 6 42 5 43 7 44 2 45 1 46
3 47 2 48 1 49 5 50 4 51 1 52 3 53
2 54 2 55 3 56 2 57 1 58 1 59 1 61
2 64 2 66 2 68 1
Thinning6,7 CL3.1 lt6.5 44 ver 4 vir CL3.2
gt6.5 2 ver 39 vir No sparse ends
23ltF CL3 (46 vers,49 vir)
Check distances in 12,28 s16,,i39,e49, e11,
e8,e44, i6,i10,i18,i19,i23,i32 outliers F 12
13 13 14 15 19 20 21 21 21 26 26 28
s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30
e31 s34 0 5 8 5 4 21 25 28 32 28
30 28 31 s6 5 0 4 3 6 18 21 23
27 24 26 23 27 s45 8 4 0 6 9 18
18 21 25 21 24 22 25 s19 5 3 6 0
6 17 21 24 27 24 25 23 27 s16 4 6 9
6 0 20 26 29 33 29 30 28 31 i39 21
18 18 17 20 0 17 21 24 21 22 19
23 e49 25 21 18 21 26 17 0 4 7 4
8 8 9 e8 28 23 21 24 29 21 4 0 5
1 7 8 8 e11 32 27 25 27 33 24 7
5 0 4 7 9 7 e44 28 24 21 24 29 21
4 1 4 0 6 8 7 e32 30 26 24 25
30 22 8 7 7 6 0 3 1 e30 28 23
22 23 28 19 8 8 9 8 3 0 4 e31
31 27 25 27 31 23 9 8 7 7 1 4 0
Here we project onto lines through the corners
and edge midpoints of the coordinate-oriented
circumscribing rectangle. It would, of course,
get better results if we choose p and q to
maximize gaps. Next we consider maximizing the
STD of the F-values to insure strong gaps (a
heuristic method).
Checking 57.68 distances i10,i36,i19,i32,i18,
i6,i23 outliers F 57 58 59 61 61 64
64 66 66 68 i26 i31 i8 i10 i36 i6 i23
i19 i32 i18 i26 0 5 4 8 7 8 10 13
10 11 i31 5 0 3 10 5 6 7 10 12
12 i8 4 3 0 10 7 5 6 9 11
11 i10 8 10 10 0 8 10 12 14 9
9 i36 7 5 7 8 0 5 7 9 9 10 i6
8 6 5 10 5 0 3 5 9 8 i23 10
7 6 12 7 3 0 4 11 10 i19 13 10 9
14 9 5 4 0 13 12 i32 10 12 11 9
9 9 11 13 0 4 i18 11 12 11 9 10 8
10 12 4 0
4
"Gap Hill Climbing" mathematical analysis
1. To increase gap size, we hill climb the
standard deviation of the functional, F (hoping
that a "rotation" of d toward a higher StDev
would increase the likelihood that gaps would be
larger since more dispersion allows for more
and/or larger gaps. This is very heuristic but
it works. 2. We are more interested in growing
the largest gap(s) of interest ( or largest
thinning). To do this we could do
F-slices are hyperplanes (assuming Fdotd) so it
would makes sense to try to "re-orient" d so that
the gap grows. Instead of taking the "improved" p
and q to be the means of the entire n-dimensional
half-spaces which is cut by the gap (or
thinning), take as p and q to be the means of the
F-slice (n-1)-dimensional hyperplanes defining
the gap or thinning. This is easy since our
method produces the pTree mask of each F-slice
ordered by increasing F-value (in fact it is the
sequence of F-values and the sequence of counts
of points that give us those value that we use to
find large gaps in the first place.).
The d2-gap is much larger than the d1gap. It is
still not the optimal gap though. Would it be
better to use a weighted mean (weighted by the
distance from the gap - that is weighted by the
d-barrel radius (from the center of the gap) on
which each point lies?)
In this example it seems to make for a larger
gap, but what weightings should be used? (e.g.,
1/radius2) (zero weighting after the first gap is
identical to the previous). Also we really want
to identify the Support vector pair of the gap
(the pair, one from one side and the other from
the other side which are closest together) as p
and q (in this case, 9 and a but we were just
lucky to draw our vector through them.) We could
check the d-barrel radius of just these gap slice
pairs and select the closest pair as p and q???
5
Maximizing theVariance
How do we use this theory? For Dot Product gap
based Clustering, we can hill-climb akk below to
a d that gives us the global maximum variance.
Heuristically, higher variance means more
prominent gaps.
Given any table, X(X1, ..., Xn), and any unit
vector, d, in n-space, let
We can separate out the diagonal or not
These computations are O(C) (Cnumber of classes)
and are instantaneous. Once we have the matrix
A, we can hill-climb to obtain a d that maximizes
the variance of the dot product projections of
the class means.
FAUST Classifier MVDI (Maximized Variance
Definite Indefinite
? d0, one can hill-climb it to locally maximize
the variance, V, as follows
Build a Decision tree. 1. Find the d that
maximizes the variance of the dot product
projections of the class means each round. 2.
Apply DI each round (see next slide).
d1?(V(d0))
d2?(V(d1))... where
6
FAUST DI K-class training set, TK, and a given
d (e.g., from DMeanTK?MedTK)
Let mimeanCi s.t. dom1?dom2? ...?domK
MniMindoCi MxiMaxdoCi MngtiMinjgtiMnj
MxltiMaxjltiMxj
Definite_i ( Mxlti, Mngti )
Indefinite_i_i1 Mngti, Mxlti1
Then recurse on each Indefinite.
For IRIS 15 records were extracted from each
Class for Testing. The rest are the Training
Set, TK. DMEANs?MEANe

Definite_i_______ Indefinite_i_i1______
class Mxlti MNgti
class MNgti Mxlti1 s-Mean 50.49 34.74
14.74 2.43 s(i1) -1 25 e-Mean 63.50
30.00 44.00 13.50 e(i2) 10 37 se
25 10 empty i-Mean 61.00 31.50 55.50 21.50
i(i3) 48 128 ei 37 48
F lt 18 ?
setosa (35 seto) 1ST
ROUND DMeans?Meane 18 lt F lt 37
? versicolor (15 vers)
37 ? F ? 48 ?
IndefiniteSet2 (20 vers, 10 virg)
48 lt F ? virginica
(25 virg)
F lt 7 ? versicolor (17
vers. 0 virg) IndefSet2
ROUND DMeane?Meani 7 ? F ? 10
? IndefSet3 ( 3 vers, 5 virg)
10 lt F ? virginica ( 0 vers, 5 virg)
F lt 3 ? versicolor (
2 vers. 0 virg)
IndefSet3 ROUND DMeane?Meani 3 ? F ?
7 ? IndefSet4 ( 2 vers, 1 virg)
Here we will assign 0 ? F
? 7 versicolor 7 lt F ?
virginica ( 0 vers, 3 virg)
7 lt F
virginica
Test F lt 15
? setosa (15 seto)
1ST ROUND DMeans?Meane 15 lt F lt 15
? versicolor ( 0
vers, 0 virg) 15 ? F ? 41
? IndefiniteSet2 (15 vers, 1
virg) 41 lt F
? virginica ( 14 virg)
100 accuracy.
F lt 20 ? versicolor (15 vers. 0 virg)
IndefSet2 ROUND DMeane?Meani 20 lt F ?
virginica ( 0 vers, 1 virg)
Option-1 The sequence of D's is
Mean(Classk)?Mean(Classk1) k1... (and Mean
could be replaced by VOM or?)
Option-2 The sequence of D's is
Mean(Classk)?Mean(?hk1..nClassh) k1... (and
Mean could be replaced by VOM or?)
Option-3 D seq Mean(Classk)?Mean(?h not used
yetClassh) where k is the Class with max count in
subcluster (VoM instead?)
Option-2 D seq. Mean(Classk)?Mean(?hk1..nClass
h) (VOM?) where k is Class with max count in
subcluster.
Option-4 D seq. always pick the means pair
which are furthest separated from each other.
Option-5 D Start with Median-to-Mean of
IndefiniteSet, then means pair corresp to max
separation of F(meani), F(meanj)
Option-6 D Always use Median-to-Mean of
IndefiniteSet, IS. (initially, ISX)
7
FAUST MVDI
on IRIS 15 records from each Class for Testing
(Virg39 was removed as an outlier.)

Definite_____ Indefinite s-Mean 50.49
34.74 14.74 2.43 s -1 10 e-Mean
63.50 30.00 44.00 13.50 e 23 48
s_ei 23 10 empty i-Mean 61.00 31.50
55.50 21.50 i 38 70 se_i 38
48
In this case, since the indefinite interval is so
narrow, we absorb it into the two definite
intervals resulting in decision tree
8
FAUST MVDI
SatLog 413train 4atr 6cls 127test
Using class means FoMN Ct min max
max1 mn4 83 101 104 82 113 8 110 121
122 mn3 85 103 108 85 117 79 105 128
129 mn1 69 106 115 94 133 12 123 148
149 Using full data (much better!) mn4 83 101
104 82 59 8 56 65 66 mn3 85 103
108 85 62 79 52 74 75 mn1 69 106
115 94 81 12 73 95 96
Gradient Hill Climb of Variance(d) d1 d2
d3 d4 Vd) 0.00 0.00 1.00 0.00
282 0.13 0.38 0.64 0.65 700 0.20 0.51
0.62 0.57 742 0.26 0.62 0.57 0.47
781 0.30 0.70 0.53 0.38 810 0.34 0.76
0.48 0.30 830 0.36 0.79 0.44 0.23
841 0.37 0.81 0.40 0.18 847 0.38 0.83
0.38 0.15 850 0.39 0.84 0.36 0.12
852 0.39 0.84 0.35 0.10 853
Fomn Ct min max max1 mn2
49 40 115 119 106 108 91 155 156 mn5 58
58 76 64 108 61 92 145 146 mn7 69 77
81 64 131 154 104 160 161 mn4 78 91 96
74 152 60 127 178 179 mn1 67 103 114 94
167 27 118 189 190 mn3 89 107 112 88 178
155 157 206 207
Gradient Hill Climb of Var(d)on t25 d1 d2
d3 d4 Vd) 0.00 0.00 0.00 1.00
1137 -0.11 -0.22 0.54 0.81 1747
MNod Ct ClMn ClMx ClMx1 mn2
45 33 115 124 150 54 102 177 178 mn5 55 52
72 59 69 33 45 88 89
Gradient Hill Climb of Var(d)on t257 0.00
0.00 1.00 0.00 496 -0.15 -0.29 0.56
0.76 1595 Same using class means or training
subset.
Gradient Hill Climb of Var(d)on t75 0.00 0.00
1.00 0.00 12 0.04 -0.09 0.83 0.55
20 -0.01 -0.19 0.70 0.69 21
Gradient Hill Climb of Var(d)on t13 0.00 0.00
1.00 0.00 29 -0.83 0.17 0.42 0.34
166 0.00 0.00 1.00 0.00 25 -0.66
0.14 0.65 0.36 81 -0.81 0.17 0.45
0.33 88
On the 127 sample SatLog TestSet 4 errors or
96.8 accuracy.
speed? With horizontal data, DTI is applied one
unclassified sample at a time (per execution
thread). With this pTree Decision Tree, we take
the entire TestSet (a PTreeSet), create the
various dot product SPTS (one for each inode),
create ut SPTS Masks. These masks mask the
results for the entire TestSet.
Gradient Hill Climb of Var(d)on t143 0.00 0.00
1.00 0.00 19 -0.66 0.19 0.47 0.56
95 0.00 0.00 1.00 0.00 27 -0.17
0.35 0.75 0.53 54 -0.32 0.36 0.65
0.58 57 -0.41 0.34 0.62 0.58 58
For WINE min
max1 8.40 10.33 27.00 9.63 28.65 9.9
53.4 7.56 11.19 32.61 10.38 34.32 7.7
111.8 8.57 12.84 30.55 11.65 32.72 8.7
108.4 8.91 13.64 34.93 11.97 37.16 13.1
92.2 Awful results!
Gradient Hill Climb of Var t156161 0.00 0.00
1.00 0.00 5 -0.23 -0.28 0.89 0.28
19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02
0.02 1.00 159 0.00 0.00 1.00 0.00
1 -0.46 -0.53 0.57 0.43
2 Inconclusive both ways so predict
purality4(17) (3ct3 tct6
Gradient Hill Climb of Var t146156 0.00 0.00
1.00 0.00 0 0.03 -0.08 0.81 -0.58
1 0.00 0.00 1.00 0.00 13 0.02 0.20
0.92 0.34 16 0.02 0.25 0.86 0.45
17 Inconclusive both ways so predict
purality4(17) (7ct15 2ct2
Gradient Hill Climb of Var t127 0.00 0.00
1.00 0.00 41 -0.01 -0.01 0.70 0.71
90 -0.04 -0.04 0.65 0.75 91 0.00 0.00
1.00 0.00 35 -0.32 -0.14 0.59 0.73
105 Inconclusive predict purality7(62 4(15)
1(5) 2(8) 5(7)
9
FAUST MVDI
Concrete
d0 -0.34 -0.16 0.81 -0.45
7 test errors / 30 77
For Concrete min max1 train 335.3 657.1
0 l 120.5 611.6 12 m 321.1 633.5 0 h Test
0 l 1 m 0
h 0 321 3.0 57.0
0 l 3.0 361.0 11 m 28.0 92.0 0 h
0 l 2 m 0 h
92 999
Seeds
.97 .17 -.02 .15 d0 13.3 19.3 0
0 l 16.4 23.5 0 0 m 12.2 15.2
25 5 h 0 13.2 19.3 23.5
d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5
867.7 0 h 0 l
0 m 0 h . 0
617
8 test errors / 32 75
d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0
847.4 40 h 1 l 0
m 11 h 662
999
10
FAUST Oblique Classifier formula P(X dot D)gta
X any set of vectors. Doblique vector
(Note if Dei, PXi gt a ).
E.g.,? Let Dvector connecting class means and
d D/D
To separate r from v D (mv?mr), a
(mvmr)/2 o d midpoint of D projected onto d
FAUST-Oblique Create tbl, TBL(classi, classj,
medoid_vectori, medoid_vectorj). Notes If we
just pick the one class which when paired with r,
gives max gap, then we can use max gap or
max_std_Int_pt instead of max_gap_midpt. Then
need stdj (or variancej) in TBL.
Best cutpoint? mean, vector_of_medians, outmost,
outmost_non-outlier?
P(mb?mr)oXgt(mrm)/2od
"outermost "furthest from means (their projs of
D-line) best rankK points, best std points,
etc. "medoid-to-mediod" close to optimal provided
classes are convex.
In higher dims same (If "convex" clustered
classes, FAUSTdiv,oblique_gap finds them.
r
11
Separate classR, classV using midpoints of means
(mom) method calc a
FAUST Oblique PR P(X dot d)lta
D mR?mV oblique vector. dD/D
View mR, mV as vectors (mRvector from origin to
pt_mR), a (mR(mV-mR)/2)od (mRmV)/2 o d
(Very same formula works when DmV?mR, i.e.,
points to left)
Training choosing "cut-hyper-plane" (CHP),
which is always an (n-1)-dimensionl hyperplane
(which cuts space in two). Classifying is one
horizontal program (AND/OR) across pTrees to get
a mask pTree for each entire class (bulk
classification) Improve accuracy? e.g., by
considering the dispersion within classes when
placing the CHP. Use 1. the vector_of_median,
vom, to represent each class, rather than mV,
vomV ( medianv1v?V, 2. project each class
onto the d-line (e.g., the R-class below) then
calculate the std (one horizontal formula per
class using Md's method) then use the std ratio
to place CHP (No longer at the midpoint between
mr vomr and mv vomv )
medianv2v?V, ... )
dim 2
r   r vv r mR   r  
   v v v v       r    r      v
mV
v      r    v v     r        
v                    
dim 1
12
L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16
17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17
18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0
2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9
10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0
2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12
13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5
8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15
17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13
0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9
10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10
11 13 15
12/8/12
L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1
1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1
1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2
1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1
1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2
1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4
1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2
1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2
1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1
1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1
1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2
3 1
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
13
L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16
17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17
18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0
2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9
10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0
2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12
13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5
8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15
17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13
0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9
10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10
11 13 15
L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1
1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1
1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2
1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1
1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2
1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4
1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2
1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2
1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1
1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1
1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2
3 1
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
This just confirms z6 as an anomaly or outlier,
since it was already declared so during the
linear gap analysis.
Confirms zf as an anomaly or outlier, since it
was already declared so during the linear gap
analysis.
After having subclustered with linear gap
analysis, it would make sense to run this round
gap algoritm out only 2 steps to determine if
there are any singleton, gapgt2 subclusters
(anomalies) which were not found by the previous
linear analysis.
14
yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10
11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1
2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5
0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7
8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2
3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12
13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2
3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9
11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2
3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10
11
Cluster by splitting at gaps gt 2
x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4
11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9
0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14
0 z1 z15 5 9 5 Mean
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1
1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2
1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5
2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3
3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3
1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2
1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1
3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1
1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1
3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1
gap 10-6
gap 5-2
cluster PTree Masks (by ORing)
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
15
yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10
11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1
2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5
0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7
8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2
3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12
13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2
3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9
11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2
3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10
11
Cluster by splitting at gaps gt 2
yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1
1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2
1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5
2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3
3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3
1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2
1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1
3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1
1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1
3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1
x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4
11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9
0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14
0 z1 z15 5 9 5 Mean
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
gap 6-9
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
16
yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10
11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1
2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5
0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7
8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2
3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12
13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2
3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9
11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2
3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10
11
Cluster by splitting at gaps gt 2
yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1
1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2
1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5
2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3
3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3
1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2
1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1
3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1
1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1
3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1
x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4
11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9
0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14
0 z1 z15 5 9 5 Mean
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
gap 3-7
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
17
yo(x-M)/x-M Value Arrays z1 0 1 2 5 6 10
11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1
2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5
0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7
8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2
3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12
13 z10 0 1 2 3 4 5 7 11 12 13 z11 0 1 2
3 4 6 8 10 11 12 z12 0 1 2 3 5 6 7 8 9
11 12 13 z13 0 1 2 3 7 8 9 10 z14 0 1 2
3 5 7 9 11 12 13 z15 0 1 3 5 6 7 8 9 10
11
Cluster by splitting at gaps gt 2
yo(x-M)/x-M Count Arrays z1 2 2 4 1 1 1
1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2
1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5
2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3
3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3
1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2
1 z10 2 1 1 1 1 1 4 1 1 2 z11 1 2 1 1
3 2 1 1 1 2 z12 1 1 1 2 2 1 1 1 1
1 1 2 z13 3 3 3 1 1 1 1 2 z14 1 1 2 1
3 2 1 1 2 1 z15 1 2 1 1 2 1 2 2 2 1
x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4
11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9
0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14
0 z1 z15 5 9 5 Mean
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
AND each red with each blue with each green, to
get the subcluster masks (12 ANDs)
18
FAUST Clustering Methods MCR (Using Midlines of
circumscribing Coordinate Rectangle)
For any FAUST clustering method, we proceed in
one of 2 ways gap analysis of the projections
onto a unit vector, d, and/or gap analysis of the
distances from a point, f (and another point, g,
usually)
Given d, f?MinPt(xod) and g?MaxPt(xod).
Given f and g, dk(f-g)/f-g
So we can do any subset (d), (df), (dg), (dfg),
(f), (fg), fgd), ...
Define a sequence fk,gkdk
fk((nv1Xv1)/2,...,nvk,...,(nvnXvn)/2)
dkek and SpS(xodk)Xk
gk((nv1Xv1)/2,...,nXk,...,(nvnXvn)/2)
f, g, d, SpS(xod) require no processing
(gap-finding is the only cost). MCR(fg) adds the
cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)).
MCR(dfg) on Iris150
Do SpS(xod) linear gap analysis (since it is
processing free).
SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap.
Sequence thruf, g pairs
On what's left
(look for outliers in subclus1, subclus2
d3 0 10 set23... 1 19 set45 0 30
ver49... 0 69 vir19
SubClus2
SubClus1
d1 none
d2 none
f1 none
f1 none
g1 none
g1 none
f2 1 41 vir23 0 47 vir18 0 47 vir32
f2 none
SubClus1
g2 none
d4 1 6 set44 0 18 vir39 Leaves exactly the
50 setosa.
f3 none
g2 none
g3 none
f4 none
f3 none
g3 none
g4 none
SubClus2
f4 none
d4 none Leaves 50 ver and 49 vir
g4 none
19
MCR(d) on Iris150Outlier30, gapgt4
Do SpS(xodk) linear gap analysis, k1,2,3,4.
Declare subclusters of size 1 or two to be
outliers. Create the full pairwise distance
table for any subcluster of size ? 10 and declare
any point an outlier if its column (other than
the zero diagonal value) values all exceed the
threshold (which is 4).
d3 0 10 set23... 1 19 set25 0 30
ver49... 1 69 vir19 Same split (expected)
d1 0 17 t124 0 17 t14 0 17 tal 1 17
t134 0 23 t13 0 23 t12 0 23 t1 1 23
t123 0 38 set14 ... 1 79 vir32 0 84
b12 0 84 b1 0 84 b13 1 84 b123 0 98
b124 0 98 b134 0 98 b14 0 98 ball
SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly
the 50 setosa as SubCluster1.
SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1
25 vir45 0 40 b4 0 40 b24 Leaves the 49
virginica (vir39 declared an outlier) and
the 50 versicolor as SubCluster2.
MCR(d) performs well on this dataset.
Accuracy We can't expect a clustering method
to separate versicolor from virginica because
there is no gap between them. This method does
separate off setosa perfectly and finds all 30
added outliers (subcluster of size 1 or 2). It
finds virginica outlier, vir39, which is the most
prominent intra-class outlier (distance 29.6 from
the other virginica iris's, whereas no other iris
is more than 9.1 from its classmates.) Speed dk
ek so there is zero calculation cost for the
d's. SpS(xodk) SpS(xoek) SpS(Xk) so there is
zero calculation cost for it. The only cost is
the loading of the dataset PTreeSet(X) (We use
one column, SpS(Xk) at a time.) and that loading
is required for any method. So MCR(d) is optimal
with respect to speed!
d2 0 5 t2 0 5 t23 0 5 t24 1 5
t234 0 20 ver1 ... 1 44 set16 0 60 b24 0
60 b2 0 60 b234 0 60 b23
20
CCR(fgd) (Corners of Circumscribing Coordinate
Rectangle) f1minVecX(minXx1..minXxn) (0000)
g1MaxVecX(MaxXx1..MaxXxn) (1111),
d(g-f)/g-f
start ?
f1MnVec RnGpgt4 none
Sequence thru main diagonal pairs, f, g
lexicographically. For each, create d.
g1MxVec RnGpgt4 0 7 vir18... 1 47 ver30 0 53
ver49.. 0 74 set14
CCR(f) Do SpS((x-f)o(x-f)) round gap analysis
CCR(g) Do SpS((x-g)o(x-g)) round gap analysis.
CCR(d) Do SpS((xod)) linear gap analysis.
Notes No calculation required to find f and g
(assuming MaxVecX and minVecX have been
calculated and residualized when PTreeSetX was
captured.) If the dimension is high, since the
main diagonal corners are liekly far from X and
thus the large radii make the round gaps nearly
linear.
SubClus1 Lingt4 none
SubCluster2
f20001 RnGpgt4 none
g21110 RnGpgt4 none
This ends SubClus2 47 setosa only
g11111 RnGpgt4 none
Lingt4 none
f10000 RnGpgt4 none
Lingt4 none
f30010 RnGpgt4 none
g21110 RnGpgt4 none
f20001 RnGpgt4 none
Lingt4 none
g31101 RnGpgt4 none
f30010 RnGpgt4 none
g31101 RnGpgt4 none
Lingt4 none
Lingt4 none
f40011 RnGpgt4 none
g41100 RnGpgt4 none
f40011 RnGpgt4 none
Lingt4 none
g41100 RnGpgt4 none
f50100 RnGpgt4 none
g51011 RnGpgt4 none
Lingt4 none
Lingt4 none
f60101 RnGpgt4 1 19 set26 0 28 ver49 0 31 set42 0
31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41
ver13
f50100 RnGpgt4 none
g51011 RnGpgt4 none
Lingt4 none
f60101 RnGpgt4 none
g61010 RnGpgt4 none
g61010 RnGpgt4 none
Lingt4 none
Lingt4 none
f70110 RnGpgt4 none
f70110 RnGpgt4 1 28 ver13 0 33 vir49
g71001 RnGpgt4 none
Lingt4 none
g71001 RnGpgt4 none
Lingt4 none
Lingt4 none
g81000 RnGpgt4 none
f80111 RnGpgt4 none
f80111 RnGpgt4 none
g81000 RnGpgt4 none
Lingt4 none This ends SubClus1 95 ver and
vir samples only
21
(No Transcript)
22
FM(fgd) (Furthest-from-the-Mediod)
FMO (FM using a Gram-Schmidt Orthonormal basis) X
? Rn. Calculate MMeanVector(X) directly, using
only the residualized 1-counts of the basic
pTrees of X. And BTW, use residualized STD
calculations to guide in choosing good gap width
thresholds (which define what an outlier is going
to be and also determine when we divide into
sub-clusters.))
fM Gpgt4 1 53 b13 0 58 t123 0 59 b234 0 59
tal 0 60 b134 1 61 b123 0 67 ball
f0t123 RnGpgt4 1 0 t123 0 25 t13 1 28 t134 0
34 set42... 1 103 b23 0 108 b13
d1(M-f1)/M-f1.
f1?MxPt(SpS(M-x)o(M-x)).
SubClust-1 f0b2 RnGpgt4 1 0 b2 0 28 ver36
SubClust-2 f0t3 RnGpgt4 none
If d11?0, Gram-Schmidt d1 e1...ek-1 ek1..en
d2 (e2 - (e2od1)d1) / e2 - (e2od1)d1
d3 (e3 - (e3od1)d1 - (e3od2)d2) / e3 -
(e3od1)d1 - (e3od2)d2 ...
SubClust-1 f0b3 RnGpgt4 1 0 b3 0 23
vir8 ... 1 54 b1 0 62 vir39
SubClust-2 f0t3 LinGapgt4 1 0 t3 0 12 t34
f0b23 RnGpgt4 1 0 b23 0 30 b3... 1 84 t34 0
95 t23 0 96 t234
Thm MxPtSpS((M-x)od)MxPtSpS(xod) (shift by
Mod, MxPts are same
Repick f1?MnPtSpS(xod1). Pick
g1?MxPtSpS(xod1)
SubClust-2 f0t34 LinGapgt4 1 0 t34 0 13 set36
Pick fh?MnPtSpS(xodh). Pick
gh?MxPtSpS(xodh).
f0b124 RnGpgt4 1 0 b124 0 28 b12 0 30 b14 1
32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86
t14 1 93 t12 0 98 t124
SubClust-1 f0t24 RnGpgt4 1 0 t24 1 12 t2 0 20
ver13
SubClust-2 f0set16 LnGpgt4 none
SubClust-1 f1ver49 RdGpgt4 none
SubClust-1 f0b1 RnGpgt4 1 0 b1 0 23 ver1
SubClust-2 f1set42 RdGpgt4 none
SubClust-1 f1ver49 LnGpgt4 none
1. Choose f0 (high outlier potential? e.g.,
furthest from mean, M?) 2. Do f0-rnd-gap analysis
( subcluster anal?) 3. f1 be s.t. no x further
away from f0 (in some dir) (all d1 dot
prods?0) 4. Do f1-rnd-gap analysis ( subclust
anal?). 5. Do d1-linear-gap analysis, d1 f0-f1 /
f0-f1. 6. Let f2 s.t. no x is further away (in
some direction) from d1-line than f2 7. Do
f2-round-gap analysis. 8. Do d2-linear-gap d2
f0-f2 - (f0-f2)od1 / len...
SubClust-1 f0ver19 RnGpgt4 none
SubClust-2 f1set42 LnGpgt4 none SubClust-2 is
50 setosa! Likely f2, f3 and f4 analysis will not
find none.
f0b34 RnGpgt4 1 0 b34 0 26 vir1 ... 1 66
vir39 0 72 set24 ... 1 83 t3 0 88 t34
SubClust-1 f0ver19 LinGpgt4 none
23
FMO(d)
f1ball g1tall LnGpgt4 1 -137 ball 0 -126
b123 0 -124 b134 1 -122 b234 0 -112 b13 ... 1
-29 t13 1 -24 t134 1 -18 t123 1 -13 tal
f2vir11 g2set16 Lngt4 none
f3t34 g3vir18 Lngt4 none
f4t4 g4b4 Lngt4 1 24 vir1 0 39 b4 0
39 b14
f4t4 g4vir1 Lngt4 none This ends the process.
We found all (and only) added anomalies, but
missed t34, t14, t4, t1, t3, b1, b3.
f1b13 g1b2 LnGpgt4 none
f2t2 g2b2 LnGpgt4 1 21 set16 0 26 b2
f2t2 g2t234 Lngt4 0 5 t23 0 5 t234 0
6 t12 0 6 t24 0 6 t124 1 6 t2 0 21
ver11
CRC method g1MaxVector ?
x x x x x x xx x x
x x x x x x x x x x x
x x x x x x x x x x x x x
xx x x x x x x x x x x x x x x
x xxx x x x x xx x x x x x x x x
xxx x xx x x x x x x xx x x x x x x
x x x x x x x x xx x x x x xx x
x x xx x x
f2vir11 g2b23 Lngt4 1 43 b12 0 50 b34 0
51 b124 0 51 b23 0 52 t13 0 53 b13
MCR f ?
?MCR g
f2vir11 g2b12 Lngt4 1 45 set16 0 61 b24 0 61
b2 0 61 b12
? CRC f1MinVector
24
f1bal RnGpgt4 1 0 ball 0 28 b123... 1
73 t4 0 78 vir39... 1 98 t34 0 103
t12 0 104 t23 0 107 t124 1 108 t234 0 113
t13 1 116 t134 0 122 t123 0 125 tal
Finally we would classify within SubCluster1
using the means of another training set (with
FAUST Classify). We would also classify
SubCluster2.1 and SubCluster2.2, but would we
know we would find SubCluster2.1 to be all Setosa
and SubCluster2.2 to be all Versicolor (as we did
before). In SubCluster1 we would separate
Versicolor from Virginica perfectly (as we did
before).
FMO(fg) start ? f1?MxPt(SpS((M-x)o(M-x))),
Round gaps first, then Linear gaps.
We could FAUST Classify each outlier (if so
desired) to find out which class they are
outliers from. However, what about the rouge
outliers I added? What would we expect? They
are not represented in the training set, so what
would happen to them? My thinking they are real
iris samples so we should not do the really do
the outlier analysis and subsequent
classification on the original 150. We already
know (assuming the "other training set" has the
same means as these 150 do), that we can separate
Setosa, Versicolor and Virginica prefectly using
FAUST Classify.
SubClus2 f1t14 Rngt4 0 0 t1 1 0 t14 0
30 ver8 ... 1 47 set15 0 52 t3 0 52 t34
SubClus1 f1b123 Rngt4 1 0 b123 0 30 b13 0
30 vir32 0 30 vir18 1 32 b23 0 37 vir6
If this is typical (though concluding from one
example is definitely "over-fitting"), then we
have to conclude that Mark's round gap analysis
is more productive than linear dot product proj
gap analysis! FFG (Furthest to Furthest),
computes SpS((M-x)o(M-x)) for f1 (expensive? Grab
any pt?, corner pt?) then compute
SpS((x-f1)o(x-f1)) for f1-round-gap-analysis.
Then compute SpS(xod1) to get g1 to have
projection furthest from that of f1 ( for d1
linear gap analysis) (Too expensive? since
gk-round-gap-analysis and linear analysis
contributed very little! But we need it to get
f2, etc. Are there other cheaper ways to get a
good f2? Need SpS((x-g1)o(x-g1)) for
g1-round-gap-analysis (too expensive!)
SubClus2 f1set23 Rngt4 1 17 vir39 0 23
ver49 0 26 ver8 0 27 ver44 1 30 ver11 0
43 t24 0 43 t2
SubClus1 f1b134 Rngt4 1 0 b134 0 24 vir19
SC1 f2ver13 Rngt4 1 0 ver13 0 5 ver43
SubClus1 f1b234 Rngt4 1 0 b234 1 30 b34 0
37 vir10
SC1 g2vir10 Rngt4 1 0 vir10 0 6 vir44
SubClus1 f1b124 Rngt4 1 0 b124 0 28 b12 0
30 b14 1 32 b24 0 41 b1... 1 59 t4 0
68 b3
SbCl_2.1 g1ver39 Rngt4 1 0 vir39 0 7 set21
Notewhat remains in SubClus2.1 is exactly the 50
setosa. But we wouldn't know that, so we
continue to look for outliers and subclusters.
SC1 f4b1 Rngt4 1 0 b1 0 23 ver1
SbCl_2.1 g1set19 Rngt4 none
SbCl_2.1 f3set16 Rngt4 none
SbCl_2.1 LnGgt4 none
SbCl_2.1 g3set9 Rngt4 none
SbCl_2.1 f2set42 Rngt4 1 0 set42 0 6 set9
SC1 f1vir19 Rngt4 1 44 t4 0 52 b2
SC1 g4b4 Rngt4 1 0 b4 0 21 vir15
SbCl_2.1 LnGgt4 none
SbCl_2.1 f4set Rngt4 none
SbCl_2.1 f2set9 Rngt4 none
SbCl_2.1 g4set Rngt4 none
SC1 g1b2 Rngt4 1 0 t4 0 28 ver36
SubC1us1 has 91, only versicolor and virginica.
SbCl_2.1 g2set16 Rngt4 none
SbCl_2.1 LnGgt4 none
SbCl_2.1 LnGgt4 none
25
For speed of text mining (and of other high
dimension datamining), we might do additional
dimension reduction (after stemming content
word). A simple way is to use STD of the column
of numbers generated by the functional (e.g., Xk,
SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod),
etc.).  The STDs of the columns, Xk, can be
precomputed up front, once and for all.  STDs of
projection and square distance functionals must
be done after they are generated (could be done
upon capture too). Good functionals produce many
large gaps.  In Iris150 and Iris150Out30, I find
that the precomputed STD is a good indicator of
that. A text mining scheme might be 1.
Capture the text as a PTreeSET (after stemming
the content words) and store mean, median, STD of
every column (content word stem). 2. Throw out
low STD columns. 4'. Use a weighted sum of
"importance" and STD? (If the STD is low, there
can't be many large gaps.)
A possible Attribute Selection algorithm 1.
Peel from X, outliers using CRM-lin, CRC-lin,
possibly M-rnd, fM-rnd, fg-rnd.. (Xin X -
Xout) 2. Calculate widths of each
Xin-Circumscribing Rectangle edge, crewk 4.
Look for wide gaps top down (or, very simply,
order by STD). 4'. Divide crewk into countxk
x?Xin. (but that doesn't account for dups)
4''. look for preponderance of wide thin-gaps top
down. 4'''. look for high projection interval
count dispersion (STD). Notes 1. Maybe an
inlier sub-cluster needs occur from more than one
functional projection to be declared an inlier
sub-cluster? 2. STD of a functional projection
appears to be a good indicator of the quality of
its gap analysis. For FAUST Cluster-d (pick d,
then fMnPt(xod) and gMxPt(xod) ) a full grid of
unit vectors (all directions, equally spaced) may
be needed. Such a grid could be constructed
using angles a1, ... , am, each equi-width
partitioned on 0,180), with the formulas
d e1?kn...2cos?k e2sin?2?kn...3cos?k
e3sin?3?kn...4cos?k ... ensin?n where ?i's
start at 0 and increment by ?.
So, di1..in ?j1..n ej sin((ij-1)?) ?kn.
.j1cos(k?) i00, ? divides 180 (e.g., 90, 45,
22.5...)
CRMSTD(dfg) Eliminate all columns with STD lt
threshold.
d3 0 10 set23...50setvir39 1 19 set25 0
30 ver49...50ver_49vir 0 69 vir19
(d3d4)/sqr(2) clus1 none (d3d4)/sqr(2) clus2
none
d5 (f5vir19, g5set14) none f5 1 0.0 vir19
clus2 0 4.1 vir23 g5 none
Just about all the high STD columns find the
subcluster split. In addition, they find the
four outliers as well
(d1d3d4)/sqr(3) clus1 1 44.5 set19 0 55.4
vir39 (d1d3d4)/sqr(3) clus2 none
d5 (f5vir23, g5set14) none,f5 none, g5 none
d5 (f5vir32, g5set14) none, f5 none, g5 none
d5 (f5vir18, g5set14) none f5 1 0.0 vir18
clus2 1 4.1 vir32 0 8.2 vir6 g5 none
d5 (f5vir6, g5set14) none, f5 none, g5 none
(d1d2d3d4)/sqr(4) clus1 (d1d2d3d4)/sqr(4)
clus2 none
(d1d3)/sqr(2) clus1 none (d1d3)/sqr(2) clus2 0
57.3 ver49 0 58.0 ver8 0 58.7 ver44 1 60.1
ver11 0 64.3 ver10 none
26
CRMSTD(dfg) using IRIS rectangle on Satlog (1805
rows of R,G,IR1,IR2 with classes 1,2,3,4,5,7.).
Here I made a mistake and left MinVec, MaxVec
and M as they were for IRIS (so probably far from
the Satlog dataset). The results were good???
Suggests random f and g?
d2 STD23.7 gpgt3 val cl num 1 121 1 297 0 126
3 361 0 127 3 84 0 128 3 100 0 128 3 315
(d2d3)/sqr2 STD23.6 1 173.2 3 244 0 183.8 3
361 0 181.7 3 84 0 184.6 3 100 0 180.3 3 315
(d1d2)/sqr2 STD25.3 1 153.4 3 200 0 157.7 3
315 1 157.7 3 84 0 161.2 3 361
(d1d4)/sqr2 STD15.5 0 59.4 5 75 1 60.1 5
24 0 64.3 5 149... 1 142.1 3 84 0 145.7 3 361
d4 STD20.3 gpgt3 val cl num 1 29 5 75 1 33
5 24 0 37 5 73... 1 150 2 85 0 154 2 191
SQRT(x-f2)o(x-f2) STD26.7 val cl num 1 41.6
5 75 0 45.9 5 24... 1 168.8 3 244 0 180.4 3
361 0 178.1 3 84 0 179.2 3 100 0 176.1 3 315
(d1d3)/sqr2 STD16.8 1 159.8 3 84 0 166.9 3 361
d3d4)/sqr2 STD25.7 1 39.5 5 75 0 44.5 5
24... 1 142.5 2 119 0 146.5 2 191 0 147.5 2 85
(d2d4)/sqr2 STD20.4 0 40.0 5 75 1 41.0 5
24... 1 109.5 3 45 0 115.0 3 361 0 115.5 3 315 0
116.0 3 84 0 117.5 3 100 same
d3 STD17.2 gpgt3 val cl num 1 139 2 191 0 145
2 85
d1d2d3d4)/sqr4 STD25.9 0 92.5 5 75 1 95.0
5 24 0 99.0 5 149 1 101.5 5 73 0 105.0 5
121... 1 222.0 3 244 0 226.5 3 315 0 227.0 3
100 1 229.0 3 84 0 233.0 3 361 same
d1d2d3)/sqr3 STD25.3 1 203.8 3 84 0 209.0 3
361
SQRT(x-g2)o(x-g2) STD26.8 val cl num 1 15.6
5 75 0 22.5 5 149 0 22.9 5 24 0 24.1 5 73 1
26.6 5 168 0 29.6 5 121... 1 162.1 2 119 0
168.7 2 191 0 169.7 2 85
d1 STD13.6 ggt3 none
SQRT(x-M)o(x-M) STD28 val cl num 1 29.6 5
75 1 34.2 5 24 0 38.7 5 149 1 39.7 5 73 0
43.7 5 168
(d1d2d4)/sqr3 STD21.9 0 67.0 5 24 1 67.5 5
75 0 72.5 5 149
sqr(x-f4)o(x-f4 STD27.8 val cl num 1 35.6 5
75 1 39.9 5 24 0 45.3 5 149 1 45.7 5 73 0
50.8 5 168... 1 176.2 2 119 0 182.9 2 191 0 182.9
2 85
d1d3d4/sq3 STD22.1 0 81.4 5 24 1 77.4 5 75
SQRT(x-f1)o(x-f1) STD27 val cl num 0 41.1 5
24 1 41.6 5 75 0 44.9 5 149... 1 172.8 3
84 0 176.6 3 361
SQRT(x-f5)o(x-f5) STD25 val cl num 1 147.1 3
100 0 151.7 2 85 0 152.3 2 191
Skip STDlt25, same outliers 2_85, 2_191, 3_361,
3_84, 3_100, 3_315, 5_24, 5_73, 5_75,
5_149, 5_168,
SQRT(x-f3)o(x-f3) STD27.5 val cl num 1 52.2 5
75 0 58.0 5 24 1 58.2 5 149 0 61.5 5 73 1
62.5 5 168 0 66.0 5 121... 1 188.2 3 361 0 192.0
2 191 0 193.6 2 85
SQRT(x-g4)o(x-g4)
STD27.7 val cl num 1 144.8 2 119 0 148.6 3
315 0 150.7 2 191 0 150.9 3 84 0 151.8 3 100 0
151.8 2 85 0 153.9 3 361
SQRTx-g5ox-g5 STD27.4 val cl num 0 27.8 5
75 1 29.4 5 24 0 35.1 5 73 1 35.6 5 149 0
39.4 5 71
SQRT(x-g1)o(x-g1) STD26.3 val cl num 1
41.6 5 75 0 45.9 5 24... 1 166.1 2 119 0 172.3
2 191 0 172.8 2 85
SQRT(x-g3)o(x-g3) STD24.9 none
27
CRMSTD(dfg) Satlog corners on Satlog
Class Means c1M 63.6 98.4 110.3 90.2 c2M
48.4 38.5 114.5 119.9 c3M 87.8 106.1
111.0 87.8 c4M 77.1 90.2 94.7
73.9 c5M 59.8 62.2 80.4 66.7 c7M 69.2
77.9 82.3 64.5
1red soil, 2cotton, 3grey soil, 4damp grey
soil, 5soil w stubble, 6mixture, 7very damp
grey soil Classes 2, 5 isolated from the rest
(and each other)? 2 and 5 produced the greatest
number of outliers. Take f5c2M g5 to be other
means
d5(f5c2M,g5c7M) ggt3 STD26 val cl num 0
-139.9 2 85 1 -138.8 2 191 0 -134.4 2 186 0
-132.1 2 119 0 -131.7 2 224 0 -130.9 2 23 1
-74.5 2 200 0 -70.2 2 160 0 -68.9 2 165 0
-68.2 2 86 0 -68.1 2 194 0 -67.3 2 138 0
-67.0 2 19 1 -67.0 2 223 0 -62.9 2 60 0
-62.5 2 132 0 -59.8 5 45 0 -14.1 7 602 0
-14.0 7 412 0 -14.0 7 420 0 -13.9 7 306 0
-13.9 7 244 0 -13.7 5 175 0 -13.2 5 15 0
-13.1 7 562 0 -13.1 7 359 0 -13.0 7 532 0
-13.0 7 530 0 -12.9 7 414 0 -12.8 5 71 0
-12.7 5 121 0 -12.2 7 636 0 -11.4 5 144 1
-11.0 7 470 0 -8.0 5 168 0 -7.9 5 24 0
-7.9 5 73 0 -7.5 5 149 1 -4.9 5 190 0 -0.8
5 75
d2 STD23.7 val cl num 1 121 1 297 0 126 3
361 0 127 3 84 0 128 3 100 0 128 3 315
Lots of outliers found, but did not separate
classes as subclusters (Keeping in mind that they
may butt up against each other (no gap) so that
they would never appear as subclsuter via gap
analysis methods.). Suppose we have a high
quality training set for this dataset reliably
accurate class means. Next, find any class gaps
that might exist by using those as our f and g
points.
(d1d2)/sqr2 STD25.2 none
d4 STD20.3 val cl num 1 29 5 75 1 33 5
24 0 37 5 73.. 1 150 2 85 0 154 2 191
(d1d3)/sqr2 STD16.6 none
(d1d4)/sqr2 STD15.3 none
(d2d3)/sqr2 STD23.4 none
(d2d4)/sqr2 STD23.4 none
SubCluster1 consists of 191 class2
samples. SubCluster3 contains every
subcluster. Next, on SubCluster3 we use f5c1M
and g5c7M.
(d3d4)/sqr2 STD25.3 1 68.6 5 168 0 72.1 5 121
2 160 2 165 2 86 2 194 2 138 2 19 2 223
0.0 20.6 4.6 9.9 5.8 20.9
15.4 20.6 0.0 22.4 11.6 23.3 5.0
12.2 4.6 22.4 0.0 12.6 4.1
21.7 18.6 9.9 11.6 12.6 0.0
12.8 13.0 6.9 5.8 23.3 4.1
12.8 0.0 22.9 18.2 20.9 5.0
21.7 13.0 22.9 0.0 15.4 15.4 12.2
18.6 6.9 18.2 15.4 0.0
(d1d2d3)/sqr3 STD25.2 none
d3 STD17.2 val cl num 1 139 2 191 0 145 2
85
(d1d2d4)/sqr3 STD21.6 none
(d1d3d4)/sqr3 STD21.8 none
(d2d3d4)/sqr3 STD25.4 none
(d1d2d3d4)/sqr4 STD25.4 none
d1 STD13.6 none
d2 STD23.7 val cl num val dis(1 297) 0
118 3 242 153.3 35.128 0 118 3 73 148.4
35.707 0 118 3 343 152.3 31.144 0 118 3 263
148.4 35.707 0 118 3 155 147.4 31.796 0
118 1 36 153.5 9.2736 0 118 3 221 152.3
31.144 0 118 3 244 158.3 35.707 0 120 3 50
155.6 33.090 0 120 3 344 148.1 24.617 0
120 3 200 151.8 33.136 0 120 3 310 151.9
29.189 0 120 3 202 154.0 33.136 1 121 1 297
149.8 0
dis(2_200,2_160)12.4 outlier
f1 STD11.8 none
dis(2_60,2_132) 3.9
g1 STD14.5 none
(2_132,5_45) 33.6 outliers.
f2 STD14.9 none
g2 STD23.6 none
5 168 5 24 5 73 5 149 5 190 5 75 0.0
14.0 7.3 8.1 16.5 15.7 14.0
0.0 7.1 7.7 26.2 8.1 7.3
7.1 0.0 4.6 19.7 11.0 8.1
7.7 4.6 0.0 22.7 10.1 16.5 26.2
19.7 22.7 0.0 27.9 15.7 8.1
11.0 10.1 27.9 0.0
f3 STD16.9 none
g3 STD12.7 val cl num 1 101.9 5 73 0 105.0 5
149
SubClus3 f5c1M, g5c7M.
f4 STD22.3 none
d5(f5c2M,g5c7M) ggt2 STD68 val cl num 0
4.9 3 70 1 90.2 5 33 0 92.3 5 121 1 92.5
5 179 0 187.5 1 110 1 216.5 3 244 1 223.3 3
315 0 225.6 3 84 0 226.6 3 100
g4 STD11.6 val cl num 1 42.1 2 10 0 48.0 2
143.. 1 114.9 5 168 0 119.8 5 73
g4 STD11.6 val cl num 0 52.1 2 143 52.1 0
54.6 2 145 54.6 16.278
f5 STD24.8 none
g5 STD27.1 none
28
Density A set is T-dense iff it has no distance
gaps greater than T.
10/20/12 (Equivalently,
every point has neighbors in its'
T-neighborhood.) We can use L1 or HOB or L?
distance, since disL1(x,y) ? disL2(x,y)
disL2(x,y) ? 2disHOB(x,y) and disL2(x,y) ?
ndisL?(x,y)
Definition Y?X is T-dense iff there does not
exist y?Y such that dis2(y, Y-y) gt T.
Theorem-1 If for every y?Y, dis2(y,Y-y) ? T
then Y is T-dense.
Using L1 distance, not L2Euclidean Theorem-2
disL1(x,y) ?disL2(x,y) (from here on we will
use disk to mean disLk ). Therefore If, for
every y?Y, dis1(y,Y-y) ? T then Y is T-dense.
( Proof dis2(y,Y-y) ? dis1(y,Y-y) ? T )
2disHOB(x,y) ? dis2(x,y) (Proof Let the bit
pattern of dis2(x,y) be 001bk-1...b0 then
disHOB(x,y)2k and the most bk-1 ...b0 can
contribute is 2k-1 (if it's all 1-bits). So
dis2(x,y) ? 2k (2k - 1) ? 22k
2disHOB(x,y).
Theorem-3 If, for every y?Y, disHOB(y,Y-y) ?
T/2 then Y is T-dense. Proof dis2(y,Y-y) ?
2disHOB(y,Y-y) ? 2T/2 T
Theorem-4 If, for every y?Y, dis?(y,Y-y) ?
T/n then Y is T-dense. Proof dis2(y,Y-y) ?
ndisHOB(y,Y-y) ? nT/n T
Pick T' based on T and the dimension, n (It can
be done!). If MaxGap(yoek)MaxGap(Yk) lt T'
?k1..n, then Y is T-dense (Recall, yoek is just
Yk as a column of values.) Note We use the
logn pTreeGapFinder to avoid sorting.
Unfortunately, it doesn't immediately find all
gaps precisely at their full width (because it
descends using power of 2 widths), but if we find
all PTreeGaps, we can be assured that
MaxPTreeGap(Y) ? MaxGap(Y) or we can keep track
of "thin gaps" and thereby actually identify all
gaps (see the slide on pTreeGapFinder).
Theorem-5 If ?k1..nMaxGap(Yk) ? T, then Y is
T-dense Proof dis1(y,x)?k1..nyk-xk.
yk-xk ? MaxGap(Yk) ?x?Y. So dis2(y,Y-y) ?
dis1(y,Y-y) ? ?k1..nMaxGap(Yk) ? T
29
p x y 1 6 36 2 7 39 3 8 41
4 9 34 5 9 38 6 10 42 7 12 34 8
12 38 9 13 35 10 13 40 11 19 38 12
25 38 13 22 22 14 26 16 15 26 25 16
29 11 17 31 18 18 32 26 19 34 11 20
34 23 21 35 20 22 37 10 23 37 23 24
38 13 25 38 21 26 39 24 27 40 9 28
42 9 29 38 39 30 38 42 31 39 44 32
41 41 33 41 45 34 42 39 35 42 43 36
44 43 37 45 40
No gaps (ct0_intervals) on the furthest-to-Mean
line, but 3 ct1 intevals. Declare pp12, p16,
p18 anomaly if pofM is far enough from the bddry
pts of its interval?
Round 2 is straight forward. So, 1. Given gaps,
find ctk_intervals. 2. Find good gaps (dot prod
with a constant vector for linear gaps?)
For rounded gaps, use xox?
Note in this example, vom works better than mean.
30
Using vector lengths
However, if the data happens to be shifted, as it
is on the right, using lengths no longer works in
this example. That is, dot product with a fixed
vector, like fM is independent of the placement
of the points with respect to the origin. Length
based gapping is dependent.
A squared pattern does not lend itself to rounded
gap boundaries. distance from the origin is in
red. Distance from (7,0) is in blue. 9
x x 8 7 x x x x x x x x x x x x x x x x 6
x x x x x x x x x x x x x x x x 5 x x x x x x x
x x x x x x x x x 4 x x x x x x x x x x x x x x
x x 3 x x x x x x x x x x x x x x x x 2 x x x
x x x x x x x x x x x x x 1 x x x x x x x x x x
x x x x x x 0 x x x x x x x x x x x x x x x x
0 1 2 3 4 5 6 7 8 9 a b c d e f
31
  • FAUSTFast, Accurate Unsupervised and
    Supervised Teaching (Teaching big data to
    reveal information) 6/9/12
  • FAUST CLUSTER-fmg (furthest-to-mean gaps for
    finding round clusters) CX (e.g., Xp1,
    ..., pf 15 pix dataset.)
  • While an incomplete cluster, C, remains find M
    Medoid(C) ( Mean or Vector_of_Medians
    or? ).
  • Pick f?C furthest from M from SSPTreeSet(D(x,M)
    .(e.g., HOBbit furthest f, take any from
    highest-order S-slice.)
  • If ct(C)/dis2(f,M)gtDT (DensThresh), C is
    complete, else split C where PPTreeSet(cofM/fM)
    gap gt GT (GapThresh)
  • End While.
  • Notes a. Euclidean and HOBbit furthest. b.
    fM/fM and just fM in P. c. find gaps by
    sorrting P or O(logn) pTree method?

C2p5 complete (singleton outlier).
C3p6,pf, will split (details omitted), so
p6, pf complete (outliers). That leaves
C1p1,p2,p3,p4 and C4p7,p8,p9,pa,pb,pc,pd,pe
still incomplete. C1 is dense ( density(C1)
4/22.5 gt DT.3 ?) , thus C1 is
complete. Applying the algorithm to C4
In both cases those probably are the best "round"
clusters, so the accuracy seems high. The speed
will be very high!
pa outlier. C2 splits into p9, pb,pc,pd
complete.
1 p1 p2 p7 2 p3 p5
p8 3 p4 p6 p9 4
pa 5 6
7 8 pf 9
pb a pc b
pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e
f
M0 8.3 4.2 M1 6.3 3.5
f1p3, C1 doesn't split (complete).
M
f
M4
D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.
8 3.3 3.3 1.8 1.5
X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4
3 3 p5 6 2 p6 9 3 p7 15 1 p8 14
2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd
9 11 pe 11 11 pf 7 8
C1 C2 C3 C4
M1
M0
32
FAUST CLUSTER-fmg O(logn) pTree method for
finding P-gaps P ScalarPTreeSet( c o fM/fM )
D(x,M) 8 7 7 6 4 2 7 6 7 4 4 6 6 7 4
X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5
6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13
4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8
D3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
D1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0
D0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0
xoUp1M 1 3 3 4 6 9 14 13 15 13 13 14 13 15 1
0
P3 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
P2 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
P1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1
P0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 0
HOBbit Furthest pt list p1 Pick fp1.
dens(C)16/8216/64.25
If GT2k then add 0,1,...,2k-1 check all k of
these down to level2k
P38,15 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 ct 10
P3'0,7 1 1 1 1 1 0 0 0 0 0 0
About PowerShow.com