Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 7534b2-NTgwZ

The Adobe Flash plugin is needed to view this content

- Machine Teaching big data to reveal its'

information - FAUST Fast, Accurate Unsupervised and

Supervised Machine Teaching - FAUST UMF (Unsupervised, using the

Medoid-to-Furthest line) Let CX be the

initial incomplete cluster - While an incomplete cluster, C, remains, pick

M?C, compute F (furthest from M). - If ( Density(C) countC/distancen(F,M) gt DT

(DensityThreshold)) - declared C to be complete and continue
- else split C at each PTreeSet(xoFM) gap gt GWT
- (GWTGapWidthThreshold)

C2p5 complete (singleton outlier).

C3p6,pf splits (for doubletons, if dist gt GT,

split) so p6, pf complete (outliers),

C1p1,p2,p3,p4, C4p7,p8,p9,pa,pb,pc,pd,pe

incomplete C1 is dense ( dens(C1)4/22.5gtDT.3),

thus C1 is complete. Applying the algorithm to

C4

M

M0 8.3 4.2 M1 6.3 3.5

f

M4

C1 C2 C3 C4

pa outlier. C2 splits p9, pb,pc,pd

complete.

M1

M0

f1p3, no C1 split (complete).

Fp1 and xoFM, T23. Illustration of the

first round of finding gaps

pTree gap finder using PTreeSet(xofM)

xofM 11 27 23 34 53 80 118 114 125 114 110 1

21 109 125 83

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1

p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1

p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0

p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5

6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13

4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8

f

For FAUST SMM (Oblique), do a similar thing on

the MrMv line. Record the number of r and v

errors if RtEndPt is used to split. Take RtEndPt

where sum min Parallelizes easily. Useful in

pTree sorting?

OR between gap 2 and 3 for cluster C2p5

width23 8 gap 010 1000, 010 1111 40,48)

width23 8 gap 011 1000, 011 1111 56,64)

width238 gap 000 0000, 000 01110,8)

width 24 16 gap 100 0000, 100 1111 64,80)

width 24 16 gap 101 1000, 110 011188,104)

OR between gap 1 2 for cluster C1p1,p3,p2,p4

between 3,4 cluster C3p6,pf

Or for cluster C4p7,p8,p9,pa,pb,pc,pd,pe

- FAUST UFF (Unsupervised, using the

Furthest-to-Furthest line) Let CX be

the initial incomplete cluster - While an incomplete cluster, C, remains, pick

M?C, compute F (furthest from M) and compute G

(furthest from F) - If ( Density(C) countC/distancen(F,M) gt DT

(DensityThreshold)) declared C to be complete and

continue else split C into Ci at each

PTreeSet(xoFG) gap gt GWT (GapWidthThreshold)

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c 0 1 2 3 4 5 6 7 8 9 a b c d e f

M1

M0

dis to M0 8.4 6.7 7.1 5.8 3.7 1.7 7.4 6.0 6.6 4

.4 4.4 5.7 6.2 6.7 3.6

dis to F(p1) 0 2 1.41 2.82 5.09 8.24

14 13.0 14.1 12.3 12.0 13.4 12.8 14.1 9.21

dis to G(pe) 14.1 12.8 12.7 11.3 10.2 8.24 10.7

9.48 8.94 7.28 2.23 1 2 0 5

M2

dis to M 2.1 0.8 1.0 1.2 3.0

dis to F 5.09 3.16 4 3.16 0

dis to G 0 2 1.41 2.82 5.09

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4

3 3 p5 6 2 p6 9 3 p7 15 1 p8 14

2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd

9 11 pe 11 11 pf 7 8

8 11.6 10.2 10 8.06 2.23 2.23 0 2 3.60

4.2 2.4 1 1.8 1.4

6.3 0 1.4 2 3.6

4 6.3 4.9 4.8 2.7 3.1 3.8 5.3 4.8 4.7

6.32 0 1.41 2 3.60 9.43 9.84 11.6 10.7 10.6

0 6.3 5.0 6 4.1

3.6 2.2 2.2 0

0 1.4 2 3.6

1.6 0.5 0.9 1.9

2.2 1 2 0 5

0.8 1.4 1.3 1.8 3.1

3.1 4.4 3.6 5 0

M0 8.6 4.7 M1 3 1.8 M2 11 6.2

dens(X) 16/8.42.23ltDT1.5 incomplete

densC21213/123gtDT complete.

dens(C1) 5/32.55ltDT, C1 incomplete

C2122pa complete

M21 (13.2 2.6)

C12p5 complete. C11 4/1.42 2gtDT, dens

complete.

M212 (14.2 2.5)

C222 pf singleton so complete ( outlier).

densC2ct/d(M2,F2)210/6.32.25ltDT, incomplete.

M22 ( 9.6 9.8)

C221 density 4/1.42 2.04gtDT, complete

densC215/4.22.28ltDT incomplete.

M221 (10.2 10.2)

FAUST SMM Supervised Medoid-to-Medoid version

(AKA, FAUST Oblique) PRP(X o dR ) lt aR

1 pass gives classR pTree D mR?mV

dD/D

Separate class R using midpoint of means method

Calc a

(mR(mV-mR)/2)od a (mRmV)/2od

(works also if DmV?mR,

Trainingplacing

cut-hyper-plane(s) (CHP) ( n-1 dim hyperplane

cutting space in two). Classification is 1

horizontal program (AND/OR) across pTrees, giving

a mask pTree for each entire predicted class (all

unclassifieds at-a-time) Accuracy improvement?

Consider the dispersion within classes when

placing the CHP. E.g., use the 1.

vectors_of_median, vom, to represent each class,

not the mean mV, where vomV (medianv1v?V, 2.

midpt_std, vom_std methods project each class

on d-line then calculate std (one horizontal

formula per class using Md's method) then use

the std ratio to place CHP (No longer at the

midpoint between mr and mv

medianv2v?V, ...)

dim 2

Notetraining (finding a and d) is a one-time

process. If we dont have training pTrees, we

can use horizontal data for a,d (one time) then

apply the formula to test data (as pTrees)

r r vv r mR r

v v v r r v

mV

v r v v r

v

dim 1

Next, use "Gap Finder" to find all gaps with

different endpts (rv or vr) Record the number of

r and v errors if GapMidPt is used to split.

Select as split pt, the GPM where errors are

minimized.

FAUST SMM PRP(X o dR ) lt aR D mR?mB

Use "Gap Finder" to find all gaps with different

EndPoints Record the number of R and B errors

if GapEndPoint is used to split. Select as split

pt, the GEP where the sum of the R errors and the

B errors are minimized.

That is, on the MR-MB line, record the sum of the

R and B errors if RtEndPt is used to split.

X x1 x2 p1 3 6 p2 6 1 p3

4 2 p4 3 4 p5 6 2 p6 9

3 p7 15 1 p8 14 2 p9 12 3 pa

10 3 pb 10 6 pc 9 7 pd 9

8 pe 12 6 pf 12 5

1 p2 p7 2 p3 p5

p8 3 p6 pa p9 4 p4

5 pf 6

p1 pb pe 7 pc 8

pd 9 a b c 1 2 3 4 5 6 7 8 9 a b c d

e f

M

M

- APPENDIX FAUST UMF (no density)

Initially CX. - While an incomplete C, remains (1 pTree

calculation) find M (mean(C)). - Create PTreeSet(D(x,M). (1

pTree calculation) Pick F to be a furthest

point from M. - Create PTreeSet(xoFM) (1

pTree calculation) - Split at each PTS(xoFM)-gap gt T (1 pTree

calculation). If there are none, continue

(declaring C complete).

C2p5 complete (singleton outlier).

C3p6,pf, will split (details omitted), so

p6, pf complete (outliers). That leaves

C1p1,p2,p3,p4 and C4p7,p8,p9,pa,pb,pc,pd,pe

still incomplete. C1 doesn't split and is

complete. Applying the algorithm to C4

This algorithm takes 4 pTree calculations

only. If we use "any point" rather than Mmean,

that eliminates create mean (next slide, Mbottom

point rather than the Mean.)

pa outlier. C2 splits into p9, pb,pc,pd

which doesn't split so complete.

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e

f

M0 8.3 4.2 M1 6.3 3.5

f1p3, C1 doesn't split so complete.

M

f

M4

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4

3 3 p5 6 2 p6 9 3 p7 15 1 p8 14

2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd

9 11 pe 11 11 pf 7 8

C1 C2 C3 C4

M1

M0

- FAUST UMF (no density, Mbottom point)

Initially CX. - While an incomplete cluster, C, remains find M

(no pTree calculations). - Create S PTreeSet(D(x,M). Pick f to be a

furthest point from M (1 pTree calculation) - Split at each PTreeSet(coFM)-gap gt T. If none,

continue (C complete) (1 pTree calculation).

This FAUST CLUSTER is minimal, with just 3 pTree

calculations. Pick a point (e.g., bottom point-

0 pTree calculations). 1. Find the furthest

point (e.g., using ScalarPTreeSet(distance(x,M),

1 pTree calculation) 2. Find gaps (e.g., using

ScalarPTreeSet(xofM),

( 1 pTree calculation). split when the

gapgtGT

( 1 pTree

calculation). Continue when there are no gaps

(declare C complete)

( 0

pTree calculations). However, we may want a

density-based stop condition (or a combination).

Even if we don't create the mean, we can get a

"radius" (for n-dim volume rn) from the length

of the fM. So with a density SC it's 3 pTree

calcs a 1-count.

Note Mbottom pt is likely better then Mmean,

because the point M will then be on one side of

the Mean and closer to an edge. Therefore the FM

line might be more of a diameter than ith FMean.

No gaps so complete

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e

f

M

Note on Stop Conditions "dense" ? "no gaps,

but not vice versa.

M

M

No gaps so complete

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4

3 3 p5 6 2 p6 9 3 p7 15 1 p8 14

2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd

9 11 pe 11 11 pf 7 8

f

M

M

M

M

M

- FAUST UMF (Mtop, Fbottom, FM_affinity_splittin

g) Initially CX. - While an incomplete cluster, C, remains find

Ftop, Mbottom pts.

( 0 pTree calculations). - Split C into C1PTree(xoFMltFMoFM/2) and C2C-C1

(uses mdpt(F,M) as in Oblique FAUST ( 1 pTree

calculation) - If Ci dense (ct(Ci)/disnFM gt DT) declare Ci

complete.

( 1 pTree 1-count).

Note pb is found to be an outlier but isn't.

Otherwise the version works. An absolute minimal

version with 1pTree calculation

(PT(xoFMltFMoFM/2) Split iff FMoFM/2gtThreshold

("large gap" splitting, "no gap" stopping, If

density stopping is used, then add a

1-count.). My "best UMF version" choice

Top-to-Furthest_splitting with the pTree gap

finder, density stopping. (3 pTree calculations,

1 one-count). Top Research Need Better

pTree "gap finder". It is useful in both FAUST

UMF clustering and in FAUST SMM

classification.

X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4

3 3 p5 6 2 p6 9 3 p7 15 1 p8 14

2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd

9 11 pe 11 11 pf 7 8

X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e

f

p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5

6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15

7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5

p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf

FAUST UMF density parameter settings effects

If DensityThresholdDT1.1 thenpa joins

p7,p8,p9.

If DT0.5 then also pf joins pb,pc.pd,pe and

p5 joins p1,p2,p3,p4.

We call the overall method FAUST CLUSTER because

it resembles FAUST CLASSIFY algorithmically and

k ( of clusters) is dynamically determined.

Improvements? Better stop condition? Is UMF

better than UFF? In affinity splitting, what if

k over shoots its' optimal value? Add a fusion

step each round? As Mark points out, having k

too large can be problematic?.

The proper definition of outlier or anomaly is a

huge question. An outlier or anomaly should be

a cluster that is both small and remote. How

small? How remote? What combination? Should

the definition be global or local? We need to

research this (give users options and advice for

their use). Md create Ffurthest pt from M,

d(F,M) while creating PTreeSet(d(x,M)? Or as a

separate procedure, start with PDh (hHigh Bit

Pos.) then recursively Pklt-- P Dh-k until

Pk10. Then back up to Pk and take any of those

points as f and that bit pattern is d(f,M). Note

that this doesn't necessarily give the furthest

pt from M but gives a pt sufficiently far from M.

Or use HOBbit dis? Modify to get absolute

furthest pt by jumping (when AND gives zero) to

Pk2 and continuing AND from there. (Dh gives a

decent f (at furthest HOBbit dis).

1 p1 p2 p7 2 p3 p5

p8 3 p4 p6 p9 4

pa 5 6

7 8 pf 9

pb a pc b

pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e

f

centriodmean h1 DT 1.5 gives 4 outliers and

3 non-outlier clusters