Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 74eb80-NTg2M

The Adobe Flash plugin is needed to view this content

FAUST Using std or rankK instead of just gap size

to determine the best gap and/or using multiple

attribute cutting can improve accuracy. We

need a pTree ALGEBRA (already well started with

the pTree Algebra paper - involves the pTree

operators, AND, OR, COMP, XOR, ... and their

algebraic properties (communtativity,

associativity, distributivity, ...) We need a

pTree CALCULUS (functions that produce the pTree

mask for just about any pTree-defining

predicate). Note that in FAUSTdiv,gap, we cut

perpendicular on the single attributes line,

which contains the maximum consecutive-class-mean

gap and use the midpoint of that gap (or the

confluence point of maximal stds) as cut_point to

separate the entire remaining space of pixels

into 2 big boxes, one containing one partition of

the remaining classes and the other, the balance

of classes. We never cut oblique to attribute

lines! Then we do it again on one of those

sub-partitions until we reach a single class. Can

that be improved? With respect to speed,

probably not. Can accuracy be improved without

sacrificing speed (too much)? Here is a way at

about the same speed. As motivation, think

about a blue-red cars class. (Define, e.g., as 2

parts red, 1 part blue). We want to do a cut (at

the midpoint of the maximal gap) maximizing over

all oblique directions, not just along dimensions

(since the dimensions form a measure zero set of

all possible directions). E.g., a blue-red cut

would define a line at a 30 degree angle from the

red axis toward the blue points in the blue-red

direction. If D is any unit vector, X dot D

?i1..nXiDi. X dot D gt cut_point defines an

oblique big box. We ought to consider all

D-lines (noting that dimension lines ARE

D-lines). For this we will need

an Multi-attribute "EIN-Oblique" mask pTree

formula P(X dot D)gta where X is any

vector and D is an oblique vector (NOTE if

Dei(0,...,1,...0) then this is just the

existing EIN formula for the ith dimension, PXi

gt a ). The pTree formula for the dot product

is in the pTree book, pages 133-134 (and Mohammad

is developing a better one?). We would like a

recursive, exhaustive search for the vector D

that gives us the maximal gap among the

consecutive training class means for the classes

that remain (not just over all attribute

directions, but all combination directions).

How can we find it? 1st examples

Using a quadratic hyper-surface? (instead of a

hyper-plane)

Suppose there are just 2 attributes (red and

blue) and we (r,b)-scatter plot the 10

reddish-blue class training points and the 10

bluish-red class training points

b b

b b

b b b

r b

r

b

r r

b

r

r

r r

r r

-----------------------------------------------

--------------gt

D-line mean for the b class

D-line mean for the r class

Take the r and the b points that project closest

to the D-line as the "best" support

pair. similarly for the "next best" or "second

best" support pair similarly for the "third best"

pair. Form the quadratic support curve from the

three r-support points for class-r Form the

quadratic support curve from the three b-support

points for class-b (or move each point in each

pair 1/3 of the way toward the other and then do

the above) or ????.

Suppose there are just 2 attributes (red and

blue) and we (r,b)-scatter plot the 10

reddish-blue class training points and the 10

bluish-red class training points

b b

b b

b b b

r b

r

b

r r

b

r r

r r

r r --------------------

-----------------------------------------gt

Fitting a parabolic hyper-surface

APPENDIX on FAUST is a Near Neighbor Classifier.

It is not a Voting NNC like pCkNN (where for

each unclassified sample pCkNN builds around that

sample, a neighborhood of TrainingSet voters, who

then classify sample through majority, plurality

or weighted (in PINE) vote. pCkNN classifies one

unclassified sample at a time. FAUST is meant

for speed and therefore FAUST attempts to

classify all unclassified samples at one time.

FAUST builds a Big Box Neighborhood (BBN) for

each class and then classifies all unclassified

samples in the BBN into that class (constructing

said class-BBNs with one EIN pTree calculation

per class).

The BBNs can overlap, so the classification needs

to be done one class at a time sequentially, in

maximum gap, maximum number of std's in gap, or

minimum rankK in gap order.) The whole process

can be iterated as in k-means classification

using the predicted classes or subsets of as

the new training set. This can be continued

until convergence.

A BBN can be a coordinate box for coord R,

cb(R,class,aR,bR) is all x such that

aRltxRltbR Either or both of the lt can be ? or ?.

aR and bR are what were called cut_points of the

class.

Or BBNs can be multi-coordinate boxes, which are

INTERSECTIONs of the best k (k?n-1,

assuming n classes) cb's for a given class

("best" can be wrt any of the above

maximizations). And instead of using a fixed

number of coordinates, k, we could use only those

in which the "quality" of its cb is higher than a

threshold, where "quality" might be measured

involving the dimensions of the gaps (or other

ways?).

FAUST could be combined with pCkNN (probably in

many ways) as follows FAUST multi-coordinate

BBN could be used first to classify the "easy

points" (that fall in an intersection of high

quality BBNs and are therefore fairly certain to

be correctly classified). Then for the remaining

"difficult points" could be classified using the

original training set (or the union of each

original TrainingSet class with the new "easy

points" of that same class) and using L? or Lp ,

p 1 or 2.

A Multi-attribute EIN Oblique (EINO) based

heuristic Instead of finding the best D, take

the vector connecting a class mean to another

class means as D To separate r from v

D(mv?mr) and amvvr/2

To separate r from b D(mb?mr) and ambvr/2

Question What's the best as cutpt? mean,

vector_of_medians, outermost, outermost_non-outlie

r?

P(mb?mr)oXgtmrmb/2

By "outermost, I mean the "furthest points away

from the means in each class (in terms of their

projections of the D-line) By "outermost

non-outlie" I mean the furthest non-outlier

points Other possibilities the best rankK

points, the best std points, etc. Comments on

where to go from here (assuming we can do the

above) I think the "medoid-to-mediod" method on

this page is close to optimal provided the

classes are convex. If they are not convex, then

some sort of Support Vector Machines, SVMs, would

be the next step. In SVMs the space is translated

to higher dimensions in such a way that the

classes ARE convex. The inner product in that

space is equivalent to a kernel function in the

original space so that one need not even do the

translation to get inner product based results

(the genius of the method). Final note I should

say "linearly separable instead of convex

(slightly weaker condition).

A Multi-attribute EIN Oblique (EINO) based

heuristic Instead of finding the best D, take

the vector connecting a class mean to another

class mean as D (d D/D)

Where a can be calculated either as (let d

D/D 1. a ( domr domv )/2 2. Letting

armaxdor avmindov (if domr lt domv, else

reverse max and min). For r take a av 3. Using

variance fits.(or rankK fits). (Notes Apply to

all other classes or only those for which there

is a positive gap.)

FAUST For isolating classi 1. Create table,

TBLclassi, meanvectori( classj, meanvectorj

) 2. Apply the pTree mask formula at left. Note

If we take the fastest route and just pick the

one class which when paired with r, gives the max

gap, then we can use max gap or maximum-std-point

is used instead of midpoint of max gap, then we

need stdj (or variancej) in TBL.

P(mr?mv)/mr?mvoXlta

r r r v v r

mr r v v v r r

v mv v r

v v r

v

D mr?mv

For classes r and b

Suppose there are just 2 attributes (red and

blue) and we (r,b)-scatter plot the 10

reddish-blue class training points and the 10

bluish-red class training points

blue

rb rb

rb rb

rb

rb rb

br

rb

br

rb

br

br rb

br

br

br

br

br br

------------------------------------

-------------------------------------------------

------gtred

D-line mean for the rb class

D-line mean for the br class

etc.

Consecutive class mean mid-point Cut_Point

Cut-HyperPlane, CHP (what we are after)

Clearly we would want to find a 45 degree unit

vector, D, then calculate the means of the

projections of the two training sets onto the

D-line then use the midpoint of the gap between

those two means as the cut_point (erecting a

perpendicular bisector "hyperplane" to D there -

which separates the space into the two class big

boxes on each side of the hyperplane. Can it

an be masked using one EIN formula??)

blue

rb rb

rb rb

rb

rb rb

br

rb

br

rb

br

br rb

br

br

br

br

br br

------------------------------------

--------------------------------gtred

The above "diagonal" cutting produces a perfect

classification (of the training points). If we

had considered only cut_points along coordinate

axes, it would have been very imperfect!

How do we search through all possible angles for

the D that will maximize that gap? We would have

to develop the formula (pTree only formula) for

the class means for any D and then maximize the

gap (distance between consecutive D-projected

means). Take a look at the formulas in the book,

think about it, take a look at Mohammads

formulas, see if you can come up with the mega

formula above. Let D (D1, , Dn) be a unit

vector (our cut_line direction vector) D dot X

D1X1 DnXn is the length of the perpendicular

projection of X on D (length of the high noon

shadow that X makes on the D line, as if D were

the earth). So, we project every training point,

Xc,i (classc, i1..10), onto D (i.e., D dot

Xc,i). Calculate D-line class means, (1/n)?(D

dot Xc,i), select the max consecutive mean gap

along D, (call it best_gap(D)bg(D). Maximize

bg(D) over all possible D. Harder? Calculate it

for a polar grid of Ds! Maximize over that

grid. Then use continuity and hill climbing to

improve it.

etc.

Cut_point

More likely the situation would be rb's are more

blue than red and br's are more red than blue.

Suppose there are just 2 attributes (red and

blue) and we (r,b)-scatter plot the 10

reddish-blue class training points and the 10

bluish-red class training points

rb rb

rb rb

rb rb rb

br rb

br rb

br br

rb

br

br

br br

br br

-----------------------------------------------

--------------gt red

blue

D-line mean for the rb class

D-line mean for the br class

What if the training points are shifted away from

the origin? This should convince you that it

still works.

In higher dimensions, nothing changes (If there

are "convex" clustered classes,

FAUSTdiv,oblique_gap can find them (consider

greenish-redish-blue and bluish-greenish-red)

r

Before considering the pTree formulas for the

above, we note again that any pair of classes

(multi-classes, as in divisive) that are convex,

can be separated by this method. What if they

are not convex? A 2-D example

A couple of comments. FAUST resembles the SVD

(Support Vector Machine) method in that it

constructs a separating hyperplane in the

"margin" between classes. The beauty of SVD

(over FAUST and all other methods) is that it is

provable that there is a transformation to a

higher dimensions that renders two non-hyperplane

seperable classes to being hyperplane seperable

(and you don't actually have to do the

transformation - just determine the kernel that

produces it.). The problem with SVD is that that

it is computationally intensive. I think we want

to keep FAUST simple (and fast!). If we can do

this generalization, I think it will be a real

winner!

How do we search over all possible Oblique

vectors, D, for the one that is "best"? Of if we

are to use multi-box neighborhoods, how do we do

that? A heuristic method follows

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1

1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0

0 0 0 1 0

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0

0 1 1 0 0

0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1

0 0 0 0 1

0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1

1 0 0 1 0

1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0

1 1 0 1 0

0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0

1 1 0 1 1

0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0

0 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

1 1 1 1 1

0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 1 1

1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0

1 0 0 0 0

0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1

0 0 0 1 0

0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0

0 0 1 0 0

1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0

1 1 1 0 1

1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0

1 1 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0

1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1

1 1 1 0 0

1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1

1 1 1 0 0

1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1

0 1 0 1 0

1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1

0 0 0 0 0

0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0

1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1

0 1 1 1 1

1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0

1 0 0 0 0

0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1

0 1 0 0 0

1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1

0 1 1 0 1

FAUST_pdq_std (using std's) 1.1 Create

attribute tables with clclass, mn, std,

nmax__stds_in_gap, cpcut_point (value in the

gap which allows the max of stds, n, to fit

forward from mean (using its std) and backward

from next mean (using its std)). n satisfies

meannstdmeanG-nstdG so n(mnG-mn)/(stdstdG

)

TpLN cl mn std n cp se 15 1.0 4.5 19 01

0011 ve

TA rec with max n

0 0 1 0 0 1 1 19

Note, since there is also a case with n4.1 which

results in the same partition (into se and

ve,vi) we might use both for improved accuracy

- certainly we can do this with sequential!

1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0

1 1 0 1 0

1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0

1 1 1 1 0

1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0

1 1 1 1 0

0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0

0 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1

TsLN cl mn std n cp se 49 3.5 0.9 53 ve 59

6.9 0.5 62 vi 66 7.6

TsWD cl mn std n cp ve 28 3.9 0.3 29 vi 29

3.1 1.3 33 se 33 3.1

TpLN cl mn std n cp se 15 1.0 4.5 19 ve 43

5.1 1.3 49 vi 57 6.0

TpWD cl n std n cp se 2 0.7 4.1 5 ve 13

2.0 1.5 16 vi 20 2.3

se_means 49.3 33.3 14.6 2.2 se_std 3.5 3.1

1.0 0.7 se_ve_n 0.9 -0.8 4.5 4.1

se_vi_n 1.5 -0.6 6.0 5.8 se_ve_cp 52.6 30.7

19.2 5.3 se_vi_cp 54.5 31.3 20.8 6.5

ve_means 59.0 27.5 42.5 13.4 ve_std 6.9

3.9 5.1 2.0 ve_vi_n 0.5 0.3 1.3 1.5

ve_se_n -0.9 0.8 -4.5 -4.1 ve_vi_cp 62.3 28.5

49.1 16.4 ve_se_cp 52.6 30.7 19.2 5.3

vi_means 65.9 29.3 56.8 19.9 vi_std 7.6

3.1 6.0 2.3 vi_se_n -1.5 0.6 -6.0 -5.8

vi_ve_n -0.5 -0.3 -1.3 -1.5 vi_se_cp 54.5 31.3

20.8 6.5 vi_ve_cp 62.3 28.5 49.1 16.4

Remove se from RC (ve, vi now) and TA's

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1

1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0

0 0 0 1 0

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0

0 1 1 0 0

0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1

0 0 0 0 1

0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1

1 0 0 1 0

1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0

1 1 0 1 0

0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0

1 1 0 1 1

0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0

0 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

1 1 1 1 1

0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 1 1

1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0

1 0 0 0 0

0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1

0 0 0 1 0

0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0

0 0 1 0 0

1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0

1 1 1 0 1

1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0

1 1 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0

1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1

1 1 1 0 0

1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1

1 1 1 0 0

1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1

0 1 0 1 0

1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1

0 0 0 0 0

0 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0

1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1

0 1 1 1 1

1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0

1 0 0 0 0

0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1

0 1 0 0 0

1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1

0 1 1 0 1

FAUST_pdq using std's 1.2 Use the 4

Attribute tables with rvmean, stds and

max__stds_in_gapn, cut value, cp (cpvalue in

gap which allows max of stds, n, to fit forward

from that mean (using its std) and backward from

next mean, meanG, (using stdG). n satisfies

mean nstd meanG - nstdG

so n(meanG-mean)/(stdstdG).

TpWD cl mn std n cp ve 13 2.0 1.5 16 vi

TA rec with max n

16 1 0 0 0 0

Pvi PpWDgt16

1 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0

0 1 1 0 0

0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 0 1 1

0 0 0 0 1

0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 1

1 0 0 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1

1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0

0 0 0 1 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1

1 1 1 1 1

Note that we get perfect accuracy with one epoch

using stds this way!!!

TsLN cl mn std n cp se 49 3.5 0.9 53 ve 59

6.9 0.5 62 vi 66 7.6

TsWD cl mn std n cp ve 28 3.9 0.3 29 vi 29 3.1

1.3 33 se 33 3.1

TpLN cl mn std n cp se 15 1.0 4. 19 ve 43 5.1

1.3 49 vi 57 6.0

TpWD cl mn std n cp se 2 0.7 4. 5 ve 13

2.0 1.5 16 vi 20 2.3

FAUST_pdq

SUMMARY We conclude that FAUST_pdq will be fast

(no loops, one pTree mask per step, may converge

with 1 or just a few epochs?? and is fairly

accurate (completely accurate in this example

using the std method!). FAUST_pdq is improved

(accuracy-wise) by using standard_deviation-based

gap measurements and choosing the maximum number

of stds as the attribute relevancy choice.

There may be many other such improvements one

can think of, e.g., using an outlier

identification method (see Dr. Dongmei Ren's

thesis) to determine the set of non-outliers in

each attribute and class. Within each attribute,

order by means and define gaps to be between the

maximum non-outlier value in one class and the

minimum non-outlier value in the next (allowing

these gap measurements to be negative if the max

of one exceeds the minimum of the next). Also

there are many ways of defining representative

values (means, medians, rank-points, ...) In

Conclusion, FAUST_pdq is intended to be very fast

(if raw speed is the need - as it might be for

initial processing of the massive and numerous

image datasets that the DoD has to categorize and

store). It may be fairly accurate as well,

depending upon the dataset, but since it uses

only one attribute or feature for each division,

it is not likely to be of maximal accuracy

compared to other methods (such as the FAUST_pms

coming up). Next look at FAUST_pms (pTree-based,

m-attribute cut_points, sequential (1 class

divided off at a time) so we can explore the

various choices for m (from 1 to the table width)

and alternate distance measures.

For i4..0 crc(PcPatt,i) if

(c?ps) rankK 2i Pc(PcPatt,i)

rank(n-K1)2i else psps-c

PcPcP'att,i

1 1 1 1

1 1 1 1

K10

4

0 6 0 0

25

pWD_vi_LO16 ? pWD_se_HI0, pWD_ve_HI0. So the

highest pWD_se_HI and pWD_ve_HI can get is 15 and

lowest pWD_vi_LO will ever be is 16. So cutting

?16 will separate all vi from se,ve. This is,

of course, with reference to the training set

only and it may not carry over to the test set

(much bigger set?) especially since the gap may

be small (1). Here we will use pWDcutpt?16 to

peal off vi! We need a theorem proof here!!!

1 1 1 1

4 2 0 0

25

26

1 1 1 1

7 2 1 10

25

26

24

26

sLN1 sWD2 pLN3 pWD4

10 10 10 10

0 6 0 0

10 10 10 10

4 2 0 0

10 10 10 10

7 2 1 10

24

sLN1 sWD2 pLN3 pWD4

For i4..0 crc(PcPatt,i) if

(c?ps) rankK 2i Pc(PcPatt,i)

rank(n-K1)2i else psps-c

PcPcP'att,i

1 1 1 1

1 1 1 1

3

10 6 0 0

0 6 0 0

25

24

25

1 1 1 1

pLN_ve_LO32 ? pLN_se_HI0. So the highest

pLN_se_HI can get is 31 and lowest pLN_ve_LO will

ever be is 32. So cutting ?32 will separate all

ve from se! Greater accuracy can be gained by

continuing the process for all i and for all K

then looking for the best gaps! (all gaps?) (all

gaps weighted?)

4 2 0 0

0 2 10 9

24

25

25

23

26

10 10 10 10

4

10 0 0 0

0 6 0 0

25

10 10 10 10

6 8

6 0 10 9

4 2 0 0

25

25

APPENDIX

FAUSTpdq,mrk (FAUSTpdq w max rank_k)

rank_k(S) is smallest kth largest value in S.

FAUSTpdq,gap divisive, quiet (no noise) with

gaps

? attr, A TA(class, md, k, cp) its attribute

table ordered on md asc, where

0. ?attr, A TA(class, rv, gap) ord on rv asc

(rvcls rep, gapdis to next rv.

k s.t. it's max k value s.t. set_rank_k of class

and set_rank_(1-k)' of the next class. (note

the rank_k for k1/2 is median, k1 is maximum

and k0 is the min. Same alg can clearly be used

as pms FAUSTpms,mrk

1. Find the TA record with maximum gap

WHILE RC not empty, DO

2. PAgtc (crvgap/2) to div RC at c into LT, GT

(pTrees, PLT and PGT).

3. If LT or GT singleton remove class)

END_DO

FAUSTpdq,std (FAUSTpdq using of gap

standard devs)

0. For each attribute, A TA(class, mn, std, n,

cp) is its attribute table ordered on n asc, where

cpval in gap allowing max of stds, n. n

satisfies meannstdmeanG-nstdG so

n(mnG-mn)/(stdstdG)

1. Find the TA record with maximum n

WHILE RC not empty, DO

2. Use PAgtcp to divide RC at cpcutpoint into LT

and GT (pTree masks, PLT and PGT).

3. If LT or GT singleton remove that class from

RC and from all TA's

END_DO

FAUSTpms,gap (FAUSTp m attr cut_pts, seq

class separation (1 class at time, m1

0. For each A, TA(class, rv, gap, avgap), where

avgap is avg of gap and previous_gap (if 1st

avgap gap). If x classes.

1. Find the TA record with maximum avgap

DO x-1 times

2. cLrv-prev_gap/2. cGrvgap/2, masks

PclassPAgtcLPA?cGPRC PRCP'classPRC (If 1st

in TA (no prev_gap), PclassPA?cGPRC. Last,

PclassPAgtcLPRC.

END_DO

3. Remove that class from RC and from all TA's

FAUSTpms,std (FAUSTpms using gap std

0. ?attr, A TA(class, mn, std, n, avgn, cp)

ordered avgn asc

cpcut_point (value in gap which allows max of

stds, n, (n satisfies mnnstdmnnext-nstdnext

so n(mnnext-mn)/(stdstdt)

1. Find the TA record with maximum avgn

DO x-1 times

2. cLrv-prev_gap/2. cGrvgap/2 and pTree masks

PclassPAgtcL PA?cGPRC PRC P'classPRC

(If class 1st in TA (has no prev_gap), then

Pclass PA?cGPRC. If last, Pclass PAgtcLPRC.)

3. Remove that class from RC and from all TA's

END_DO