Loading...

PPT – Data Mining 2 PowerPoint presentation | free to download - id: 58eebd-MGMxN

The Adobe Flash plugin is needed to view this content

Data Mining 2

- Data Mining is one aspect of Database Query

Processing (on the "what if" or pattern and trend

end of Query Processing, rather than the "please

find" or straight forward end. - To say it another way, data mining queries are on

the ad hoc or unstructured end of the query

spectrum rather than standard report generation

or "retieve all records matching a criteria" or

SQL side). - Still, Data Mining queries ARE queries and are

processed (or will eventually be processed) by a

Database Management System the same way queries

are processed today, namely - 1. SCAN and PARSE (SCANNER-PARSER) A Scanner

identifies the tokens or language elements of the

DM query. The Parser check for syntax or grammar

validity. - 2. VALIDATED The Validator checks for valid

names and semantic correctness. - 3. CONVERTER converts to an internal

representation. - 4. QUERY OPTIMIZED the Optimzier devises a

stategy for executing the DM query (chooses among

alternative Query internal representations). - 5. CODE GENERATION generates code to implement

each operator in the selected DM query plan (the

optimizer-selected internal representation). - 6. RUNTIME DATABASE PROCESSORING run plan code.
- Developing new, efficient and effective

DataMining Query (DMQ) processors is the central

need and issue in DBMS research today (far and

away!).

Database analysis can be broken down into 2

areas, Querying and Data Mining.

Data Mining can be broken down into 2 areas,

Machine Learning and Assoc. Rule

Mining Machine Learning can be broken down into

2 areas, Clustering and Classification. Clu

stering can be broken down into 2 types,

Isotropic (round clusters) and

Density-based Classification can be broken down

into to types, Model-based and

Neighbor-based

- Machine Learning is almost always based on Near

Neighbor Set(s), NNS. - Clustering, even density based, identifies near

neighbor cores 1st (round NNSs, ?? about a

center). - Classification is continuity based and Near

Neighbor Sets (NNS) are the central concept in

continuity - ??gt0 ??gt0 ? d(x,a)lt? ? d(f(x),f(a))lt?

where f assigns a class to a feature

vector, or - ? ?-NNS of f(a), ? a ?-NNS of a in its

pre-image. f(Dom) categorical ??gt0 ?

d(x,a)lt??f(x)f(a)

Caution For classification, boundary analysis

may be needed also to see the class (done by

projecting?). 1234

Finding NNS in lower a dimension may still the

1st step. Eg, 12345678 are all ? from a 5 a

6 (unclassified sample)

1234 are red-class, 5678 are blue-class. 7 8

Any ? that gives us a

vote gives us a tie vote (0-to-0 then

4-to-4). But projecting onto the vertical

subspace, then taking ?/2 we see that ?/2 about a

contains only blue class (5,6) votes.

Using horizontal data, NNS derivation requires 1

scan (O(n)). L? e-NNS can be derived using

vertical-data in O(log2n) (but Euclidean disks

are preferred). (Euclidean and L? coincide in

Binary data sets).

Association Rule Mining (ARM)

- Assume a relationship between two entities,
- T (e.g., a set of Transactions an enterprise

performs) and - I (e.g., a set of Items which are acted upon by

those transactions). - In Market Basket Research (MBR) a transaction is

a checkout transaction and an item is an Item in

that customer's market basket going thru check

out). - An I-Association Rule, A?C, relates 2 disjoint

subsets of I (I-temsets) has 2 main measures,

support and confidence (A is called the

antecedent, C is called the consequent)

The support of an I-set, A, is the fraction of

T-instances related to every I-instance in A,

e.g. if Ai1,i2 and Ci4 then supp(A)

t2,t4/t1,t2,t3,t4,t5 2/5 Note

means set size or count of elements in the

set. I.e., T2 and T4

are the only transactions from the total

transaction set, TT1,T2,T3,T4,T5. that are

related to both i1 and i2, (buy i1 and i2 during

the pertinent T-period of time). support of rule,

A?C, is defined as suppA ?C T2,

T4/T1,T2,T3,T4,T5 2/5 confidence of rule,

A?C, is supp(A?C)/ supp(A) (2/5) / (2/5)

1 DM Queriers typically want STRONG RULES

suppminsupp, confminconf (minsupp and

minconf are threshold levels) Note that

Conf(A?C) is also just the conditional

probability of t being related to C, given that t

is related to A).

There are also the dual concepts of T-association

rules (just reverse the roles of T and I

above). Examples of Association Rules include

The MBR, relationship between customer

cash-register transactions, T, and purchasable

items, I (t is related to i iff i is being bought

by that customer during that cash-register

transaction.). In Software Engineering (SE), the

relationship between Aspects, T, and Code

Modules, I (t is related to i iff module, i, is

part of the aspect, t). In Bioformatics, the

relationship between experiments, T, and genes, I

(t is related to i iff gene, i, expresses at a

threshold level during experiment, t). In ER

diagramming, any part of relationship in which

i?I is part of t?T (t is related to i iff i is

part of t) and any ISA relationship in

which i?I ISA t?T (t is related to i iff i

IS A t) . . .

Finding Strong Assoc Rules

- The relationship between Transactions and Items

can be expressed in a - Transaction Table where each transaction is a

row containing its ID - and the list of the items that are related to

that transaction

Or Transaction Table Items can be expressed

using Item bit vectors

T ID A B C D E F

2000 1 1 1 0 0 0

1000 1 0 1 0 0 0

4000 1 0 0 1 0 0

5000 0 1 0 0 1 1

If minsupp is set by the querier at .5 and

minconf at .75

To find frequent or Large itemsets (support

minsupp)

- PseudoCode Assume the items in Lk-1 are ordered
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1,

q.itemk-1 - p from Lk-1, q from Lk-1 where
- p.item1q.item1,..,p.itemk-2q.itemk-2,

p.itemk-1ltq.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) delete c from Ck

FACT Any subset of a large itemset is large.

Why? (e.g., if A, B is large, A and B must

be large) APRIORI METHOD Iteratively find the

large k-itemsets, k1... Find all association

rules supported by each large Itemset. Ck

denotes the candidate k-itemsets generated at

each step Lk denotse the Large k-itemsets.

Vertical basic binary Predicate-tree (P-tree)

vertically partition table compress each

vertical bit slice into a basic binary P-tree as

follows

Ptree Review A data table, R(A1..An), containing

horizontal structures (records) is

processed vertically (vertical scans)

then process using multi-operand logical ANDs.

R11 0 0 0 0 1 0 1 1

The basic binary P-tree, P1,1, for R11 is built

top-down by record truth of predicate pure1

recursively on halves, until purity.

But it is pure (pure0) so this branch ends

R11 0 0 0 0 1 0 1 1

Top-down construction of basic binary P-trees is

good for understanding, but bottom-up is more

efficient.

Bottom-up construction of P11 is done using

in-order tree traversal and the collapsing of

pure siblings, as follow

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42

R43

P11

0 1 0 1 1 1 1 1 0 0 0 1 0 1 1

1 1 1 1 1 0 0 0 0 0 1 0 1 1

0 1 0 1 0 0 1 0 1 0 1 1 1 1

0 1 1 1 1 1 0 1 0 1 0 0 0 1

1 0 0 0 1 0 0 1 0 0 0 1 1 0

1 1 1 1 0 0 0 0 0 1 1 0 0 1 1

1 0 0 0 0 0 1 1 0 0

0

Processing Efficiencies? (prefixed leaf-sizes

have been removed)

R(A1 A2 A3 A4)

2 7 6 1 6 7 6 0 2 7 5 1 2 7 5

7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1

4

21-level has the only 1-bit so the 1-count

121 2

Database D

Example ARM using uncompressed P-trees (note I

have placed the 1-count at the root of each Ptree)

TID 1 2 3 4 5

100 1 0 1 1 0

200 0 1 1 0 1

300 1 1 1 0 1

400 0 1 0 0 1

L3

L1

L2

1-ItemSets dont support Association Rules (They

will have no antecedent or no consequent).

2-Itemsets do support ARs.

Are there any Strong Rules supported by Large

2-ItemSets

(at minconf.75)?

1,3 conf1?3 supp1,3/supp1 2/2 1

.75 STRONG

conf3?1 supp1,3/supp3 2/3 .67 lt .75

2,3 conf2?3 supp2,3/supp2 2/3 .67

lt .75

conf3?2 supp2,3/supp3 2/3 .67 lt .75

2,5 conf2?5 supp2,5/supp2 3/3 1

.75 STRONG!

conf5?2 supp2,5/supp5 3/3 1

.75 STRONG!

3,5 conf3?5 supp3,5/supp3 2/3 .67

lt .75

conf5?3 supp3,5/supp5 2/3 .67 lt .75

Are there any Strong Rules supported by Large

3-ItemSets?

2,3,5 conf2,3?5 supp2,3,5/supp2,3

2/2 1 .75 STRONG!

conf2,5?3 supp2,3,5/supp2,5 2/3 .67

lt .75

No subset antecedent can yield a strong rule

either (i.e., no need to check conf2?3,5 or

conf5?2,3 since both denominators will be at

least as large and therefore, both confidences

will be at least as low.

conf3,5?2 supp2,3,5/supp3,5 2/3

.67 lt .75

No need to check conf3?2,5 or conf5?2,3

DONE!

Ptree-ARM versus Apriori on aerial photo (RGB)

data together with yeild data

- P-ARM compared to Horizontal Apriori (classical)

and FP-growth (an improvement of it). - In P-ARM, we find all frequent itemsets, not just

those containing Yield (for fairness) - Aerial TIFF images (R,G,B) with synchronized

yield (Y).

Scalability with number of transactions

Scalability with support threshold

- Identical results
- P-ARM is more scalable for lower support

thresholds. - P-ARM algorithm is more
- scalable to large spatial datasets.

- 1320 ? 1320 pixel TIFF- Yield dataset

(total number of transactions is 1,700,000).

P-ARM versus FP-growth (see literature for

definition)

17,424,000 pixels (transactions)

Scalability with support threshold

Scalability with number of trans

- FP-growth efficient, tree-based frequent

pattern mining method (details later) - For a dataset of 100K bytes, FP-growth runs very

fast. But for images of large - size, P-ARM achieves better

performance. - P-ARM achieves better performance in the case of

low support threshold.

Other methods (other than FP-growth) to Improve

Aprioris Efficiency (see the literature or the

html notes 10datamining.html in Other Materials

for more detail)

- Hash-based itemset counting A k-itemset whose

corresponding hashing bucket count is below the

threshold cannot be frequent - Transaction reduction A transaction that does

not contain any frequent k-itemset is useless in

subsequent scans - Partitioning Any itemset that is potentially

frequent in DB must be frequent in at least one

of the partitions of DB - Sampling mining on a subset of given data, lower

support threshold a method to determine the

completeness - Dynamic itemset counting add new candidate

itemsets only when all of their subsets are

estimated to be frequent

- The core of the Apriori algorithm
- Use only large (k 1)-itemsets to generate

candidate large k-itemsets - Use database scan and pattern matching to collect

counts for the candidate itemsets - The bottleneck of Apriori candidate generation

- 1. Huge candidate sets
- 104 large 1-itemset may generate 107

candidate 2-itemsets - To discover large pattern of size 100, eg,

a1a100, we need to generate 2100 ? 1030

candidates. - 2. Multiple scans of database (Needs (n 1 )

scans, n length of the longest pattern)

Classification

Using a Training Data Set (TDS) in which each

feature tuple is already classified (has a class

value attached to it in the class column, called

its class label.), 1. Build a model of the TDS

(called the TRAINING PHASE). 2. Use that model

to classify unclassified feature tuples

(unclassified samples). E.g., TDS last year's

aerial image of a crop field (feature columns are

R,G,B columns together with last year's crop

yeilds attached in a class column, e.g., class

valuesHi, Med, Lo yeild. Unclassified samples

are the RGB tuples from this year's aerial image

3. Predict the class of each unclassified tuple

(in the e.g., predict yeild for each point in

the field.)

- 3 steps Build a Model of the TDS

feature-to-class relationship, Test that Model,

Use the Model - (to predict the most likely class of each

unclassified sample). Note other names for this

process regression analysis, case-based

reasoning,...) - Other Typical Applications
- Targeted Product Marketing (the so-called

classsical Business Intelligence problem) - Medical Diagnosis (the so-called Computer Aided

Diagnosis or CAD) - Nearest Neighbor Classifiers (NNCs) use a portion

of the TDS as the model (neighboring tuples vote)

finding the neighbor set is much faster than

building other models but it must be done anew

for each unclasified sample. (NNC is called a

lazy classifier because it get's lazy and doesn't

take the time to build a concise model of the

relationship between feature tuples and class

labels ahead of time). - Eager Classifiers (all other classifiers) build

1 concise model once and for all - then use it

for all unclassified samples. The model building

can be very costly but that cost can be amortized

over all the classifications of a large

number of unclassified samples (e.g., all RGB

points in a field).

Eager Classifiers

no

yes

yes

Test Process (2) Usually some of the Training

Tuples are set aside as a Test Set and after a

model is constructed, the Test Tuples are run

through the Model. The Model is acceptable if,

e.g., the correct gt 60. If not, the Model is

rejected (never used).

correct classifications?

NAME

RANK

YEARS

TENURED

Tom

Assistant Prof

2

no

Correct3 Incorrect1 75

Merlisa

Associate Prof

7

no

George

Associate Prof

5

yes

Joseph

Assistant Prof

7

no

Since 75 is above the acceptability threshold,

accept the model!

Classification by Decision Tree Induction

- Decision tree (instead of a simple case statement

of rules, the rules are prioritized into a tree) - Each Internal node denotes a test or rule on an

attribute (test attribute for that node) - Each Branch represents an outcome of the test

(value of the test attribute) - Leaf nodes represent class label decisions

(plurality leaf class is predicted class) - Decision tree model development consists of two

phases - Tree construction
- At start, all the training examples are at the

root - Partition examples recursively based on selected

attributes - Tree pruning
- Identify and remove branches that reflect noise

or outliers - Decision tree use Classifying unclassified

samples by filtering them down the decision tree

to their proper leaf, than predict the plurality

class of that leaf (often only one, depending

upon the stopping condition of the construction

phase)

Algorithm for Decision Tree Induction

- Basic ID3 algorithm (a simple greedy top-down

algorithm) - At start, the current node is the root and all

the training tuples are at the root - Repeat, down each branch, until the stopping

condition is true - At current node, choose a decision attribute

(e.g., one with largest information gain). - Each value for that decision attribute is

associated with a link to the next level down and

that value is used as the selection criterion of

that link. - Each new level produces a partition of the parent

training subset based on the selection value

assigned to its link. - stopping conditions
- When all samples for a given node belong to the

same class - When there are no remaining attributes for

further partitioning majority voting is

employed for classifying the leaf - When there are no samples left

Bayesian Classification (eager Model is based

on conditional probabilities. Prediction is done

by taking the highest conditionally probable

class)

- A Bayesian classifier is a statistical

classifier, which is based on following theorem

known as Bayes theorem - Bayes theorem
- Let X be a data sample whose class label is

unknown. - Let H be the hypothesis that X belongs to class,

H. - P(HX) is the conditional probability of H given

X. P(H) is prob of H, then - P(HX) P(XH)P(H)/P(X)

Naïve Bayesian Classification

- Given training set, R(f1..fn, C) where CC1..Cm

is the class label attribute. - A Naive Bayesian Classifier will predict the

class of unknown data sample, X(x1..xn), to be

the class, Cj having the highest conditional

probability, conditioned on X. - That is it will predict the class to be Cj iff

(a tie handling algorithm may be required).

P(CjX) P(CiX), i ? j. - From the Bayes theorem P(CjX)

P(XCj)P(Cj)/P(X) - P(X) is constant for all classes so we need only

maximize P(XCj)P(Cj) - P(Ci)s are known.
- To reduce the computational complexity of

calculating all P(XCj)s, the naive assumption is

to assume class conditional independence

P(XCi) is the product of the P(XiCi)s.

Neural Network Classificaton

- A Neural Network is trained to make the

prediction - Advantages
- prediction accuracy is generally high
- it is generally robust (works when training

examples contain errors) - output may be discrete, real-valued, or a vector

of several discrete or real-valued attributes - It provides fast classification of unclassified

samples. - Criticism
- It is difficult to understand the learned

function (involves complex and almost magic

weight adjustments.) - It makes it difficult to incorporate domain

knowledge - long training time (for large training sets, it

is prohibitive!)

A Neuron

- The input feature vector x(x0..xn) is mapped

into variable y by means of the scalar product

and a nonlinear function mapping, f (called the

damping function). and a bias function, ?

Neural Network Training

- The ultimate objective of training
- obtain a set of weights that makes almost all the

tuples in the training data classify correctly

(usually using a time consuming "back

propagation" procedure which is based, ultimately

on Neuton's method. See literature of Other

materials - 10datamining.html for examples and

alternate training techniques). - Steps
- Initialize weights with random values
- Feed the input tuples into the network
- For each unit
- Compute the net input to the unit as a linear

combination of all the inputs to the unit - Compute the output value using the activation

function - Compute the error
- Update the weights and the bias

Neural Multi-Layer Perceptron

Output vector

Output nodes

Hidden nodes

wij

Input nodes

Input vector xi

These next 3 slides treat the concept of Distance

it great detail. You may feel you don't need

this much detail - if so, skip what you feel you

don't need. For Nearest Neighbor Classification,

a distance is needed (to make sense of

"nearest". Other classifiers also use distance.)

A distance is a function, d, applied to two

n-dimensional points X and Y, is such that

d(X, Y) is positive definite if (X ?

Y), d(X, Y) gt 0 if (X Y), d(X, Y)

0 d(X, Y) is symmetric d(X, Y) d(Y,

X) d(X, Y) holds triangle inequality d(X, Y)

d(Y, Z) ? d(X, Z)

An Example

A two-dimensional space

d1 ? d2 ? d? always

Neighborhoods of a Point

A Neighborhood (disk neighborhood) of a point, T,

is a set of points, S, ? X ? S iff d(T, X)

? r

If X is a point on the boundary, d(T, X) r

Classical k-Nearest Neighbor Classification

- Select a suitable value for k (how many Training

Data Set (TDS) neighbors do you want to vote as

to the best predicted class for the unclassified

feature sample? ) - Determine a suitable distance metric (to give

meaning to neighbor) - Find the k nearest training set points to the

unclassified sample. - Let them vote (tally up the counts of TDS

neighbors that for each class.) - Predict the highest class vote (plurality class)

from among the k-nearest neighbor set.

Closed-KNN

Example assume 2 features (one in the x-direction

and one in the y T is the unclassified sample.

using k 3, find the three nearest

neighbor, KNN arbitrarily select one point from

the boundary line shown Closed-KNN includes all

points on the boundary

Closed-KNN yields higher classification accuracy

than traditional KNN (thesis of MD Maleq Khan,

NDSU, 2001). The P-tree method always produce

closed neighborhoods (and is faster!)

k-Nearest Neighbor (kNN) Classification

and Closed-k-Nearest Neighbor (CkNN)

Classification

1) Select a suitable value for k 2)

Determine a suitable distance or similarity

notion. 3) Find the k nearest neighbor set

closed of the unclassified sample. 4) Find

the plurality class in the nearest neighbor

set. 5) Assign the plurality class as the

predicted class of the sample

T is the unclassified sample. Use Euclidean

distance. k 3 Find 3 closest neighbors. Move

out from T until 3 neighbors

T

That's 2 !

That's 1 !

kNN arbitrarily select one point from that

boundary line as 3rd nearest neighbor, whereas,

CkNN includes all points on that boundary line.

That's more than 3 !

CkNN yields higher classification accuracy than

traditional kNN. At what additional cost?

Actually, at negative cost (faster and more

accurate!!)

The slides numbered 28 through 93 give great

detail on the relative performance of kNN and

CkNN, on the use of other distance functions and

some exampels, etc. There may be more detail on

these issue that you want/need. If so, just scan

for what you are most interested in or just skip

ahead to slide 94 on CLUSTERING.

Experimented on two sets of (Arial) Remotely

Sensed Images of Best Management Plot (BMP)

of Oakes Irrigation Test Area (OITA), ND Data

contains 6 bands Red, Green, Blue reflectance

values, Soil Moisture, Nitrate, and Yield (class

label). Band values ranges from 0 to 255 (8

bits) Considering 8 classes or levels of yield

values

Performance Accuracy (3 horizontal methods in

middle, 3 vertical methods (the 2 most accurate

and the least accurate)

1997 Dataset

80

75

70

65

Accuracy ()

60

55

kNN-Manhattan

kNN-Euclidian

50

kNN-Max

kNN using HOBbit distance

P-tree Closed-KNN0-max

Closed-kNN using HOBbit distance

45

40

256

1024

4096

16384

65536

262144

Training Set Size (no. of pixels)

Performance Accuracy (3 horizontal methods in

middle, 3 vertical methods (the 2 most accurate

and the least accurate)

1998 Dataset

65

60

55

50

45

Accuracy ()

40

kNN-Manhattan

kNN-Euclidian

kNN-Max

kNN using HOBbit distance

P-tree Closed-KNN-max

Closed-kNN using HOBbit distance

20

256

1024

4096

16384

65536

262144

Training Set Size (no of pixels)

Performance Speed (3 horizontal methods

in middle, 3 vertical methods (the 2 fastest

(the same 2) and the slowest)

Hint NEVER use a log scale to show a WIN!!!

1997 Dataset both axis in logarithmic scale

Training Set Size (no. of pixels)

256

1024

4096

16384

65536

262144

1

0.1

0.01

Per Sample Classification time (sec)

0.001

0.0001

kNN-Manhattan

kNN-Euclidian

kNN-Max

kNN using HOBbit distance

P-tree Closed-KNN-max

Closed-kNN using HOBbit dist

Performance Speed (3 horizontal methods

in middle, 3 vertical methods (the 2 fastest

(the same 2) and the slowest)

Win-Win situation!! (almost never

happens) P-tree CkNN and CkNN-H are more

accurate and much faster. kNN-H is not

recommended because it is slower and less

accurate (because it doesn't use Closed nbr sets

and it requires another step to get rid of ties

(why do it?). Horizontal kNNs are not

recommended because they are less accurate and

slower!

1998 Dataset both axis in logarithmic scale

Training Set Size (no. of pixels)

256

1024

4096

16384

65536

262144

1

0.1

0.01

Per Sample Classification Time (sec)

0.001

0.0001

kNN-Manhattan

kNN-Euclidian

kNN-Max

kNN using HOBbit distance

P-tree Closed-kNN-max

Closed-kNN using HOBbit dist

WALK THRU 3NN CLASSIFICATION of an unclassified

sample, a(a5 a6 a11a12a13a14 )(000000). HORIZONT

AL APPROACH ( relevant attributes are

a5 a6 a11 a12 a13 a14 )

Note only 1 of many training tuple at a

distance2 from the sample got to vote. We

didnt know that distance2 was going to be the

vote cutoff until the end of the 1st

scan. Finding the other distance2 voters (Closed

3NN set or C3NN) requires another scan.

t12 0 0 1 0 1 1 0 2

t13 0 0 1 0 1 0 0 1

t15 0 0 1 0 1 0 1 2

t53 0 0 0 0 1 0 0 1

0 1

Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12

a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0

0 0 1 1 0 1 0 1 1 0 1 1 0 0 0

1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0

0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1

0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0

1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0

1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0

1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0

0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0

1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1

1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0

1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0

0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35

0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0

1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0

1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0

0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55

0 1 0 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0

0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0

1 0 0 0 1 0 1 0 1 0 0 0 1 1 0

1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1

0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1

0 0 0 1 0 1 0 0 1 1 0 0

WALK THRU of required 2nd scan to find Closed 3NN

set. Does it change vote?

YES! C0 wins now!

Vote after 1st scan.

Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12

a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0

0 0 1 1 0 1 0 1 1 0 1 1 0 0 0

1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0

0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1

0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0

1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0

1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0

1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0

0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0

1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1

1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0

1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0

0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35

0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0

1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0

1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0

0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55

0 1 0 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0

0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0

1 0 0 0 1 0 1 0 1 0 0 0 1 1 0

1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1

0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1

0 0 0 1 0 1 0 0 1 1 0 0

WALK THRU Closed 3NNC using P-trees

First let all training points at distance0

vote, then distance1, then distance2,

... until ? 3 For distance0 (exact matches)

constructing the P-tree, Ps then AND with PC and

PC to compute the vote.

(black denotes complement, red denotes

uncomplemented

a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0

a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1

No neighbors at distance0

a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0

C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1

a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1

C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1

C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0

a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1

a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0

a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0

a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0

a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1

a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0

a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1

a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0

a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0

a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1

a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1

a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0

Ps 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0

key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5

3 t55 t57 t61 t72 t75

a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0

a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

Construct Ptree, PS(s,1) OR

Pi Psi-ti1 sj-tj0, j?i OR

PS(si,1) ? ? S(sj,0)

WALK THRU C3NNC distance1 nbrs

i5,6,11,12,13,14

i5,6,11,12,13,14

j?5,6,11,12,13,14-i

a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0

a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1

a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0

C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1

a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1

C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1

PD(s,1) 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0

a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0

key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5

3 t55 t57 t61 t72 t75

a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0

a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0

a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1

a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0

a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0

a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0

a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1

a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0

a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1

a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0

a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0

a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1

a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1

a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0

WALK THRU C3NNC distance2 nbrs

We now have 3 nearest nbrs. We could quite and

declare C1 winner?

We now have the C3NN set and we can declare C0

the winner!

P5,12

P5,13

P5,14

P6,11

P6,12

P6,13

P6,14

P11,12

P11,13

P11,14

P12,13

P12,14

P13,14

P5,6

P5,11

a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0

a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0

a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1

a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0

a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1

key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5

3 t55 t57 t61 t72 t75

In the previous example, there were no exact

matches (dis0 neighbors or similarity6

neighbors) for the sample. There were two

neighbors were found at a distance of 1 (dis1 or

sim5) and nine dis2, sim4 neighbors. All 11

neighbors got an equal votes even though the two

sim5 are much closer neighbors than the nine

sim4. Also processing for the 9 is costly. A

better approach would be to weight each vote by

the similarity of the voter to the sample (We

will use a vote weight function which is linear

in the similarity (admittedly, a better choice

would be a function which is Gaussian in the

similarity, but, so far, it has been too hard to

compute). As long as we are weighting votes by

similarity, we might as well also weight

attributes by relevance also (assuming some

attributes are more relevant neighbors than

others. e.g., the relevance weight of a feature

attribute could be the correlation of that

attribute to the class label). P-trees

accommodate this method very well (in fact, a

variation on this theme won the KDD-cup

competition in 02 ( http//www.biostat.wisc.edu/c

raven/kddcup/ )

Association of Computing Machinery KDD-Cup-02

NDSU Team

Closed Manhattan Nearest Neighbor Classifier

(uses a linear fctn of Manhattan similarity)

Sample is (000000), attribute

weights of relevant attributes are their

subscripts)

black is attribute complement, red is

uncomplemented. The vote is even simpler than the

"equal" vote case. We just note that all tuples

vote in accordance with their weighted similarity

(if the ai values differs form that of (000000)

then the vote contribution is the subscript of

that attribute, else zero). Thus, we can just

add up the root counts of each relevant attribute

weighted by their subscript.

a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0

a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1

Class1 root counts

a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0

a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1

a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1

C1 vote is 343 45 86

711 412 413

714

C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5

3 t55 t57 t61 t72 t75

a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0

a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0

a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1

a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0

a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1

a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1

rc(PCPa12)4

rc(PCPa13)4

rc(PCPa14)7

rc(PCPa6)8

rc(PCPa11)7

rc(PCPa5)4

C1 vote is 343

Similarly, C0 vote is 258 65 76 511

312 313 414

We note that the Closed Manhattan NN Classifier

uses an influence function which is pyramidal

It would be much better to use a

Gaussian influence function but

it is much harder to implement.

One generalization of this method to the case of

integer values rather than Boolean, would be to

weight each bit position in a more Gaussian shape

(i.e., weight the bit positions, b, b-1, ..., 0

(high order to low order) using Gaussian weights.

By so doing, at least within each attribute,

influences are Gaussian. We can call this

method, Closed Manhattan Gaussian NN

Classification. Testing the performance of

either CM NNC or CMG NNC would make a great paper

for this course (thesis?). Improving it in some

way would make an even better paper (thesis).

Review of slide 2 (with additions) Database

analysis can be broken down into 2 areas,

Querying and Data Mining.

Data Mining can be broken down into 2 areas,

Machine Learning and Assoc. Rule

Mining Machine Learning can be broken down into

2 areas, Clustering and Classification. Clu

stering can be broken down into 2 types,

Isotropic (round clusters) and

Density-based Classification can be broken down

into to types, Model-based and

Neighbor-based

- Machine Learning is based on Near Neighbor

Set(s), NNS. - Clustering, even density based, identifies near

neighbor cores 1st (round NNSs, ?? about a

center). - Classification is continuity based and Near

Neighbor Sets (NNS) are the central concept in

continuity - ??gt0 ??gt0 ? d(x,a)lt? ? d(f(x),f(a))lt?

where f assigns a class to a feature

vector, or - ? ?-NNS of f(a), ? a ?-NNS of a in its

pre-image. f(Dom) categorical ??gt0 ?

d(x,a)lt??f(x)f(a)

Caution For classification, boundary analysis

may be needed also to see the class (done by

projecting?). 1234

Finding NNS in lower a dimension may still the

1st step. Eg, 12345678 are all ? from a 5 a

6 (unclassified sample)

1234 are red-class, 5678 are blue-class. 7 8

Any ? that gives us a

vote gives us a tie vote (0-to-0 then

4-to-4). But projecting onto the vertical

subspace, then taking ?/2 we see that ?/2 about a

contains only blue class (5,6) votes.

Using horizontal data, NNS derivation requires 1

scan (O(n)). L? e-NNS can be derived using

vertical-data in O(log2n) (but Euclidean disks

are preferred). (Euclidean and L? coincide in

Binary data sets). Solution (next slide)

Circumscribe desired Euclidean-?NNS with a few

intersections of functional-contours, (f -1(b,c

) sets, until the intersection is scannable, then

scan it for Euclidean-?-nbrhd membership. Advantag

e intersection can be determined before scanning

- create and AND functional contour P-trees.

Functional Contours ? function, fR(A1..An) ? Y

f(x)

- and ? S ? Y, contour(f,S) f-1(S)

Equivalently, contour(Af,S) SELECT A1..An

FROM R WHERE x.Af?S. Graphically

? partition, Si of Y, the contour set,

f-1(Si), is a partition of R (clustering of R)

A Weather map, f barometric pressure or

temperature, Siequi-width partion of Reals. f

local density (eg, OPTICS f reachability

distance, Sk partition produced by

intersection points of graph(f), plotted wrt to

some walk of R and a horizontal threshold

line. A grid is the intersection of dimension

projection contour partitions (next slide for

more defintions). A Class is a contour under

fR?ClassAttr wrt the partition, Ci of

ClassAttr (where Ci are the classes). An L?

?-disk about a is the intersection of all

?-dimension_projection contours containing a.

GRIDs

fR?Y, ? partition SSk of Y, f-1(Sk)

S,f-grid of R (grid cellscontours)

If YReals, the j.lo f-grid is produced by

agglomerating over the j lo bits of Y, ? fixed

(b-j) hi bit pattern. The j lo bits walk

isobars of cells. The b-j hi bits

identify cells. (loextension / hiintention)

Let b-1,...,0 be the b bit positions of Y. The

j.lo f-grid is the partition of R generated by f

and S Sb-1,...,b-j Sb-1,...,b-j

(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)

partition of YReals. If Ffh, the j.lo

F-grid is the intersection partition of the j.lo

fh-grids (intersection of partitions). The

canonical j.lo grid is the j.lo ?-grid

??dR?RAd ?d dth coordinate projection

j-hi gridding is similar ( the b-j lo bits walk

cell contents / j hi bits identify cells).

If the horizontal and vertical dimensions have

bitwidths 3 and 2 respectively

j.lo and j.hi gridding continued The

horizontal_bitwidth vertical_bitwidth b

iff j.lo grid (b-j).hi grid e.g.,

for hbvbb3 and j2

2.lo grid

1.hi grid

111 110 101 100

111 110 101 100

011 010 001 000

011 010 001 000

000 001 010 011 100 101 110 111

000 001 010 011 100 101 110 111

Similarity NearNeighborSets (SNNS) Given

similarity sR?R?PartiallyOrderedSet (eg,

Reals) ( i.e., s(x,y)s(y,x) and s(x,x)?s(x,y)

?x,y?R ) and given any C ? R

The Ordinal disks, skins and rings are disk(C,k)

? C ? disk(C,k)?C'k and s(x,C)?s(y,C)

?x?disk(C,k), y?disk(C,k) skin(C,k)

disk(C,k)-C (skin

comes from s k immediate neighbors and is a kNNS

of C.) ring(C,k) cskin(C,k)-cskin(C,k-1)

closeddisk(C,k)??alldisk(C,k)

closedskin(C,k)??allskin(C,k)

The Cardinal disk, skins and rings are

(PartiallyOrderedSet Reals) disk(C,r) ? x?R

s(x,C)?r also functional contour, f-1(r,

?), where f(x)sC(x)s(x,C) skin(C,r) ?

disk(C,r) - C ring(C,r2,r1) ? disk(C,r2)-disk(C,r

1) ? skin(C,r2)-skin(C,r1) also functional

contour, sC-1(r1,r2

Note closeddisk(C,r) is redundant, since all

r-disks are closed and closeddisk(C,k)

disk(C,s(C,y)) where y kth NN of C

L? skins skin?(a,k) x ?d, xd is one of the

k-NNs of ad (a local normalizer?)

Ptrees

Partition tree R / \ C1 Cn /\ /\

C11C1,n1 Cn1Cn,nn . . .

Vertical, compressed, lossless structures that

facilitates fast horizontal AND-processing Jury

is still out on parallelization, vertical (by

relation) or horizontal (by tree node) or some

combination? Horizontal parallelization is

pretty, but network multicast overhead is

huge Use active networking? Clusters of

Playstations?... Formally, P-trees are be defined

as any of the following Partition-tree Tree of

nested partitions (a partition P(R)C1..Cn

each component is partitioned by

P(Ci)Ci,1..Ci,ni i1..n each component

is partitioned by P(Ci,j)Ci,j1..Ci,jnij... )

- Predicate-tree For a predicate on the leaf-nodes

of a partition-tree (also induces predicates on

i-nodes using quantifiers) - Predicate-tree nodes can be truth-values

(Boolean P-tree) can be quantified

existentially (1 or a threshold ) or

universally Predicate-tree nodes can count

of true leaf children of that component (Count

P-tree) - Purity-tree universally quantified

Boolean-Predicate-tree (e.g., if the predicate is

lt1gt, Pure1-tree or P1tree) - A 1-bit at a node iff corresponding component

is pure1 (universally quantified) - There are many other useful predicates, e.g.,

NonPure0-trees But we will focus on

P1trees. - All Ptrees shown so far were 1-dimensional

(recursively partition by halving bit files), but

they can be 2-D (recursively quartering)

(e.g., used for 2-D images) 3-D

(recursively eighth-ing), Or based on

purity runs or LZW-runs or

Further observations about Ptrees Partition-tree

have set nodes Predicate-tree have either

Boolean nodes (Boolean P-tree) or count nodes

(Count P-tree) Purity-tree being universally

quantified Boolean-Predicate-tree have Boolean

nodes (since the count is always

the full count of leaves, expressing

Purity-trees as count-trees is redundant. Partitio

n-tree can be sliced at a level if each partition

is labeled with same label set (e.g., Month

partition of years). A Partition-tree can be

generalized to a Set-graph when the siblings of a

node do not form a partition.

The partitions used to create P-trees can come

from functional contours (Note there is a

natural duality between partitions and functions,

namely a partition creates a function from the

space of points partitioned to the set of

partition components and a function creates the

pre-image partition of its domain). In Functional

Contour terms (i.e., f-1(S) where fR(A1..An)?Y,

S?Y), the uncompressed Ptree or uncompressed

Predicate-tree 0Pf, S bitmap of set

containment-predicate, 0Pf,S(x)true iff

x?f-1(S)

0Pf,S equivalently, the existential R-bit

map of predicate, R.Af ?S

The Compressed Ptree, sPf,S is the compression

of 0Pf,S with equi-width leaf size, s, as

follows 1. Choose a walk of R

(converts 0Pf,S from bit map to bit

vector) 2. Equi-width partition 0Pf,S with

segment size, s (sleafsize, the last segment

can be short) 3. Eliminate and mask to 0, all

pure-zero segments (call mask, NotPure0 Mask or

EM) 4. Eliminate and mask to 1, all pure-one

segments (call mask, Pure1 Mask or UM)

(EMexistential aggregation UMuniversal

aggregation)

Compressing each leaf of sPf,S with leafsizes2

gives s1,s2Pf,S Recursivly, s1, s2,

s3Pf,S s1, s2, s3, s4Pf,S ... (builds

an EM and a UM tree)

BASIC P-trees If Ai Real or Binary and fi,j(x) ?

jth bit of xi ()Pfi,j ,1

?()Pi,jjb..0 are basic ()P-trees of Ai,

s1..sk If Ai Categorical and fi,a(x)1 if

xia, else 0 ()Pfi,a,1? ()Pi,aa?RAi are

basic ()P-trees of Ai Notes The UM masks

(e.g., of 2k,...,20Pi,j, with kroof(log2R ),

form a (binary) tree. Whenever the EM bit is 1,

that entire subtree can be eliminated (since it

represents a pure0 segment), then a 0-node at

level-k (lowest level level-0) with no sub-tree

indicates a 2k-run of zeros. In this

construction, the UM tree is redundant. We call

these EM trees the basic binary P-trees. The

next slide shows a top-down (easy to understand)

construction of and the following slide is a

(much more efficient) bottom up construction of

the same. We have suppressed the leafsize prefix.

Example functionals Total Variation (TV)

functionals TV(a)

?x?R(x-a)o(x-a) If we use d for a index

variable over the dimensions,

?x?R?d1..n(xd2 - 2adxd

ad2)

i,j,k bit slices indexes

Note that the first term does not depend upon a.

Thus, the derived attribute, TV-TV(?) (eliminate

1st term) is much simpler to compute and has

identical contours (just lowers the graph by

TV(?) ). We also find it useful to post-compose

a log to reduce the number of bit slices. The

result functional is called High-Dimension-ready

Total Variation or HDTV(a).

?dadad )

TV(a) ?x,d,i,j 2ij xdixdj

R ( -2?dad?d

From equation 7, Normalized Total Variation,

NTV(a) ? TV(a)-TV(?)

?d(adad- ?d?d))

R (-2?d(ad?d-?d?d)

R a-?2

Thus there is a simpler function which gives us

circular contours, the Log Normal TV functionlt

LNTV(a) ln( NTV(a) ) ln( TV(a)-TV(?) )

lnR lna-?2

The length of LNTV(a) depends only on the length

of a-?, so isobars are hyper-circles centered at ?

The graph of LNTV is a log-shaped hyper-funnel

go inward and outward along a-? by

? to the points inner point, b?(1-?/a-?)(a-?)

and outer point, c?-(1?/a-?)(a-?).

For an ?-contour ring (radius ? about a)

Then take g(b) and g(c) as lower and upper

endpoints of a vertical interval. Then we use

EIN formulas on that interval to get a mask

P-tree for the ?-contour (which is a well-pruned

superset of the ?-neighborhood of a)

use circumscribing A?d-contour (Note A?d is not

a derived attribute at all, but just Ad, so we

already have its basic P-trees).

If the LNTV circumscribing contour of a is still

too populous,

- As pre-processing, calculate basic P-trees for

the LNTV derived attribute - (or another hypercircular contour derived

attribute). - To classify a
- 1. Calculate b and c (Depend on a, ?)
- 2. Form mask P-tree for training pts with

LNTV-values?LNTV(b), LNTV(c) - 3. User that P-tree to prune out the candidate

NNS. - If the count of candidates is small, proceed to

scan and assign class votes using Gaussian vote

function, else prune further using a dimension

projections).

LNTV(x)

LNTV(c)

We can also note that LNTV can be further

simplified (retaining same contours) using

h(a)a-?. Since we create the derived

attribute by scanning the training set, why not

just use this very simple function? Others leap

to mind, e.g., hb(a)a-b

LNTV(b)

x1

contour of dimension projection f(a)a1

b

c

x2

Graphs of functionals with hyper-circular contours

Angular Variation functionals e.g., AV(a) ? (

1/a ) ?x?R xoa d is an index over the

dimensions,

(1/a)?x?R?d1..nxdad

(1/a)?d(?xxdad) factor out ad

COS?(a) ? AV(a)/(?R) ?oa/(?a)

cos(?a?)

COS? (and AV) has hyper-conic isobars center on ?

COS? and AV have ?-contour(a) the space

between two hyper-cones center on ? which just

circumscribes the Euclidean ?-hyperdisk at a.

Intersection (in pink) with LNTV ?-contour.

Graphs of functionals with hyper-conic

contours E.g., COSb(a) for any vector, b

f(a)x (x-a)o(x-a)

d index over dims,

?d1..n(xd2 -

2adxd

ad2) i,j,k bit slices indexes

Adding up the Gaussian weighted votes for class c

Collecting diagonal terms inside exp

?i,j,d inside exp, coefs are multiplied by

10-bit (depends on x). For fixed i,j,d either

coef is x-indep (if 1bit) or not (if 0bit)

Some additiona formulas

?d,i,j 2ij ( xdixdj - 2adixdj adiadj )

?d,i,j 2ij ( xdi-adj )(xdj-adj)

?i 2i ( xdi adi )

?i adi0 2ixdi

- ?i adi1 2ix'di

fd(a)x x-ad

Thus, for the derived attribute, fd(a) numeric

distance of xd from ad, if we remember that

when adi1, subtract those contributing powers of

2 (don't add) and that we use the complement

dimensiond basic Ptrees, then it should

work. The point is that we can get a set of near

basic or negative basic Ptrees, nbPtrees, for

derived attr fd(a) directly from the basic Ptrees

for Ad for free. Thus, the near basic Ptrees for

fd(a) are the basic Ad Ptrees for those

bit-positions where adi 0 and they are

the complements of the basic Ad Ptrees

for those bit-positions where adi 1

(called fd(a)'s nbPtrees) Caution subtract the

contribution of the nbPtrees for positions where

adi1 Note nbPtrees are not predicate trees (are

they? What's the predicate?) The EIN ring

formulas are related to this, how? If we are

simply after easy pruning contours containing a

(so that we can scan to get the actual Euclidean

epsilon nbrs and/or to get Guassian weighted vote

counts, we can use Hobbit-type contours (middle

earth contours of a?). See next slide for a

discussion of hobbit contours.

A principle A job is not done until the

Mathematics is completed. The Mathematics of a

research job includes 0. Getting to the

frontiers of the area (researching, organizing,

understanding and integrating everything others

have done in the area up to the present moment

and what they are likely to do next). 1.

developing a killer idea for a better way to do

something. 2. proving claims (theorems,

performance evaluation, simulation, etc.), 3.

simplification (everything is simple once fully

understood), 4. generalization (to the widest

possible application scope), and 4. insight

(what are the main issues and underlying

mega-truths (with full drill down)).

Therefore, we need to ask the following

questions at this point

Should we use the vector of medians (the only

good choice of middle point in mulidimensional

space, since the point closest to the mean

definition is influenced by skewness, like the

mean). We will denote the vector of medians

as ? h?(a)a-? is an important functional

(better than h?(a)a-??) If we compute the

median of an even number of values as the

count-weighted average of the middle two values,

then in binary columns, ? and ? coincide. (so if

µ and ? are far apart, that tells us there is

high skew in the data (and the coordinates where

they differ are the columns where the skew is

found).

Additional Mathematics to enjoy

What about the vector of standard deviations, ??

(computable with P-trees!) Do we have an

improvement of BIRCH here? - generating similar

comprehensive statistical measures, but much

faster and more focused?) We can do the same for

any rank statistic (or order statistic), e.g.,

vector of 1st or 3rd quartiles, Q1 or Q3 the

vector of kth rank values (kth ordinal

values). If we preprocessed to get the basic

P-trees of ?, and each mixed quartile vector

(e.g., in 2-D add 5 new derived attributes ?,

Q1,1, Q1,2, Q2,1, Q2,2 where Qi,j is the ith

quartile of the jth column), what does this tell

us (e.g., what can we conclude about the location

of core clusters? Maybe all we need is the basic

P-trees of the column quartiles, Q1..Qn ?)

- L? ordinal disks
- disk?(C,k) x xd is one of the k-Nearest

Neighbors of ad ? d. - skin?(C,k), closed skin?(C,k) and ring?(C,k) are

defined as above.

Are they easy P-tree computations? Do they

offer advantages? When? What? Why? E.g.,

do they automatically normalize for us?

The Middle Earth Contours of a are gotten by

ANDing in the basic Ptree for ad,i1 and ANDing

in the complement if ad,i0 (down to some

bit-position threshold in each dimension, bptd .

bptd can be the same for each d or

not). Caution Hobbit contours of a are not

symmetric about a. That becomes a problem (for

knowing when you have a symmetric nbrhd in the

contour) expecially when many lowest order bits

of a are identical (e.g., if ad 8 1000 ) If

the low order bits of ad are zeros, one should

union (OR) take the Hobbit contour of ad - 1

(e.g., for 8 also take 70111) If the low order

bits of ad are ones, one should union (OR) the

Hobbit contour of ad 1 (e.g, for 7111 also

take 81000) Some need research Since we are

looking for an easy prune to get our mask down to

a scannable size (low root count) but not so much

of a prune that we have t