Data Mining 2 - PowerPoint PPT Presentation

Loading...

PPT – Data Mining 2 PowerPoint presentation | free to download - id: 58eebd-MGMxN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Data Mining 2

Description:

Data Mining 2 Data Mining is one aspect of Database Query Processing (on the – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 111
Provided by: WilliamP162
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining 2


1
Data Mining 2
  • Data Mining is one aspect of Database Query
    Processing (on the "what if" or pattern and trend
    end of Query Processing, rather than the "please
    find" or straight forward end.
  • To say it another way, data mining queries are on
    the ad hoc or unstructured end of the query
    spectrum rather than standard report generation
    or "retieve all records matching a criteria" or
    SQL side).
  • Still, Data Mining queries ARE queries and are
    processed (or will eventually be processed) by a
    Database Management System the same way queries
    are processed today, namely
  • 1. SCAN and PARSE (SCANNER-PARSER) A Scanner
    identifies the tokens or language elements of the
    DM query. The Parser check for syntax or grammar
    validity.
  • 2. VALIDATED The Validator checks for valid
    names and semantic correctness.
  • 3. CONVERTER converts to an internal
    representation.
  • 4. QUERY OPTIMIZED the Optimzier devises a
    stategy for executing the DM query (chooses among
    alternative Query internal representations).
  • 5. CODE GENERATION generates code to implement
    each operator in the selected DM query plan (the
    optimizer-selected internal representation).
  • 6. RUNTIME DATABASE PROCESSORING run plan code.
  • Developing new, efficient and effective
    DataMining Query (DMQ) processors is the central
    need and issue in DBMS research today (far and
    away!).

2
Database analysis can be broken down into 2
areas, Querying and Data Mining.
Data Mining can be broken down into 2 areas,
Machine Learning and Assoc. Rule
Mining Machine Learning can be broken down into
2 areas, Clustering and Classification. Clu
stering can be broken down into 2 types,
Isotropic (round clusters) and
Density-based Classification can be broken down
into to types, Model-based and
Neighbor-based
  • Machine Learning is almost always based on Near
    Neighbor Set(s), NNS.
  • Clustering, even density based, identifies near
    neighbor cores 1st (round NNSs, ?? about a
    center).
  • Classification is continuity based and Near
    Neighbor Sets (NNS) are the central concept in
    continuity
  • ??gt0 ??gt0 ? d(x,a)lt? ? d(f(x),f(a))lt?
    where f assigns a class to a feature
    vector, or
  • ? ?-NNS of f(a), ? a ?-NNS of a in its
    pre-image. f(Dom) categorical ??gt0 ?
    d(x,a)lt??f(x)f(a)

Caution For classification, boundary analysis
may be needed also to see the class (done by
projecting?). 1234
Finding NNS in lower a dimension may still the
1st step. Eg, 12345678 are all ? from a 5 a
6 (unclassified sample)
1234 are red-class, 5678 are blue-class. 7 8
Any ? that gives us a
vote gives us a tie vote (0-to-0 then
4-to-4). But projecting onto the vertical
subspace, then taking ?/2 we see that ?/2 about a
contains only blue class (5,6) votes.
Using horizontal data, NNS derivation requires 1
scan (O(n)). L? e-NNS can be derived using
vertical-data in O(log2n) (but Euclidean disks
are preferred). (Euclidean and L? coincide in
Binary data sets).
3
Association Rule Mining (ARM)
  • Assume a relationship between two entities,
  • T (e.g., a set of Transactions an enterprise
    performs) and
  • I (e.g., a set of Items which are acted upon by
    those transactions).
  • In Market Basket Research (MBR) a transaction is
    a checkout transaction and an item is an Item in
    that customer's market basket going thru check
    out).
  • An I-Association Rule, A?C, relates 2 disjoint
    subsets of I (I-temsets) has 2 main measures,
    support and confidence (A is called the
    antecedent, C is called the consequent)

The support of an I-set, A, is the fraction of
T-instances related to every I-instance in A,
e.g. if Ai1,i2 and Ci4 then supp(A)
t2,t4/t1,t2,t3,t4,t5 2/5 Note
means set size or count of elements in the
set. I.e., T2 and T4
are the only transactions from the total
transaction set, TT1,T2,T3,T4,T5. that are
related to both i1 and i2, (buy i1 and i2 during
the pertinent T-period of time). support of rule,
A?C, is defined as suppA ?C T2,
T4/T1,T2,T3,T4,T5 2/5 confidence of rule,
A?C, is supp(A?C)/ supp(A) (2/5) / (2/5)
1 DM Queriers typically want STRONG RULES
suppminsupp, confminconf (minsupp and
minconf are threshold levels) Note that
Conf(A?C) is also just the conditional
probability of t being related to C, given that t
is related to A).
There are also the dual concepts of T-association
rules (just reverse the roles of T and I
above). Examples of Association Rules include
The MBR, relationship between customer
cash-register transactions, T, and purchasable
items, I (t is related to i iff i is being bought
by that customer during that cash-register
transaction.). In Software Engineering (SE), the
relationship between Aspects, T, and Code
Modules, I (t is related to i iff module, i, is
part of the aspect, t). In Bioformatics, the
relationship between experiments, T, and genes, I
(t is related to i iff gene, i, expresses at a
threshold level during experiment, t). In ER
diagramming, any part of relationship in which
i?I is part of t?T (t is related to i iff i is
part of t) and any ISA relationship in
which i?I ISA t?T (t is related to i iff i
IS A t) . . .
4
Finding Strong Assoc Rules
  • The relationship between Transactions and Items
    can be expressed in a
  • Transaction Table where each transaction is a
    row containing its ID
  • and the list of the items that are related to
    that transaction

Or Transaction Table Items can be expressed
using Item bit vectors
T ID A B C D E F
2000 1 1 1 0 0 0
1000 1 0 1 0 0 0
4000 1 0 0 1 0 0
5000 0 1 0 0 1 1
If minsupp is set by the querier at .5 and
minconf at .75
To find frequent or Large itemsets (support
minsupp)
  • PseudoCode Assume the items in Lk-1 are ordered
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1,
    q.itemk-1
  • p from Lk-1, q from Lk-1 where
  • p.item1q.item1,..,p.itemk-2q.itemk-2,
    p.itemk-1ltq.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) delete c from Ck

FACT Any subset of a large itemset is large.
Why? (e.g., if A, B is large, A and B must
be large) APRIORI METHOD Iteratively find the
large k-itemsets, k1... Find all association
rules supported by each large Itemset. Ck
denotes the candidate k-itemsets generated at
each step Lk denotse the Large k-itemsets.
5
Vertical basic binary Predicate-tree (P-tree)
vertically partition table compress each
vertical bit slice into a basic binary P-tree as
follows
Ptree Review A data table, R(A1..An), containing
horizontal structures (records) is
processed vertically (vertical scans)
then process using multi-operand logical ANDs.
R11 0 0 0 0 1 0 1 1
The basic binary P-tree, P1,1, for R11 is built
top-down by record truth of predicate pure1
recursively on halves, until purity.
But it is pure (pure0) so this branch ends
6
R11 0 0 0 0 1 0 1 1
Top-down construction of basic binary P-trees is
good for understanding, but bottom-up is more
efficient.
Bottom-up construction of P11 is done using
in-order tree traversal and the collapsing of
pure siblings, as follow
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42
R43
P11













0 1 0 1 1 1 1 1 0 0 0 1 0 1 1
1 1 1 1 1 0 0 0 0 0 1 0 1 1
0 1 0 1 0 0 1 0 1 0 1 1 1 1
0 1 1 1 1 1 0 1 0 1 0 0 0 1
1 0 0 0 1 0 0 1 0 0 0 1 1 0
1 1 1 1 0 0 0 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 0 0
0




7
Processing Efficiencies? (prefixed leaf-sizes
have been removed)
R(A1 A2 A3 A4)
2 7 6 1 6 7 6 0 2 7 5 1 2 7 5
7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1
4
21-level has the only 1-bit so the 1-count
121 2
8
Database D

Example ARM using uncompressed P-trees (note I
have placed the 1-count at the root of each Ptree)

TID 1 2 3 4 5
100 1 0 1 1 0
200 0 1 1 0 1
300 1 1 1 0 1
400 0 1 0 0 1

9
L3
L1
L2
1-ItemSets dont support Association Rules (They
will have no antecedent or no consequent).
2-Itemsets do support ARs.
Are there any Strong Rules supported by Large
2-ItemSets
(at minconf.75)?
1,3 conf1?3 supp1,3/supp1 2/2 1
.75 STRONG
conf3?1 supp1,3/supp3 2/3 .67 lt .75
2,3 conf2?3 supp2,3/supp2 2/3 .67
lt .75
conf3?2 supp2,3/supp3 2/3 .67 lt .75
2,5 conf2?5 supp2,5/supp2 3/3 1
.75 STRONG!
conf5?2 supp2,5/supp5 3/3 1
.75 STRONG!
3,5 conf3?5 supp3,5/supp3 2/3 .67
lt .75
conf5?3 supp3,5/supp5 2/3 .67 lt .75
Are there any Strong Rules supported by Large
3-ItemSets?
2,3,5 conf2,3?5 supp2,3,5/supp2,3
2/2 1 .75 STRONG!
conf2,5?3 supp2,3,5/supp2,5 2/3 .67
lt .75
No subset antecedent can yield a strong rule
either (i.e., no need to check conf2?3,5 or
conf5?2,3 since both denominators will be at
least as large and therefore, both confidences
will be at least as low.
conf3,5?2 supp2,3,5/supp3,5 2/3
.67 lt .75
No need to check conf3?2,5 or conf5?2,3
DONE!
10
Ptree-ARM versus Apriori on aerial photo (RGB)
data together with yeild data
  • P-ARM compared to Horizontal Apriori (classical)
    and FP-growth (an improvement of it).
  • In P-ARM, we find all frequent itemsets, not just
    those containing Yield (for fairness)
  • Aerial TIFF images (R,G,B) with synchronized
    yield (Y).

Scalability with number of transactions
Scalability with support threshold
  • Identical results
  • P-ARM is more scalable for lower support
    thresholds.
  • P-ARM algorithm is more
  • scalable to large spatial datasets.
  • 1320 ? 1320 pixel TIFF- Yield dataset
    (total number of transactions is 1,700,000).

11
P-ARM versus FP-growth (see literature for
definition)
17,424,000 pixels (transactions)
Scalability with support threshold
Scalability with number of trans
  • FP-growth efficient, tree-based frequent
    pattern mining method (details later)
  • For a dataset of 100K bytes, FP-growth runs very
    fast. But for images of large
  • size, P-ARM achieves better
    performance.
  • P-ARM achieves better performance in the case of
    low support threshold.

12
Other methods (other than FP-growth) to Improve
Aprioris Efficiency (see the literature or the
html notes 10datamining.html in Other Materials
for more detail)
  • Hash-based itemset counting A k-itemset whose
    corresponding hashing bucket count is below the
    threshold cannot be frequent
  • Transaction reduction A transaction that does
    not contain any frequent k-itemset is useless in
    subsequent scans
  • Partitioning Any itemset that is potentially
    frequent in DB must be frequent in at least one
    of the partitions of DB
  • Sampling mining on a subset of given data, lower
    support threshold a method to determine the
    completeness
  • Dynamic itemset counting add new candidate
    itemsets only when all of their subsets are
    estimated to be frequent
  • The core of the Apriori algorithm
  • Use only large (k 1)-itemsets to generate
    candidate large k-itemsets
  • Use database scan and pattern matching to collect
    counts for the candidate itemsets
  • The bottleneck of Apriori candidate generation
  • 1. Huge candidate sets
  • 104 large 1-itemset may generate 107
    candidate 2-itemsets
  • To discover large pattern of size 100, eg,
    a1a100, we need to generate 2100 ? 1030
    candidates.
  • 2. Multiple scans of database (Needs (n 1 )
    scans, n length of the longest pattern)

13
Classification
Using a Training Data Set (TDS) in which each
feature tuple is already classified (has a class
value attached to it in the class column, called
its class label.), 1. Build a model of the TDS
(called the TRAINING PHASE). 2. Use that model
to classify unclassified feature tuples
(unclassified samples). E.g., TDS last year's
aerial image of a crop field (feature columns are
R,G,B columns together with last year's crop
yeilds attached in a class column, e.g., class
valuesHi, Med, Lo yeild. Unclassified samples
are the RGB tuples from this year's aerial image
3. Predict the class of each unclassified tuple
(in the e.g., predict yeild for each point in
the field.)
  • 3 steps Build a Model of the TDS
    feature-to-class relationship, Test that Model,
    Use the Model
  • (to predict the most likely class of each
    unclassified sample). Note other names for this
    process regression analysis, case-based
    reasoning,...)
  • Other Typical Applications
  • Targeted Product Marketing (the so-called
    classsical Business Intelligence problem)
  • Medical Diagnosis (the so-called Computer Aided
    Diagnosis or CAD)
  • Nearest Neighbor Classifiers (NNCs) use a portion
    of the TDS as the model (neighboring tuples vote)
    finding the neighbor set is much faster than
    building other models but it must be done anew
    for each unclasified sample. (NNC is called a
    lazy classifier because it get's lazy and doesn't
    take the time to build a concise model of the
    relationship between feature tuples and class
    labels ahead of time).
  • Eager Classifiers (all other classifiers) build
    1 concise model once and for all - then use it
    for all unclassified samples. The model building
    can be very costly but that cost can be amortized
    over all the classifications of a large
    number of unclassified samples (e.g., all RGB
    points in a field).

14
Eager Classifiers
no
yes
yes
15
Test Process (2) Usually some of the Training
Tuples are set aside as a Test Set and after a
model is constructed, the Test Tuples are run
through the Model. The Model is acceptable if,
e.g., the correct gt 60. If not, the Model is
rejected (never used).
correct classifications?
NAME
RANK
YEARS
TENURED
Tom
Assistant Prof
2
no
Correct3 Incorrect1 75
Merlisa
Associate Prof
7
no
George
Associate Prof
5
yes
Joseph
Assistant Prof
7
no
Since 75 is above the acceptability threshold,
accept the model!
16
Classification by Decision Tree Induction
  • Decision tree (instead of a simple case statement
    of rules, the rules are prioritized into a tree)
  • Each Internal node denotes a test or rule on an
    attribute (test attribute for that node)
  • Each Branch represents an outcome of the test
    (value of the test attribute)
  • Leaf nodes represent class label decisions
    (plurality leaf class is predicted class)
  • Decision tree model development consists of two
    phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Decision tree use Classifying unclassified
    samples by filtering them down the decision tree
    to their proper leaf, than predict the plurality
    class of that leaf (often only one, depending
    upon the stopping condition of the construction
    phase)

17
Algorithm for Decision Tree Induction
  • Basic ID3 algorithm (a simple greedy top-down
    algorithm)
  • At start, the current node is the root and all
    the training tuples are at the root
  • Repeat, down each branch, until the stopping
    condition is true
  • At current node, choose a decision attribute
    (e.g., one with largest information gain).
  • Each value for that decision attribute is
    associated with a link to the next level down and
    that value is used as the selection criterion of
    that link.
  • Each new level produces a partition of the parent
    training subset based on the selection value
    assigned to its link.
  • stopping conditions
  • When all samples for a given node belong to the
    same class
  • When there are no remaining attributes for
    further partitioning majority voting is
    employed for classifying the leaf
  • When there are no samples left

18
Bayesian Classification (eager Model is based
on conditional probabilities. Prediction is done
by taking the highest conditionally probable
class)
  • A Bayesian classifier is a statistical
    classifier, which is based on following theorem
    known as Bayes theorem 
  • Bayes theorem
  • Let X be a data sample whose class label is
    unknown.
  • Let H be the hypothesis that X belongs to class,
    H.
  • P(HX) is the conditional probability of H given
    X. P(H) is prob of H, then
  • P(HX) P(XH)P(H)/P(X)

19
Naïve Bayesian Classification
  • Given training set, R(f1..fn, C) where CC1..Cm
    is the class label attribute.
  • A Naive Bayesian Classifier will predict the
    class of unknown data sample, X(x1..xn), to be
    the class, Cj having the highest conditional
    probability, conditioned on X.
  • That is it will predict the class to be Cj iff
    (a tie handling algorithm may be required).
    P(CjX) P(CiX), i ? j.
  • From the Bayes theorem P(CjX)
    P(XCj)P(Cj)/P(X)
  • P(X) is constant for all classes so we need only
    maximize P(XCj)P(Cj)
  • P(Ci)s are known.
  • To reduce the computational complexity of
    calculating all P(XCj)s, the naive assumption is
    to assume class conditional independence
    P(XCi) is the product of the P(XiCi)s.

20
Neural Network Classificaton
  • A Neural Network is trained to make the
    prediction
  • Advantages
  • prediction accuracy is generally high
  • it is generally robust (works when training
    examples contain errors)
  • output may be discrete, real-valued, or a vector
    of several discrete or real-valued attributes
  • It provides fast classification of unclassified
    samples.
  • Criticism
  • It is difficult to understand the learned
    function (involves complex and almost magic
    weight adjustments.)
  • It makes it difficult to incorporate domain
    knowledge
  • long training time (for large training sets, it
    is prohibitive!)

21
A Neuron
  • The input feature vector x(x0..xn) is mapped
    into variable y by means of the scalar product
    and a nonlinear function mapping, f (called the
    damping function). and a bias function, ?

22
Neural Network Training
  • The ultimate objective of training
  • obtain a set of weights that makes almost all the
    tuples in the training data classify correctly
    (usually using a time consuming "back
    propagation" procedure which is based, ultimately
    on Neuton's method. See literature of Other
    materials - 10datamining.html for examples and
    alternate training techniques).
  • Steps
  • Initialize weights with random values
  • Feed the input tuples into the network
  • For each unit
  • Compute the net input to the unit as a linear
    combination of all the inputs to the unit
  • Compute the output value using the activation
    function
  • Compute the error
  • Update the weights and the bias

23
Neural Multi-Layer Perceptron
Output vector
Output nodes
Hidden nodes
wij
Input nodes
Input vector xi
24
These next 3 slides treat the concept of Distance
it great detail. You may feel you don't need
this much detail - if so, skip what you feel you
don't need. For Nearest Neighbor Classification,
a distance is needed (to make sense of
"nearest". Other classifiers also use distance.)
A distance is a function, d, applied to two
n-dimensional points X and Y, is such that     
d(X, Y) is positive definite   if (X ?
Y), d(X, Y) gt 0 if (X Y), d(X, Y)
0 d(X, Y) is symmetric d(X, Y) d(Y,
X) d(X, Y) holds triangle inequality d(X, Y)
d(Y, Z) ? d(X, Z)
25
An Example
A two-dimensional space
d1 ? d2 ? d? always
26
Neighborhoods of a Point
A Neighborhood (disk neighborhood) of a point, T,
is a set of points, S, ? X ? S iff d(T, X)
? r
If X is a point on the boundary, d(T, X) r
27
Classical k-Nearest Neighbor Classification
  • Select a suitable value for k (how many Training
    Data Set (TDS) neighbors do you want to vote as
    to the best predicted class for the unclassified
    feature sample? )
  • Determine a suitable distance metric (to give
    meaning to neighbor)
  • Find the k nearest training set points to the
    unclassified sample.
  • Let them vote (tally up the counts of TDS
    neighbors that for each class.)
  • Predict the highest class vote (plurality class)
    from among the k-nearest neighbor set.

28
Closed-KNN
Example assume 2 features (one in the x-direction
and one in the y T is the unclassified sample.
using k 3, find the three nearest
neighbor, KNN arbitrarily select one point from
the boundary line shown Closed-KNN includes all
points on the boundary
Closed-KNN yields higher classification accuracy
than traditional KNN (thesis of MD Maleq Khan,
NDSU, 2001). The P-tree method always produce
closed neighborhoods (and is faster!)
29
k-Nearest Neighbor (kNN) Classification
and Closed-k-Nearest Neighbor (CkNN)
Classification
1)  Select a suitable value for k    2)
Determine a suitable distance or similarity
notion. 3) Find the k nearest neighbor set
closed of the unclassified sample. 4)  Find
the plurality class in the nearest neighbor
set. 5) Assign the plurality class as the
predicted class of the sample
T is the unclassified sample. Use Euclidean
distance. k 3 Find 3 closest neighbors. Move
out from T until 3 neighbors
T
That's 2 !
That's 1 !
kNN arbitrarily select one point from that
boundary line as 3rd nearest neighbor, whereas,
CkNN includes all points on that boundary line.
That's more than 3 !
CkNN yields higher classification accuracy than
traditional kNN. At what additional cost?
Actually, at negative cost (faster and more
accurate!!)
30
The slides numbered 28 through 93 give great
detail on the relative performance of kNN and
CkNN, on the use of other distance functions and
some exampels, etc. There may be more detail on
these issue that you want/need. If so, just scan
for what you are most interested in or just skip
ahead to slide 94 on CLUSTERING.
Experimented on two sets of (Arial) Remotely
Sensed Images of Best Management Plot (BMP)
of Oakes Irrigation Test Area (OITA), ND Data
contains 6 bands Red, Green, Blue reflectance
values, Soil Moisture, Nitrate, and Yield (class
label). Band values ranges from 0 to 255 (8
bits) Considering 8 classes or levels of yield
values
31
Performance Accuracy (3 horizontal methods in
middle, 3 vertical methods (the 2 most accurate
and the least accurate)
1997 Dataset
80
75
70
65
Accuracy ()
60
55
kNN-Manhattan
kNN-Euclidian
50
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN0-max
Closed-kNN using HOBbit distance
45
40
256
1024
4096
16384
65536
262144
Training Set Size (no. of pixels)
32
Performance Accuracy (3 horizontal methods in
middle, 3 vertical methods (the 2 most accurate
and the least accurate)
1998 Dataset
65
60
55
50
45
Accuracy ()
40
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN-max
Closed-kNN using HOBbit distance
20
256
1024
4096
16384
65536
262144
Training Set Size (no of pixels)
33
Performance Speed (3 horizontal methods
in middle, 3 vertical methods (the 2 fastest
(the same 2) and the slowest)
Hint NEVER use a log scale to show a WIN!!!
1997 Dataset both axis in logarithmic scale
Training Set Size (no. of pixels)
256
1024
4096
16384
65536
262144
1
0.1
0.01
Per Sample Classification time (sec)
0.001
0.0001
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN-max
Closed-kNN using HOBbit dist
34
Performance Speed (3 horizontal methods
in middle, 3 vertical methods (the 2 fastest
(the same 2) and the slowest)
Win-Win situation!! (almost never
happens) P-tree CkNN and CkNN-H are more
accurate and much faster. kNN-H is not
recommended because it is slower and less
accurate (because it doesn't use Closed nbr sets
and it requires another step to get rid of ties
(why do it?). Horizontal kNNs are not
recommended because they are less accurate and
slower!
1998 Dataset both axis in logarithmic scale
Training Set Size (no. of pixels)
256
1024
4096
16384
65536
262144
1
0.1
0.01
Per Sample Classification Time (sec)
0.001
0.0001
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-kNN-max
Closed-kNN using HOBbit dist
35
WALK THRU 3NN CLASSIFICATION of an unclassified
sample, a(a5 a6 a11a12a13a14 )(000000). HORIZONT
AL APPROACH ( relevant attributes are
a5 a6 a11 a12 a13 a14 )
Note only 1 of many training tuple at a
distance2 from the sample got to vote. We
didnt know that distance2 was going to be the
vote cutoff until the end of the 1st
scan. Finding the other distance2 voters (Closed
3NN set or C3NN) requires another scan.
t12 0 0 1 0 1 1 0 2
t13 0 0 1 0 1 0 0 1
t15 0 0 1 0 1 0 1 2
t53 0 0 0 0 1 0 0 1
0 1
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12
a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0
0 0 1 1 0 1 0 1 1 0 1 1 0 0 0
1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0
0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1
0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0
1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0
1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0
1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0
0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0
1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1
1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0
1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0
0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35
0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0
1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0
1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0
0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55
0 1 0 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0
1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1
0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1
0 0 0 1 0 1 0 0 1 1 0 0
36
WALK THRU of required 2nd scan to find Closed 3NN
set. Does it change vote?
YES! C0 wins now!
Vote after 1st scan.
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10C a11 a12
a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0
0 0 1 1 0 1 0 1 1 0 1 1 0 0 0
1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0
0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1
0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0
1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0
1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0
1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0
0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0
1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1
1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0
1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0
0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35
0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0
1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0
1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0
0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55
0 1 0 1 0 0 1 1 0 0 0 1 0 1 0
0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0
1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1
0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1
0 0 0 1 0 1 0 0 1 1 0 0
37
WALK THRU Closed 3NNC using P-trees
First let all training points at distance0
vote, then distance1, then distance2,
... until ? 3 For distance0 (exact matches)
constructing the P-tree, Ps then AND with PC and
PC to compute the vote.
(black denotes complement, red denotes
uncomplemented
a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0
a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1
No neighbors at distance0
a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0
C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0
a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
Ps 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
38
Construct Ptree, PS(s,1) OR
Pi Psi-ti1 sj-tj0, j?i OR
PS(si,1) ? ? S(sj,0)
WALK THRU C3NNC distance1 nbrs
i5,6,11,12,13,14
i5,6,11,12,13,14
j?5,6,11,12,13,14-i
a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0
a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1
a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0
C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
PD(s,1) 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0
a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1
a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
39
WALK THRU C3NNC distance2 nbrs
We now have 3 nearest nbrs. We could quite and
declare C1 winner?
We now have the C3NN set and we can declare C0
the winner!
P5,12
P5,13
P5,14
P6,11
P6,12
P6,13
P6,14
P11,12
P11,13
P11,14
P12,13
P12,14
P13,14
P5,6
P5,11
a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
40
In the previous example, there were no exact
matches (dis0 neighbors or similarity6
neighbors) for the sample. There were two
neighbors were found at a distance of 1 (dis1 or
sim5) and nine dis2, sim4 neighbors. All 11
neighbors got an equal votes even though the two
sim5 are much closer neighbors than the nine
sim4. Also processing for the 9 is costly. A
better approach would be to weight each vote by
the similarity of the voter to the sample (We
will use a vote weight function which is linear
in the similarity (admittedly, a better choice
would be a function which is Gaussian in the
similarity, but, so far, it has been too hard to
compute). As long as we are weighting votes by
similarity, we might as well also weight
attributes by relevance also (assuming some
attributes are more relevant neighbors than
others. e.g., the relevance weight of a feature
attribute could be the correlation of that
attribute to the class label). P-trees
accommodate this method very well (in fact, a
variation on this theme won the KDD-cup
competition in 02 ( http//www.biostat.wisc.edu/c
raven/kddcup/ )
41
Association of Computing Machinery KDD-Cup-02
NDSU Team
42
Closed Manhattan Nearest Neighbor Classifier
(uses a linear fctn of Manhattan similarity)

Sample is (000000), attribute
weights of relevant attributes are their
subscripts)


black is attribute complement, red is
uncomplemented. The vote is even simpler than the
"equal" vote case. We just note that all tuples
vote in accordance with their weighted similarity
(if the ai values differs form that of (000000)
then the vote contribution is the subscript of
that attribute, else zero). Thus, we can just
add up the root counts of each relevant attribute
weighted by their subscript.
a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0
a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1
Class1 root counts
a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0
a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1
a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
C1 vote is 343 45 86
711 412 413
714
C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t5
3 t55 t57 t61 t72 t75
a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0
a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0
a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1
a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0
a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1
a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1
rc(PCPa12)4
rc(PCPa13)4
rc(PCPa14)7
rc(PCPa6)8
rc(PCPa11)7
rc(PCPa5)4
C1 vote is 343
Similarly, C0 vote is 258 65 76 511
312 313 414
43
We note that the Closed Manhattan NN Classifier
uses an influence function which is pyramidal
It would be much better to use a
Gaussian influence function but
it is much harder to implement.
One generalization of this method to the case of
integer values rather than Boolean, would be to
weight each bit position in a more Gaussian shape
(i.e., weight the bit positions, b, b-1, ..., 0
(high order to low order) using Gaussian weights.
By so doing, at least within each attribute,
influences are Gaussian. We can call this
method, Closed Manhattan Gaussian NN
Classification. Testing the performance of
either CM NNC or CMG NNC would make a great paper
for this course (thesis?). Improving it in some
way would make an even better paper (thesis).
44
Review of slide 2 (with additions) Database
analysis can be broken down into 2 areas,
Querying and Data Mining.
Data Mining can be broken down into 2 areas,
Machine Learning and Assoc. Rule
Mining Machine Learning can be broken down into
2 areas, Clustering and Classification. Clu
stering can be broken down into 2 types,
Isotropic (round clusters) and
Density-based Classification can be broken down
into to types, Model-based and
Neighbor-based
  • Machine Learning is based on Near Neighbor
    Set(s), NNS.
  • Clustering, even density based, identifies near
    neighbor cores 1st (round NNSs, ?? about a
    center).
  • Classification is continuity based and Near
    Neighbor Sets (NNS) are the central concept in
    continuity
  • ??gt0 ??gt0 ? d(x,a)lt? ? d(f(x),f(a))lt?
    where f assigns a class to a feature
    vector, or
  • ? ?-NNS of f(a), ? a ?-NNS of a in its
    pre-image. f(Dom) categorical ??gt0 ?
    d(x,a)lt??f(x)f(a)

Caution For classification, boundary analysis
may be needed also to see the class (done by
projecting?). 1234
Finding NNS in lower a dimension may still the
1st step. Eg, 12345678 are all ? from a 5 a
6 (unclassified sample)
1234 are red-class, 5678 are blue-class. 7 8
Any ? that gives us a
vote gives us a tie vote (0-to-0 then
4-to-4). But projecting onto the vertical
subspace, then taking ?/2 we see that ?/2 about a
contains only blue class (5,6) votes.
Using horizontal data, NNS derivation requires 1
scan (O(n)). L? e-NNS can be derived using
vertical-data in O(log2n) (but Euclidean disks
are preferred). (Euclidean and L? coincide in
Binary data sets). Solution (next slide)
Circumscribe desired Euclidean-?NNS with a few
intersections of functional-contours, (f -1(b,c
) sets, until the intersection is scannable, then
scan it for Euclidean-?-nbrhd membership. Advantag
e intersection can be determined before scanning
- create and AND functional contour P-trees.
45
Functional Contours ? function, fR(A1..An) ? Y
f(x)
  • and ? S ? Y, contour(f,S) f-1(S)
    Equivalently, contour(Af,S) SELECT A1..An
    FROM R WHERE x.Af?S. Graphically

? partition, Si of Y, the contour set,
f-1(Si), is a partition of R (clustering of R)
A Weather map, f barometric pressure or
temperature, Siequi-width partion of Reals. f
local density (eg, OPTICS f reachability
distance, Sk partition produced by
intersection points of graph(f), plotted wrt to
some walk of R and a horizontal threshold
line. A grid is the intersection of dimension
projection contour partitions (next slide for
more defintions). A Class is a contour under
fR?ClassAttr wrt the partition, Ci of
ClassAttr (where Ci are the classes). An L?
?-disk about a is the intersection of all
?-dimension_projection contours containing a.
46
GRIDs
fR?Y, ? partition SSk of Y, f-1(Sk)
S,f-grid of R (grid cellscontours)
If YReals, the j.lo f-grid is produced by
agglomerating over the j lo bits of Y, ? fixed
(b-j) hi bit pattern. The j lo bits walk
isobars of cells. The b-j hi bits
identify cells. (loextension / hiintention)
Let b-1,...,0 be the b bit positions of Y. The
j.lo f-grid is the partition of R generated by f
and S Sb-1,...,b-j Sb-1,...,b-j
(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)
partition of YReals. If Ffh, the j.lo
F-grid is the intersection partition of the j.lo
fh-grids (intersection of partitions). The
canonical j.lo grid is the j.lo ?-grid
??dR?RAd ?d dth coordinate projection
j-hi gridding is similar ( the b-j lo bits walk
cell contents / j hi bits identify cells).
If the horizontal and vertical dimensions have
bitwidths 3 and 2 respectively
47
j.lo and j.hi gridding continued The
horizontal_bitwidth vertical_bitwidth b
iff j.lo grid (b-j).hi grid e.g.,
for hbvbb3 and j2
2.lo grid
1.hi grid
111 110 101 100
111 110 101 100
011 010 001 000
011 010 001 000
000 001 010 011 100 101 110 111
000 001 010 011 100 101 110 111
48
Similarity NearNeighborSets (SNNS) Given
similarity sR?R?PartiallyOrderedSet (eg,
Reals) ( i.e., s(x,y)s(y,x) and s(x,x)?s(x,y)
?x,y?R ) and given any C ? R
The Ordinal disks, skins and rings are disk(C,k)
? C ? disk(C,k)?C'k and s(x,C)?s(y,C)
?x?disk(C,k), y?disk(C,k) skin(C,k)
disk(C,k)-C (skin
comes from s k immediate neighbors and is a kNNS
of C.) ring(C,k) cskin(C,k)-cskin(C,k-1)
closeddisk(C,k)??alldisk(C,k)
closedskin(C,k)??allskin(C,k)
The Cardinal disk, skins and rings are
(PartiallyOrderedSet Reals) disk(C,r) ? x?R
s(x,C)?r also functional contour, f-1(r,
?), where f(x)sC(x)s(x,C) skin(C,r) ?
disk(C,r) - C ring(C,r2,r1) ? disk(C,r2)-disk(C,r
1) ? skin(C,r2)-skin(C,r1) also functional
contour, sC-1(r1,r2
Note closeddisk(C,r) is redundant, since all
r-disks are closed and closeddisk(C,k)
disk(C,s(C,y)) where y kth NN of C
L? skins skin?(a,k) x ?d, xd is one of the
k-NNs of ad (a local normalizer?)
49
Ptrees
Partition tree R / \ C1 Cn /\ /\
C11C1,n1 Cn1Cn,nn . . .
Vertical, compressed, lossless structures that
facilitates fast horizontal AND-processing Jury
is still out on parallelization, vertical (by
relation) or horizontal (by tree node) or some
combination? Horizontal parallelization is
pretty, but network multicast overhead is
huge Use active networking? Clusters of
Playstations?... Formally, P-trees are be defined
as any of the following Partition-tree Tree of
nested partitions (a partition P(R)C1..Cn
each component is partitioned by
P(Ci)Ci,1..Ci,ni i1..n each component
is partitioned by P(Ci,j)Ci,j1..Ci,jnij... )
  • Predicate-tree For a predicate on the leaf-nodes
    of a partition-tree (also induces predicates on
    i-nodes using quantifiers)
  • Predicate-tree nodes can be truth-values
    (Boolean P-tree) can be quantified
    existentially (1 or a threshold ) or
    universally Predicate-tree nodes can count
    of true leaf children of that component (Count
    P-tree)
  • Purity-tree universally quantified
    Boolean-Predicate-tree (e.g., if the predicate is
    lt1gt, Pure1-tree or P1tree)
  • A 1-bit at a node iff corresponding component
    is pure1 (universally quantified)
  • There are many other useful predicates, e.g.,
    NonPure0-trees But we will focus on
    P1trees.
  • All Ptrees shown so far were 1-dimensional
    (recursively partition by halving bit files), but
    they can be 2-D (recursively quartering)
    (e.g., used for 2-D images) 3-D
    (recursively eighth-ing), Or based on
    purity runs or LZW-runs or

Further observations about Ptrees Partition-tree
have set nodes Predicate-tree have either
Boolean nodes (Boolean P-tree) or count nodes
(Count P-tree) Purity-tree being universally
quantified Boolean-Predicate-tree have Boolean
nodes (since the count is always
the full count of leaves, expressing
Purity-trees as count-trees is redundant. Partitio
n-tree can be sliced at a level if each partition
is labeled with same label set (e.g., Month
partition of years). A Partition-tree can be
generalized to a Set-graph when the siblings of a
node do not form a partition.
50
The partitions used to create P-trees can come
from functional contours (Note there is a
natural duality between partitions and functions,
namely a partition creates a function from the
space of points partitioned to the set of
partition components and a function creates the
pre-image partition of its domain). In Functional
Contour terms (i.e., f-1(S) where fR(A1..An)?Y,
S?Y), the uncompressed Ptree or uncompressed
Predicate-tree 0Pf, S bitmap of set
containment-predicate, 0Pf,S(x)true iff
x?f-1(S)
0Pf,S equivalently, the existential R-bit
map of predicate, R.Af ?S
The Compressed Ptree, sPf,S is the compression
of 0Pf,S with equi-width leaf size, s, as
follows 1. Choose a walk of R
(converts 0Pf,S from bit map to bit
vector) 2. Equi-width partition 0Pf,S with
segment size, s (sleafsize, the last segment
can be short) 3. Eliminate and mask to 0, all
pure-zero segments (call mask, NotPure0 Mask or
EM) 4. Eliminate and mask to 1, all pure-one
segments (call mask, Pure1 Mask or UM)
(EMexistential aggregation UMuniversal
aggregation)
Compressing each leaf of sPf,S with leafsizes2
gives s1,s2Pf,S Recursivly, s1, s2,
s3Pf,S s1, s2, s3, s4Pf,S ... (builds
an EM and a UM tree)
BASIC P-trees If Ai Real or Binary and fi,j(x) ?
jth bit of xi ()Pfi,j ,1
?()Pi,jjb..0 are basic ()P-trees of Ai,
s1..sk If Ai Categorical and fi,a(x)1 if
xia, else 0 ()Pfi,a,1? ()Pi,aa?RAi are
basic ()P-trees of Ai Notes The UM masks
(e.g., of 2k,...,20Pi,j, with kroof(log2R ),
form a (binary) tree. Whenever the EM bit is 1,
that entire subtree can be eliminated (since it
represents a pure0 segment), then a 0-node at
level-k (lowest level level-0) with no sub-tree
indicates a 2k-run of zeros. In this
construction, the UM tree is redundant. We call
these EM trees the basic binary P-trees. The
next slide shows a top-down (easy to understand)
construction of and the following slide is a
(much more efficient) bottom up construction of
the same. We have suppressed the leafsize prefix.
51
Example functionals Total Variation (TV)
functionals TV(a)
?x?R(x-a)o(x-a) If we use d for a index
variable over the dimensions,
?x?R?d1..n(xd2 - 2adxd
ad2)
i,j,k bit slices indexes
Note that the first term does not depend upon a.
Thus, the derived attribute, TV-TV(?) (eliminate
1st term) is much simpler to compute and has
identical contours (just lowers the graph by
TV(?) ). We also find it useful to post-compose
a log to reduce the number of bit slices. The
result functional is called High-Dimension-ready
Total Variation or HDTV(a).
52
?dadad )
TV(a) ?x,d,i,j 2ij xdixdj
R ( -2?dad?d
From equation 7, Normalized Total Variation,
NTV(a) ? TV(a)-TV(?)
?d(adad- ?d?d))
R (-2?d(ad?d-?d?d)
R a-?2
Thus there is a simpler function which gives us
circular contours, the Log Normal TV functionlt
LNTV(a) ln( NTV(a) ) ln( TV(a)-TV(?) )
lnR lna-?2
The length of LNTV(a) depends only on the length
of a-?, so isobars are hyper-circles centered at ?
The graph of LNTV is a log-shaped hyper-funnel

go inward and outward along a-? by
? to the points inner point, b?(1-?/a-?)(a-?)
and outer point, c?-(1?/a-?)(a-?).
For an ?-contour ring (radius ? about a)
Then take g(b) and g(c) as lower and upper
endpoints of a vertical interval. Then we use
EIN formulas on that interval to get a mask
P-tree for the ?-contour (which is a well-pruned
superset of the ?-neighborhood of a)
53


use circumscribing A?d-contour (Note A?d is not
a derived attribute at all, but just Ad, so we
already have its basic P-trees).
If the LNTV circumscribing contour of a is still
too populous,
  • As pre-processing, calculate basic P-trees for
    the LNTV derived attribute
  • (or another hypercircular contour derived
    attribute).
  • To classify a
  • 1. Calculate b and c (Depend on a, ?)
  • 2. Form mask P-tree for training pts with
    LNTV-values?LNTV(b), LNTV(c)
  • 3. User that P-tree to prune out the candidate
    NNS.
  • If the count of candidates is small, proceed to
    scan and assign class votes using Gaussian vote
    function, else prune further using a dimension
    projections).

LNTV(x)
LNTV(c)
We can also note that LNTV can be further
simplified (retaining same contours) using
h(a)a-?. Since we create the derived
attribute by scanning the training set, why not
just use this very simple function? Others leap
to mind, e.g., hb(a)a-b
LNTV(b)
x1
contour of dimension projection f(a)a1
b
c
x2
54
Graphs of functionals with hyper-circular contours
55
Angular Variation functionals e.g., AV(a) ? (
1/a ) ?x?R xoa d is an index over the
dimensions,
(1/a)?x?R?d1..nxdad
(1/a)?d(?xxdad) factor out ad
COS?(a) ? AV(a)/(?R) ?oa/(?a)
cos(?a?)
COS? (and AV) has hyper-conic isobars center on ?
COS? and AV have ?-contour(a) the space
between two hyper-cones center on ? which just
circumscribes the Euclidean ?-hyperdisk at a.
Intersection (in pink) with LNTV ?-contour.
Graphs of functionals with hyper-conic
contours E.g., COSb(a) for any vector, b
56
f(a)x (x-a)o(x-a)

d index over dims,
?d1..n(xd2 -
2adxd
ad2) i,j,k bit slices indexes
Adding up the Gaussian weighted votes for class c
Collecting diagonal terms inside exp
?i,j,d inside exp, coefs are multiplied by
10-bit (depends on x). For fixed i,j,d either
coef is x-indep (if 1bit) or not (if 0bit)
Some additiona formulas
?d,i,j 2ij ( xdixdj - 2adixdj adiadj )
?d,i,j 2ij ( xdi-adj )(xdj-adj)
57
?i 2i ( xdi adi )
?i adi0 2ixdi
- ?i adi1 2ix'di
fd(a)x x-ad
Thus, for the derived attribute, fd(a) numeric
distance of xd from ad, if we remember that
when adi1, subtract those contributing powers of
2 (don't add) and that we use the complement
dimensiond basic Ptrees, then it should
work. The point is that we can get a set of near
basic or negative basic Ptrees, nbPtrees, for
derived attr fd(a) directly from the basic Ptrees
for Ad for free. Thus, the near basic Ptrees for
fd(a) are the basic Ad Ptrees for those
bit-positions where adi 0 and they are
the complements of the basic Ad Ptrees
for those bit-positions where adi 1
(called fd(a)'s nbPtrees) Caution subtract the
contribution of the nbPtrees for positions where
adi1 Note nbPtrees are not predicate trees (are
they? What's the predicate?) The EIN ring
formulas are related to this, how? If we are
simply after easy pruning contours containing a
(so that we can scan to get the actual Euclidean
epsilon nbrs and/or to get Guassian weighted vote
counts, we can use Hobbit-type contours (middle
earth contours of a?). See next slide for a
discussion of hobbit contours.
58
A principle A job is not done until the
Mathematics is completed. The Mathematics of a
research job includes 0. Getting to the
frontiers of the area (researching, organizing,
understanding and integrating everything others
have done in the area up to the present moment
and what they are likely to do next). 1.
developing a killer idea for a better way to do
something. 2. proving claims (theorems,
performance evaluation, simulation, etc.), 3.
simplification (everything is simple once fully
understood), 4. generalization (to the widest
possible application scope), and 4. insight
(what are the main issues and underlying
mega-truths (with full drill down)).
Therefore, we need to ask the following
questions at this point
Should we use the vector of medians (the only
good choice of middle point in mulidimensional
space, since the point closest to the mean
definition is influenced by skewness, like the
mean). We will denote the vector of medians
as ? h?(a)a-? is an important functional
(better than h?(a)a-??) If we compute the
median of an even number of values as the
count-weighted average of the middle two values,
then in binary columns, ? and ? coincide. (so if
µ and ? are far apart, that tells us there is
high skew in the data (and the coordinates where
they differ are the columns where the skew is
found).
59
Additional Mathematics to enjoy
What about the vector of standard deviations, ??
(computable with P-trees!) Do we have an
improvement of BIRCH here? - generating similar
comprehensive statistical measures, but much
faster and more focused?) We can do the same for
any rank statistic (or order statistic), e.g.,
vector of 1st or 3rd quartiles, Q1 or Q3 the
vector of kth rank values (kth ordinal
values). If we preprocessed to get the basic
P-trees of ?, and each mixed quartile vector
(e.g., in 2-D add 5 new derived attributes ?,
Q1,1, Q1,2, Q2,1, Q2,2 where Qi,j is the ith
quartile of the jth column), what does this tell
us (e.g., what can we conclude about the location
of core clusters? Maybe all we need is the basic
P-trees of the column quartiles, Q1..Qn ?)
  • L? ordinal disks
  • disk?(C,k) x xd is one of the k-Nearest
    Neighbors of ad ? d.
  • skin?(C,k), closed skin?(C,k) and ring?(C,k) are
    defined as above.

Are they easy P-tree computations? Do they
offer advantages? When? What? Why? E.g.,
do they automatically normalize for us?
60
The Middle Earth Contours of a are gotten by
ANDing in the basic Ptree for ad,i1 and ANDing
in the complement if ad,i0 (down to some
bit-position threshold in each dimension, bptd .
bptd can be the same for each d or
not). Caution Hobbit contours of a are not
symmetric about a. That becomes a problem (for
knowing when you have a symmetric nbrhd in the
contour) expecially when many lowest order bits
of a are identical (e.g., if ad 8 1000 ) If
the low order bits of ad are zeros, one should
union (OR) take the Hobbit contour of ad - 1
(e.g., for 8 also take 70111) If the low order
bits of ad are ones, one should union (OR) the
Hobbit contour of ad 1 (e.g, for 7111 also
take 81000) Some need research Since we are
looking for an easy prune to get our mask down to
a scannable size (low root count) but not so much
of a prune that we have t
About PowerShow.com