Why Not Store Everything in Main Memory? Why use disks? - PowerPoint PPT Presentation

Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 6fb04d-ZGJiN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Why Not Store Everything in Main Memory? Why use disks?

Description:

2/2/13 Datamining Big Data big data= up to trillions of rows (and more) and, possibly, thousands of columns. I structure data vertically (pTrees) and process it ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 25
Provided by: William1152
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory? Why use disks?


1
2/2/13
Datamining Big Data big data up to trillions
of rows (and more) and, possibly, thousands of
columns. I structure data vertically (pTrees)
and process it horizontally. Typically,
processing thousands of columns is orders of
magnitude faster than processing trillions of
rows. So sometimes that process can be done in
much less time on vertical data than on
horizontal data. Data mining is roughly
CLASSIFICATION (assigning a best guess class
label to a row based on a training set of already
classified rows). What about clustering and
Association Rule Mining (ARM)? They are
important and related! Roughly clustering
creates (or improves) a classification training
set and ARM is used to data mine more complex
data (e.g., relationship matrixes, etc.).
CLASSIFICATIONcase-based reasoning. To make a
decision, we typically search our memory for
similar situations (near-neighbor cases) and make
our decision based on the decisions we made in
those cases (we do what worked before for us or
others). We let those near neighbors vote.
"The Magical Number Seven, Plus or Minus Two
Some Limits on Our Capacity for Processing
Information"2 is one of the most highly cited
papers in psychology. It was published in 1956 by
the cognitive psychologist George A. Miller of
Princeton University's Department of Psychology
in Psychological Review. It is often cited to
argue that the number of objects or contexts an
average human can hold in working memory is 7 2
(called Miller's Law). We can think of
classification as providing a better 7 (so it's
decision support, not decision making). One can
make a case that all Classification methods (even
model based ones) are a form of Near Neighbor
Classification. E.g. in Decision Tree Induction
(DTI) the classes at the bottom of a decision
branch ARE the Near Neighbor set due to the fact
that the sample arrived at that leaf.
Each row of an entity table (e.g.,
Iris(PL,PW,SL,SW) or Image(R,G,B) describes an
instance of an entity (Irises or pixels) with
columns of descriptive information on each
instance (Pedal Length, Pedal Width, Sepal
Length, Sepal Width or Red, Green, Blue photon
counts). If the table consists entirely of real
numbers, then the row set can be viewed as s
subset of a real vector space with dimension
of columns. Then, the notion of "near" or
"similar" in classification and clustering can
be defined using a dissimilarity (a distance) or
a similarity (two rows are "near" if the distance
between them is small, or if their similarity is
high. Two columns are near if they are highly
correlated with respect to some correlation
measure (e.g., Pearson's, Spearman's...)
If the columns also describe instances of an
entity then the table is really a matrix or a
relationship between instances of the row entity
and the column entity. Each cell in the matrix
measure some aspect of that relationship (e.g.,
it may just be 1 iff that row is related to
that column, else 0 or each cell entry may be an
entire structure of data describing that row
instance and that column instance.) In Market
Basket Research, MBR, the row entity is customers
and column is items. A cell1 iff that customer
has that item in the market basket. In Netflix
Cinematch, the row entity is customers and column
movies and each cell has the 5-star rating that
customer gave to that movie. In Bioinformatics
the row entity might be experiments and the
column entity might be genes and each cell has
the expression level of that gene under that
experiment or the row and column entities might
both be proteins and each cell has a 1-bit iff
the two proteins interact in some way. In
Facebook the rows might be people and the columns
might also be people (and a cell has a one bit
iff the row and column persons are friends) Even
when the table appears to be a simple entity
table with descriptive feature columns, it may be
viewable as a relationship between 2 entities.
E.g., Image(R,B,G) is a table of pixel instances
with columns, R,G,B. The R-values count the
photons in a "red" frequency range detected at
that pixel over an interval of time. That red
frequency range is determined more by the camera
technology than by any scientific definition. If
we had separate CCD cameras that could count
photons in each of a million very thin adjacent
frequency intervals, we could view the column
values of that image as instances a frequency
entity, Then the image would be a relationship
matrix between the pixel and the frequency
entities. So an entity table can often be
usefully viewed as a relationship matrix. If so,
it can also be rotated so that the former column
entity is now viewed as the new row entity and
the former row entity is now viewed as the new
set of descriptive columns. The bottom line is
that we can often do data mining on a table of
data in many ways as an entity table
(classification and clustering), as a
relationship matrix (ARM) or upon rotation that
matrix, as another entity table. For a rotated
entity table, the concepts of nearness that can
be used also rotate (e.g., The cosine
correlation of two columns morphs into the cosine
of the angle between 2 vectors as a row
similarity measure.
2
Where Does In-Memory Come into BI? Here is just
a sampling of what in-memory computing technology
can do for you Enable mixed workloads of
analytics, operations, and performance management
in a single software landscape. Support smarter
business decisions by providing increased
visibility of very large volumes of business
information Enable users to react to business
events more quickly through real-time analysis
and reporting of operational data. Provide
greater flexibility by delivering innovative
real-time analysis and reporting. Support the
deployment of innovative new business
applications Help streamline the IT landscape
and reduce total cost of ownership (TCO). In
manufacturing enterprises, in-memory computing
technology will connect the shop floor to the
boardroom, and the shop floor associate will have
instant access to the same data as the
board member shop floor daily transaction
processing. Boardroom data mining. The shop
floor will then see the results of their actions
reflected immediately in the relevant Key
Performance Indicators (KPI). SAP
BusinessObjects Event Insight software is key. In
what used to be called exception reporting, the
software deals with huge amounts of realtime data
to determine immediate and appropriate action for
a real-time situation.
SAP In-Memory Computing Technology Executive
Summary Enabling Real-Time Computing SAP
In-Memory Computing technology enables real-time
computing by bringing together online transaction
processing (OLTP) applications and online
analytical processing (OLA P) applications at a
low total cost. Combining the advances in
hardware technology with SAP InMemory Computing
empowers the entire business from shop floor to
boardroom by giving real-time business
processes instantaneous access to data. The
alliance of these two technologies can eliminate
todays information lag for your business. With
the revolution of in-memory computing already
under way, the question isnt if this revolution
will impact businesses but when and, more
importantly, how. In-memory computing wont be
introduced because a company can afford the
technology. It will be brought on board because a
business cannot afford to allow its competitors
to adopt the technology first. This paper
details how in-memory computing can change the
way you manage business intelligence and the
value your business can derive from the
technology. For business and IT executives, the
paper furnishes substantial information and
business examples about what changes they can
look forward to and how those changes can
catalyze their strategic initiatives.
Product managers will still look at inventory and
point-of-sale data, but in the future they will
also receive, for example, notifications when
customers broadcast their dissatisfaction with
a product to the masses over Twitter. Or they
might be alerted to a negative product review
released online that highlights some unpleasant
product features requiring immediate
action. From the other side, small businesses
running real-time inventory reports will be able
to announce to their Facebook and Twitter
communities that a high demand product is
available, how to order, and where to pick
up. Bad movies have been able to enjoy a great
opening weekend before crashing 2nd weekend when
negative word-of-mouth feedback cools enthusiasm.
That week-long grace period is about to disappear
for silver screen flops. Consumer feedback wont
take a week, a day, or an hour. The very
second showing of a movie could suffer from a
noticeable falloff in attendance due to consumer
criticism piped instantaneously through the new
technologies. It will no longer be good enough
to have the weekend numbers ready for executives
on Monday morning. Executives will run their own
reports on revenue, Twitter their reviews over
the weekend, and by Monday morning have acted on
their decisions.
3
Final example is from the utilities industry The
most expensive energy that a utilities company
provides is energy to meet unexpected demand
during peak periods of consumption. If the
company could analyze trends in electrical power
consumption based on real-time meter reading,
it could offer its consumers in real time
extra low rates for the week or month if they
reduce their consumption during the following few
hours. This advantage will become much more
dramatic when we switch to electric cars
predictably, those cars are going to be recharged
the minute the owners return home from work,
which could be within a very short period of
time. In-memory computing technology combines
hardware and software technology innovations.
Hardware innovations include blade servers and
CPUs with multicore architecture and memory
capacities measured in terabytes for massive
parallel scaling. Software innovations include
an in-memory database with highly compressible
row and column storage specifically designed to
maximize in-memory computing tech. Parallel
processing takes place in the database layer
rather than in the application layer as we know
it from the client-server architecture. Total
cost is to be 30 lower than traditional
relational database technology due to
Leaner hardware and less system capacity
required, as mixed workloads of analytics,
operations, performance mgmt are handled within a
single system, which also reduces redundant data
storage. Reduced extract, transform, and load
(ETL) processes between systems and fewer
prebuilt reports, reducing the support effort
required to run the software. Replacing
traditional databases in SAP applications with
in-memory computing technology resulted in report
runtime improvements of up to a factor of 1000
and compression rates of up to a factor of 10.
Performance improvements are expected to be
even higher in SAP apps natively developed for
inmemory DBs, with initial results showing a
reduction of computing time from several hours to
a few seconds. However, in-memory computing will
not eliminate the need for data warehousing. A
real-time reporting function will solve old
challenges and create new opportunities, but new
challenges will arise. SAP HANA 1.0 software
supports realtime database access to data from
the SAP apps that support OLTP. Formerly,
operational reporting functionality was
transferred from OLTP applications to a data
warehouse. With in-memory computing technology,
this functionality is integrated back into the
transaction system.
Adopting in-memory computing results in an
uncluttered architecture based on a few, tightly
aligned core systems enabled by service-oriented
architecture (SOA) to provide harmonized, valid
metadata and master data across business
processes. Some of the most salient shifts and
trends in future enterprise architectures will
be A shift to BI self-service apps like data
exploration, instead of static report
solutions. Central metadata and
master-data repositories that define the data
architecture, allowing data stewards to work
effectively across all business units and all
platforms Instantaneous analysis of real-time
trending algorithms with direct impact on live
execution of business processes Real-time
in-memory computing technology will most probably
cause a decline in sheer numbers of Structured
Query Language (SQL) satellite databases. The
purpose of those databases as flexible, ad hoc,
more business-oriented, less IT-static tools
might still be required, but their offline status
will be a disadvantage and will delay data
updates. Some might argue that satellite systems
with in-memory computing technology will take
over from satellite SQL DBs. SAP Business
Explorer tools that use in-memory computing
technology represent a paradigm shift. Instead of
waiting for IT to work on a long queue of support
tickets to create new reports, business users can
access large data sets for data exploration and
defining reports on the fly.
4
Functional-Gap-based clustering methods (the
unsupervised part of FAUSTFunctional Analysis
Unsupervised and Supervised Teaching)
This class of partitioning or clustering methods
relies on choosing a functional (a mapping of
each row in a table to a real number) which is
distance dominated (i.e., the difference between
two functional values, F(x) and F(y) is always ?
the distance between x and y. Distance dominance
of F tells us If we find a gap in the F-values,
we know that points that map to opposite sides of
that gap are at least as far apart as the gap
width.) Here are some of the functionals that we
have already used productively (in each, we can
also test the actual pair-wise distances at the
extreme ends of the F-values for outliers.)
Coordinate Projection Functionals (ej) Check
gaps in ej(y) yj
Square Distance Functional (SD) Check gaps in
SDp(y) (y-p)o(y-p) (parameterized over a p
grid).
Square Dot Product Radius (SDPR) DPRpq(y)
SDp(y) - DPPpq(y)2 (easier to do the pTree
processing)
DPP-KM 1. Check DPPp,d(y) gaps (grids of p and
d?). 1.1 Check distances at sparse
extremes. 2. After several rounds of 1, apply
k-means to the resulting clusters (when k seems
to be determined).
DPP-DA 1. Check DPPp,d(y) gaps (grids of p and
d?) against density of subcluster. 1.1 Check
distances at sparse extremes against subcluster
density. 2. Apply other methods once Dot
ceases to be effective.
DPP-SD) 1. Check DPPp,d(y) (over a p-grid and
a d-grid) and SDp(y) (over a p-grid). 1.1 Check
sparse ends distance with subcluster density.
(DPPpd , SDp share construction steps!)
SD-DPP-SDPR) (DPPpq , SDp and SDPRpq SDp(y)
(y-p)o(y-p) yoy - 2 yop pop
DPPpq(y) (y-p)odyod-pod (1/p-q)yop -
(1/p-q)yoq
Calc yoy, yop, yoq concurrently? Then constant
multiplies 2yop, (1/p-q)yop concurrently.
Then add subtract.
Calculate DPPpq(y)2. Then subtract it from
SDp(y)
5
SL SW PL PW set 51 35 14 2 0 1 1 0 0 1 1
1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49
30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1
0 0 0 0 0 1 0 set 47 32 13 2 0 1 0 1 1 1 1 1
0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 46 31
15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1
0 0 0 0 1 0 set 50 36 14 2 0 1 1 0 0 1 0 1 0 0
1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 54 39 17
4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0
0 1 0 0 set 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1
0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 50 34 15 2 0
1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0
1 0 set 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1
0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1
0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0
1 set 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0
0 0 1 1 1 1 0 0 0 0 1 0 set 48 34 16 2 0 1 1 0
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
0 set 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0
0 0 1 1 1 0 0 0 0 0 0 1 set 43 30 11 1 0 1 0 1
0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0
1 set 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0 0 1 0 set 57 44 15 4 0 1 1 1
0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0
0 set 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0
0 0 1 1 0 1 0 0 0 1 0 0 set 51 35 14 3 0 1 1 0
0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1
1 set 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0
0 1 0 0 0 1 0 0 0 0 1 1 set 51 38 15 3 0 1 1 0
0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1
1 set 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0 0 0 0 1 0 set 51 37 15 4 0 1 1 0
0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0
0 set 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0
0 0 1 0 1 0 0 0 0 0 1 0 set 51 33 17 5 0 1 1 0
0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0
1 set 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0
0 1 0 0 1 1 0 0 0 0 1 0 set 50 30 16 2 0 1 1 0
0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1
0 set 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 1 0 0 set 52 35 15 2 0 1 1 0
1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1
0 set 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0
0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 16 2 0 1 0 1
1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 set 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0
0 1 0 0 0 0 0 0 0 0 1 0 set 54 34 15 4 0 1 1 0
1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0
0 set 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0
0 0 1 1 1 1 0 0 0 0 0 1 set 55 42 14 2 0 1 1 0
1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1
0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0
0 0 1 1 1 1 0 0 0 0 0 1 set 50 32 12 2 0 1 1 0
0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1
0 set 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0
0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0
1 set 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0
0 0 1 1 0 1 0 0 0 0 1 0 set 51 34 15 2 0 1 1 0
0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1
0 set 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 0 0 1 1 set 45 23 13 3 0 1 0 1
1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1
1 set 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0
0 0 1 1 0 1 0 0 0 0 1 0 set 50 35 16 6 0 1 1 0
0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1
0 set 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0
0 1 0 0 1 1 0 0 0 1 0 0 set 48 30 14 3 0 1 1 0
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1
1 set 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0
0 1 0 0 0 0 0 0 0 0 1 0 set 46 32 14 2 0 1 0 1
1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1
0 set 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0
0 0 1 1 1 1 0 0 0 0 1 0 set 50 33 14 2 0 1 1 0
0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1
0 ver 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0
1 0 1 1 1 1 0 0 1 1 1 0 ver 64 32 45 15 1 0 0 0
0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1
1 ver 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0
1 1 0 0 0 1 0 0 1 1 1 1 ver 55 23 40 13 0 1 1 0
1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0
1 ver 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0
1 0 1 1 1 0 0 0 1 1 1 1 ver 57 28 45 13 0 1 1 1
0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0
1 ver 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0
1 0 1 1 1 1 0 1 0 0 0 0 ver 49 24 33 10 0 1 1 0
0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1
0 ver 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0
1 0 1 1 1 0 0 0 1 1 0 1 ver 52 27 39 14 0 1 1 0
1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1
0 ver 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0
1 0 0 0 1 1 0 0 1 0 1 0 ver 59 30 42 15 0 1 1 1
0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1
1 ver 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0
1 0 1 0 0 0 0 0 1 0 1 0 ver 61 29 47 14 0 1 1 1
1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1
0 ver 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0
1 0 0 1 0 0 0 0 1 1 0 1 ver 67 31 44 14 1 0 0 0
0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1
0 ver 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0
1 0 1 1 0 1 0 0 1 1 1 1 ver 58 27 41 10 0 1 1 1
0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1
0 ver 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0
1 0 1 1 0 1 0 0 1 1 1 1 ver 56 25 39 11 0 1 1 1
0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1
1 ver 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 1 0 0 1 0 ver 61 28 40 13 0 1 1 1
1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0
1 ver 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0
1 1 0 0 0 1 0 0 1 1 1 1 ver 61 28 47 12 0 1 1 1
1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0
0 ver 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0
1 0 1 0 1 1 0 0 1 1 0 1 ver 66 30 44 14 1 0 0 0
0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1
0 ver 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0
1 1 0 0 0 0 0 0 1 1 1 0 ver 67 30 50 17 1 0 0 0
0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0
1 ver 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0
1 0 1 1 0 1 0 0 1 1 1 1 ver 57 26 35 10 0 1 1 1
0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1
0 ver 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0
1 0 0 1 1 0 0 0 1 0 1 1 ver 55 24 37 10 0 1 1 0
1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1
0 ver 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0
1 0 0 1 1 1 0 0 1 1 0 0 ver 60 27 51 16 0 1 1 1
1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0
0 ver 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0
1 0 1 1 0 1 0 0 1 1 1 1 ver 60 34 45 16 0 1 1 1
1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0
0 ver 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0
1 0 1 1 1 1 0 0 1 1 1 1 ver 63 23 44 13 0 1 1 1
1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0
1 ver 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0
1 0 1 0 0 1 0 0 1 1 0 1 ver 55 25 40 13 0 1 1 0
1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0
1 ver 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0
1 0 1 1 0 0 0 0 1 1 0 0 ver 61 30 46 14 0 1 1 1
1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0
SL SW PL PW ver 61 30 46 14 0 1 1 1 1 0 1
0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 ver 58
26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0
0 0 0 1 1 0 0 ver 50 23 33 10 0 1 1 0 0 1 0 0
1 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 56 27
42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0
0 0 1 1 0 1 ver 57 30 42 12 0 1 1 1 0 0 1 0 1 1
1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 ver 57 29 42
13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0
1 1 0 1 ver 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0
1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 51 25 30 11 0
1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0
1 1 ver 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0
0 1 0 1 0 0 1 0 0 1 1 0 1 vir 63 33 60 25 0 1 1
1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0
1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0
1 1 0 0 1 1 0 1 0 0 1 1 vir 71 30 59 21 1 0 0 0
1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0
1 vir 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0
1 1 1 0 0 0 0 1 0 0 1 0 vir 65 30 58 22 1 0 0 0
0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1
0 vir 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1
0 0 0 0 1 0 0 1 0 1 0 1 vir 49 25 45 17 0 1 1 0
0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0
1 vir 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0
1 1 1 1 1 1 0 1 0 0 1 0 vir 67 25 58 18 1 0 0 0
0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
0 vir 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0
1 1 1 1 0 1 0 1 1 0 0 1 vir 65 32 51 20 1 0 0 0
0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0
0 vir 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0
1 1 0 1 0 1 0 1 0 0 1 1 vir 68 30 55 21 1 0 0 0
1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0
1 vir 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0
1 1 0 0 1 0 0 1 0 1 0 0 vir 58 28 51 24 0 1 1 1
0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0
0 vir 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0
1 1 0 1 0 1 0 1 0 1 1 1 vir 65 30 55 18 1 0 0 0
0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1
0 vir 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1
0 0 0 0 1 1 0 1 0 1 1 0 vir 77 26 69 23 1 0 0 1
1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1
1 vir 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0
1 1 0 0 1 0 0 0 1 1 1 1 vir 69 32 57 23 1 0 0 0
1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1
1 vir 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0
1 1 0 0 0 1 0 1 0 1 0 0 vir 77 28 67 20 1 0 0 1
1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0
0 vir 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0
1 1 0 0 0 1 0 1 0 0 1 0 vir 67 33 57 21 1 0 0 0
0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0
1 vir 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0
1 1 1 1 0 0 0 1 0 0 1 0 vir 62 28 48 18 0 1 1 1
1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1
0 vir 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0
1 1 0 0 0 1 0 1 0 0 1 0 vir 64 28 56 21 1 0 0 0
0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0
1 vir 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0
1 1 1 0 1 0 0 1 0 0 0 0 vir 74 28 61 19 1 0 0 1
0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1
1 vir 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0 1 0 0 vir 64 28 56 22 1 0 0 0
0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 1
0 vir 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0
1 1 0 0 1 1 0 0 1 1 1 1 vir 61 26 56 14 0 1 1 1
1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1
0 vir 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0
1 1 1 1 0 1 0 1 0 1 1 1 vir 63 34 56 24 0 1 1 1
1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0
0 vir 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0
1 1 0 1 1 1 0 1 0 0 1 0 vir 60 30 18 18 0 1 1 1
1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1
0 vir 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0
1 1 0 1 1 0 0 1 0 1 0 1 vir 67 31 56 24 1 0 0 0
0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0
0 vir 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0
1 1 0 0 1 1 0 1 0 1 1 1 vir 58 27 51 19 0 1 1 1
0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1
1 vir 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0
1 1 1 0 1 1 0 1 0 1 1 1 vir 67 33 57 25 1 0 0 0
0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0
1 vir 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0
1 1 0 1 0 0 0 1 0 1 1 1 vir 63 25 50 19 0 1 1 1
1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1
1 vir 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0
1 1 0 1 0 0 0 1 0 1 0 0 vir 62 34 54 23 0 1 1 1
1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1
1 vir 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0
1 1 0 0 1 1 0 1 0 0 1 0
6
ID PL PW SL SW DPP s1 51 35 14 2 60 0 1 1 1 1
0 0 s2 49 30 14 2 59 0 1 1 1 0 1 1 s3 47 32 13
2 60 0 1 1 1 1 0 0 s4 46 31 15 2 58 0 1 1 1 0
1 0 s5 50 36 14 2 60 0 1 1 1 1 0 0 s6 54 39 17
4 58 0 1 1 1 0 1 0 s7 46 34 14 3 60 0 1 1 1 1
0 0 s8 50 34 15 2 59 0 1 1 1 0 1 1 s9 44 29 14
2 59 0 1 1 1 0 1 1 s10 49 31 15 1 58 0 1 1 1 0
1 0 s11 54 37 15 2 60 0 1 1 1 1 0 0 s12 48 34 16
2 58 0 1 1 1 0 1 0 s13 48 30 14 1 59 0 1 1 1 0
1 1 s14 43 30 11 1 62 0 1 1 1 1 1 0 s15 58 40 12
2 63 0 1 1 1 1 1 1 s16 57 44 15 4 61 0 1 1 1 1
0 1 s17 54 39 13 4 61 0 1 1 1 1 0 1 s18 51 35 14
3 60 0 1 1 1 1 0 0 s19 57 38 17 3 58 0 1 1 1 0
1 0 s20 51 38 15 3 60 0 1 1 1 1 0 0 s21 54 34 17
2 57 0 1 1 1 0 0 1 s22 51 37 15 4 59 0 1 1 1 0
1 1 s23 46 36 10 2 64 1 0 0 0 0 0 0 s24 51 33 17
5 56 0 1 1 1 0 0 0 s25 48 34 19 2 56 0 1 1 1 0
0 0 s26 50 30 16 2 57 0 1 1 1 0 0 1 s27 50 34 16
4 57 0 1 1 1 0 0 1 s28 52 35 15 2 59 0 1 1 1 0
1 1 s29 52 34 14 2 60 0 1 1 1 1 0 0 s30 47 32 16
2 58 0 1 1 1 0 1 0 s31 48 31 16 2 57 0 1 1 1 0
0 1 s32 54 34 15 4 58 0 1 1 1 0 1 0 s33 52 41 15
1 61 0 1 1 1 1 0 1 s34 55 42 14 2 62 0 1 1 1 1
1 0 s35 49 31 15 1 58 0 1 1 1 0 1 0 s36 50 32 12
2 61 0 1 1 1 1 0 1 s37 55 35 13 2 61 0 1 1 1 1
0 1 s38 49 31 15 1 58 0 1 1 1 0 1 0 s39 44 30 13
2 60 0 1 1 1 1 0 0 s40 51 34 15 2 59 0 1 1 1 0
1 1 s41 50 35 13 3 61 0 1 1 1 1 0 1 s42 45 23 13
3 57 0 1 1 1 0 0 1 s43 44 32 13 2 60 0 1 1 1 1
0 0 s44 50 35 16 6 57 0 1 1 1 0 0 1 s45 51 38 19
4 56 0 1 1 1 0 0 0 s46 48 30 14 3 58 0 1 1 1 0
1 0 s47 51 38 16 2 59 0 1 1 1 0 1 1 s48 46 32 14
2 59 0 1 1 1 0 1 1 s49 53 37 15 2 60 0 1 1 1 1
0 0 s50 50 33 14 2 60 0 1 1 1 1 0 0 e1 70 32 47
14 25 0 0 1 1 0 0 1 e2 64 32 45 15 27 0 0 1 1 0
1 1 e3 69 31 49 15 22 0 0 1 0 1 1 0 e4 55 23 40
13 29 0 0 1 1 1 0 1 e5 65 28 46 15 24 0 0 1 1 0
0 0 e6 57 28 45 13 26 0 0 1 1 0 1 0 e7 63 33 47
16 25 0 0 1 1 0 0 1 e8 49 24 33 10 37 0 1 0 0 1
0 1 e9 66 29 46 13 25 0 0 1 1 0 0 1 e10 52 27 39
14 31 0 0 1 1 1 1 1 e11 50 20 35 10 34 0 1 0 0 0
1 0 e12 59 30 42 15 29 0 0 1 1 1 0 1 e13 60 22 40
10 30 0 0 1 1 1 1 0 e14 61 29 47 14 24 0 0 1 1 0
0 0 e15 56 29 36 13 35 0 1 0 0 0 1 1 e16 67 31 44
14 27 0 0 1 1 0 1 1 e17 56 30 45 15 26 0 0 1 1 0
1 0 e18 58 27 41 10 31 0 0 1 1 1 1 1 e19 62 22 45
15 23 0 0 1 0 1 1 1 e20 56 25 39 11 32 0 1 0 0 0
0 0 e21 59 32 48 18 23 0 0 1 0 1 1 1 e22 61 28 40
13 31 0 0 1 1 1 1 1 e23 63 25 49 15 21 0 0 1 0 1
0 1 e24 61 28 47 12 25 0 0 1 1 0 0 1 e25 64 29 43
13 28 0 0 1 1 1 0 0
ID PL PW SL SW DPP e26 66 30 44 14 27 0 0 1 1 0
1 1 e27 68 28 48 14 23 0 0 1 0 1 1 1 e28 67 30 50
17 21 0 0 1 0 1 0 1 e29 60 29 45 15 26 0 0 1 1 0
1 0 e30 57 26 35 10 36 0 1 0 0 1 0 0 e31 55 24 38
11 32 0 1 0 0 0 0 0 e32 55 24 37 10 33 0 1 0 0 0
0 1 e33 58 27 39 12 32 0 1 0 0 0 0 0 e34 60 27 51
16 20 0 0 1 0 1 0 0 e35 54 30 45 15 27 0 0 1 1 0
1 1 e36 60 34 45 16 27 0 0 1 1 0 1 1 e37 67 31 47
15 24 0 0 1 1 0 0 0 e38 63 23 44 13 25 0 0 1 1 0
0 1 e39 56 30 41 13 31 0 0 1 1 1 1 1 e40 55 25 40
13 30 0 0 1 1 1 1 0 e41 55 26 44 12 27 0 0 1 1 0
1 1 e42 61 30 46 14 26 0 0 1 1 0 1 0 e43 58 26 40
12 30 0 0 1 1 1 1 0 e44 50 23 33 10 37 0 1 0 0 1
0 1 e45 56 27 42 13 29 0 0 1 1 1 0 1 e46 57 30 42
12 30 0 0 1 1 1 1 0 e47 57 29 42 13 29 0 0 1 1 1
0 1 e48 62 29 43 13 28 0 0 1 1 1 0 0 e49 51 25 30
11 40 0 1 0 1 0 0 0 e50 57 28 41 13 30 0 0 1 1 1
1 0 i1 63 33 60 25 10 0 0 0 1 0 1 0 i2 58 27 51
19 19 0 0 1 0 0 1 1 i3 71 30 59 21 11 0 0 0 1 0
1 1 i4 63 29 56 18 15 0 0 0 1 1 1 1 i5 65 30 58
22 12 0 0 0 1 1 0 0 i6 76 30 66 21 5 0 0 0 0 1
0 1 i7 49 25 45 17 24 0 0 1 1 0 0 0 i8 73 29 63
18 8 0 0 0 1 0 0 0 i9 67 25 58 18 12 0 0 0 1 1
0 0 i10 72 36 61 25 10 0 0 0 1 0 1 0 i11 65 32 51
20 19 0 0 1 0 0 1 1 i12 64 27 53 19 16 0 0 1 0 0
0 0 i13 68 30 55 21 15 0 0 0 1 1 1 1 i14 57 25 50
20 19 0 0 1 0 0 1 1 i15 58 28 51 24 17 0 0 1 0 0
0 1 i16 64 32 53 23 17 0 0 1 0 0 0 1 i17 65 30 55
18 16 0 0 1 0 0 0 0 i18 77 38 67 22 6 0 0 0 0 1
1 0 i19 77 26 69 23 0 0 0 0 0 0 0 0 e50 57 28 41
13 30 0 0 1 1 1 1 0 i1 63 33 60 25 10 0 0 0 1 0
1 0 i2 58 27 51 19 19 0 0 1 0 0 1 1 i3 71 30 59
21 11 0 0 0 1 0 1 1 i4 63 29 56 18 15 0 0 0 1 1
1 1 i5 65 30 58 22 12 0 0 0 1 1 0 0 i6 76 30 66
21 5 0 0 0 0 1 0 1 i7 49 25 45 17 24 0 0 1 1 0
0 0 i8 73 29 63 18 8 0 0 0 1 0 0 0 i9 67 25 58
18 12 0 0 0 1 1 0 0 i10 72 36 61 25 10 0 0 0 1 0
1 0 i11 65 32 51 20 19 0 0 1 0 0 1 1 i12 64 27 53
19 16 0 0 1 0 0 0 0 i13 68 30 55 21 15 0 0 0 1 1
1 1 i14 57 25 50 20 19 0 0 1 0 0 1 1 i15 58 28 51
24 17 0 0 1 0 0 0 1 i16 64 32 53 23 17 0 0 1 0 0
0 1 i17 65 30 55 18 16 0 0 1 0 0 0 0 i18 77 38 67
22 6 0 0 0 0 1 1 0 i19 77 26 69 23 0 0 0 0 0 0
0 0 i40 69 31 54 21 16 0 0 1 0 0 0 0 i41 67 31 56
24 13 0 0 0 1 1 0 1 i42 69 31 51 23 18 0 0 1 0 0
1 0 i43 58 27 51 19 19 0 0 1 0 0 1 1 i44 68 32 59
23 11 0 0 0 1 0 1 1 i45 67 33 57 25 12 0 0 0 1 1
0 0 i46 67 30 52 23 17 0 0 1 0 0 0 1 i47 63 25 50
19 19 0 0 1 0 0 1 1 i48 65 30 52 20 18 0 0 1 0 0
1 0 i49 62 34 54 23 16 0 0 1 0 0 0 0 i50 59 30 51
18 20 0 0 1 0 1 0 0
7
Dot Product Projection (DPP) 1.Check
F(y)(y-p)o(q-p)/q-p gaps or thin intervals.
1.1Check actual distances at sparse ends.
Apply DPP to the IRIS data set (UCI MLR) 150
iris samples (rows) and 4 columns (Pedal Length,
Pedal Width, Sepal Length, Sepal Width). We
assume we don't know ahead of time that the first
50 are the Setosa Class, next 50 are Versicolor
Class and the final 50 are Virginica Class. We
cluster with DPP and then see how close it comes
to separating into the 3 known classes
(ssetosal, eversicolor, ivirginica).
CLUS3 outliers removed paaax qaaan F Cnt 0
4 1 2 2 5 3 13 4 8 5 12 6 4 7
2 8 11 9 5 10 4 11 5 12 2 13 7 14
3 15 2
No Thining. Sparse Lo end Check 0,8 distances
0 0 3 5 5 6 8 8 i30 i35
i20 e34 i34 e23 e19 e27 i30 0 12 17 14 12
14 18 11 i35 12 0 7 6 6 7 12
11 i20 17 7 0 5 7 4 5 10 e34 14
6 5 0 3 4 8 9 i34 12 6 7 3 0
4 9 6 e23 14 7 4 4 4 0 5
6 e19 18 12 5 8 9 5 0 9 e27 11 11
10 9 6 6 9 0 i30,i35,i20 outliers
because F?3 they are ?4 from 5,6,7,8 e34,i34
doubleton outlier set
Sparse Lower end Checking 0,4 distances
0 1 2 3 3 3 4 s14 s42 s45 s23
s16 s43 s3 s14 0 8 14 7 20 3 5 s42
8 0 17 13 24 9 9 s45 14 17 0 11 9
11 10 s23 7 13 11 0 15 5 5 s16 20
24 9 15 0 18 16 s43 3 9 11 5 18
0 3 s3 5 9 10 5 16 3 0 s42 is
revealed as an outlier because F(s42) 1 is ?4
from 5,6,... and it's ?4 from others in 0,4
gapgt4 pnnnn qxxxx F Count 0 1 1 1 2
1 3 3 4 1 5 6 6 4 7 5 8 7 9
3 10 8 11 5 12 1 13 2 14 1 15 1 19
1 20 1 21 3 26 2 28 1 29 4 30 2 31
2 32 2 33 4 34 3 36 5 37 2 38 2 39
2 40 5 41 6 42 5 43 7 44 2 45 1 46
3 47 2 48 1 49 5 50 4 51 1 52 3 53
2 54 2 55 3 56 2 57 1 58 1 59 1 61
2 64 2 66 2 68 1
CLUS3.1 panxa qaxna F Cnt 0 2 3 1 5
2 6 1 8 2 9 4 10 3 11 6 12 6 13
7 14 7 15 4 16 3 19 2
Thinning6,7 CLUS3.1 lt6.5 44 ver 4
vir LUS3.2 gt6.5 2 ver 39 vir No sparse ends
Sparse Upper end Check 16,19 distances 16
16 16 19 19 e7 e32 e33 e30 e15 e7 0
17 12 16 14 e32 17 0 5 3 6 e33 12
5 0 5 4 e30 16 3 5 0 4 e15 14 6
4 4 0 e15 outlier. So CLUS3.1 42
versicolor
Gaps15,19 21,26 Check dis in 12,28 to see
if s16, i39,e49,e8,e11,e44 outliers 12 13
13 14 15 19 20 21 21 21 26 26 28
s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30
e31 s34 0 5 8 5 4 21 25 28 32 28
30 28 31 s6 5 0 4 3 6 18 21 23
27 24 26 23 27 s45 8 4 0 6 9 18
18 21 25 21 24 22 25 s19 5 3 6 0
6 17 21 24 27 24 25 23 27 s16 4 6 9
6 0 20 26 29 33 29 30 28 31 i39 21
18 18 17 20 0 17 21 24 21 22 19
23 e49 25 21 18 21 26 17 0 4 7 4
8 8 9 e8 28 23 21 24 29 21 4 0 5
1 7 8 8 e11 32 27 25 27 33 24 7
5 0 4 7 9 7 e44 28 24 21 24 29 21
4 1 4 0 6 8 7 e32 30 26 24 25
30 22 8 7 7 6 0 3 1 e30 28 23
22 23 28 19 8 8 9 8 3 0 4 e31
31 27 25 27 31 23 9 8 7 7 1 4
0 So s16,,i39,e49, e11 are outlier. e8,e44
doubleton outlier. Separate at 17 and 23,
giving CLUS1 Flt17 ( CLUS1 50 Setosa
with s16,s42 declared as outliers). 17ltF CLUS2
Flt23 (e8,e11,e44,e49,i39 all are already
declared outliers) 23ltF CLUS3 ( 46
vers, 49 virg with i6,i10,i18,i19,i23,i32
declared as outliers)
CLUS3.2 39 virg, 2 vers (unable to separate
the 2 vers from the 39 virg)
Sparse Upper end Checking 57.68 distances
57 58 59 61 61 64 64 66 66 68 i26
i31 i8 i10 i36 i6 i23 i19 i32 i18 i26 0 5
4 8 7 8 10 13 10 11 i31 5 0 3 10
5 6 7 10 12 12 i8 4 3 0 10 7
5 6 9 11 11 i10 8 10 10 0 8 10 12
14 9 9 i36 7 5 7 8 0 5 7 9
9 10 i6 8 6 5 10 5 0 3 5 9
8 i23 10 7 6 12 7 3 0 4 11 10 i19
13 10 9 14 9 5 4 0 13 12 i32 10
12 11 9 9 9 11 13 0 4 i18 11 12
11 9 10 8 10 12 4 0 i10,i36,i19,i32,i18
singleton outlies because F ?4 from 56 and ?4
from each other. i6,i23 is a doubleton outlier
set.
8
HILL CLIMBING GAP WIDTH
On CLUS2unionCLUS3 pavglt16 qavggt16 0 1 1
1 2 2 3 1 7 2 9 2 10 2 11 3 12
3 13 2 14 5 15 1 16 3 17 3 18 2 19
2 20 4 21 5 22 2 23 5 24 9 25 1 26
1 27 3 28 2 29 1 30 3 31 5 32 2 33
3 34 3 35 1 36 2 37 4 38 1 39 1 42
2 44 1 45 2 47 2
CL123 p is avg14 q is avg17 0 1 2 3 3
2 4 4 5 7 6 4 7 8 8 2 9 11 10
4 12 3 13 1 20 1 21 1 22 2 23 1 27
2 28 1 29 1 30 2 31 4 32 2 33 3 34
4 35 1 36 3 37 4 38 2 39 2 40 5 41
3 42 3 43 6 44 8 45 1 46 2 47 1 48
3 49 3 51 7 52 2 53 2 54 3 55 1 56
3 57 3 58 1 61 2 63 2 64 1 66 1 67 1
No conclusive gaps Sparse Lo end Check 0,9
0 1 2 2 3 7 7 9 9 i39
e49 e8 e44 e11 e32 e30 e15 e31 i39 0 17 21
21 24 22 19 19 23 e49 17 0 4 4 7
8 8 9 9 e8 21 4 0 1 5 7 8 10
8 e44 21 4 1 0 4 6 8 9 7 e11
24 7 5 4 0 7 9 11 7 e32 22 8
7 6 7 0 3 6 1 e30 19 8 8 8 9
3 0 4 4 e15 19 9 10 9 11 6 4
0 6 e31 23 9 8 7 7 1 4 6
0 i39,e49,e11 singleton outliers. e8,i44
doubleton outlier set
Dot F paaan qaaax 0 6 1 28 2 7 3
7 4 1 5 1 9 7 10 3 11 5 12 13 13
8 14 12 15 4 16 2 17 12 18 5 19 6 20
6 21 3 22 8 23 3 24 3
CLUS1lt7 (50 Set)
7 ltCLUS2lt 16 (4 Virg, 48 Vers)
Here, the gap between CLUS1 and CLUS2 is made
more pronounced???? (Why?) But the thinning
between CLUS2 and CLUS3 seems even more obscure???
There is a thinning at 22 and it is the same one
but it is not more prominent. Next we attempt to
hill-climb the gap at 16 using the mean of the
half-space boundary. (i.e., p is avg14 q is
avg17.
CLUS3gt16 (46 Virg, 2 Vers)
Sparse Hi end Check 38,47 distances 38
39 42 42 44 45 45 47 47 i31 i8 i36
i10 i6 i23 i32 i18 i19 i31 0 3 5 10 6
7 12 12 10 i8 3 0 7 10 5 6 11 11
9 i36 5 7 0 8 5 7 9 10 9 i10
10 10 8 0 10 12 9 9 14 i6 6 5
5 10 0 3 9 8 5 i23 7 6 7 12 3
0 11 10 4 i32 12 11 9 9 9 11 0
4 13 i18 12 11 10 9 8 10 4 0 12 i19
10 9 9 14 5 4 13 12
0 i10,i18,i19,i32,i36 singleton outliers
i6,i23 doubleton outlier
Next we attempt to hill-climb the gap at 16 using
the half-space averages.
9
"Gap Hill Climbing" mathematical analysis
One way to increase the size of the functional
gaps is to hill climb the standard deviation of
the functional, F (hoping that a "rotation" of d
toward a higher STDev would increase the
likelihood that gaps would be larger ( more
dispersion allows for more and/or larger gaps).
This is very general. We are more interested in
growing the one particular gap of interest
(largest gap or largest thinning). To do this we
can do as follows
F-slices are hyperplanes (assuming Fdotd) so it
would makes sense to try to "re-orient" d so that
the gap grows. Instead of taking the "improved" p
and q to be the means of the entire n-dimensional
half-spaces which is cut by the gap (or
thinning), take as p and q to be the means of the
F-slice (n-1)-dimensional hyperplanes defining
the gap or thinning. This is easy since our
method produces the pTree mask of each F-slice
ordered by increasing F-value (in fact it is the
sequence of F-values and the sequence of counts
of points that give us those value that we use to
find large gaps in the first place.).
The d2-gap is much larger than the d1gap. It is
still not the optimal gap though. Would it be
better to use a weighted mean (weighted by the
distance from the gap - that is weighted by the
d-barrel radius (from the center of the gap) on
which each point lies?)
In this example it seems to make for a larger
gap, but what weightings should be used? (e.g.,
1/radius2) (zero weighting after the first gap is
identical to the previous). Also we really want
to identify the Support vector pair of the gap
(the pair, one from one side and the other from
the other side which are closest together) as p
and q (in this case, 9 and a but we were just
lucky to draw our vector through them.) We could
check the d-barrel radius of just these gap slice
pairs and select the closest pair as p and q???
10
Functional Gap Clustering using
Fpq(x)RND(x-p)o(q-p)/q-p-minF on a Spaeth
image (pmean)
X x1 x2 1 2 3 4 5 6 7 8 9 a b 1 1 1 1q 3
1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9
3 6 15 1 7 f 14 2 8 15 3 9
6 p d 13 4 a
b 10 9 b c e 1110 c 9 11
d a 1111 e 8 7 8 f 7 9
The 15 Value_Arrays (one for each
qz1,z2,z3,...) z1 0 1 2 5 6 10 11 12 14 z2
0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11
12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3
5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0
1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9
11 12 z9 0 1 2 3 4 6 7 10 12 13 za 0 1
2 3 4 5 7 11 12 13 zb 0 1 2 3 4 6 8 10
11 12 zc 0 1 2 3 5 6 7 8 9 11 12 13 zd
0 1 2 3 7 8 9 10 ze 0 1 2 3 5 7 9 11
12 13 zf 0 1 3 5 6 7 8 9 10 11
The 15 Count_Arrays z1 2 2 4 1 1 1 1 2
1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1
1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2
3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3
3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3
1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 za
2 1 1 1 1 1 4 1 1 2 zb 1 2 1 1 3 2
1 1 1 2 zc 1 1 1 2 2 1 1 1 1 1 1
2 zd 3 3 3 1 1 1 1 2 ze 1 1 2 1 3 2
1 1 2 1 zf 1 2 1 1 2 1 2 2 2 1
gap F6, F10
gap F2, F5
pTree masks of the 3 z1_clusters (obtained by
ORing)
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
The FAUST algorithm
1. project onto each pq line using the dot
product with the unit vector from p to q.
2. Generate ValueArrays (also generate the
CountArray and the mask pTrees).
3. Analyze all gaps and create sub-cluster pTree
Masks.
11
z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6
10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0
1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12
14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4
6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1
2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7
11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0
1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7
8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0
1 3 5 6 7 8 9 10 11
z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1
1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2
4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2
1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1
1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1
1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4
1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1
1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1
1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2
1 1 2 1 2 2 2 1
1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 p
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
gap 6-9
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
In Step_3 of the algorithm we can Analyze one
of the gap arrays (e.g., As done for z1.
Subclusters is shown above) then start over on
each subcluster. Or we can analyze all gap
arrays concurrently (in parallel using the same F
(saving the substantial? re-compute costs?) and
then intersect the subcluster partitions we get
from each x_ValueArray gap analysis, for the
final subclustering.
Here we use the second alternative, judiciously
choosing only the x's that are likely to be
productive (choosing z7 next). Many are likely to
produce redundant partitions - e.g., z1, z2, z3,
z4, z6 - as their projection lines will be nearly
coincident. How should we choose the sequence
of "productive" strides? One way would be to
always choose the remaining stride with the
shortest ValueArray, so that the chances of
decent sized gaps is maximized. Other ways of
choosing?
12
z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6
10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0
1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12
14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4
6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1
2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7
11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0
1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7
8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0
1 3 5 6 7 8 9 10 11
z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1
1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2
4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2
1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1
1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1
1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4
1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1
1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1
1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2
1 1 2 1 2 2 2 1
1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
gap 3-7
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
We choose zdz13 next (Should have been first?
Since it's ValueArray is shortest?) Note, z8, z9,
za projection lines will be nearly coincident
with that of z7.
13
z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6
10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0
1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12
14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4
6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1
2 3 4 6 7 10 12 13 z10 0 1 2 3 4 5 7
11 12 13 z11 0 1 2 3 4 6 8 10 11 12 z12 0
1 2 3 5 6 7 8 9 11 12 13 z13 0 1 2 3 7
8 9 10 z14 0 1 2 3 5 7 9 11 12 13 z15 0
1 3 5 6 7 8 9 10 11
z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1
1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2
4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2
1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1
1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1
1 2 1 3 1 1 2 1 z10 2 1 1 1 1 1 4
1 1 2 z11 1 2 1 1 3 2 1 1 1 2 z12 1 1
1 2 2 1 1 1 1 1 1 2 z13 3 3 3 1 1
1 1 2 z14 1 1 2 1 3 2 1 1 2 1 z15 1 2
1 1 2 1 2 2 2 1
1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
z71 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1
z72 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
zd1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
zd2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
AND each red with each blue with each green, to
get the subcluster masks (12 ANDs producing 5
sub-clusters.
14
F1(x,y) L1Distance(x,y) ?(x1-y1x2-y2)
on X?X(x,y)x,y?X, Cluster by splitting at
all F1_gaps
L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16
17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17
18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0
2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9
10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0
2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12
13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5
8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15
17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13
0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9
10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10
11 13 15
L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1
1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1
1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2
1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1
1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2
1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4
1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2
1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2
1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1
1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1
1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2
3 1
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
gap 10-5 (redundant subclustering)
There is a z1-gap, but it produces a
subclustering that was already discovered by a
previous round. Which z values will give new
subclusterings?
15
L1(x,y) Value Array z1 0 2 4 5 10 13 14 15 16
17 18 19 20 z2 0 2 3 8 11 12 13 14 15 16 17
18 z3 0 2 3 8 11 12 13 14 15 16 17 18 z4 0
2 3 4 6 9 11 12 13 14 15 16 z5 0 3 5 8 9
10 11 12 13 14 15 z6 0 5 6 7 8 9 10 z7 0
2 5 8 11 12 13 14 15 16 z8 0 2 3 6 9 11 12
13 14 z9 0 2 3 6 11 12 13 14 16 z10 0 3 5
8 9 10 11 13 15 z11 0 2 3 4 7 8 11 12 13 15
17 z12 0 1 2 3 6 8 9 11 13 14 15 17 19 z13
0 2 3 5 8 11 13 14 16 18 z14 0 1 2 3 7 9
10 12 14 15 16 18 20 z15 0 4 5 6 7 8 9 10
11 13 15
L1(x,y) Count Array z1 1 2 1 1 1 1 2 1 1
1 1 1 1 z2 1 3 1 1 1 2 1 1 1 1 1
1 z3 1 3 1 1 1 1 1 2 1 1 1 1 z4 1 2
1 1 1 1 1 2 1 2 1 1 z5 1 3 2 1 1
1 2 1 1 1 1 z6 1 2 3 2 4 1 2 z7 1 2
1 1 1 1 2 4 1 1 z8 1 2 1 1 1 2 4
1 2 z9 1 2 1 1 3 2 1 3 1 z10 1 2 2 2
1 2 2 2 1 z11 1 1 2 1 1 1 2 1 2 2
1 z12 1 1 1 1 1 1 1 2 1 1 1 2 1 z13 1
1 2 1 1 1 1 3 3 1 z14 1 1 1 1 1 1
1 2 1 1 1 2 1 z15 1 1 1 1 2 1 1 1 2
3 1
x y x\y 1 2 3 4 5 6 7 8 9 a b 1 1 1 1 3 1 2
3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15
1 7 f 14 2 8 15 3 9 6 M
d 13 4 a b 10 9 b
c e 1110 c 9 11 d a 1111 e
8 7 8 f 7 9
This re-confirms z6 is an anomaly or outlier,
since it was already declared so during the
linear gap analysis.
Re-confirms zf is an anomaly.
After having subclustered with linear gap
analysis, which is best for determining larger
subclusters, we run this round gap algorithm out
only 2 steps to determine if there are any
singleFvalue gapsgt2 (the points in the
singleFvalueGapped set are then declared
anomalies). So we run it out two steps only,
then find those points for which the one initial
gap determined by those first two values is
sufficient to declare outlierness. Doing that
here, we reconfirm the outlierness of z6 and zf,
while finding new outliers, z5 and za.
16
Gap Revealer
Width ? 24 That is, go down to p4 and p'4
1 z1 z2 z7 2 z3 z5
z8 3 z4 z6 z9 4
za 5 M 6
7 8 zf 9
zb a zc b
zd ze c 0 1 2 3 4 5 6 7 8 9 a b c d e f
X x1 x2 z1 1 1 z2 3 1 z3 2 2 z4 3 3 z5
6 2 z6 9 3 z7 15 1 z8 14 2 z9 15 3 za 13
4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8
Fxod 11 27 23 34 53 80 118 114 125 114 110
121 109 125 83
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0
p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1
p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1
p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0
p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0
f






011 0000, 011 1111 48, 64). z5od53 is 19
from z4od34 (gt24) but 11 from 64. But the next
int 64,80) is empty z5 is ?27 from its right
nbr. z5 is declared an outlier and we put a
subcluster cut thru z5
000 0000, 000 1111 0,150,16) has 1 point
in it, z1. z1od11 is only 5 units from the
right edge, so z1 is not declared an outlier
(yet). Next, we check the min dis from the
right edge of the next interval to see if z1's
right-side gap is actually ? 24 (the calculation
of the min a pTree process - no x looping
required!)
001 0000, 001 1111 16,32). The minimum,
z3od23 is 7 units from the left edge, 16, so z1
has only a 5712 unit gap on its right (not a 24
gap). So z1 is not declared a 24 (and is
declared a 24 inlier).
010 0000 , 010 1111 32,48). z4od34 is
within 2 of 32, so z4 is not declared an anomaly.
111 0000 , 111 1111 112,128)
z7od118 z8od114 z9od125 zaod114 zcod121 zeod
125 No 24 gaps. But we can consult SpS(d2(x,y)
for actual distances
110 0000 , 110 1111 96,112). zbod110,
zdod109. So both z6,zf declared outliers
(gap?16 both sides.
100 0000 , 100 1111 64, 80). This is clearly
a 24 gap. But we have already declared point to
the left an outlier and made a subcluster cut!
101 0000 , 101 1111 80, 96). z6od80, zfod83
Which reveals that there are no 24 gaps in this
subcluster. And, incidentally, it reveals a 5.8
gap between 7,8,9,a and b,c,d,e but that
analysis is messy and the gap would be revealed
by the next xofM round on this sub-cluster anyway.
X1 X2 dX1X2 z7 z8 1.4 z7 z9 2.0 z7 z10
3.6 z7 z11 9.4 z7 z12 9.8 z7 z13 11.7 z7
z14 10.8 z8 z9 1.4 z8 z10 2.2 z8 z11
8.1 z8 z12 8.5 z8 z13 10.3 z8 z14 9.5
X1 X2 dX1X2 z9 z10 2.2 z9 z11 7.8 z9 z12
8.1 z9 z13 10.0 z9 z14 8.9 z10 z11 5.8 z10
z12 6.3 z10 z13 8.1 z10 z14 7.3
X1 X2 dX1X2 z11 z12 1.4 z11 z13 2.2 z11 z14
2.2 z12 z13 2.2 z12 z14 1.0 z13 z14 2.0
17
APPENDIX Barrel Clustering (This method
attempts to build barrel-shaped gaps around
clusters)
q
Allows for a better fit around convex clusters
that are elongated in one direction (not round).
Exhaustive Search for all barrel gaps It takes
two parameters for a pseudo- exhaustive search
(exhaustive modulo a grid width). 1. A
StartPoint, p (an n-vector, so n
dimensional) 2. A UnitVector, d (a n-direction,
so n-1 dimensional - grid on the surface of
sphere in Rn). Then for every choice of (p,d)
(e.g., in a grid of points in R2n-1) two
functionals are used to enclose subclusters in
barrel shaped gaps. a. SquareBarrelRadius
functional, SBR(y) (y-p)o(y-p) - ((y-p)od)2
b. BarrelLength functional, BL(y) (y-p)od
Given a p, do we need a full grid of ds
(directions)? No! d and -d give the same
BL-gaps.
Given d, do we need a full grid of p starting
pts? No! All p' s.t. p'pcd give same
gaps. Hill climb gap width from a good starting
point and direction.
MATH Need dot product projection length and dot
product projection distance (in red).
p
dot product projection distance
That is, we needed to compute the green constants
and the blue and red dot product functionals in
an optimal way (and then do the PTreeSet
additions/subtractions/multiplications). What is
optimal? (minimizing PTreeSet functional
creations and PTreeSet operations.)
18
4 functionals in the dot product group of gap
clusterers on a VectorSpace subset, Y (y?Y)
1. SLp(y) (y-p)o(y-p), p a fixed vector. Square
Length functional primarily for outlier
identification and densities.
2. Dotd(y) yod, (d is a unit vector) the
Dot-product functional. Using
dq-p/q-p and y-p Dotp,q(y)
(y-p)o(q-p)/q-p
Is it better to leave all the additions and
subtractions for one mega-step at the end? Other
efficiency thoughts?
We note that Dot(y)yod shares many construction
steps with SPD.
4. CAd(y) yod/y, (d unit vector) the Cone
Angle functional. Using dq-p/q-p and yx-p
CAp,q(y) (y-p)od/y-p
SCAp,q(y) (y-p)od2/y-p2
(y-p)od2/(y-p)o(y-p), Squared Cone Angle
functional
19
SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 2
10 3 12 4 12 5 12 6 8 7 11 8 9 9
5 10 9 11 4 12 4 13 2 14 1 17 2 18
3 19 10 20 5 21 6 22 5 23 6 24 6 25
3 27 2 29 2 30 1
SPD on CLUS1 p 50 20 35 10 e11 q 58 31 37 12
MN V Ct 2 3 3 4 4 5 5 7 6 2 7
2 8 6 9 6 10 3 11 4 12 2 13 4 14
4 15 3 16 2 17 1 18 5 19 1 20 2 22
2 23 1 24 1 25 1 26 1 29 1
SPD p 64 29 50 17 q 61 29 45 14 e14 V Ct 1
6 2 4 3 8 4 4 5 10 6 2 7 2 8
2 9 7 10 2 11 2 12 2 13 1 15 2 17
1 18 4 19 2 20 4 22 1 24 1 25 1 26
1 29 1 31 2 32 2 33 3 37 2 i15 i36 92
1 i32
SPD p 54 22 39 10 q 70 34 51 18 V Ct 2 8 3
10 4 10 5 10 6 5 7 10 8 6 9 8 10 6 11 1
mask Vlt8.5 CTs 50 0 SMs CTe 50 50 SMe CTi 50 24
SMi CLUS1
mask Vlt12.5 5 SMe 24 SMi CLUS1.1
thin gap
mask 8.5ltVlt15.5 CTs 50 1 SMs CTe 50 0 SMe CTi
50 24 SMi CLUS2
masking Vgt6 Total_e 37 2 Masked_e Total_i
37 29 Masked_i However I cheated a bit. I
used pMinVect(e) and qMaxVect(e) which makes it
somewhat supervised. START OVER WITH THE FULL
150-----------------gt
mask Vgt12.5 45 SMe 0 SMi CLUS1.2
mask Vgt15.5 CTs 50 49 SMs CTe 50 0 SMe CTi 50
2 SMi This tube contains 49 setosa 2
virginica CLUS3
CLUS1.2 is pure Versicolor (45 of the 50). CLUS3
is almost pure Setosa (49 of the 50, plus 2
virginica) CLUS2 is almost purely 1/2 of
viriginica (24 of 50, plus 1 setosa). CLUS1.1
is the other 24 virginicas, plus the other 5
versicolors. So this method clusters IRIS quite
well (albeit into 4 clusters, not three). Note
that caps were not put on these tubes. Also, this
was NOT unsupervised clustering! I took
advantage of my knowledge of the classes to
carefully chose the unit vector points, p and q
E.g., p MinVector(Versicolor) and q
MaxVector(Versicolor. True, if one sequenced
thru a fine enough d-grid of all unit vectors
directions, one would happen upon a unit vector
closely aligned to dq-p/q-p but that would be
a whole lot more work that I did here (would take
much longer). In worst case though, for totally
unsupervised clustering. there would be no other
way than to sequence through a grid of unit
vectors. However, a good heuristic might be to
try all unit vectors "corner-to-corner" and
"middle-of-face-TO-middle-of-opposite-face"
first, etc. Another thought would be to try to
introduce some sort of hill climbing to "work our
way" toward a good combination of a radial gap
plus two good linear cap gaps for that radial gap.
20
SPD on CLUS1 p 60 34 60 25 C1US1axxx q 60 28 46
15 C1US1aaaa V Ct . 1 3 2
5 3 9 4 13 5 18 6 12 7 4 8 1 9
2 11 3 no thinnings
SPD on CLUS1 p 69 28 60 25 C1US1xaxx q 60 28 46
15 C1US1aaaa V Ct . 1 4 2
13 3 7 4 19 5 9 6 7 7 9 8 2
SPD on CLUS1 p 69 34 46 25 C1US1xxax q 60 28 46
15 C1US1aaaa V Ct . 1 1 2
4 3 3 4 9 5 9 6 14 7 9 8 4 9
6 10 3 11 3 12 1 14 2 15 1 16 1 no
thinnings
SPD on CLUS1 p 69 34 60 15 C1US1xxxa q 60 28 46
15 C1US1aaaa V Ct . 1 1 2
3 3 10 4 15 5 16 6
About PowerShow.com