Why Not Store Everything in Main Memory? Why use disks? - PowerPoint PPT Presentation

Loading...

PPT – Why Not Store Everything in Main Memory? Why use disks? PowerPoint presentation | free to download - id: 748150-YmU3M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Why Not Store Everything in Main Memory? Why use disks?

Description:

1 1 1 1 1 1 1 1 1 1 0 0 course 2 3 4 5 Applications: PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. 1 – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 13
Provided by: William1305
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Why Not Store Everything in Main Memory? Why use disks?


1
Applications
2
From msilverman_at_treeminer.com Sat, Nov 12, 2011
832 To Perrizo, W Subject question-  Bill-
Im working toward a generic framework where
we can handle multiple data types easily. Some of
the test datasets I am working with have scalar
types from a match perspective, it seems that
such features either match, or they dont match. 
The key issue from a classification method would
be how much to weight the match or lack of match
in a final classification determination.  Is this
correct? Now, after having typed that, it seems
that perhaps you could construct scalar specific
distance measures.  Perhaps in a case where one
feature is a text string, you could do a variety
of string distance measures, and use that as a
proxy for distance in that dimension (even as you
use other more relevant distance measures for
other features in the same dataset, for example,
hob distance for a continuous feature). 
Similarly, a day of the week scalar could return
a distance reflecting that Wednesday is closer to
Thursday than is Monday.  In the end, for a
heterogeneous dataset, with the goal of a
commercially useful product, you could have a
library of scalar distance functions, with a
default of match / no match (if there is no
relevant distance metric, a cockpit switch is in
position 1, 2, or 3 as an ex)  (another project
than to park is how to leverage vertical methods
for string distances given the huge issues with
unstructured data, if it could be done..  but
thats for later) And then of course this
highlights the hyper criticality I think in a
product of having a feature optimization working
as part of what we do...  This acts as a voting
normalization process in a way, weighting
properly the relative importance of a scalar
distance (if there is a way to calculate it)
versus a hob distance in a final class vote. 
From Perrizo, W Sent Mon, Nov 14, 2011 1005 AM
To 'Mark Silverman' Subject RE question- We
can think of distance as  a measure of
dissimilarity, and focus on similarities (we
weight by similarity anyway). sim(a,b)
requires  sim(a,a)gt sim(a,b)  (nothing more
similar to a than a itself) sim(a,b)sim(b,a) 
 (symmetric). Any correlations is a similarity
and any distance produces a similarity (via some
inversion or reciprocation) Similarities are in
1-1 correspondence with diagonal dominant upper
triangular matrixes. E.g.,     M  T  W  Th F
 Sa Su M   7  6  5  4  4  5   6 T      7  6 
5  4  4   5 W        7  6  5  4   4 Th        
7  6  5   4 F               7  6   5 Sa           
      7   6 Su                     7 Of course,
the matrix rep (and corresp table-lookup code
implementation) is most useful for low
cardinality scalar attributes such as day or
month But, Amal did create a matrix for
Pearson correlation of movies in the Netflix
code (we didnt produce a table for renters since
it would have been 500K x 500K).  Did you get
Amals matrix as part of the software? If the
scalar values just cant be coded usefully to
numeric (like days can be coded to
1,2,3,4,5,6,7), then Hamming distance may be best
(1 for a match, 0 for a mismatch).
3
Mark why is knn not O(n)? Couldnt you just
track lowest 7 distances and loop once thru the
data set?  If a distance was less than 7,
substitute it in.  I understand about ties and
picking an arbitrary point affecting accuracy...
We are exactly linear with training points.  WP
Yes kNN and ckNN are O(n). But bigO analysis is
not always revealing of true speed.  If Nintendo,
Sony and Microsoft had relied on bigO analysis,
they might have concluded that video games are
impossible. Lets say there are n records (so
uncompressed pTrees are n-bits deep) and d
individual pTrees (d is constant), then the
vertical processing of the n horizontal records
is O(n) (each individual process step being
unaffected by the addition of new records).
Horizontal processing of the d vertical pTrees
would be O(d)O(0), except that adding records
make each pTree longer and slower to process.
But by how much?  Analysis should consider
current processors (GPUs used?, folding algs,
etc.).  BigO analysis would make each of our
steps O(n) but that doesnt reflect
reality. Mark Theres a minor fix in the
ptreelib which dramatically improves performance,
Ill send that separately. I am thinking that we
are losing a big advantage by structuring ptrees
the way we are we gain no advantage from being
able to prune an entire subtree off which would
give us at least sublinear as training set
increases... WP The reason we have not yet used
leveled or compressed pTrees (actual trees) is
that it has not been necessary (ANDing is so
fast, why compress?).  Now the reason I (and
Sony, MS, Nintendo) dont like BigO gets smoked
out.  Even if we have the data as actual
compressed leveled pTrees, its still O(n)
because the worst case is that there are no runs
(i.e., 010101010101010101010101) in which case
the leaf level is size n and the accumulation of
the other levels is also n, so we have
O(2n)O(n).  BigO analysis is too coarse to
recognize the difference between retrieving d
pTrees then ANDing/ORing (what we do) and
retrieving n records then comparing (what they
do).  I suppose there is an argument for our
claiming sublinear, even logn, but in BigO were
supposed to assume worst case.   So theoretically
its a tie at O(n).   Practically we win big.
 Mark Agreed. What I am seeing is about a 5-6x
speed increase across the board.  Because of the
noncompressed implementation, adding to a ptree
adds to each step precisely the amount it takes
to and/or a 64 bit value.  Thus, if we are 5x the
speed with a 1M pTree, we are 5x the speed with a
2M pTree.  From my analysis, the cost of doing a
pass on a ptree is virtually completely dependent
on the size, there doesnt seem to be much of a
fixed cost component. Im looking to push the
limit 5x a couple of new cpus get them there,
10-20x gets attention....  As well, nobody cares
about what you claim, they only care about what
they see when they push go (they will not bring
up the 1010101010... case and argue so I agree
on your larger point about O(n)).  Images should
be highly compressible and at larger sizes these
are advantages we are giving away.  Showing a
real world image get more efficient as sizes
increase is a huge psychological plus. Certainly
a level based approach is a bit more complex but
I think we can add to the library some methods to
convert from non-compressed to compressed, and
adjust the operator routines accordingly (and to
your point also, O(n) at the lowest level is not
the same as O(n) at the next level up, since n is
smaller).  Will play with this over the
weekend...  WP Some thoughts regarding
multilevel pTree implementations  Greg has
created almost all of the code we need for
multi-level pTrees. We are not trying to save
storage space any more, so we would create the
current PTreeSet implementation for the
uncompressed pTrees (e.g., for the 17000 Netflix
movies, the movie PTreeSet had  317000 pTrees,
each 500,000 bits deep since there were 500,000
users). To implement a new set of level-1 pTrees
with  bit-stride1000 (recall, the uncompressed
PTreeSet is called level-0), we would just create
another PTreeSet which would have 317000 pTrees,
each 500 bits deep.  Each of these 500 bits would
be set according to the predicate.  For the pure1
predicate, a bit would be set to 1 iff that
stride of 1000 consecutive bits is entirely
1-bits, else set to 0. Another 317000 by 500
level-1 PTreeSet could be created using the
predicate at least 50 1-bits.  This predicate
provides blindingly fast level-1 only FAUST
processing with almost the same accuracy.  Each
of these new level-1 PTreeSets could be built
using the existing code.  They could be used on
their own (for level-1 only processing) or used
with the level-0 uncompressed PTreeSet (we would
access a non-pure leaf stride by simply
calculating the proper offset into the
corresponding level-0 pTree). We could even
create level-2 PTreeSets, e.g.,  having 317000
pTrees, each 64 bits deep (stride16000) so that
each of these level-2 pTree would fit in a 64 bit
register I predict that FAUST using 50 level-1
pTrees will be 1000x faster and very nearly as
accurate.
4
From msilverman_at_treeminer.com Sent Sat, Nov
12, 2011 832 To Perrizo, W Subject
question-  Bill- Im working toward a
generic framework where we can handle multiple
data types easily. Some of the test datasets I
am working with have scalar types from a match
perspective, it seems that such features either
match, or they dont match.  The key issue from a
classification method would be how much to weight
the match or lack of match in a final
classification determination.  Is this
correct? Now, after having typed that, it seems
that perhaps you could construct scalar specific
distance measures.  Perhaps in a case where one
feature is a text string, you could do a variety
of string distance measures, and use that as a
proxy for distance in that dimension (even as you
use other more relevant distance measures for
other features in the same dataset, for example,
hob distance for a continuous feature). 
Similarly, a day of the week scalar could return
a distance reflecting that Wednesday is closer to
Thursday than is Monday.  In the end, then, for
a heterogeneous dataset, and with the goal of a
commercially useful product, you could possible
have a library of scalar distance functions,
with a default of match / no match (if there is
no relevant distance metric, a cockpit switch is
in position 1, 2, or 3 as an example)  (another
project than to park is how to leverage vertical
methods for string distances given the huge
issues with unstructured data, if it could be
done..  but thats for later) And then of course
this highlights the hyper criticality I think in
a product of having a feature optimization
working as part of what we do...  This acts as a
voting normalization process in a way, weighting
properly the relative importance of a scalar
distance (if there is a way to calculate it)
versus a hob distance in a final class vote. 
5
From Perrizo, W Sent Mon, Nov 14, 2011 1005 AM
To 'Mark Silverman' Subject RE question- We
can think of distance as  a measure of
dissimilarity, and then focus on similarities,
because we end up weighting by similarity in the
end anyway. The only requirement of a similarity
sim(a,b) is, for every a and b,   sim(a,a) gt
sim(a,b)  (nothing is more similar to a than a
itself)   and that    sim(a,b)sim(b,a) 
 (symmetric). Any correlations is a similarity
and any distance produces a similarity (via some
inversion or reciprocation) Similarities are in
1-1 correspondence with diagonal dominant upper
triangular matrixes. E.g.,     M  T  W  Th F
 Sa Su M   7  6  5  4  4  5   6 T      7  6 
5  4  4   5 W        7  6  5  4   4 Th        
7  6  5   4 F               7  6   5 Sa           
      7   6 Su                     7 Of course,
the matrix representation (and the corresponding
table-lookup code implementation) is most useful
for low cardinality scalar attributes such as
day or month On the other hand, Amal did
create a matrix for Pearson correlation of
renters in the Netflix code (we didnt produce a
table for movies since it would have been 500K x
500K).  Did you get Amals matrix as part of the
software? If the scalar values just cant be
coded usefully to numeric (like days can be
coded to 1,2,3,4,5,6,7), then Hamming distance
may be best (1 for a match, 0 for a mismatch).
6
Mark Silvermand said I'm working to complete
the c-knn as a test using my xml ptree
description files, and have a question on HOB
distance. What happens if I have a 16 bit attr,
and an 8 bit attr?  Should I mask based upon
smallest field (e.g. in this case, pass 1 would
0xFFFF   0xFF) If not sufficient matches, pass 2
would be 0xFFFC  0xFEso that HOB mask is
proportionally the same for all attributes, or
should I make mask for pass 2 as 0xFFFE 0xFE
?Have you considered this before? Thanks as
always! Mark
WP said back Your question is a very good, but
tough to give a definitive answer to (for me
anyway). It goes to the "attribute
normalization issue" and that is an important but
tough issue (few deal with it well in my
opinion). Consider Hobbit as "quick look" L1 dist
(using just one bit slice in each attribute,
namely the highest-order bit-of-difference in
that attribute. So, e.g., in 2-D, the question
is how do we relatively scale the two
attributes so that a "sphere" is correctly round
(of course an L1 or Hobbit sphere is a rectangle,
and not round at all, but the idea is to
relatively scale 2 dims so that it is a "square"
in terms of the semantics 2 attr. Take an image
with Red and Green (e.g., for vegetation
indexing).   If R is 16 bit and G is 8 bit (not
likely but let's assume that just for the
purposes of this discussion). If we use 0xFFFE
for R (horizontal dimension) and 0xFE for G
(vertical dimension) then we are really
constructing tall skinny rectangles as our
neighborhoods,xdistance(x,center)ltepsilon
In this case it would seem to make sense
to me to use 0xFE00 R 0xFE  G So that
the neighborhood would look more like a
square  since both dims count photons
detected over the same interval of time, but
maybe R counts in Fixed4.2  format and G counts
in Fixed2.0??? If R counts in Fixed4.0 and G
counts in Fixed2.0 then 0xFFFE R  and  0xFE G
would give us a square neighborhood. What I'm
trying to say is that the two delta's should be
chosen based on the semantics of the two columns
(what is being measured and then what does that
suggest to us in terms of "moving out from the
center" approximately the same amount in each
ring). The more general question of normalizing
attributes (columns) when the two measurements
are not so clearly related, is a harder, but
extremely important one (if done poorly, one
dimension gets radically favored in terms of the
chosen "neighboring point" that get to vote
during the classification, for instance.
7
  • Mark S. I realize this is variable depending
    upon data type...  Ive had some nibbles from
    people who are asking to view the solution from a
    financial perspective, e.g., what is cost of
    ownership vs. existing methods. 
  • Currently it seems everything is geared towards
    in memory data mining.  What happens if the
    pTree doesnt fit into memory (seems to me in the
    case of classification, for example, if the
    training set doesnt fit into memory which
    appears to be the bigger issue)?  
  • Im working up a model (geared If I turbocharged
    your existing data warehouse with a vertical
    solution, how much would I save? guesses?
      Viewing holistically (only parts are
    quantifiable how much is it worth to get an
    answer 10x faster, e.g., quantifiable in some
    cases).
  • Has there been any work done on effectiveness of
    pTree compression (how much will pTrees compress
    a large dataset)?
  • WP Thats a tough one since we really dont have
    an implementation other than for uncompressed
    pTrees yet -(
  • I cant say too much definitively, since we
    havent yet produced an implementation with any
    compression (leveled pTrees).  Images
  • Urban images may not compress so much
    since they are replete with edges (many
    adjacent bit changes).
  • Non-urban images (e.g., mother nature as
    creator) have massive patches of uniform color w
    low percent of boundary or edge pixels.  
  • Therefore even raster orderings produce longs
    strings of 0s or long strings of 1s (which
    compress well in multi-leveled pTrees). 
  • Z-ordering or Hilbert ordering ompression
    increases in Mother Nature images (even
    man-made rural spaces crops) the same can be
    said. 
  • Aside If Z or Hilbert helps, why not use
    run-length or LZW compression?  We need to be
    able to process (at each level) without
    decompressing.  How do you AND 2 run-length
    compressed strings of bits?  Lots of decisions
    in the code?, not just AND/OR of massive strings
  • A tangential thought  A quick predictor of, say,
    a terrorist training site might be a sudden
    increase in bit change frequency (as you proceed
    down the pTree sequence there is a lack of
    compression all of a sudden -because terrorist
    training sites are edge-busy (almost Urban) and
    they are usually place in (surrounded by) uniform
    terrain like desert or forest or?? not
    urban).    
  • As a decision support machine, not a decision
    maker, the pTree classifier would point to a
    closer look at that area of sudden hi
    bit-change? 
  • And this pTree phenomenon could be quickly
    identified by looking through the pTree_segment
    ToC for inode subtrees with inordinately many
    pointers to non-compressed leaf segments
    (assuming no pure0 or pure1 segments are stored,
    since they do not need to be).
  • This would be yet another potential answer which
    could come directly from the schema (without even
    retrieving the pTree data itself)??  E.g., if the
    segment ToC gets suddenly longer (many more leaf
    segments to point to) that would signal an area
    for closer scrutiny.
  • The main pTree turbocharger is horizontal
    traversal across (pertinency selective) vertical
    pTrees ( and ) (few comparisonscode decision
    pts)

In this age of infinite storage (nearly infinite
non-rotating memory possibilities), maybe the
problem goes away to an extent? For pTree
classification a critical steps is attribute
relevance (e.g., In Netflix, applied pruning
based on relevancehigh_correlation (from
pre-computed correlation tables?).  In the 2006
KDDcup case GA-based relevant attribute selection
was a central part of the win....) If when we
have multi-level pTrees implemented (with various
sets of rough_pTrees pre-computed and available)
we will be able to work down from the top level
to get the predicted class, possibly, without
needing to see the lower levels of the pTrees
(i.e., leaf level) or at least be able to do
significant attribute relevance analysis at high
levels in the pTrees so that we only access
leaves of a select few pTrees. Pre-computed
attrib correlation tables help in determining
relevance before accessing the training dataset. 
Note that correlation calculations are two
columns at a time calculations and therefore
lend themselves to very efficient calculation via
pTrees (or any vertical arrangement). Even if all
the above fail or are difficult (which may be
somewhat the case for pixel classification of
large images?) we can code so that training
pTrees are brought in in batches (and prefetched
while ANDing/ORing the previous batch).  Since we
dont have to pull in entire rows (as in the
horizontal data case) to process relevant column
feature values, we get a distinct advantage over
vertical processing of horizontal data (note
that indexes help the VPHD people but indexes ARE
vertical partial replications of the
dataset.  Or said another way, pTrees ARE
comprehensive indexes that contain all the
information in the dataset and therefore
eliminate the need for the dataset itself
altogether -)) In many classifications, only
relevant training pTrees go into memory at a
time.  That often saves us from the problem you
point to. Bottom line  Needing data from disk in
real time is usually a bottleneck, but pTrees are
a serious palliative medicine to treat that pain.
8
  • Discrete Sequential Data? Ive been going back
    over the last NASA submission in preparation for
    the new one, and am trying to improve on areas
    where reviewers comments indicated weakness. One
    is on the work plan e.g. how are we going to
    solve the problem. Do you remember last year we
    were discussing discrete sequential data and
    anomaly detection?  We had discussed back then
    turning the sequences into vertical columns, but
    had concerns (rightly so) that we could be
    growing a huge number of columns eliminating our
    scalability advantage. The issue is as follows

e.g., a set of 3 switches on a plane flaps,
landing gear, fuel pump.  Assume that we have an
example of a correct sequence         Flaps Gear
Pump Flaps Gear Pump Time 0       
0      0     0 Time 1 1       0     0 Time
2        1      1      0 Time 3   1        1     
  0 Time 4        1       1      1
vertical is only relevant for a table centric
view (each row is an instance of the entity to be
data mined) 2 entities, switches and times. 
focus classification on either both (data
on/off setting ? pair of a switch-instance and a
time-instance.
Arent discrete time sequences already vertical
datasets?  Yes. ( See above.) 2nd obs  assuming
actual sequence from pilot            Flaps Gear
Pump Flaps Gear Pump Time 0     0     0    
0 Time 1     1     1     0 Time 2     1    
1     0 Time 3     1     1    0 Time 4    
1     1     1 Call this S(i).  Assume n of
these to classify. Can you not effectively
determine a few things rather simply with pTree
math 1. an anomalous point in time is one in
which create anomaly ptree as follows
(training set symbol 1 XOR data point symbol 1)
OR (training set symbol 2 XOR data point symbol
2) OR (training point symbol 3 XOR data point
symbol 3) if this value is 1, there is an
anomaly at that point.  (and maybe the count of
1s measures the extent of anomaly?). The RC(
anomaly ptree) is a count of the relative
anomalousness of seq. to training sequence.  If
RC0 there is no difference.
2. Further, you can find exact pt in time where
anomalies exist by indexes of the 1s in anomaly
ptree. Wouldnt the point in time of anomaly be
the indexed by the offset of the XOR 1-bit?
 (Greg has a function to return index of next
1-bit) We have a temporal normalization problem,
but so does every such method), this should work,
and be relatively fast.  Current art is O(n2).
This should work even if more than binary
switches (multibit). My response Right.
temporal norm means?   In Simulation Science
terms fixed increment time advance and next
event time advance. My guess is next event. Try
data mining both way combo the of 3 ways as we
did in Netflix.  So we would create (one time
cost) switch pTrees (as you have done thinking
of the data as rows of time-instances with
switch-value columns) and also time pTrees
(rotating the table so that each row is a switch
entity and each col is a time instance (each
successive col is time of the next switch
change).
VSSD outlierpt w no nbrs (0NNC or avg pairwise
dist in set, apds, outlier10apds or VSSD to
set of other pts is large. With respect to this
point FAUST should work well also, I agree. 
From Algs to Analytics Integrated Ag Bus/CS
Initiative to Explore Comp Facilitated Commodity
Trading Research.The strategy will be to
highlight the potential to integrate and focus
NDSU's leadership in vertical algorithm
development and implementation on problems of
interest to agricultural commodity trading.
 Highlight opp CTR provides to take CS into
practical application by students and
researchers. (CTR to serve as a crucible to focus
multi-disciplinary research and education efforts
across diverse Univ. strength areas.) Get are
copies/ptrs to research on stochastic based risk
modeling.  
9
  • From M Silverman Sent 9/2, To W Perrizo,
    Subject need your guidance.... remember discrete
    sequential data?
  • Ive been going back over the last NASA
    submission in preparation for the new one, and am
    trying to improve upon areas where reviewers
    comments indicated weakness. One is on the work
    plan e.g. how are we going to solve the
    problem.
  • Do you remember last year we were discussing
    discrete sequential data and anomaly detection? 
    We had discussed back then turning the sequences
    into vertical columns, but had concerns (rightly
    so) that we could be growing a huge number of
    columns eliminating our scalability advantage.
    The issue is as follows

e.g., a set of 3 switches on a plane flaps,
landing gear, fuel pump.  Assume that we have an
example of a correct sequence               Flaps
    Gear    Pump Time 0        0           
0        0 Time 1        1            0       
0 Time 2        1            1        0 Time
3        1            1        0 Time 4       
1            1        1
My response Yes, it looks good.  This is the
way I remember thinking of it. The word
vertical is only relevant when one takes a
table centric view (or horizontal row centric
each row is an instance of the entity to be
studied or data mined) As with the Netflix case
when we had two entities, users and moves - in
this case we have switches and times.  We can
focus the datamine (classification) on either
and typically we do both (datamining with
respect to one entity at a time, but do it both
ways rotating the table (swapping rows and
columns) to get the other vertical view, but
producing a whole new set of pTrees (recall, in
Netflix, we had movie-pTrees and user-pTrees.
Here we would have switch pTrees and time
pTrees coming from the same data which is an
on/off setting for each pair of a switch-instance
and a time-instance.
10
  • Mark again 1st observation  Arent discrete
    time sequences already vertical datasets?  Me
    Yes. ( See comments above.)
  • 2nd observation  assuming for the moment time
    intervals are normalized, and you have an actual
    sequence from the pilot
  •            Flaps Gear Pump
  • Time 0     0     0     0
  • Time 1     1     1     0
  • Time 2     1     1     0
  • Time 3     1     1    0
  • Time 4     1     1     1
  • Call this S(i).  Assume you have n of these to
    classify. Can you not effectively determine a
    few things rather simply with pTree math
  • 1. an anomalous point in time is one in which
  • create anomaly ptree as follows
  • (training set symbol 1 XOR data point symbol 1)
    OR (training set symbol 2 XOR data point symbol
    2) OR (training point symbol 3 XOR data point
    symbol 3)
  • if this value is 1, there is an anomaly at that
    point.  (and maybe the count of 1s measures the
    extent of anomaly?)

11
  • 2. Further, you can find the exact point in time
    where the anomalies exist by the indexes of the
    1s in the anomaly ptree. Wouldnt the point in
    time of the anomaly be the indexed by the
    offset of the XOR 1-bit?  (Greg has a function to
    return index of next 1-bit)
  • Unless Im missing something (we have a temporal
    normalization problem perhaps, but so does every
    such method), this seems like it should work, and
    be relatively fast.  Current art is O(n2).
  • This should work even if we have more than binary
    switches (multibit).
  • Am I missing something?
  • My response Youre right, Mark.
  • Comments
  • temporal normalization means?   I would
    understand it in Simulation Science terms namely
    that we have essentially two choices fixed
    increment time advance and next event time
    advance.
  • My guess is that the data comes to us as next
    event that is, a new time entity instance (time
    row) is produced whenever one of more of the
    switches changes regardless of whether that
    change events occurs after an evenly spaced
    time increment (which it almost certainly would
    not).
  • I think we would want to try data mining both way
    and in fact, - combining the two ways as we did
    in Netflix.  So we would create (one time cost)
    switch pTrees (as you have done thinking of the
    data as rows of time-instances with switch-value
    columns) and also time pTrees (rotating the table
    so that each row is a switch entity and each
    column is a time instance (each successive column
    is probably the time of the next switch change)

12
  • From Mark Silverman Sent Wednesday, September
    07 To Perrizo, William Subject VSSD
  • I am going through methods now to finish the NASA
    SBIR proposal, and was looking though some other
    work on outlier detection. I ran across the VSSD
    material. Was any work ever done in applying
    VSSD to outlier detection?
  • From Perrizo, William Sent Wednesday,
    September 07 To 'Mark Silverman' Subject RE
    VSSD
  • The VSSD classifier method might be used for
    outlier detection, in the sense that an outlier
    might be defined as a sample point with no
    neighbors (so that they would be determined by
    applying 0NNC or zero nearest neighbor
    classification.  One might calculate the average
    pairwise distance separation in the set, apds,
    then decide upon a definition of outlier such
    as 10apds.  And use the VSSD classifier.
  • But more pertinent to VSSD, one might define a
    point to be an outlier if its VSSD with respect
    to the set of other points is large.  I actually
    like that because it is not so sensitive to a few
    other points being close but most points being
    far away (e.g., if a fumble fingered pilot hits
    on/off/on by mistake several times, it still show
    up as an anomaly even tho it has several nbrs?)
  • From Perrizo, William Sent Wednesday,
    September 07 To 'Mark Silverman' Subject RE
    VSSD
  • With respect to this point FAUST should work
    well also, I agree. 
  • One could analyze the training set for potential
    outlier anomaly classes (tiny classes that when
    scrutinized are determined to be fumble
    fingering).  Then one could do FAUST on the
    training set with respect to that anomaly class
    to determine a good cutpoint with respect to each
    other good class.  Then do the Multi-class FAUST
    AND operation and wallah!   Or for real speed
    just lump all good classes into one big
    multi-class and calculate a cutpoint between that
    multi-class and the anomaly class.  Then apply
    the one inequality pTree function and walah!
  • Lumping all good classes into one multiclass
    might result in there not being a good cutpoint
    (no good oblique cut-plane).  The lumping
    approach could be tried first (very quickly) and
    if it fails, go with the ANDing of binary
    half-spaces.
  • So if there is a hyperplane separating the
    anomaly class from the rest of the points the
    lumping method will find it.  If not one looks
    for a intersection of hyperplanes that
    separates.
  • Yes, then 2 fumble fingered pilots would show up
    as anomalies even though they fumble the same
    way. This would work I think pretty generally
    not only for discrete time sequential but for all
    data types...
About PowerShow.com