(Rare)%20Category%20Detection%20Using%20Hierarchical%20Mean%20Shift - PowerPoint PPT Presentation

About This Presentation
Title:

(Rare)%20Category%20Detection%20Using%20Hierarchical%20Mean%20Shift

Description:

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. ... Single kd-tree optimization used to speed up Hierarchical Mean Shift ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 30
Provided by: webEngrOr
Category:

less

Transcript and Presenter's Notes

Title: (Rare)%20Category%20Detection%20Using%20Hierarchical%20Mean%20Shift


1
(Rare) Category Detection Using Hierarchical Mean
Shift
  • Pavan Vatturi (vatturi_at_eecs.oregonstate.edu)
  • Weng-Keen Wong (wong_at_eecs.oregonstate.edu)

2
1. Introduction
  • Applications for surveillance, monitoring,
    scientific discovery and data cleaning require
    anomaly detection
  • Anomalies often identified as statistically
    unusual data points
  • Many anomalies are simply irrelevant or
    correspond to known sources of noise

3
1. Introduction
Known objects (99.9 of the data)
Anomalies (0.1 of the data)
Pictures from Sloan Digital Sky Survey
(http//www.sdss.org/iotw/archive.html) Pelleg,
D. (2004). Scalable and Practical Probability
Density Estimators for Scientific Anomaly
Detection. PhD Thesis, Carnegie Mellon
University.
Uninteresting (99 of anomalies)
Interesting (1 of anomalies)
4
1. Introduction
  • Category Detection Pelleg and Moore 2004
    human-in-the-loop exploratory data analysis

Ask User to Label Categories of Interesting Data
Points
Data Set
Update Model with Labels
Build Model
Spot Interesting Data Points
5
1. Introduction
  • User can
  • Label a query data point under an existing
    category
  • Or declare data point to belong to a previous
    undeclared category

Ask User to Label Categories of Interesting Data
Points
Data Set
Update Model with Labels
Build Model
Spot Interesting Data Points
6
1. Introduction
  • Goal present to user a single instance from each
    category in as few queries as possible
  • Difficult to detect rare categories if class
    imbalance is severe
  • Interested in rare categories for anomaly
    detection

7
Outline
  1. Introduction
  2. Related Work
  3. Background
  4. Methodology
  5. Results
  6. Conclusion / Future Work

8
2. Related Work
  • Interleave Pelleg and Moore 2004
  • Nearest-Neighbor-based active learning for
    rare-category detection for multiple classes He
    and Carbonell 2008
  • Multiple output identification Fine and Mansour
    2006

9
3. Background Mean Shift Fukunaga and Hostetler
1975
Reference data set
Mean shift vector (follows density gradient)
Query point
Center of Mass
Mean shift vector with kernel k
10
3. Background Mean Shift Fukunaga and Hostetler
1975
Reference data set
Convergence to cluster center
Query point
Center of Mass
11
3. Background Mean Shift Blurring
Reference data set
Query point
Center of Mass
  • Blurring
  • When query points are the same as the reference
    data set
  • Progressively blurs the original data set

12
3. Background Mean Shift
End result of applying mean shift to a synthetic
data set
13
4. Methodology Overview
  1. Sphere the data
  2. Hierarchical Mean Shift
  3. Query user

14
4. Methodology Hierarchical Mean Shift
Repeatedly blur data using Mean Shift with
increasing bandwidth hnew k hold
15
4. Methodology Hierarchical Mean Shift
  • Mean Shift complexity is O(n2dm) where
  • n of data points
  • d dimensionality of data points
  • m of mean shift iterations
  • Single kd-tree optimization used to speed up
    Hierarchical Mean Shift

16
4. Methodology Querying the User
  • Rank cluster centers for querying to user.
  • Outlierness Leung et al. 2000 for Cluster Ci

Lifetime of Ci Log (bandwidth when cluster Ci
is merged with other clusters bandwidth when
cluster Ci is formed)
17
4. Methodology Querying the User
  • Rank cluster centers for querying to user.
  • Compactness Isolation Leung et al. 2000 for
    Cluster Ci

18
4. Methodology Tiebreaker
  • Ties may occur in Outlierness or
    Compactness/Isolation values.
  • Highest average distance heuristic choose
    cluster center with highest average distance from
    user-labeled points.

19
5. Results
Data sets used in experiments
Name Dims Records Classes Smallest Class Largest Class
Abalone 7 4177 20 0.34 16
Shuttle 8 4000 7 0.02 64.2
OptDigits 64 1040 10 0.77 50
OptLetters 16 2128 26 0.37 24
Statlog 19 512 7 1.5 50
Yeast 8 1484 10 0.33 31.68
Shuttle, OptDigits, OptLetters, and Statlog were
subsampled to simulate class imbalance.
20
5. Results (Yeast)
Category detection metric queries before user
presented with at least one example from all
categories
21
5. Results (Statlog)
22
5. Results (OptLetters)
23
5. Results (OptDigits)
24
5. Results (Shuttle)
25
5. Results (Abalone)
26
5. Results
Number of hints to discover all classes
Dataset HMS-CI HMS-CIHAD HMS-Out HMS-OutHAD NNDM Interleave
Abalone 1195 93 603 385 124 193
Shuttle 44 32 36 28 162 35
OptDigits 100 100 160 118 576 117
OptLetters 133 133 161 182 420 489
Statlog 18 20 34 124 228 54
Yeast 73 91 103 77 88 111
27
5. Results
Area under the category detection curve
Dataset HMS-CI HMS-CIHAD HMS-Out NNDM Interleave
Abalone 0.835 0.873 0.837 0.846 0.840
Shuttle 0.925 0.929 0.917 0.480 0.905
OptDigits 0.855 0.855 0.840 0.199 0.808
OptLetters 0.936 0.936 0.917 0.573 0.765
Statlog 0.956 0.958 0.944 0.472 0.934
Yeast 0.821 0.805 0.793 0.838 0.778
28
6. Conclusion / Future Work
  • Conclusions
  • HMS-based methods consistently discover more
    categories in fewer queries than existing methods
  • Do not need apriori knowledge of dataset
    properties eg. total number of classes

29
6. Conclusion / Future Work
  • Future Work
  • Better use of user feedback
  • Presentation of an entire cluster to the user
    instead of a representative data point
  • Improved computational efficiency
  • Theoretical analysis
Write a Comment
User Comments (0)
About PowerShow.com