Rare Category Detection Using Hierarchical Mean Shift

About This Presentation

Title:

Rare Category Detection Using Hierarchical Mean Shift

Description:

Applications for surveillance, scientific discovery and data cleaning require ... Highest Average Distance heuristic: choose representative data point with ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 24

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Rare Category Detection Using Hierarchical Mean Shift

1
(Rare) Category Detection Using Hierarchical Mean
Shift

Pavan Vatturi (vatturi_at_eecs.oregonstate.edu)
Weng-Keen Wong (wong_at_eecs.oregonstate.edu)

2
1. Introduction

Applications for surveillance, scientific
discovery and data cleaning require anomaly
detection
Anomalies often identified as statistically
unusual data points
Many detected anomalies are simply uninteresting
or correspond to known sources of noise

3
1. Introduction
Known objects (99.9 of the data)
Anomalies (0.1 of the data)
Pictures from Sloan Digital Sky Survey
(http//www.sdss.org/iotw/archive.html) Pelleg,
D. (2004). Scalable and Practical Probability
Density Estimators for Scientific Anomaly
Detection. PhD Thesis, Carnegie Mellon
University.
Uninteresting (99 of anomalies)
Interesting (1 of anomalies)
4
1. Introduction

Category Detection Pelleg and Moore 2004
human-in-the-loop exploratory data analysis

Ask User to Label Categories of Interesting Data
Points
Data Set
Update Model with Labels
Build Model
Spot Interesting Data Points
5
1. Introduction

User can
Label a query data point under an existing
category
Or declare data point to belong to a previous
undeclared category

Ask User to Label Categories of Interesting Data
Points
Data Set
Update Model with Labels
Build Model
Spot Interesting Data Points
6
1. Introduction

Goal present to user a single instance from each
category in as few queries as possible
Difficult to detect rare categories if class
imbalance is severe
Interested in rare categories for anomaly
detection

7
Outline

Introduction
Related Work
Background
Methodology
Results
Conclusion / Future Work

8
2. Related Work

Interleave Pelleg and Moore 2004
Nearest-Neighbor-based active learning for
rare-category detection for multiple classes He
and Carbonell 2008
Multiple output identification Fine and Mansour
2006

9
3. Background Mean Shift Fukunaga and Hostetler
1975
Reference data set
Mean shift vector (follows density gradient)
Query point
Center of Mass
Mean shift vector with kernel k
10
3. Background Mean Shift Fukunaga and Hostetler
1975
Reference data set
Convergence to cluster center
Query point
Center of Mass
11
3. Background Mean Shift Blurring
Reference data set
Query point
Center of Mass

Blurring
When query points are the same as the reference
data set
Progressively blurs the original data set

12
3. Background Mean Shift
End result of applying mean shift to a synthetic
data set
13
4. Methodology Overview

Sphere the data
Hierarchical Mean Shift
Query user

14
4. Methodology Hierarchical Mean Shift
Repeatedly blur data using Mean Shift with
increasing bandwidth hnew k hold
15
4. Methodology Querying the User

The data point closest to the cluster center is
the representative data point. Rank
representative data points for querying to user
according to
Outlierness Leung et al. 2000 for Cluster Ci

Lifetime of Ci Log (bandwidth when cluster Ci
is merged with other clusters bandwidth when
cluster Ci is formed)
16
4. Methodology Querying the User

Rank representative data points for querying to
user according to
Compactness Isolation Leung et al. 2000 for
Cluster Ci

17
4. Methodology Tiebreaker

Ties may occur in Outlierness or
Compactness/Isolation values.
Highest Average Distance heuristic choose
representative data point with highest average
distance from user-labeled points.

18
5. Results
Data sets used in experiments
Shuttle, OptDigits, OptLetters, and Statlog were
subsampled to simulate class imbalance.
19
5. Results (Yeast)
Category detection metric queries before user
presented with at least one example from all
categories
20
5. Results
Number of hints to discover all classes
21
5. Results
Area under the category detection curve
22
6. Conclusion / Future Work

Conclusions
HMS-based methods consistently discover more
categories in fewer queries than existing methods
Do not need apriori knowledge of dataset
properties

23
6. Conclusion / Future Work

Future Work
Better use of user feedback
Presentation of an entire cluster to the user
instead of a representative data point
Improved computational efficiency
Theoretical analysis

Write a Comment

User Comments (0)

About PowerShow.com

Rare Category Detection Using Hierarchical Mean Shift - PowerPoint PPT Presentation

Rare Category Detection Using Hierarchical Mean Shift

Applications for surveillance, scientific discovery and data cleaning require ... Highest Average Distance heuristic: choose representative data point with ... – PowerPoint PPT presentation