Rare Category Detection - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Rare Category Detection

Description:

Start de-novo. Very skewed classes. Majority classes. Minority classes. Labeling oracle ... Initial condition: labeled examples from each class ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 39
Provided by: carbonVide1
Category:
Tags: category | de | detection | novo | rare

less

Transcript and Presenter's Notes

Title: Rare Category Detection


1
Rare Category Detection
  • Jingrui He
  • Machine Learning Department
  • Carnegie Mellon University
  • Joint work with Jaime Carbonell

2
Whats Rare Category Detection
  • Start de-novo
  • Very skewed classes
  • Majority classes
  • Minority classes
  • Labeling oracle
  • Goal
  • Discover minority classes with a few label
    requests

3
Comparison with Outlier Detection
  • Rare classes
  • A group of points
  • Clustered
  • Non-separable from the majority classes
  • Outliers
  • A single point
  • Scattered
  • Separable

4
Comparison with Active Learning
  • Rare category detection
  • Initial condition NO labeled examples
  • Goal discover the minority classes with the
    least label requests
  • Active learning
  • Initial condition labeled examples from each
    class
  • Goal improve the performance of the current
    classifier with the least label requests

5
Applications
Network intrusion detection
Fraud detection
Astronomy
Spam image detection
6
The Big Picture
Classifier
Unbalanced Unlabeled Data Set
Rare Category Detection
Learning in Unbalanced Settings
Feature Extraction
Spatial
Raw Data
Relational
Temporal
7
Outline
  • Problem definition
  • Related work
  • Rare category detection for spatial data
  • Prior-dependent rare category detection
  • Prior-free rare category detection
  • Conclusion

8
Related Work
  • Pelleg Moore 2004
  • Mixture model
  • Different selection criteria
  • Fine Mansour 2006
  • Generic consistency algorithm
  • Upper bounds and lower bounds
  • Papadimitriou et al 2003
  • LOCI algorithm for groups of outliers

Separable or Near-separable
9
Outline
  • Problem definition
  • Related work
  • Rare category detection for spatial data
  • Prior-dependent rare category detection
  • Prior-free rare category detection
  • Conclusion

10
Notations
  • Unlabeled examples ,
  • m Classes
  • m-1 rare classes
  • One majority class ,
  • Goal find at least ONE example from each rare
    class by requesting a few labels

11
Assumptions
  • The distribution of the majority class is
    sufficiently smooth
  • Examples from the minority classes form compact
    clusters in the feature space

12
Overview of the Algorithms
  • Nearest-neighbor-based methods
  • Methodology local density differential sampling
  • Intuition select examples according to the
    change in local density

13
Two Classes NNDB
1. Calculate class-specific radius
2. ,
,
Increase t by 1
3.
4. Query
No
5. Rare class?
Yes
6. Output
14
NNDB Calculate Class-Specific Radius
  • Number of examples from the minority class
  • , calculate the distance between
    and its nearest neighbor
  • The class-specific radius

15
NNDB Calculate Nearest Neighbors
16
NNDB Calculate the Scores
Query
17
NNDB Pick the Next Candidate
Increase t by 1
Query
18
Why NNDB Works
  • Theoretically
  • Theorem 1 He Carbonell 2007 under certain
    conditions, with high probability, after a few
    iteration steps, NNDB queries at least one
    example whose probability of coming from the
    minority class is at least 1/3
  • Intuitively
  • The score measures the
  • change in local density

19
Multiple Classes ALICE
  • m-1 rare classes
  • One majority class ,

1. For each rare class c,
Yes
2. We have found examples from class c
No
3. Run NNDB with prior
20
Why ALICE Works
  • Theoretically
  • Theorem 2 He Carbonell 2008 under certain
    conditions, with high probability, in each outer
    loop of ALICE, after a few iteration steps in
    NNDB, ALICE queries at least one example whose
    probability of coming from one minority class is
    at least 1/3

21
Implementation Issues
  • ALICE
  • Problem repeatedly sampling from the same rare
    class
  • MALICE
  • Solution relevance feedback

Class-specific radius
22
Results on Synthetic Data Sets
23
Summary of Real Data Sets
  • Abalone
  • 4177 examples
  • 7-dimensional features
  • 20 classes
  • Largest class 16.50
  • Smallest class 0.34
  • Shuttle
  • 4515 examples
  • 9-dimensional features
  • 7 classes
  • Largest class 75.53
  • Smallest class 0.13

24
Results on Real Data Sets
Abalone
Shuttle
MALICE
MALICE
Interleave
Interleave
Random sampling
Random sampling
25
Imprecise priors
Abalone
Shuttle
26
Outline
  • Problem definition
  • Related work
  • Rare category detection for spatial data
  • Prior-dependent rare category detection
  • Prior-free rare category detection
  • Conclusion

27
Overview of the Algorithm
  • Density-based method
  • Methodology specially designed exponential
    families
  • Intuition select examples according to the
    change in local density
  • Difference from NNDB (ALICE) NO prior
    information needed

28
Specially Designed ExponentialFamilies Efron
Tibshirani 1996
  • Favorable compromise between parametric and
    nonparametric density estimation
  • Estimated density

Carrier density
parameter vector
Normalizing parameter
vector of sufficient statistics
29
SEDER Algorithm
  • Carrier density kernel density estimator
  • To decouple the estimation of different
    parameters
  • Decompose
  • Relax the constraint such that

30
Parameter Estimation
  • Theorem 3 To appear the maximum likelihood
    estimate and of and satisfy the
    following conditions
  • where

31
Parameter Estimation cont.
  • Let
  • where ,

positive parameter
in most cases
32
Scoring Function
  • The estimated density
  • Scoring function norm of the gradient
  • where

33
Results on Synthetic Data Sets
34
Summary of Real Data Sets
Moderately Skewed
Extremely Skewed
35
Moderately Skewed Data Sets
Ecoli
Glass
MALICE
MALICE
36
Extremely Skewed Data Sets
Page Blocks
Abalone
MALICE
MALICE
Shuttle
MALICE
37
Conclusion
  • Rare category detection
  • Open challenge
  • Lack of effective methods
  • Nearest-neighbor-based methods
  • Prior-dependent
  • Local density differential sampling
  • Density-based method
  • Prior-free
  • Specially designed exponential families

38
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com