Mining Distance based Outliers from Large Databases in Any Metric Space - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Mining Distance based Outliers from Large Databases in Any Metric Space

Description:

Hold m 1 pages of objects in memory, and use the remaining page to scan the database. ... for finding distance-based outliers by scanning the database twice. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 35
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: Mining Distance based Outliers from Large Databases in Any Metric Space


1
Mining Distance based Outliers from Large
Databases in Any Metric Space
  • Yufei Tao Chinese University of Hong Kong
  • Xiaokui Xiao Chinese University of Hong Kong
  • Shuigeng Zhou Fudan University

2
Outlier definition
  • Parameters a set R of objects, r and k 1.
  • An object o is an outlier, if at most k objects
    in R (including o) have distances r from o.
  • E.g., k 3.

3
Outlier definition (cont.)
  • k 3.

4
Applications
  • Fraud detection
  • Catching fraudulent creditcard transactions
  • Euclidean distance (?)
  • Business location planning
  • Outlier restaurants? The vicinities of these
    restaurants may be good locations for new
    restaurants.
  • Shortest path distance
  • Fire detection
  • Which sensor has a much higher temperature than
    its neighboring sensors?
  • Composite distance capturing both the proximity
    of two sensors and their temperature difference.

5
Why this definition?
  • Because it is compatible with rare events in
    statistics.
  • Example Rare events in a Gaussian distribution.
  • The definition was first introduced by Knorr and
    Ng, VLDB 98.

6
Goals
  • Support any distance metric satisfying the
    triangle inequality.
  • Road network shortest-path distance
  • Edit distance
  • Time series similarity
  • Find all outliers in I/O cost linear to n/b.
  • n the cardinality of R
  • b page size

7
Nested loop?
  • Worst case CPU time O(n2).
  • Average CPU time ? O(n)
  • Shown by Bay and Schwabancher, SIGKDD 03.
  • Intuition If an object o is not an outlier, its
    verification may require scanning only a fraction
    of the database.
  • See next.

8
Nested loop (cont.)
  • k 3
  • of points in the circle 21
  • total of points 31
  • Hence, to verify the greenpoint is not an
    outlier, we need to scan
  • 4 / (21 / 31) 5.9 points

9
Block nested loop?
  • m the number memory pages
  • Hold m 1 pages of objects in memory, and use
    the remaining page to scan the database.
  • During the scan For every object o kept in
    memory, count the number of points with distance
    r from o.
  • Stop the scan once the counters of all objects
    are above k.
  • I/O cost O((n / b)2)
  • Why is it so different from the CPU cost?

10
Block nested loop (cont.)
  • I/O cost O((n / b)2)
  • Why is it so different from the CPU cost?
  • Say p is the percentage of non-outliers.
  • Example p 99.95
  • Probability of not scanning the whole dataset in
    a loop
  • p (m 1) ? b
  • Practical values m 101, b 100
  • p (m 1) ? b lt 1

11
Our solution
  • Done in 2 scans.
  • Assumption Objects stored in a random order.
  • Otherwise, a randomization process is needed.
  • Just like nested loop.
  • But in this talk, limited by time, we will
    elaborate only a basic solution with 3 scans.
  • See paper for the 2-scan approach.

12
Data Summary
  • Randomly take s samples, called centroids.
  • Example s 4.
  • Perform object partitioning.
  • For each object o, select thosecentroids within
    distance r / 2 from o.
  • Among these centroids, assigno to the closest
    one.

13
Data Summary (cont.)
  • Collect statistics
  • Associate every centroid with a counter, equal to
    the number of objects with distances r / 2.

14
Pruning
  • Assume k 3.
  • No object in this partition can be an outlier.
  • Remember that the circle has a radius r / 2.
  • In general, all points assigned to a centroid
    with a counter at least k are non-outliers.

15
Pruning (cont.)
  • From

16
Pruning (cont.)
  • We get

17
Another scan
  • Keep these points in memory, and perform another
    scan to verify if they are outliers.

18
Memory requirement for achieving 3 scans?
  • How large is it in practice?
  • Answer About 1 of the dataset!
  • See next.

19
3-scan memory requirement
  • CA a spatial dataset released by the TIGER
    project
  • 62k two-dimensional points representing addresses
    in California.
  • Household released by the US Census Bureau
  • 1 million three-dimension points
  • each of which represents the annual expenditure
    of an American family on electricity, gas, and
    water, respectively.
  • Server KDD Cup 1999 data containing the
    statistics of 500k network connections.

20
3-scan memory requirement (cont.)
when 0.1 of the
dataset are outliers
when there is 1 outlier
21
3-scan memory requirement (cont.)
22
3-scan memory requirement (cont.)
23
3-scan memory requirement (cont.)
  • Observations
  • The memory requirement decreases exponentially as
    r increases.
  • For most meaningful r, the memory just needs to
    hold 1 of the dataset!
  • In practice, the memory may be much larger than
    1 of the dataset.
  • So we can use the additional memory to improve
    efficiency.
  • This is the motivation of our 2-scan approach.

24
More in the paper
  • A SNIF technique for extracting outliers in two
    scans.
  • Detailed theoretical analysis.
  • Analytical comparison of the proposed algorithm
    against the state-of-the-art CELL.
  • CELL is applicable to Euclidean data only.

25
Experiment 1 (I/O vs r)
  • k 0.05 of the cardinality

26
Experiment 1 (I/O vs r)
  • k 0.05 of the cardinality

27
Experiment 1 (I/O vs r)
  • k 0.05 of the cardinality

28
Experiment 2 (I/O vs k)
  • r median of interesting range

29
Experiment 2 (I/O vs k)
  • r median of interesting range

30
Experiment 2 (I/O vs k)
  • r median of interesting range

31
Experiment 3 (I/O vs memory)
  • r median of interesting range, k 0.05 of the
    cardinality.

32
Experiment 3 (I/O vs memory)
  • r median of interesting range, k 0.05 of the
    cardinality.

33
Experiment 3 (I/O vs memory)
  • r median of interesting range, k 0.05 of the
    cardinality.

34
Summary
  • An algorithm for finding distance-based outliers
    by scanning the database twice.
  • Applicable to any distance metric.
Write a Comment
User Comments (0)
About PowerShow.com