Mining Distance based Outliers from Large Databases in Any Metric Space - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Mining Distance based Outliers from Large Databases in Any Metric Space

Description:

Hold m 1 pages of objects in memory, and use the remaining page to scan the database. ... for finding distance-based outliers by scanning the database twice. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 35

Provided by: cseCu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Distance based Outliers from Large Databases in Any Metric Space

1
Mining Distance based Outliers from Large
Databases in Any Metric Space

Yufei Tao Chinese University of Hong Kong
Xiaokui Xiao Chinese University of Hong Kong
Shuigeng Zhou Fudan University

2
Outlier definition

Parameters a set R of objects, r and k 1.
An object o is an outlier, if at most k objects
in R (including o) have distances r from o.
E.g., k 3.

3
Outlier definition (cont.)

k 3.

4
Applications

Fraud detection
Catching fraudulent creditcard transactions
Euclidean distance (?)
Business location planning
Outlier restaurants? The vicinities of these
restaurants may be good locations for new
restaurants.
Shortest path distance
Fire detection
Which sensor has a much higher temperature than
its neighboring sensors?
Composite distance capturing both the proximity
of two sensors and their temperature difference.

5
Why this definition?

Because it is compatible with rare events in
statistics.
Example Rare events in a Gaussian distribution.
The definition was first introduced by Knorr and
Ng, VLDB 98.

6
Goals

Support any distance metric satisfying the
triangle inequality.
Road network shortest-path distance
Edit distance
Time series similarity
Find all outliers in I/O cost linear to n/b.
n the cardinality of R
b page size

7
Nested loop?

Worst case CPU time O(n2).
Average CPU time ? O(n)
Shown by Bay and Schwabancher, SIGKDD 03.
Intuition If an object o is not an outlier, its
verification may require scanning only a fraction
of the database.
See next.

8
Nested loop (cont.)

k 3
of points in the circle 21
total of points 31
Hence, to verify the greenpoint is not an
outlier, we need to scan
4 / (21 / 31) 5.9 points

9
Block nested loop?

m the number memory pages
Hold m 1 pages of objects in memory, and use
the remaining page to scan the database.
During the scan For every object o kept in
memory, count the number of points with distance
r from o.
Stop the scan once the counters of all objects
are above k.
I/O cost O((n / b)2)
Why is it so different from the CPU cost?

10
Block nested loop (cont.)

I/O cost O((n / b)2)
Why is it so different from the CPU cost?
Say p is the percentage of non-outliers.
Example p 99.95
Probability of not scanning the whole dataset in
a loop
p (m 1) ? b
Practical values m 101, b 100
p (m 1) ? b lt 1

11
Our solution

Done in 2 scans.
Assumption Objects stored in a random order.
Otherwise, a randomization process is needed.
Just like nested loop.
But in this talk, limited by time, we will
elaborate only a basic solution with 3 scans.
See paper for the 2-scan approach.

12
Data Summary

Randomly take s samples, called centroids.
Example s 4.
Perform object partitioning.
For each object o, select thosecentroids within
distance r / 2 from o.
Among these centroids, assigno to the closest
one.

13
Data Summary (cont.)

Collect statistics
Associate every centroid with a counter, equal to
the number of objects with distances r / 2.

14
Pruning

Assume k 3.
No object in this partition can be an outlier.
Remember that the circle has a radius r / 2.
In general, all points assigned to a centroid
with a counter at least k are non-outliers.

15
Pruning (cont.)

From

16
Pruning (cont.)

We get

17
Another scan

Keep these points in memory, and perform another
scan to verify if they are outliers.

18
Memory requirement for achieving 3 scans?

How large is it in practice?
Answer About 1 of the dataset!
See next.

19
3-scan memory requirement

CA a spatial dataset released by the TIGER
project
62k two-dimensional points representing addresses
in California.
Household released by the US Census Bureau
1 million three-dimension points
each of which represents the annual expenditure
of an American family on electricity, gas, and
water, respectively.
Server KDD Cup 1999 data containing the
statistics of 500k network connections.

20
3-scan memory requirement (cont.)
when 0.1 of the
dataset are outliers
when there is 1 outlier
21
3-scan memory requirement (cont.)
22
3-scan memory requirement (cont.)
23
3-scan memory requirement (cont.)

Observations
The memory requirement decreases exponentially as
r increases.
For most meaningful r, the memory just needs to
hold 1 of the dataset!
In practice, the memory may be much larger than
1 of the dataset.
So we can use the additional memory to improve
efficiency.
This is the motivation of our 2-scan approach.

24
More in the paper

A SNIF technique for extracting outliers in two
scans.
Detailed theoretical analysis.
Analytical comparison of the proposed algorithm
against the state-of-the-art CELL.
CELL is applicable to Euclidean data only.

25
Experiment 1 (I/O vs r)