Title: Mining Distance based Outliers from Large Databases in Any Metric Space
1Mining Distance based Outliers from Large
Databases in Any Metric Space
- Yufei Tao Chinese University of Hong Kong
- Xiaokui Xiao Chinese University of Hong Kong
- Shuigeng Zhou Fudan University
2Outlier definition
- Parameters a set R of objects, r and k 1.
- An object o is an outlier, if at most k objects
in R (including o) have distances r from o. - E.g., k 3.
3Outlier definition (cont.)
4Applications
- Fraud detection
- Catching fraudulent creditcard transactions
- Euclidean distance (?)
- Business location planning
- Outlier restaurants? The vicinities of these
restaurants may be good locations for new
restaurants. - Shortest path distance
- Fire detection
- Which sensor has a much higher temperature than
its neighboring sensors? - Composite distance capturing both the proximity
of two sensors and their temperature difference.
5Why this definition?
- Because it is compatible with rare events in
statistics. - Example Rare events in a Gaussian distribution.
- The definition was first introduced by Knorr and
Ng, VLDB 98.
6Goals
- Support any distance metric satisfying the
triangle inequality. - Road network shortest-path distance
- Edit distance
- Time series similarity
-
- Find all outliers in I/O cost linear to n/b.
- n the cardinality of R
- b page size
7Nested loop?
- Worst case CPU time O(n2).
- Average CPU time ? O(n)
- Shown by Bay and Schwabancher, SIGKDD 03.
- Intuition If an object o is not an outlier, its
verification may require scanning only a fraction
of the database. - See next.
8Nested loop (cont.)
- k 3
- of points in the circle 21
- total of points 31
- Hence, to verify the greenpoint is not an
outlier, we need to scan - 4 / (21 / 31) 5.9 points
9Block nested loop?
- m the number memory pages
- Hold m 1 pages of objects in memory, and use
the remaining page to scan the database. - During the scan For every object o kept in
memory, count the number of points with distance
r from o. - Stop the scan once the counters of all objects
are above k. - I/O cost O((n / b)2)
- Why is it so different from the CPU cost?
10Block nested loop (cont.)
- I/O cost O((n / b)2)
- Why is it so different from the CPU cost?
- Say p is the percentage of non-outliers.
- Example p 99.95
- Probability of not scanning the whole dataset in
a loop - p (m 1) ? b
- Practical values m 101, b 100
- p (m 1) ? b lt 1
11Our solution
- Done in 2 scans.
- Assumption Objects stored in a random order.
- Otherwise, a randomization process is needed.
- Just like nested loop.
- But in this talk, limited by time, we will
elaborate only a basic solution with 3 scans. - See paper for the 2-scan approach.
12Data Summary
- Randomly take s samples, called centroids.
- Example s 4.
- Perform object partitioning.
- For each object o, select thosecentroids within
distance r / 2 from o. - Among these centroids, assigno to the closest
one.
13Data Summary (cont.)
- Collect statistics
- Associate every centroid with a counter, equal to
the number of objects with distances r / 2.
14Pruning
- Assume k 3.
- No object in this partition can be an outlier.
- Remember that the circle has a radius r / 2.
- In general, all points assigned to a centroid
with a counter at least k are non-outliers.
15Pruning (cont.)
16Pruning (cont.)
17Another scan
- Keep these points in memory, and perform another
scan to verify if they are outliers.
18Memory requirement for achieving 3 scans?
- How large is it in practice?
- Answer About 1 of the dataset!
- See next.
193-scan memory requirement
- CA a spatial dataset released by the TIGER
project - 62k two-dimensional points representing addresses
in California. - Household released by the US Census Bureau
- 1 million three-dimension points
- each of which represents the annual expenditure
of an American family on electricity, gas, and
water, respectively. - Server KDD Cup 1999 data containing the
statistics of 500k network connections.
203-scan memory requirement (cont.)
when 0.1 of the
dataset are outliers
when there is 1 outlier
213-scan memory requirement (cont.)
223-scan memory requirement (cont.)
233-scan memory requirement (cont.)
- Observations
- The memory requirement decreases exponentially as
r increases. - For most meaningful r, the memory just needs to
hold 1 of the dataset! - In practice, the memory may be much larger than
1 of the dataset. - So we can use the additional memory to improve
efficiency. - This is the motivation of our 2-scan approach.
24More in the paper
- A SNIF technique for extracting outliers in two
scans. - Detailed theoretical analysis.
- Analytical comparison of the proposed algorithm
against the state-of-the-art CELL. - CELL is applicable to Euclidean data only.
25Experiment 1 (I/O vs r)
- k 0.05 of the cardinality
26Experiment 1 (I/O vs r)
- k 0.05 of the cardinality
27Experiment 1 (I/O vs r)
- k 0.05 of the cardinality
28Experiment 2 (I/O vs k)
- r median of interesting range
29Experiment 2 (I/O vs k)
- r median of interesting range
30Experiment 2 (I/O vs k)
- r median of interesting range
31Experiment 3 (I/O vs memory)
- r median of interesting range, k 0.05 of the
cardinality.
32Experiment 3 (I/O vs memory)
- r median of interesting range, k 0.05 of the
cardinality.
33Experiment 3 (I/O vs memory)
- r median of interesting range, k 0.05 of the
cardinality.
34Summary
- An algorithm for finding distance-based outliers
by scanning the database twice. - Applicable to any distance metric.