Novelty Detection and Profile Tracking from Massive Data

About This Presentation

Title:

Novelty Detection and Profile Tracking from Massive Data

Description:

Title: PowerPoint Presentation Author: eugene Last modified by: Eugene Fink Created Date: 6/25/2003 4:44:54 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 36

Provided by: euge1153

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Novelty Detection and Profile Tracking from Massive Data

1
Novelty Detection and Profile Tracking from
Massive Data
Jaime Carbonell Eugene Fink
Santosh Ananthraman
2
Motivation
Search for interesting patternsin large data sets
3
Motivation
Search for interesting patternsin large data sets

Current applications
Processing of intelligence data
Prediction of natural threats

Future applications
Scientific discoveries
Analysis of business data
and more

4
Outline

Main results of the ARGUS project -
Approximate matching - Streaming data -
Novelty detection
More about approximate matching - Records and
queries - Search for matches - Experimental
results

5
Large data sets
Large From a million (106) to several billion
(1010) records
Data Structured records with numbers, strings,
and nominal values
Sets Databases and streams of records

Specific sets
Hospital admissions (1.7 million records)
Network flow (5 trillion records)
Federal wire (simulated data)

6
Main results
We have developed a system thataddresses three
problems

Retrieval of approximate matches for known
patterns

Processing of streaming data

Identification of new patterns and gradual
changes in old patterns

7
Approximate matching
Fast identification of approximatematches in
large sets of records

Examples
Misspelled names
Inexact numbers
Spatial proximity

8
Streaming data
Continuous search for matchesin a stream of new
records

Maintain a set of pending queries

Identify matches for these queries among
incoming records

9
RETE network
Identify common parts of queries andarrange them
into a RETE network, which significantly reduces
the matching time

Hundreds to thousands of pending queries
Tens to hundreds of records per second

10
Novelty detection
11
Example Static event
12
Example New event
density
distance
13
Example Hidden event
density
distance
14
Example Growing event
density
distance
15
Visualization

Display of records, clusters, and queries in
two and three dimensions
Access to data tables and analysis results

16
Example Data and clusters
17
Example Density analysis
18
Information flow
19
Outline

Main results of the ARGUS project -
Approximate matching - Streaming data -
Novelty detection
More about approximate matching - Records and
queries - Search for matches - Experimental
results

20
Motivation
Retrieval of relevant records basedon partially
inaccurate information

Inaccurate records
Inaccurate queries
Incomplete knowledge

21
Table of records
We specify a table of records by a list of
attributes
Example We can describe patients in a hospitalby
their sex, age, and diagnosis
22
Records and queries
A record includes a specificvalue for each
attribute
A query may include lists ofvalues and numeric
ranges
Query Sex male, female Age 20..40 Dx asthma,
flu
23
Query types
24
Exact matches
A record is an exact match for a query if every
value in the record belongs tothe respective
range in the query
25
Approximate matches
A record is an approximate match for aquery if
it is close to the query region
Record
26
Approximate queries
An approximate query includes

Point or region

Distance function

Number of matches

Distance limit

27
Indexing tree

Maintain a PATRICIA tree of records

male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
28
Search for matches

Depth-first search for exact matches

Best-first search for approximate matches

male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
29
Performance
Experiments with a database of all
patientsadmitted to Massachusetts hospitals
fromOctober 2000 to September 2002

Twenty-one attributes
1.7 million records

Use of a Pentium computer
2.4 GHz CPU
1 Gbyte memory
400 MHz bus

30
Variables

Control variables
Number of records
Memory size
Query type

Measurements
Retrieval time

31
Small memory

Number of records 100 to 1,670,000
Memory size 4 MByte

32
Large memory

Number of records 1,670,000
Memory size 64 to 1,024 MByte

33
Scalability
Retrieval time grows as fractionalpower (about
0.5) of database size
Number ofrecords (n) n 0.5 time(seconds)
1,000,000100,000,00010,000,000,000 0.05 . 0.50 .5.00 .
34
Distributed architecture
Indexing trees on multiple computers
35
Conclusions