ARGUS: A Prototype Stream Anomaly Monitoring System - PowerPoint PPT Presentation

About This Presentation

Title:

ARGUS: A Prototype Stream Anomaly Monitoring System

Description:

Stream Anomaly Monitoring System (SAMS) is an important sub-class of stream applications. ... an approach for SAMS's that implements incremental evaluation ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 43

Provided by: cjin

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: ARGUS: A Prototype Stream Anomaly Monitoring System

1
ARGUS A Prototype Stream Anomaly Monitoring
System

Thesis Proposal
Chun Jin

Thesis Committee Jaime Carbonell
(Chair) Christopher Olston Jamie Callan Phil
Hayes, DYNAMiX Technologies
2
Thesis Statement

Stream Anomaly Monitoring System (SAMS) is an
important sub-class of stream applications. The
difficulty is raised by the very-large-volume
data and a large number of queries the system is
supposed to handle.
Propose an approach for SAMSs that implements
incremental evaluation schemes with adapted Rete
algorithm upon a traditional DBMS platform and
exploit SAMS characteristics for query evaluation
optimization.
Demonstrate how the approach and the improvements
could lead to a simple and fast implementation of
an effective and efficient SAMS system.

3
Outline

Motivation
My ARGUS Approach
Current Work Status
Current System
Preliminary Results
Proposed Work and Timeline

4
Stream Processing

Stream Processing Applications
Network Traffic Analysis and Router Configuration
Internet Services
Sensor Data Analysis
Anomaly Detection
Stream Processing Projects
STREAM, TelegraphCQ, Aurora
NiagaraCQ, OpenCQ, WebCQ
Gigascope, Tribeca
Tapestry, Alert, Tukwila, etc.

5
Stream Anomaly Monitoring Systems (SAMS)

SAMS monitors structured data streams for
anomalies or potential hazards.
Continuous queries may number in thousands or
tens of thousands.
Daily stream volumes may exceed millions of
records.
Satisfaction of a SAMS query is often rare
(very-high-selectivity).

6
SAMS Dataflow
Data Streams
FedWire Money Transfers
Patient Records
Stream Anomaly Monitoring System
Queries
Storage
Alerts
Analyst
7
Query Example 4

Suppose for every big transaction of type code
1000, the analyst wants to check if the money
stayed in the bank or left within ten days. An
additional sign of possible fraud is that
transactions involve at least one intermediate
bank. The query generates an alarm whenever the
receiver of a large transaction (over 1,000,000)
transfers at least half of the money further
within ten days of this transaction using an
intermediate bank.

8
SQL Query for Example 4

FROM transaction r1, transaction r2, transaction
r3
WHERE r2.type_code 1000 AND
r3.type_code 1000 AND
r1.type_code 1000 AND
r1.amount gt 1000000 AND
r1.rbank_aba r2.sbank_aba AND
r1.benef_account r2.orig_account AND
r2.amount gt 0.5 r1.amount AND
r1.tran_date lt r2.tran_date AND
r2.tran_date lt r1.tran_date 10 AND
r2.rbank_aba r3.sbank_aba AND
r2.benef_account r3.orig_account AND
r2.amount r3.amount AND
r2.tran_date lt r3.tran_date AND
r3.tran_date lt r2.tran_date 10

9
ARGUS as a Prototype SAMS

Implement the Adapted Rete Algorithm upon a
traditional DBMS platform
Rete (Forgy 1982) Incremental Evaluation based
on Materialized Intermediate Results.
SAMSs assumption of very-high-selectivity query
over very-large-volume data justifies employment
of Rete and necessitates some unique
improvements.
Transitivity Inference
Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97
Predicate Set Evaluation and Materialization
Partial Rete (Materialization skipping)
Complex Common Computation Identification for
Sharing
Intermingled Sharing and Optimization processing

10
ARGUS System Architecture
Data Tables
Stream Anomaly Monitoring
Intermediate Tables
Data Streams
Query Table
Do_queries
Analyst
Rete Network Generator
Query
Scheduler
Rete Networks
Identified Threats
11
ReteGenerator Architecture
History-based Cost Estimating
ReteGenerator
Update Tables
History-based Rete Optimizer
Topology Table
Counter Table
SQL Queries
Transitivity Inference
ReteGen Manager
Sharing
Query Rewriter
Topology Checker
Check Topology
Register Rete Networks
12
Selected ARGUS Topics

Adapted Rete Algorithm
ReteGenerator translates a query into a Rete
network that is wrapped as a stored procedure.
The procedure implements the Adapted Rete
Algorithm accounting for the incremental
evaluation
Transitivity Inference
Rete Optimization
Computation Sharing

13
Adapted Rete Algorithm (Selection)

n and m are old data sets
?n and ?m are the new much smaller incremental
data sets.
Selection o
o(n ?n)

o(?n)

o(n)
14
Adapted Rete Algorithm (Join)

Join
(n?n) (m?m)
n m ?n m n ?m ?n ?m
When ?n and ?m are very small compared to n and
m, time complexity of incremental join is O(nm)

Old Results
New Incremental Results
15
Incremental Evaluation in Rete Example 4
r1.rbank_aba r2.sbank_aba r1.benef_account
r2.orig_account r2.amount gt r1.amount0.5 r1.tran_
date lt r2.tran_date r2.tran_date gt
r1.tran_date10
Type_code1000 Amountgt1000000
DataTable
Type_code1000
r1, r2, r3
r2.rbank_aba r3.sbank_aba r2.benef_account
r3.orig_account r2.amount r3.amount r2.tran_date
lt r3.tran_date r3.tran_date gt r2.tran_date10
Type_code1000
16
Complex Queries

A continuous query may contain multiple SQL
statements, and a single SQL statement may
contain unions of multiple SQL terms.
Each SQL term is mapped to a sub-Rete network.
These sub-Rete networks are then connected to
form the statement-level sub-networks.
And the statement-level subnetworks are further
connected based on the view references to form
the final query-level Rete network.

17
Transitivity Inference

Exploring transitivity properties of comparison
operators
To derive hidden high-selective selection
predicates
High-selective selection predicates can
significantly improve performance as they may
produce very small intermediate results.
Subsequent join could be performed very fast on
the materialized intermediate results.

18
Transitivity Inference Example

Given
r1.amount gt 1000000 and
r2.amount gt r1.amount 0.5 and
r3.amount r2.amount
r1.amount gt 1000000 is very high-selective on r1
We can infer high-selective predicates
r2.amount gt 500000
r3.amount gt 500000

19
Rete Optimization
DB
History-based Cost Estimator
Active List
SQL Query
Join Enumerator
Join Graph
Rete network
Update Tables
History-based Rete Optimizer
StructureBuilder
20
Join Graph Example
1
P(1,2)
P(1,3)
1,2
2
P(2,3)
3
4
P(3,4)
21
History-based Cost Estimator

Run sub-plans on historical data
To estimate the costs of sub-plans on future data
Assume same data distribution in past and future
Apply heuristic functions to avoid estimating
extremely high cost sub-plans.
Justify History-based Cost Estimator
Compiled and optimized once, and executed
multiple times
Tolerable to spend more time on the one-time
optimization
Accurate cost estimates compensate as queries run
more and more times

22
Computation Sharing

Predicate Indexing
Extended predicate set operations
Sharing Algorithm

23
Predicate Indexing

Predicate Indexing Concepts
Equivalent Predicate,
p1 p2, iff ?D, p1(D) p2(D)
Equivalent Predicate Class
Canonical Predicate Form
Predicates are converted into the canonical forms
and stored as records in tables.
Searching a predicate becomes data retrieval from
tables.

24
Relationship between Predicate Sets and Their
Result Tuple Sets

Predicate Set a set of conjunctive predicates
Its Result Tuple Set a set of database tuples
that satisfy all the predicates of the Predicate
Set.
Fix database status D, a mapping from predicate
set P to its result tuple set SD(P)
SD P ---gt SD(P)
Predicate sets and their result tuple sets are
complementary
Predicates are filters of data items
The more number of predicates, the less number of
result tuples

25
Extending Predicate Set Operations

Defined on predicate sets
Definitions are justified by the relationships
among corresponding result tuple sets
Important to common computation identification

26
Semantic Subset ?

Given two predicate sets P1 and P2, we say that
P1 is a semantic subset of P2, and denote as
P1?P2, if for any database status D, we have
SD(P1)?SD(P2).

27
Semantic Subset Example

p1 t1.agt1, p2 t1.agt2
P1 p1, P2 p2
S(P1)?S(P2),
P1? P2.
Why?
P2 p1, p2

28
Sharing Types
T1
POT1
T1
POT1
POJ-PFJ
POJ
PNJ-POJ
PFJ
T2
POT2
T2
POT2
Add-only
Non-change
Reconstruction
Selection Add-only
29
Sharing Algorithm Overview

Non-change sharing.
Add-only sharing.
Optimizing the remaining query.
Reconstruction and selection sharing.
Constructing the remaining Rete network based on
the optimized plan with possible sharing.

30
Current Work Status

A preliminary system
Database
A preliminary ReteGenerator
With the Adapted Rete and Transitivity Inference
Will be expanded to incorporate optimization,
computation sharing, and incremental aggregation,
etc.
A Preliminary evaluation
Will conduct full evaluation on the complete
system in future

31
Preliminary EvaluationQueries and Data

7 queries on synthesized FedWire money transfer
database. 320006 records.
Two Data Conditions
Data1 Old first 300000 records
New remaining 20006 records
ALERT
Data2 Old first 300000 records
New next 20000 records
NOT alert

32
Preliminary Results
50
40
30
Execution Time(s)
20
10
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Rete Data1
SQL Data1
Rete Data2
SQL Data2
Rete with Transitivity Inference
33
Transitivity Inference
Q4
50
Q2
45
40
25
35
30
20
Execution Time(s)
25
15
20
Execution Time(s)
15
10
10
5
5
0
0
Data1
Data2
Data1
Data2
Rete TI
Rete Non-TI
SQL Non-TI
SQL TI
34
Partial Rete Generation
50
45
40
35
Partial Rete
30
Execution Time(s)
25
Rete
20
SQL
15
10
5
0
Data1
Data2
Q4 assumes Transitivity Inference not applicable
35
Proposed Work

System Design and Implementation
System Evaluation

36
System Design and Implementation

Rete Optimization (am doing) (0508/2004)
Computation Sharing (will do) (0711/2004)
Incremental Aggregation (will do) (12/2004
02/2005)
Constraint Exploiting (optional) (0405/2005)
Transitivity Inference Enhancements (optional) (
06 08/2005)
Automatic Index Selection (optional) (0912/2005)

37
System Evaluation

Data Collection ( 12/2004 01/2005)
Query Generation ( 12/2004 01/2005)
Simulation and Evaluation ( 02 05/2005)
Single SQL vs. Single Rete,
Multiple SQL vs. Multiple Shared Optimized Rete
Single Non-optimized Rete vs. Single Optimized
Rete
Multiple Non-shared Optimized Rete vs. Multiple
Shared Optimized Rete
Non-incremental Aggregation vs. Incremental
Aggregation

38
Evaluation Data Collection

FedWire Money Transfer Transactions
Synthesized 0.5M records.
Plan to generate 0.5M more.
23 attributes/record
Massachusetts Medical Data
Real 1.6M records (sanitized)
70 attributes/record
In-patient admission and discharge records.
Expand to 10M.

39
Evaluation Queries

Now, 7 queries on FedWire, 3 queries on Medical.
Plan to extend to 20-40 queries for each domain.
Further extend query sets
Similar predicates matching different constants
Join predicate sets have non-empty intersections
Same where_clauses but different groupby_clauses
Same where_clauses and groupby_clauses but
different aggregation operators

40
Timeline

System Design and Implementation (Required)
03/2004 02/2005
System Implementation (Optional) 04/2005
12/2005
Evaluation on Required Parts 12/2004 05/2005
Thesis Writing and Defense 06/2005 03/2006
Thesis Writing 06 12/2005
Thesis Finalizing 01 03/2006
Defense 02 or 03/2006

41
ARGUS Summary

Implement the incremental evaluation schemes with
the Adapted Rete Algorithm upon a traditional
DBMS platform
To deal with very-large-volume data, exploit the
very-high-selectivity query property for
optimization
Transitivity Inference
Predicate Set Evaluation and Materialization
Partial Rete (Materialization skipping)
Complex Common Computation Identification for
Sharing
Intermingled Sharing and Optimization processing