Title: ARGUS: A Prototype Stream Anomaly Monitoring System
1ARGUS A Prototype Stream Anomaly Monitoring
System
Thesis Committee Jaime Carbonell
(Chair) Christopher Olston Jamie Callan Phil
Hayes, DYNAMiX Technologies
2Thesis Statement
- Stream Anomaly Monitoring System (SAMS) is an
important sub-class of stream applications. The
difficulty is raised by the very-large-volume
data and a large number of queries the system is
supposed to handle. - Propose an approach for SAMSs that implements
incremental evaluation schemes with adapted Rete
algorithm upon a traditional DBMS platform and
exploit SAMS characteristics for query evaluation
optimization. - Demonstrate how the approach and the improvements
could lead to a simple and fast implementation of
an effective and efficient SAMS system.
3Outline
- Motivation
- My ARGUS Approach
- Current Work Status
- Current System
- Preliminary Results
- Proposed Work and Timeline
4Stream Processing
- Stream Processing Applications
- Network Traffic Analysis and Router Configuration
- Internet Services
- Sensor Data Analysis
- Anomaly Detection
- Stream Processing Projects
- STREAM, TelegraphCQ, Aurora
- NiagaraCQ, OpenCQ, WebCQ
- Gigascope, Tribeca
- Tapestry, Alert, Tukwila, etc.
5Stream Anomaly Monitoring Systems (SAMS)
- SAMS monitors structured data streams for
anomalies or potential hazards. - Continuous queries may number in thousands or
tens of thousands. - Daily stream volumes may exceed millions of
records. - Satisfaction of a SAMS query is often rare
(very-high-selectivity).
6SAMS Dataflow
Data Streams
FedWire Money Transfers
Patient Records
Stream Anomaly Monitoring System
Queries
Storage
Alerts
Analyst
7Query Example 4
- Suppose for every big transaction of type code
1000, the analyst wants to check if the money
stayed in the bank or left within ten days. An
additional sign of possible fraud is that
transactions involve at least one intermediate
bank. The query generates an alarm whenever the
receiver of a large transaction (over 1,000,000)
transfers at least half of the money further
within ten days of this transaction using an
intermediate bank.
8SQL Query for Example 4
- FROM transaction r1, transaction r2, transaction
r3 - WHERE r2.type_code 1000 AND
- r3.type_code 1000 AND
- r1.type_code 1000 AND
- r1.amount gt 1000000 AND
- r1.rbank_aba r2.sbank_aba AND
- r1.benef_account r2.orig_account AND
- r2.amount gt 0.5 r1.amount AND
- r1.tran_date lt r2.tran_date AND
- r2.tran_date lt r1.tran_date 10 AND
- r2.rbank_aba r3.sbank_aba AND
- r2.benef_account r3.orig_account AND
- r2.amount r3.amount AND
- r2.tran_date lt r3.tran_date AND
- r3.tran_date lt r2.tran_date 10
9ARGUS as a Prototype SAMS
- Implement the Adapted Rete Algorithm upon a
traditional DBMS platform - Rete (Forgy 1982) Incremental Evaluation based
on Materialized Intermediate Results. - SAMSs assumption of very-high-selectivity query
over very-large-volume data justifies employment
of Rete and necessitates some unique
improvements. - Transitivity Inference
- Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97
- Predicate Set Evaluation and Materialization
- Partial Rete (Materialization skipping)
- Complex Common Computation Identification for
Sharing - Intermingled Sharing and Optimization processing
10ARGUS System Architecture
Data Tables
Stream Anomaly Monitoring
Intermediate Tables
Data Streams
Query Table
Do_queries
Analyst
Rete Network Generator
Query
Scheduler
Rete Networks
Identified Threats
11ReteGenerator Architecture
History-based Cost Estimating
ReteGenerator
Update Tables
History-based Rete Optimizer
Topology Table
Counter Table
SQL Queries
Transitivity Inference
ReteGen Manager
Sharing
Query Rewriter
Topology Checker
Check Topology
Register Rete Networks
12Selected ARGUS Topics
- Adapted Rete Algorithm
- ReteGenerator translates a query into a Rete
network that is wrapped as a stored procedure. - The procedure implements the Adapted Rete
Algorithm accounting for the incremental
evaluation - Transitivity Inference
- Rete Optimization
- Computation Sharing
13Adapted Rete Algorithm (Selection)
- n and m are old data sets
- ?n and ?m are the new much smaller incremental
data sets. - Selection o
- o(n ?n)
o(?n)
o(n)
14Adapted Rete Algorithm (Join)
- Join
- (n?n) (m?m)
- n m ?n m n ?m ?n ?m
- When ?n and ?m are very small compared to n and
m, time complexity of incremental join is O(nm)
Old Results
New Incremental Results
15Incremental Evaluation in Rete Example 4
r1.rbank_aba r2.sbank_aba r1.benef_account
r2.orig_account r2.amount gt r1.amount0.5 r1.tran_
date lt r2.tran_date r2.tran_date gt
r1.tran_date10
Type_code1000 Amountgt1000000
DataTable
Type_code1000
r1, r2, r3
r2.rbank_aba r3.sbank_aba r2.benef_account
r3.orig_account r2.amount r3.amount r2.tran_date
lt r3.tran_date r3.tran_date gt r2.tran_date10
Type_code1000
16Complex Queries
- A continuous query may contain multiple SQL
statements, and a single SQL statement may
contain unions of multiple SQL terms. - Each SQL term is mapped to a sub-Rete network.
- These sub-Rete networks are then connected to
form the statement-level sub-networks. - And the statement-level subnetworks are further
connected based on the view references to form
the final query-level Rete network.
17Transitivity Inference
- Exploring transitivity properties of comparison
operators - To derive hidden high-selective selection
predicates - High-selective selection predicates can
significantly improve performance as they may
produce very small intermediate results.
Subsequent join could be performed very fast on
the materialized intermediate results.
18Transitivity Inference Example
- Given
- r1.amount gt 1000000 and
- r2.amount gt r1.amount 0.5 and
- r3.amount r2.amount
- r1.amount gt 1000000 is very high-selective on r1
- We can infer high-selective predicates
- r2.amount gt 500000
- r3.amount gt 500000
19Rete Optimization
DB
History-based Cost Estimator
Active List
SQL Query
Join Enumerator
Join Graph
Rete network
Update Tables
History-based Rete Optimizer
StructureBuilder
20Join Graph Example
1
P(1,2)
P(1,3)
1,2
2
P(2,3)
3
4
P(3,4)
21History-based Cost Estimator
- Run sub-plans on historical data
- To estimate the costs of sub-plans on future data
- Assume same data distribution in past and future
- Apply heuristic functions to avoid estimating
extremely high cost sub-plans. - Justify History-based Cost Estimator
- Compiled and optimized once, and executed
multiple times - Tolerable to spend more time on the one-time
optimization - Accurate cost estimates compensate as queries run
more and more times
22Computation Sharing
- Predicate Indexing
- Extended predicate set operations
- Sharing Algorithm
23Predicate Indexing
- Predicate Indexing Concepts
- Equivalent Predicate,
- p1 p2, iff ?D, p1(D) p2(D)
- Equivalent Predicate Class
- Canonical Predicate Form
- Predicates are converted into the canonical forms
and stored as records in tables. - Searching a predicate becomes data retrieval from
tables.
24Relationship between Predicate Sets and Their
Result Tuple Sets
- Predicate Set a set of conjunctive predicates
- Its Result Tuple Set a set of database tuples
that satisfy all the predicates of the Predicate
Set. - Fix database status D, a mapping from predicate
set P to its result tuple set SD(P) - SD P ---gt SD(P)
- Predicate sets and their result tuple sets are
complementary - Predicates are filters of data items
- The more number of predicates, the less number of
result tuples
25Extending Predicate Set Operations
- Defined on predicate sets
- Definitions are justified by the relationships
among corresponding result tuple sets - Important to common computation identification
26Semantic Subset ?
- Given two predicate sets P1 and P2, we say that
P1 is a semantic subset of P2, and denote as
P1?P2, if for any database status D, we have
SD(P1)?SD(P2).
27Semantic Subset Example
- p1 t1.agt1, p2 t1.agt2
- P1 p1, P2 p2
- S(P1)?S(P2),
- P1? P2.
- Why?
- P2 p1, p2
28Sharing Types
T1
POT1
T1
POT1
POJ-PFJ
POJ
PNJ-POJ
PFJ
T2
POT2
T2
POT2
Add-only
Non-change
Reconstruction
Selection Add-only
29Sharing Algorithm Overview
- Non-change sharing.
- Add-only sharing.
- Optimizing the remaining query.
- Reconstruction and selection sharing.
- Constructing the remaining Rete network based on
the optimized plan with possible sharing.
30Current Work Status
- A preliminary system
- Database
- A preliminary ReteGenerator
- With the Adapted Rete and Transitivity Inference
- Will be expanded to incorporate optimization,
computation sharing, and incremental aggregation,
etc. - A Preliminary evaluation
- Will conduct full evaluation on the complete
system in future
31Preliminary EvaluationQueries and Data
- 7 queries on synthesized FedWire money transfer
database. 320006 records. - Two Data Conditions
- Data1 Old first 300000 records
- New remaining 20006 records
- ALERT
- Data2 Old first 300000 records
- New next 20000 records
- NOT alert
32Preliminary Results
50
40
30
Execution Time(s)
20
10
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Rete Data1
SQL Data1
Rete Data2
SQL Data2
Rete with Transitivity Inference
33Transitivity Inference
Q4
50
Q2
45
40
25
35
30
20
Execution Time(s)
25
15
20
Execution Time(s)
15
10
10
5
5
0
0
Data1
Data2
Data1
Data2
Rete TI
Rete Non-TI
SQL Non-TI
SQL TI
34Partial Rete Generation
50
45
40
35
Partial Rete
30
Execution Time(s)
25
Rete
20
SQL
15
10
5
0
Data1
Data2
Q4 assumes Transitivity Inference not applicable
35Proposed Work
- System Design and Implementation
- System Evaluation
36System Design and Implementation
- Rete Optimization (am doing) (0508/2004)
- Computation Sharing (will do) (0711/2004)
- Incremental Aggregation (will do) (12/2004
02/2005) - Constraint Exploiting (optional) (0405/2005)
- Transitivity Inference Enhancements (optional) (
06 08/2005) - Automatic Index Selection (optional) (0912/2005)
37System Evaluation
- Data Collection ( 12/2004 01/2005)
- Query Generation ( 12/2004 01/2005)
- Simulation and Evaluation ( 02 05/2005)
- Single SQL vs. Single Rete,
- Multiple SQL vs. Multiple Shared Optimized Rete
- Single Non-optimized Rete vs. Single Optimized
Rete - Multiple Non-shared Optimized Rete vs. Multiple
Shared Optimized Rete - Non-incremental Aggregation vs. Incremental
Aggregation
38Evaluation Data Collection
- FedWire Money Transfer Transactions
- Synthesized 0.5M records.
- Plan to generate 0.5M more.
- 23 attributes/record
- Massachusetts Medical Data
- Real 1.6M records (sanitized)
- 70 attributes/record
- In-patient admission and discharge records.
- Expand to 10M.
39Evaluation Queries
- Now, 7 queries on FedWire, 3 queries on Medical.
- Plan to extend to 20-40 queries for each domain.
- Further extend query sets
- Similar predicates matching different constants
- Join predicate sets have non-empty intersections
- Same where_clauses but different groupby_clauses
- Same where_clauses and groupby_clauses but
different aggregation operators
40Timeline
- System Design and Implementation (Required)
03/2004 02/2005 - System Implementation (Optional) 04/2005
12/2005 - Evaluation on Required Parts 12/2004 05/2005
- Thesis Writing and Defense 06/2005 03/2006
- Thesis Writing 06 12/2005
- Thesis Finalizing 01 03/2006
- Defense 02 or 03/2006
41ARGUS Summary
- Implement the incremental evaluation schemes with
the Adapted Rete Algorithm upon a traditional
DBMS platform - To deal with very-large-volume data, exploit the
very-high-selectivity query property for
optimization - Transitivity Inference
- Predicate Set Evaluation and Materialization
- Partial Rete (Materialization skipping)
- Complex Common Computation Identification for
Sharing - Intermingled Sharing and Optimization processing
42- Thank you!
- Questions and Comments?