ARGUS: A Prototype Stream Anomaly Monitoring System - PowerPoint PPT Presentation

About This Presentation
Title:

ARGUS: A Prototype Stream Anomaly Monitoring System

Description:

Stream Anomaly Monitoring System (SAMS) is an important sub-class of stream applications. ... an approach for SAMS's that implements incremental evaluation ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 43
Provided by: cjin
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: ARGUS: A Prototype Stream Anomaly Monitoring System


1
ARGUS A Prototype Stream Anomaly Monitoring
System
  • Thesis Proposal
  • Chun Jin

Thesis Committee Jaime Carbonell
(Chair) Christopher Olston Jamie Callan Phil
Hayes, DYNAMiX Technologies
2
Thesis Statement
  • Stream Anomaly Monitoring System (SAMS) is an
    important sub-class of stream applications. The
    difficulty is raised by the very-large-volume
    data and a large number of queries the system is
    supposed to handle.
  • Propose an approach for SAMSs that implements
    incremental evaluation schemes with adapted Rete
    algorithm upon a traditional DBMS platform and
    exploit SAMS characteristics for query evaluation
    optimization.
  • Demonstrate how the approach and the improvements
    could lead to a simple and fast implementation of
    an effective and efficient SAMS system.

3
Outline
  • Motivation
  • My ARGUS Approach
  • Current Work Status
  • Current System
  • Preliminary Results
  • Proposed Work and Timeline

4
Stream Processing
  • Stream Processing Applications
  • Network Traffic Analysis and Router Configuration
  • Internet Services
  • Sensor Data Analysis
  • Anomaly Detection
  • Stream Processing Projects
  • STREAM, TelegraphCQ, Aurora
  • NiagaraCQ, OpenCQ, WebCQ
  • Gigascope, Tribeca
  • Tapestry, Alert, Tukwila, etc.

5
Stream Anomaly Monitoring Systems (SAMS)
  • SAMS monitors structured data streams for
    anomalies or potential hazards.
  • Continuous queries may number in thousands or
    tens of thousands.
  • Daily stream volumes may exceed millions of
    records.
  • Satisfaction of a SAMS query is often rare
    (very-high-selectivity).

6
SAMS Dataflow
Data Streams
FedWire Money Transfers
Patient Records
Stream Anomaly Monitoring System
Queries
Storage
Alerts
Analyst
7
Query Example 4
  • Suppose for every big transaction of type code
    1000, the analyst wants to check if the money
    stayed in the bank or left within ten days. An
    additional sign of possible fraud is that
    transactions involve at least one intermediate
    bank. The query generates an alarm whenever the
    receiver of a large transaction (over 1,000,000)
    transfers at least half of the money further
    within ten days of this transaction using an
    intermediate bank.

8
SQL Query for Example 4
  • FROM transaction r1, transaction r2, transaction
    r3
  • WHERE r2.type_code 1000 AND
  • r3.type_code 1000 AND
  • r1.type_code 1000 AND
  • r1.amount gt 1000000 AND
  • r1.rbank_aba r2.sbank_aba AND
  • r1.benef_account r2.orig_account AND
  • r2.amount gt 0.5 r1.amount AND
  • r1.tran_date lt r2.tran_date AND
  • r2.tran_date lt r1.tran_date 10 AND
  • r2.rbank_aba r3.sbank_aba AND
  • r2.benef_account r3.orig_account AND
  • r2.amount r3.amount AND
  • r2.tran_date lt r3.tran_date AND
  • r3.tran_date lt r2.tran_date 10

9
ARGUS as a Prototype SAMS
  • Implement the Adapted Rete Algorithm upon a
    traditional DBMS platform
  • Rete (Forgy 1982) Incremental Evaluation based
    on Materialized Intermediate Results.
  • SAMSs assumption of very-high-selectivity query
    over very-large-volume data justifies employment
    of Rete and necessitates some unique
    improvements.
  • Transitivity Inference
  • Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97
  • Predicate Set Evaluation and Materialization
  • Partial Rete (Materialization skipping)
  • Complex Common Computation Identification for
    Sharing
  • Intermingled Sharing and Optimization processing

10
ARGUS System Architecture
Data Tables
Stream Anomaly Monitoring
Intermediate Tables
Data Streams
Query Table
Do_queries
Analyst
Rete Network Generator
Query
Scheduler
Rete Networks
Identified Threats
11
ReteGenerator Architecture
History-based Cost Estimating
ReteGenerator
Update Tables
History-based Rete Optimizer
Topology Table
Counter Table
SQL Queries
Transitivity Inference
ReteGen Manager
Sharing
Query Rewriter
Topology Checker
Check Topology
Register Rete Networks
12
Selected ARGUS Topics
  • Adapted Rete Algorithm
  • ReteGenerator translates a query into a Rete
    network that is wrapped as a stored procedure.
  • The procedure implements the Adapted Rete
    Algorithm accounting for the incremental
    evaluation
  • Transitivity Inference
  • Rete Optimization
  • Computation Sharing

13
Adapted Rete Algorithm (Selection)
  • n and m are old data sets
  • ?n and ?m are the new much smaller incremental
    data sets.
  • Selection o
  • o(n ?n)

o(?n)

o(n)
14
Adapted Rete Algorithm (Join)
  • Join
  • (n?n) (m?m)
  • n m ?n m n ?m ?n ?m
  • When ?n and ?m are very small compared to n and
    m, time complexity of incremental join is O(nm)

Old Results
New Incremental Results
15
Incremental Evaluation in Rete Example 4
r1.rbank_aba r2.sbank_aba r1.benef_account
r2.orig_account r2.amount gt r1.amount0.5 r1.tran_
date lt r2.tran_date r2.tran_date gt
r1.tran_date10
Type_code1000 Amountgt1000000
DataTable
Type_code1000
r1, r2, r3
r2.rbank_aba r3.sbank_aba r2.benef_account
r3.orig_account r2.amount r3.amount r2.tran_date
lt r3.tran_date r3.tran_date gt r2.tran_date10
Type_code1000
16
Complex Queries
  • A continuous query may contain multiple SQL
    statements, and a single SQL statement may
    contain unions of multiple SQL terms.
  • Each SQL term is mapped to a sub-Rete network.
  • These sub-Rete networks are then connected to
    form the statement-level sub-networks.
  • And the statement-level subnetworks are further
    connected based on the view references to form
    the final query-level Rete network.

17
Transitivity Inference
  • Exploring transitivity properties of comparison
    operators
  • To derive hidden high-selective selection
    predicates
  • High-selective selection predicates can
    significantly improve performance as they may
    produce very small intermediate results.
    Subsequent join could be performed very fast on
    the materialized intermediate results.

18
Transitivity Inference Example
  • Given
  • r1.amount gt 1000000 and
  • r2.amount gt r1.amount 0.5 and
  • r3.amount r2.amount
  • r1.amount gt 1000000 is very high-selective on r1
  • We can infer high-selective predicates
  • r2.amount gt 500000
  • r3.amount gt 500000

19
Rete Optimization
DB
History-based Cost Estimator
Active List
SQL Query
Join Enumerator
Join Graph
Rete network
Update Tables
History-based Rete Optimizer
StructureBuilder
20
Join Graph Example
1
P(1,2)
P(1,3)
1,2
2
P(2,3)
3
4
P(3,4)
21
History-based Cost Estimator
  • Run sub-plans on historical data
  • To estimate the costs of sub-plans on future data
  • Assume same data distribution in past and future
  • Apply heuristic functions to avoid estimating
    extremely high cost sub-plans.
  • Justify History-based Cost Estimator
  • Compiled and optimized once, and executed
    multiple times
  • Tolerable to spend more time on the one-time
    optimization
  • Accurate cost estimates compensate as queries run
    more and more times

22
Computation Sharing
  • Predicate Indexing
  • Extended predicate set operations
  • Sharing Algorithm

23
Predicate Indexing
  • Predicate Indexing Concepts
  • Equivalent Predicate,
  • p1 p2, iff ?D, p1(D) p2(D)
  • Equivalent Predicate Class
  • Canonical Predicate Form
  • Predicates are converted into the canonical forms
    and stored as records in tables.
  • Searching a predicate becomes data retrieval from
    tables.

24
Relationship between Predicate Sets and Their
Result Tuple Sets
  • Predicate Set a set of conjunctive predicates
  • Its Result Tuple Set a set of database tuples
    that satisfy all the predicates of the Predicate
    Set.
  • Fix database status D, a mapping from predicate
    set P to its result tuple set SD(P)
  • SD P ---gt SD(P)
  • Predicate sets and their result tuple sets are
    complementary
  • Predicates are filters of data items
  • The more number of predicates, the less number of
    result tuples

25
Extending Predicate Set Operations
  • Defined on predicate sets
  • Definitions are justified by the relationships
    among corresponding result tuple sets
  • Important to common computation identification

26
Semantic Subset ?
  • Given two predicate sets P1 and P2, we say that
    P1 is a semantic subset of P2, and denote as
    P1?P2, if for any database status D, we have
    SD(P1)?SD(P2).

27
Semantic Subset Example
  • p1 t1.agt1, p2 t1.agt2
  • P1 p1, P2 p2
  • S(P1)?S(P2),
  • P1? P2.
  • Why?
  • P2 p1, p2

28
Sharing Types
T1
POT1
T1
POT1
POJ-PFJ
POJ
PNJ-POJ
PFJ
T2
POT2
T2
POT2
Add-only
Non-change
Reconstruction
Selection Add-only
29
Sharing Algorithm Overview
  • Non-change sharing.
  • Add-only sharing.
  • Optimizing the remaining query.
  • Reconstruction and selection sharing.
  • Constructing the remaining Rete network based on
    the optimized plan with possible sharing.

30
Current Work Status
  • A preliminary system
  • Database
  • A preliminary ReteGenerator
  • With the Adapted Rete and Transitivity Inference
  • Will be expanded to incorporate optimization,
    computation sharing, and incremental aggregation,
    etc.
  • A Preliminary evaluation
  • Will conduct full evaluation on the complete
    system in future

31
Preliminary EvaluationQueries and Data
  • 7 queries on synthesized FedWire money transfer
    database. 320006 records.
  • Two Data Conditions
  • Data1 Old first 300000 records
  • New remaining 20006 records
  • ALERT
  • Data2 Old first 300000 records
  • New next 20000 records
  • NOT alert

32
Preliminary Results
50
40
30
Execution Time(s)
20
10
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Rete Data1
SQL Data1
Rete Data2
SQL Data2
Rete with Transitivity Inference
33
Transitivity Inference
Q4
50
Q2
45
40
25
35
30
20
Execution Time(s)
25
15
20
Execution Time(s)
15
10
10
5
5
0
0
Data1
Data2
Data1
Data2
Rete TI
Rete Non-TI
SQL Non-TI
SQL TI
34
Partial Rete Generation
50
45
40
35
Partial Rete
30
Execution Time(s)
25
Rete
20
SQL
15
10
5
0
Data1
Data2
Q4 assumes Transitivity Inference not applicable
35
Proposed Work
  • System Design and Implementation
  • System Evaluation

36
System Design and Implementation
  • Rete Optimization (am doing) (0508/2004)
  • Computation Sharing (will do) (0711/2004)
  • Incremental Aggregation (will do) (12/2004
    02/2005)
  • Constraint Exploiting (optional) (0405/2005)
  • Transitivity Inference Enhancements (optional) (
    06 08/2005)
  • Automatic Index Selection (optional) (0912/2005)

37
System Evaluation
  • Data Collection ( 12/2004 01/2005)
  • Query Generation ( 12/2004 01/2005)
  • Simulation and Evaluation ( 02 05/2005)
  • Single SQL vs. Single Rete,
  • Multiple SQL vs. Multiple Shared Optimized Rete
  • Single Non-optimized Rete vs. Single Optimized
    Rete
  • Multiple Non-shared Optimized Rete vs. Multiple
    Shared Optimized Rete
  • Non-incremental Aggregation vs. Incremental
    Aggregation

38
Evaluation Data Collection
  • FedWire Money Transfer Transactions
  • Synthesized 0.5M records.
  • Plan to generate 0.5M more.
  • 23 attributes/record
  • Massachusetts Medical Data
  • Real 1.6M records (sanitized)
  • 70 attributes/record
  • In-patient admission and discharge records.
  • Expand to 10M.

39
Evaluation Queries
  • Now, 7 queries on FedWire, 3 queries on Medical.
  • Plan to extend to 20-40 queries for each domain.
  • Further extend query sets
  • Similar predicates matching different constants
  • Join predicate sets have non-empty intersections
  • Same where_clauses but different groupby_clauses
  • Same where_clauses and groupby_clauses but
    different aggregation operators

40
Timeline
  • System Design and Implementation (Required)
    03/2004 02/2005
  • System Implementation (Optional) 04/2005
    12/2005
  • Evaluation on Required Parts 12/2004 05/2005
  • Thesis Writing and Defense 06/2005 03/2006
  • Thesis Writing 06 12/2005
  • Thesis Finalizing 01 03/2006
  • Defense 02 or 03/2006

41
ARGUS Summary
  • Implement the incremental evaluation schemes with
    the Adapted Rete Algorithm upon a traditional
    DBMS platform
  • To deal with very-large-volume data, exploit the
    very-high-selectivity query property for
    optimization
  • Transitivity Inference
  • Predicate Set Evaluation and Materialization
  • Partial Rete (Materialization skipping)
  • Complex Common Computation Identification for
    Sharing
  • Intermingled Sharing and Optimization processing

42
  • Thank you!
  • Questions and Comments?
Write a Comment
User Comments (0)
About PowerShow.com