Title: ARGUS: Rete DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams
1ARGUS Rete DBMS Efficient Persistent
Profile Matching on Large-Volume Data Streams
- Chun Jin
- Language Technologies Institute
- School of Computer Science
- Carnegie Mellon University
- cjin_at_cs.cmu.edu
2Stream Processing Model
- Stream Processing becomes
- demanding and prevalent.
3Stream Databases
- Stream Database Applications
- Network Traffic Analysis and Router Configuration
- Dynamic Internet Services
- Sensor Data Analysis
- Anomaly Detection
- Stream Database Projects
- STREAM, TelegraphCQ, Aurora
- NiagaraCQ, OpenCQ, WebCQ
- Gigascope, Tribeca
- Tapestry, Alert, Tukwila, etc.
- ARGUS
4Stream Anomaly Monitoring Systems (SAMS)
- SAMS monitors structured data streams for
anomalies or potential hazards. - Matches of queries may be high urgency alerts.
Prompt detections are desirable. - Satisfaction of a SAMS query is often rare
(very-high-selectivity).
5SAMS Dataflow
Data Streams
FedWire Money Transfers
Patient Records
Stream Anomaly Monitoring System
Queries
Storage
Alerts
Analyst
6Challenges to SAMS
- Persistent queries may number in thousands or
tens of thousands. - Daily stream volumes may exceed millions of
records. - Prompt detections are desirable.
- Very-high-selectivity Query Property.
7Proposed ARGUS Approach
- Basic Framework
- Incremental evaluation schemes (Adapted Rete
algorithm) - Rete (Forgy 1982) Incremental Evaluation based
on - Materialized Intermediate Results.
- Upon a traditional DBMS platform
- Exploiting Very-High-Selectivity Query Property
- Transitivity Inference
- Conditional Materialization
- Optimizing Join Order
- Computation Sharing
- Related to Other Applications
- Stream Databases
- Modern DBMS Query Optimization
8Query Example 4
- Suppose for every big transaction of type code
1000, the analyst wants to check if the money
stayed in the bank or left within ten days. An
additional sign of possible fraud is that
transactions involve at least one intermediate
bank. The query generates an alarm whenever the
receiver of a large transaction (over 1,000,000)
transfers at least half of the money further
within ten days of this transaction using an
intermediate bank.
9SQL Query for Example 4
- FROM transaction r1, transaction r2, transaction
r3 - WHERE r2.type_code 1000 AND
- r3.type_code 1000 AND
- r1.type_code 1000 AND
- r1.amount gt 1000000 AND
- r1.rbank_aba r2.sbank_aba AND
- r1.benef_account r2.orig_account AND
- r2.amount gt 0.5 r1.amount AND
- r1.tran_date lt r2.tran_date AND
- r2.tran_date lt r1.tran_date 10 AND
- r2.rbank_aba r3.sbank_aba AND
- r2.benef_account r3.orig_account AND
- r2.amount r3.amount AND
- r2.tran_date lt r3.tran_date AND
- r3.tran_date lt r2.tran_date 10
10ARGUS System Architecture
Data Tables
Stream Anomaly Monitoring
Intermediate Tables
Data Streams
Query Table
Do_queries
Analyst
Rete Network Generator
Query
Scheduler
Rete Networks
Identified Threats
11ReteGenerator Architecture
ReteGenerator
Optimizer
Join Order
Conditional Materialization
Transitivity Inference
Sharing Module
SQL Queries
- Common Computation Identification
- Predicate Indexing
- Extended Predicate Set Operations
- Choose what and how to share
- Recording and Manipulating Network Topology
- Estimating Sharing Costs
12Adapted Rete Algorithm (Selection)
- n and m are old data sets
- ?n and ?m are the new much smaller incremental
data sets. - Selection o
- o(n ?n)
o(?n)
o(n)
13Adapted Rete Algorithm (Join)
- Join
- (n?n) (m?m)
- n m ?n m n ?m ?n ?m
- When ?n and ?m are very small compared to n and
m, time complexity of incremental join is O(nm)
Old Results
New Incremental Results
14Incremental Evaluation in Rete Example 4
r1.rbank_aba r2.sbank_aba r1.benef_account
r2.orig_account r2.amount gt r1.amount0.5 r1.tran_
date lt r2.tran_date r2.tran_date gt
r1.tran_date10
Type_code1000 Amountgt1000000
DataTable
Type_code1000
r1, r2, r3
r2.rbank_aba r3.sbank_aba r2.benef_account
r3.orig_account r2.amount r3.amount r2.tran_date
lt r3.tran_date r3.tran_date gt r2.tran_date10
Type_code1000
15Complex Queries
- A persistent query may contain multiple SQL
statements, and a single SQL statement may
contain unions of multiple SQL terms. - Each SQL term is mapped to a sub-Rete network.
- These sub-Rete networks are then connected to
form the statement-level sub-networks. - And the statement-level subnetworks are further
connected based on the view references to form
the final query-level Rete network.
16Transitivity Inference
- Exploring transitivity properties of comparison
operators - To derive hidden high-selective selection
predicates - High-selective selection predicates can
significantly improve performance as they may
produce very small intermediate results.
Subsequent join could be performed very fast on
the materialized intermediate results. - Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97
17Transitivity Inference Example
- Given
- r1.amount gt 1000000 and
- r2.amount gt r1.amount 0.5 and
- r3.amount r2.amount
- r1.amount gt 1000000 is very high-selective on r1
- We can infer high-selective predicates
- r2.amount gt 500000
- r3.amount gt 500000
18Conditional Materialization
Unconditional Materialization
r1
r2
Conditional Materialization
r1
Choose materialization or not based on cost
estimates
r2
19Preliminary EvaluationQueries and Data
- 7 queries on synthesized FedWire money transfer
database. 320006 records. - Two Data Conditions
- Data1 Old first 300000 records
- New remaining 20006 records
- ALERT
- Data2 Old first 300000 records
- New next 20000 records
- NOT alert
20Preliminary Results
50
40
30
Execution Time(s)
20
10
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Rete Data1
SQL Data1
Rete Data2
SQL Data2
Rete with Transitivity Inference
21Transitivity Inference
Q4
50
Q2
45
40
25
35
30
20
Execution Time(s)
25
15
20
Execution Time(s)
15
10
10
5
5
0
0
Data1
Data2
Data1
Data2
Rete TI
Rete Non-TI
SQL Non-TI
SQL TI
22Conditional Materialization
50
45
40
35
Conditional
30
Execution Time(s)
25
Rete
20
SQL
15
10
5
0
Data1
Data2
Q4 assumes Transitivity Inference not applicable
23ARGUS Summary
- Adapted Rete Algorithm upon a traditional DBMS
platform - Exploit the very-high-selectivity query property
for optimization - Transitivity Inference
- Conditional Materialization
- Current and Future Work
- Optimizing Join Order
- Computation Sharing
24- Thank you!
- Questions and Comments?