ARGUS: Rete DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: ARGUS: Rete DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams


1
ARGUS Rete DBMS Efficient Persistent
Profile Matching on Large-Volume Data Streams
  • Chun Jin
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • cjin_at_cs.cmu.edu

2
Stream Processing Model
  • Stream Processing becomes
  • demanding and prevalent.

3
Stream Databases
  • Stream Database Applications
  • Network Traffic Analysis and Router Configuration
  • Dynamic Internet Services
  • Sensor Data Analysis
  • Anomaly Detection
  • Stream Database Projects
  • STREAM, TelegraphCQ, Aurora
  • NiagaraCQ, OpenCQ, WebCQ
  • Gigascope, Tribeca
  • Tapestry, Alert, Tukwila, etc.
  • ARGUS

4
Stream Anomaly Monitoring Systems (SAMS)
  • SAMS monitors structured data streams for
    anomalies or potential hazards.
  • Matches of queries may be high urgency alerts.
    Prompt detections are desirable.
  • Satisfaction of a SAMS query is often rare
    (very-high-selectivity).

5
SAMS Dataflow
Data Streams
FedWire Money Transfers
Patient Records
Stream Anomaly Monitoring System
Queries
Storage
Alerts
Analyst
6
Challenges to SAMS
  • Persistent queries may number in thousands or
    tens of thousands.
  • Daily stream volumes may exceed millions of
    records.
  • Prompt detections are desirable.
  • Very-high-selectivity Query Property.

7
Proposed ARGUS Approach
  • Basic Framework
  • Incremental evaluation schemes (Adapted Rete
    algorithm)
  • Rete (Forgy 1982) Incremental Evaluation based
    on
  • Materialized Intermediate Results.
  • Upon a traditional DBMS platform
  • Exploiting Very-High-Selectivity Query Property
  • Transitivity Inference
  • Conditional Materialization
  • Optimizing Join Order
  • Computation Sharing
  • Related to Other Applications
  • Stream Databases
  • Modern DBMS Query Optimization

8
Query Example 4
  • Suppose for every big transaction of type code
    1000, the analyst wants to check if the money
    stayed in the bank or left within ten days. An
    additional sign of possible fraud is that
    transactions involve at least one intermediate
    bank. The query generates an alarm whenever the
    receiver of a large transaction (over 1,000,000)
    transfers at least half of the money further
    within ten days of this transaction using an
    intermediate bank.

9
SQL Query for Example 4
  • FROM transaction r1, transaction r2, transaction
    r3
  • WHERE r2.type_code 1000 AND
  • r3.type_code 1000 AND
  • r1.type_code 1000 AND
  • r1.amount gt 1000000 AND
  • r1.rbank_aba r2.sbank_aba AND
  • r1.benef_account r2.orig_account AND
  • r2.amount gt 0.5 r1.amount AND
  • r1.tran_date lt r2.tran_date AND
  • r2.tran_date lt r1.tran_date 10 AND
  • r2.rbank_aba r3.sbank_aba AND
  • r2.benef_account r3.orig_account AND
  • r2.amount r3.amount AND
  • r2.tran_date lt r3.tran_date AND
  • r3.tran_date lt r2.tran_date 10

10
ARGUS System Architecture
Data Tables
Stream Anomaly Monitoring
Intermediate Tables
Data Streams
Query Table
Do_queries
Analyst
Rete Network Generator
Query
Scheduler
Rete Networks
Identified Threats
11
ReteGenerator Architecture
ReteGenerator
Optimizer
Join Order
Conditional Materialization
Transitivity Inference
Sharing Module
SQL Queries
  • Common Computation Identification
  • Predicate Indexing
  • Extended Predicate Set Operations
  • Choose what and how to share
  • Recording and Manipulating Network Topology
  • Estimating Sharing Costs

12
Adapted Rete Algorithm (Selection)
  • n and m are old data sets
  • ?n and ?m are the new much smaller incremental
    data sets.
  • Selection o
  • o(n ?n)

o(?n)

o(n)
13
Adapted Rete Algorithm (Join)
  • Join
  • (n?n) (m?m)
  • n m ?n m n ?m ?n ?m
  • When ?n and ?m are very small compared to n and
    m, time complexity of incremental join is O(nm)

Old Results
New Incremental Results
14
Incremental Evaluation in Rete Example 4
r1.rbank_aba r2.sbank_aba r1.benef_account
r2.orig_account r2.amount gt r1.amount0.5 r1.tran_
date lt r2.tran_date r2.tran_date gt
r1.tran_date10
Type_code1000 Amountgt1000000
DataTable
Type_code1000
r1, r2, r3
r2.rbank_aba r3.sbank_aba r2.benef_account
r3.orig_account r2.amount r3.amount r2.tran_date
lt r3.tran_date r3.tran_date gt r2.tran_date10
Type_code1000
15
Complex Queries
  • A persistent query may contain multiple SQL
    statements, and a single SQL statement may
    contain unions of multiple SQL terms.
  • Each SQL term is mapped to a sub-Rete network.
  • These sub-Rete networks are then connected to
    form the statement-level sub-networks.
  • And the statement-level subnetworks are further
    connected based on the view references to form
    the final query-level Rete network.

16
Transitivity Inference
  • Exploring transitivity properties of comparison
    operators
  • To derive hidden high-selective selection
    predicates
  • High-selective selection predicates can
    significantly improve performance as they may
    produce very small intermediate results.
    Subsequent join could be performed very fast on
    the materialized intermediate results.
  • Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97

17
Transitivity Inference Example
  • Given
  • r1.amount gt 1000000 and
  • r2.amount gt r1.amount 0.5 and
  • r3.amount r2.amount
  • r1.amount gt 1000000 is very high-selective on r1
  • We can infer high-selective predicates
  • r2.amount gt 500000
  • r3.amount gt 500000

18
Conditional Materialization
Unconditional Materialization
r1
r2
Conditional Materialization
r1
Choose materialization or not based on cost
estimates
r2
19
Preliminary EvaluationQueries and Data
  • 7 queries on synthesized FedWire money transfer
    database. 320006 records.
  • Two Data Conditions
  • Data1 Old first 300000 records
  • New remaining 20006 records
  • ALERT
  • Data2 Old first 300000 records
  • New next 20000 records
  • NOT alert

20
Preliminary Results
50
40
30
Execution Time(s)
20
10
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Rete Data1
SQL Data1
Rete Data2
SQL Data2
Rete with Transitivity Inference
21
Transitivity Inference
Q4
50
Q2
45
40
25
35
30
20
Execution Time(s)
25
15
20
Execution Time(s)
15
10
10
5
5
0
0
Data1
Data2
Data1
Data2
Rete TI
Rete Non-TI
SQL Non-TI
SQL TI
22
Conditional Materialization
50
45
40
35
Conditional
30
Execution Time(s)
25
Rete
20
SQL
15
10
5
0
Data1
Data2
Q4 assumes Transitivity Inference not applicable
23
ARGUS Summary
  • Adapted Rete Algorithm upon a traditional DBMS
    platform
  • Exploit the very-high-selectivity query property
    for optimization
  • Transitivity Inference
  • Conditional Materialization
  • Current and Future Work
  • Optimizing Join Order
  • Computation Sharing

24
  • Thank you!
  • Questions and Comments?
Write a Comment
User Comments (0)
About PowerShow.com