Continuous Data Stream Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Continuous Data Stream Processing

Description:

Continuous Data Stream Processing – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 24
Provided by: yihu4
Category:

less

Transcript and Presenter's Notes

Title: Continuous Data Stream Processing


1
Continuous Data Stream Processing
  • MAKE Lab

Post-Excellence Project Subproject 6
Date 2006/03/07
2
Music Virtual Channel
Clustering engine
Interface
Profile monitor
Channel monitor
Favorite channel
1
Internet
V.C. player

2
V.C. player

Filtering engine
N
Music metadata
Music collections
3
Research Directions
Sequence Query Matching
Temporal Query Processing
Episode Query Matching
Range Search
Filtering
Spatial Query Processing
KNN Search
Aggregate Query Processing
Streaming Data Management
Top-K Search
Frequent Tree Pattern Mining
Closed Tree Pattern Mining
Mining
Frequent Itemset Mining (sliding window)
Frequent Itemset Mining (landmark model)
4
Hash-based synopsis with memory consideration for
mining frequent itemsets over data streams
5
Landmark model
6
Lossy Counting
Step 1 Divide the stream into buckets
bucket-size 1/e e 10 of support s
7
Lossy Counting in Action
Empty
8
Lossy Counting continued ...
Output Elements with counter values exceeding
sN eN
9
Drawbacks of Lossy Counting
1
Applied to mine frequent itemsets, the space may
exponentially increase
s
e
0
Lossy-Counting
10
hCount
m
  • , 9,

h1(9) mod m
1
0
1
1
1
2
h2(9) mod m
1
0
0
1
2
2
h
h3(9) mod m
1
1
0
1
1
2
h4(9) mod m
1
0
1
1
1
2
For each item, hash the item into buckets, choose
the minimum count and return the item if its
minimum count sN
11
hash-based
1
N
1
  • Transaction 1, 2, 3
  • Subsets of 1, 2, 3

N
1
N

1
N
?
2
Itemset
Surplus_Estimate
3
?
True_Count



1, 2

1, 3
?
2, 3

1, 2, 3
Total_Access
Nlast_access
How to compute the Surplus_Estimate?
12
Compute the Surplus_Estimate for an Itemset
  • Two variables
  • n number of different itemsets in the bucket but
    not in the list
  • c sensible counts to be divided between itemsets
    which are not in the list
  • If c 3, 5, n 3, ? ?Surplus_Estimate 3,
    (3, 1, 1)
  • Surplus_Estimate --, until (Surplus_Estimate) /
    Nlast_acces lt minSup

13
Determine c and n
2, 3, 5, N 4, minSup 0.4 2 is hashed into
the bucket
Boundary of c 4-(2SE) c 4-2 Boundary of n
c 2, n 2 ? (1, 1)
?Surplus_Estimate 1
4
3
2
1
2
0
1
1
5
4
Itemset
Total_Access
Surplus_Estimate
Nlast_access
True_Count
14
Monitoring Constrained k-Nearest Neighbor over
Moving Objects with Different Values
15
Motivation (Cont.)
  • Example
  • Consider that an user wants to find the k places
    to buy new shoes where the costs are the lowest.
  • Cost Price() Traffic Cost()

2-NN Query
100
4001001500 2001002400 1001003400
901005590
200
3
90
2
400
5
1
16
Motivation
  • Objects with different values in spatial
    database.
  • find the k places to buy something where the
    costs are the lowest.
  • Cost Price() Traffic
    Cost()
  • Taxi driver wants to find the k places to gain
    the most profits.
  • Profit Gain() - Traffic Cost()
  • Taxi driver wants to find the k places to gain
    the most profits.
  • Profit Gain() / Time Gain() / Time
  • Virtual Channel
  • age profile distance
  • listen hours / profile distance
  • Market Survey
  • consumption (or income , age) / profile
    distance

17
Challenges
  • Efficiency
  • Search space reduction
  • Query processing enhancement
  • Effectiveness
  • Previous result reuse

18
Framework
  • Initialization
  • Step1
  • Find k-candidates to restrict the search
    region.
  • Step2
  • Run Pruning Ring on the remaining candidates to
    determine actual answer.

q
  • Handling updates
  • -Incrementally update positions or values for
    objects and queries
  • -Computation is necessary only for affected
    query

19
Querying Episodes over Event Stream
20
Motivation
  • Knowledge Discovery from Telecommunication
    Network Alarm Databases ICDE96
  • If an alarm of type A occurs, then an alarm of
    type B occurs within 30 seconds with probability
    0.8
  • If alarms of types A and B occurs within 5
    seconds, then a alarm of type C occurs within 60
    seconds with probability 0.7
  • If an alarm of type A precedes an alarm of type
    B, and C precedes D, all within 15 seconds, then
    E will follow within 4 minutes with probability
    0.6

B
A
A
A
B
C
D
21
Challenges
  • Efficiency
  • Index impaction
  • Partial result sharing
  • Load shedding

22
Framework
Q1
Q2
Q3
  • Q3 is composed of p5 and p4

23
A E. I. 6
A EQueue
A TLink
B E. I. 5
B EQueue
B TLink
C E. I. 2
C EQueue
C TLink
D E. I. 4
D EQueue
D TLink
Write a Comment
User Comments (0)
About PowerShow.com