Approximate Frequency Counts over Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Approximate Frequency Counts over Data Streams

Description:

Often, level-wise algorithms are used to mine offline databases ... Level-wise algorithms cannot be applied to mine data streams ... – PowerPoint PPT presentation

Number of Views:319
Avg rating:3.0/5.0
Slides: 27
Provided by: looki
Category:

less

Transcript and Presenter's Notes

Title: Approximate Frequency Counts over Data Streams


1
Approximate Frequency Counts over Data Streams
  • Loo Kin Kong
  • 4th Oct., 2002

2
Plan
  • Motivation
  • Paper review Approximate Frequency Counts over
    Data Streams
  • Finding frequent items
  • Finding frequent itemsets
  • Performance
  • Conclusion

3
Motivation
  • In some new applications, data come as a
    continuous stream
  • The sheer volume of a stream over its lifetime is
    huge
  • Queries require timely answer
  • Examples
  • Stock ticks
  • Network traffic measurements

4
Frequent itemset mining on offline databases vs
data streams
  • Often, level-wise algorithms are used to mine
    offline databases
  • E.g., the Apriori algorithm and its variants
  • At least 2 database scans are needed
  • Level-wise algorithms cannot be applied to mine
    data streams
  • Cannot go through the data stream multiple times

5
Paper review Approximate Frequency Counts over
Data Streams
  • By G. S. Manku and R. Motwani
  • Published in VLDB 02
  • Main contributions of the paper
  • Proposed 2 algorithms to find frequent items
    appear in a data stream of items
  • Extended the algorithms to find frequent itemset

6
Notations
  • Some notations
  • Let N denote the current length of the stream
  • Let s ?(0,1) denote the support threshold
  • Let ? ?(0,1) denote the error tolerance

7
Goals of the paper
  • The algorithm ensures that
  • All itemsets whose true frequency exceeds sN are
    reported
  • No itemset whose true frequency is less than
    (s-?)N is output
  • Estimated frequencies are less than the true
    frequencies by at most ?N

8
The simple case finding frequent items
  • Each transaction in the stream contains only 1
    item
  • 2 algorithms were proposed, namely
  • Sticky Sampling Algorithm
  • Lossy Counting Algorithm
  • Features of the algorithms
  • Sampling techniques are used
  • Frequency counts found are approximate but error
    is guaranteed not to exceed a user-specified
    tolerance level
  • For Lossy Counting, all frequent items are
    reported

9
Sticky Sampling Algorithm
  • User input includes 3 values, namely
  • Support threshold s
  • Error tolerance ?
  • Probability of failure ?
  • Counts are kept in a data structure S
  • Each entry in S is in the form (e,f), where
  • e is the item
  • f is the frequency of e in the stream since the
    entry is inserted in S
  • When queried about the frequent items, all
    entries (e,f) such that f ? (s - ?)N

10
Sticky Sampling Algorithm (contd)
S The set of all counts e Transaction (item) N
Curr. len. of stream r Sampling rate t 1/? log
(1/s?)
  1. S ? ? N ? 0 t ? 1/? log (1/s?) r ? 1
  2. e ? next transaction N ? N 1
  3. if (e,f) exists in S do
  4. increment the count f
  5. else if random(0,1) gt 1/r do
  6. insert (e,1) to S
  7. endif
  8. if N 2t ? 2n do
  9. r ? 2r
  10. halfSampRate(S)
  11. endif
  12. Goto 2

11
Sticky Sampling Algorithm halfSampRate()
  1. function halfSampRate(S)
  2. for every entry (e,f) in S do
  3. while random(0,1) lt 0.5 and f gt 0 do
  4. f ? f 1
  5. if f 0 do
  6. remove the entry from S
  7. endif

12
Lossy Counting Algorithm
  • Incoming data stream is conceptually divided into
    buckets of ?1/?? transactions
  • Counts are kept in a data structure D
  • Each entry in D is in the form (e, f, ?), where
  • e is the item
  • f is the frequency of e in the stream since the
    entry is inserted in D
  • ? is the maximum count of e in the stream before
    e is added to D

13
Lossy Counting Algorithm (contd)
D The set of all counts N Curr. len. of
stream e Transaction (itemset) w Bucket
width b Current bucket id
  1. D ? ? N ? 0
  2. w ? ?1/?? b ? 1
  3. e ? next transaction N ? N 1
  4. if (e,f,?) exists in D do
  5. f ? f 1
  6. else do
  7. insert (e,1,b-1) to D
  8. endif
  9. if N mod w 0 do
  10. prune(D, b) b ? b 1
  11. endif
  12. Goto 3

14
Lossy Counting Algorithm prune()
  1. function prune(D, b)
  2. for each entry (e,f,?) in D do
  3. if f ? ? b do
  4. remove the entry from D
  5. endif

15
Lossy Counting
  • Lossy Counting guarantees that
  • When deletion occurs, b ? ?N
  • If an entry (e, f, ?) is deleted, fe ? b where fe
    is the actual frequency count of e
  • Hence, if an entry (e, f, ?) is deleted, fe ? ?N
  • Finally, f ? fe ? f ?N

16
Sticky Sampling vs Lossy Counting
  • Sticky Sampling is non-deterministic, while Lossy
    Counting is deterministic
  • Experimental result shows that Lossy Counting
    requires fewer entries than Sticky Sampling

17
The more complex case finding frequent itemsets
  • The Lossy Counting algorithm is extended to find
    frequent itemsets
  • Transactions in the data stream contains any
    number of items

18
Overview of the algorithm
  • Incoming data stream is conceptually divided into
    buckets of ?1/?? transactions
  • Counts are kept in a data structure D
  • Multiple buckets (? of them say) are processed in
    a batch
  • Each entry in D is in the form (set, f, ?),
    where
  • set is the itemset
  • f is the frequency of set in the stream since the
    entry is inserted in D
  • ? is the maximum count of set in the stream
    before set is added to D

19
Overview of the algorithm (contd)
  • D is updated by the operations UPDATE_SET and
    NEW_SET
  • UPDATE_SET updates and deletes entries in D
  • For each entry (set, f, ?), count occurrence of
    set in the batch and update the entry
  • If an updated entry satisfies f ? ? bcurrent,
    the entry is removed from D
  • NEW_SET inserts new entries into D
  • If a set set has frequency f ? ? in the batch and
    set does not occur in D, create a new entry (set,
    f, bcurrent-?)

20
Implementation
  • Challenges
  • Not to enumerate all subsets of a transaction
  • Data structure must be compact for better space
    efficiency
  • 3 major modules
  • Buffer
  • Trie
  • SetGen

21
Implementation (contd)
  • Buffer repeatedly reads in a batch of buckets of
    transactions, where each transaction is a set of
    item-ids, into available main memory
  • Trie maintains the data structure D
  • SetGen generates subsets of item-ids along with
    their frequency counts in the current batch
  • Not all possible subsets need to be generated
  • If a subset S is not inserted into D after
    application of both UPDATE_SET and NEW_SET, then
    no supersets of S should be considered

22
Performance
  • IBM dataset (T10 I4 D1000K / 10K items)

23
Performance (contd)
  • Compared with Apriori
  • IBM dataset (T10 I4 D1000K / 10K items)

24
Conclusion
  • Sticky Sampling and Lossy Counting are 2
    approximate algorithms that can find frequent
    items
  • Both algorithms produces frequency counts within
    a user-specified error tolerance level, though
    Sticky Sampling is non-deterministic
  • Lossy Counting can be extended to find frequent
    itemsets

25
Reference
  • G. S. Manku and R. Motwani. Approximate Frequency
    Counts over Data Streams. In VLDB 02, Hong Kong,
    2002

26
Q A
Write a Comment
User Comments (0)
About PowerShow.com