Mining%20Decision%20Trees%20from%20Data%20Streams - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Mining%20Decision%20Trees%20from%20Data%20Streams

Description:

Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 57
Provided by: Unkno200
Learn more at: http://www.cs.ust.hk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Mining%20Decision%20Trees%20from%20Data%20Streams


1
Mining Decision Trees from Data Streams
  • Thanks Tong Suk Man Ivy
  • HKU

2
Contents
  • Introduction problems in mining data streams
  • Classification of stream data
  • VFDT algorithm
  • Window approach
  • CVFDT algorithm
  • Experimental results
  • Conclusions
  • Future work

3
Data Streams
  • Characteristics
  • Large volume of ordered data points, possibly
    infinite
  • Arrive continuously
  • Fast changing
  • Appropriate model for many applications
  • Phone call records
  • Network and security monitoring
  • Financial applications (stock exchange)
  • Sensor networks

4
Problems in Mining Data Streams
  • Traditional data mining techniques usually
    require
  • Entire data set to be present
  • Random access (or multiple passes) to the data
  • Much time per data item
  • Challenges of stream mining
  • Impractical to store the whole data
  • Random access is expensive
  • Simple calculation per data due to time and space
    constraints

5
Processing Data Streams Motivation
  • A growing number of applications generate streams
    of data
  • Performance measurements in network monitoring
    and traffic management
  • Call detail records in telecommunications
  • Transactions in retail chains, ATM operations in
    banks
  • Log records generated by Web Servers
  • Sensor network data
  • Application characteristics
  • Massive volumes of data (several terabytes)
  • Records arrive at a rapid rate
  • Goal Mine patterns, process queries and compute
    statistics on data streams in real-time

(from VLDB02 Tutorial)
6
Data Streams Computation Model
Synopsis in Memory
Data Streams
Stream Processing Engine
(Approximate) Answer
  • A data stream is a (massive) sequence of
    elements Stream processing requirements
  • Single pass Each record is examined at most once
  • Bounded storage Limited Memory (M) for storing
    synopsis
  • Real-time Per record processing time (to
    maintain synopsis) must be low

7
Network Management Application
  • Network Management involves monitoring and
    configuring network hardware and software to
    ensure smooth operation
  • Monitor link bandwidth usage, estimate traffic
    demands
  • Quickly detect faults, congestion and isolate
    root cause
  • Load balancing, improve utilization of network
    resources

Network Operations Center
Measurements Alarms
Network
(from VLDB02 Tutorial)
8
IP Network Measurement Data
  • IP session data (collected using Cisco
    NetFlow)
  • ATT collects 100 GBs of NetFlow data each
    day!
  • ATT collects 100 GB of NetFlow data per day!

Source Destination Duration
Bytes Protocol 10.1.0.2
16.2.3.7 12 20K
http 18.6.7.1 12.4.0.3
16 24K http
13.9.4.3 11.6.8.2 15
20K http 15.2.2.9
17.1.2.1 19 40K
http 12.4.3.8 14.8.7.4
26 58K http
10.5.1.3 13.0.0.1 27
100K ftp 11.1.0.6
10.3.4.5 32 300K
ftp 19.7.1.2 16.5.5.8
18 80K ftp
(from VLDB02 Tutorial)
9
Network Data Processing
(from VLDB02 Tutorial)
  • Traffic estimation
  • How many bytes were sent between a pair of IP
    addresses?
  • What fraction network IP addresses are active?
  • List the top 100 IP addresses in terms of traffic
  • Traffic analysis
  • What is the average duration of an IP session?
  • What is the median of the number of bytes in each
    IP session?
  • Fraud detection
  • List all sessions that transmitted more than 1000
    bytes
  • Identify all sessions whose duration was more
    than twice the normal
  • Security/Denial of Service
  • List all IP addresses that have witnessed a
    sudden spike in traffic
  • Identify IP addresses involved in more than 1000
    sessions

10
Data Stream Processing Algorithms
  • Generally, algorithms compute approximate answers
  • Difficult to compute answers accurately with
    limited memory
  • Approximate answers - Deterministic bounds
  • Algorithms only compute an approximate answer,
    but bounds on error
  • Approximate answers - Probabilistic bounds
  • Algorithms compute an approximate answer with
    high probability
  • With probability at least , the computed
    answer is within a factor of the actual
    answer
  • Single-pass algorithms for processing streams
    also applicable to (massive) terabyte databases!

(from VLDB02 Tutorial)
11
Classification of Stream Data
  • VFDT algorithm
  • Mining High-Speed Data Streams, KDD 2000.
  • Pedro Domingos, Geoff Hulten
  • CVFDT algorithm (window approach)
  • Mining Time-Changing Data Streams, KDD 2001.
  • Geoff Hulten, Laurie Spencer, Pedro Domingos

12
Hoeffding Trees
13
Definitions
  • A classification problem is defined as
  • N is a set of training examples of the form (x,
    y)
  • x is a vector of d attributes
  • y is a discrete class label
  • Goal To produce from the examples a model yf(x)
    that predict the classes y for future examples x
    with high accuracy

14
Decision Tree Learning
  • One of the most effective and widely-used
    classification methods
  • Induce models in the form of decision trees
  • Each node contains a test on the attribute
  • Each branch from a node corresponds to a possible
    outcome of the test
  • Each leaf contains a class prediction
  • A decision tree is learned by recursively
    replacing leaves by test nodes, starting at the
    root

15
Challenges
  • Classic decision tree learners assume all
    training data can be simultaneously stored in
    main memory
  • Disk-based decision tree learners repeatedly read
    training data from disk sequentially
  • Prohibitively expensive when learning complex
    trees
  • Goal design decision tree learners that read
    each example at most once, and use a small
    constant time to process it

16
Key Observation
  • In order to find the best attribute at a node, it
    may be sufficient to consider only a small subset
    of the training examples that pass through that
    node.
  • Given a stream of examples, use the first ones to
    choose the root attribute.
  • Once the root attribute is chosen, the successive
    examples are passed down to the corresponding
    leaves, and used to choose the attribute there,
    and so on recursively.
  • Use Hoeffding bound to decide how many examples
    are enough at each node

17
Hoeffding Bound
  • Consider a random variable a whose range is R
  • Suppose we have n observations of a
  • Mean
  • Hoeffding bound states
  • With probability 1- ?, the true mean of a is at
    least
  • , where

18
How many examples are enough?
  • Let G(Xi) be the heuristic measure used to choose
    test attributes (e.g. Information Gain, Gini
    Index)
  • Xa the attribute with the highest attribute
    evaluation value after seeing n examples.
  • Xb the attribute with the second highest split
    evaluation function value after seeing n
    examples.
  • Given a desired ?, if
    after seeing n examples at a node,
  • Hoeffding bound guarantees the true
    , with probability 1-?.
  • This node can be split using Xa, the succeeding
    examples will be passed to the new leaves.

19
Algorithm
  • Calculate the information gain for the attributes
    and determines the best two attributes
  • Pre-pruning consider a null attribute that
    consists of not splitting the node
  • At each node, check for the condition
  • If condition satisfied, create child nodes based
    on the test at the node
  • If not, stream in more examples and perform
    calculations till condition satisfied

20
(No Transcript)
21
Performance Analysis
  • p probability that an example passed through DT
    to level i will fall into a leaf at that point
  • The expected disagreement between the tree
    produced by Hoeffding tree algorithm and that
    produced using infinite examples at each node is
    no greater than ? /p.
  • Required memory O(leaves attributes values
    classes)

22
VFDT
23
VFDT (Very Fast Decision Tree)
  • A decision-tree learning system based on the
    Hoeffding tree algorithm
  • Split on the current best attribute, if the
    difference is less than a user-specified
    threshold
  • Wasteful to decide between identical attributes
  • Compute G and check for split periodically
  • Memory management
  • Memory dominated by sufficient statistics
  • Deactivate or drop less promising leaves when
    needed
  • Bootstrap with traditional learner
  • Rescan old data when time available

24
VFDT(2)
  • Scales better than pure memory-based or pure
    disk-based learners
  • Access data sequentially
  • Use subsampling to potentially require much less
    than one scan
  • VFDT is incremental and anytime
  • New examples can be quickly incorporated as they
    arrive
  • A usable model is available after the first few
    examples and then progressively defined

25
Experiment Results (VFDT vs. C4.5)
  • Compared VFDT and C4.5 (Quinlan, 1993)
  • Same memory limit for both (40 MB)
  • 100k examples for C4.5
  • VFDT settings d 10-7, t 5, nmin200
  • Domains 2 classes, 100 binary attributes
  • Fifteen synthetic trees 2.2k 500k leaves
  • Noise from 0 to 30

26
Experiment Results
Accuracy as a function of the number of training
examples
27
Experiment Results
Tree size as a function of number of training
examples
28
Mining Time-Changing Data Stream
  • Most KDD systems, include VFDT, assume training
    data is a sample drawn from stationary
    distribution
  • Most large databases or data streams violate this
    assumption
  • Concept Drift data is generated by a
    time-changing concept function, e.g.
  • Seasonal effects
  • Economic cycles
  • Goal
  • Mining continuously changing data streams
  • Scale well

29
Window Approach
  • Common Approach when a new example arrives,
    reapply a traditional learner to a sliding window
    of w most recent examples
  • Sensitive to window size
  • If w is small relative to the concept shift rate,
    assure the availability of a model reflecting the
    current concept
  • Too small w may lead to insufficient examples to
    learn the concept
  • If examples arrive at a rapid rate or the concept
    changes quickly, the computational cost of
    reapplying a learner may be prohibitively high.

30
CVFDT
31
CVFDT
  • CVFDT (Concept-adapting Very Fast Decision Tree
    learner)
  • Extend VFDT
  • Maintain VFDTs speed and accuracy
  • Detect and respond to changes in the
    example-generating process

32
Observations
  • With a time-changing concept, the current
    splitting attribute of some nodes may not be the
    best any more.
  • An outdated subtree may still be better than the
    best single leaf, particularly if it is near the
    root.
  • Grow an alternative subtree with the new best
    attribute at its root, when the old attribute
    seems out-of-date.
  • Periodically use a bunch of samples to evaluate
    qualities of trees.
  • Replace the old subtree when the alternate one
    becomes more accurate.

33
CVFDT algorithm
  • Alternate trees for each node in HT start as
    empty.
  • Process examples from the stream indefinitely.
    For each example (x, y),
  • Pass (x, y) down to a set of leaves using HT and
    all alternate trees of the nodes (x, y) passes
    through.
  • Add (x, y) to the sliding window of examples.
  • Remove and forget the effect of the oldest
    examples, if the sliding window overflows.
  • CVFDTGrow
  • CheckSplitValidity if f examples seen since last
    checking of alternate trees.
  • Return HT.

34
CVFDT algorithm process each example
Yes
No
Read new example
35
CVFDT algorithm process each example
36
CVFDTGrow
  • For each node reached by the example in HT,
  • Increment the corresponding statistics at the
    node.
  • For each alternate tree Talt of the node,
  • CVFDTGrow
  • If enough examples seen at the leaf in HT which
    the example reaches,
  • Choose the attribute that has the highest average
    value of the attribute evaluation measure
    (information gain or gini index).
  • If the best attribute is not the null
    attribute, create a node for each possible value
    of this attribute

37
CVFDT algorithm process each example
38
Forget old example
  • Maintain the sufficient statistics at every node
    in HT to monitor the validity of its previous
    decisions.
  • VFDT only maintain such statistics at leaves.
  • HT might have grown or changed since the example
    was initially incorporated.
  • Assigned each node a unique, monotonically
    increasing ID as they are created.
  • forgetExample (HT, example, maxID)
  • For each node reached by the old example with
    node ID no larger than the max leave ID the
    example reaches,
  • Decrement the corresponding statistics at the
    node.
  • For each alternate tree Talt of the node,
    forget(Talt, example, maxID).

39
CVFDT algorithm process each example
Read new example
40
CheckSplitValidtiy
  • Periodically scans the internal nodes of HT.
  • Start a new alternate tree when a new winning
    attribute is found.
  • Tighter criteria to avoid excessive alternate
    tree creation.
  • Limit the total number of alternate trees.

41
Smoothly adjust to concept drift
  • Alternate trees are grown the same way HT is.
  • Periodically each node with non-empty alternate
    trees enter a testing mode.
  • M training examples to compare accuracy.
  • Prune alternate trees with non-increasing
    accuracy over time.
  • Replace if an alternate tree is more accurate.

42
Adjust to concept drift(2)
  • Dynamically change the window size
  • Shrink the window when many nodes gets
    questionable or data rate changes rapidly.
  • Increase the window size when few nodes are
    questionable.

43
Performance
  • Require memory O(nodes attributes attribute
    values classes).
  • Independent of the total number of examples.
  • Running time O(Lc attributes attribute values
    number of classes).
  • Lc the longest length an example passes through
    number of alternate trees.
  • Model learned by CVFDT vs. the one learned by
    VFDT-Window
  • Similar in accuracy
  • O(1) vs. O(window size) per new example.

44
Experiment Results
  • Compare CVFDT, VFDT, VFDT-Window
  • 5 million training examples
  • Concept changed at every 50k examples
  • Drift Level average percentage of the test
    points that changes label at each concept change.
  • About 8 of test points change label each drift
  • 100,000 examples in window
  • 5 noise
  • Test the model every 10k examples throughout the
    run, averaged these results.

45
Experiment Results (CVFDT vs. VFDT)
Error rate as a function of number of attributes
46
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
47
Experiment Results (CVFDT vs. VFDT)
Error rates of learners as a function of the
number of examples seen
48
Experiment Results (CVFDT vs. VFDT)
Error rates as a function of the amount of
concept drift
49
Experiment Results
CVFDTs drift characteristics
50
Experiment Results (CVFDT vs. VFDT vs.
VFDT-window)
  • Error Rate
  • VFDT 19.4
  • CVFDT 16.3
  • VFDT-Window 15.3
  • Running Time
  • VFDT 10 minutes
  • CVFDT 46 minutes
  • VFDT-Window expect 548 days

Error rates over time of CVFDT, VFDT, and
VFDT-window
51
Experiment Results
  • CVFDT not use too much RAM
  • D50, CVFDT never uses more than 70MB
  • Use as little as half the RAM of VFDT
  • VFDT often had twice as many leaves as the number
    of nodes in CVFDTs HT and alternate subtrees
    combined
  • Reason VFDT considers many more outdated
    examples and is forced to grow larger trees to
    make up for its earlier wrong decisions due to
    concept drift

52
Conclusions
  • CVFDT a decision-tree induction system capable
    of learning accurate models from high speed,
    concept-drifting data streams
  • Grow an alternative subtree whenever an old one
    becomes questionable
  • Replace the old subtree when the new more
    accurate
  • Similar in accuracy to applying VFDT to a moving
    window of examples

53
Future Work
  • Concepts changed periodically and removed
    subtrees may become useful again
  • Comparisons with related systems
  • Continuous attributes
  • Weighting examples

54
Reference List
  • P. Domingos and G. Hulten. Mining high-speed data
    streams. In Proceedings of the Sixth ACM SIGKDD
    International Conference on Knowledge Discovery
    and Data Mining, 2000.
  • G. Hulten, L. Spencer, and P. Domingos. Mining
    time-changing data streams. In Proceedings of the
    Seventh ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, 2001
  • V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON
    Mining and monitoring evolving data. In
    Proceedings of the Sixteenth International
    Conference on Data Engineering, 2000.
  • J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L.
    Loh. BOAT optimistic decision tree construction.
    In Proceedings of the 1999 ACM SIGMOD
    International Conference on Management of Data,
    1999.

55
The end
  • Q A

56
Thank You!
About PowerShow.com