Fast Subsequence Matching in TimeSeries Databases - PowerPoint PPT Presentation


PPT – Fast Subsequence Matching in TimeSeries Databases PowerPoint presentation | free to download - id: 1448e8-NjkwN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Fast Subsequence Matching in TimeSeries Databases


Whole Matching (Previous Work) ... The structure of a leaf node and a non-leaf node. Proposed Method (cont.) Two questions ... time series matching by wavelets ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 33
Provided by: Rui55


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Fast Subsequence Matching in TimeSeries Databases

Fast Subsequence Matching in Time-Series Databases
  • Christos Faloutsos
  • M. Ranganathan
  • Yannis Manolopoulos
  • Department of Computer Science and ISR
  • University of Maryland at College Park

Presented by Rui Li
  • Goal To find an efficient indexing method to
    locate time series in a database
  • Main Idea
  • Map each time series into a small set of
    multidimensional rectangles in feature space
  • Rectangles can be readily indexed using
    traditional spatial access methods, e.g., R-tree

  • Hot Problem Searching similar patterns in
    time-series databases
  • Applications
  • financial, marketing and production time series,
    e.g. stock prices
  • scientific databases, e.g. weather, geological,
    environmental data

Introduction (cont.)
  • Similarity Queries
  • Whole Matching
  • Subsequence Matching
  • partial matching
  • report time series along with offset

Introduction (cont.)
  • Whole Matching (Previous Work)
  • Use a distance-preserving transform (e.g., DFT)
    to extract f features from time series (e.g., the
    first f DFT coefficients), and then map them into
    points in the f-dimensional feature space
  • Spatial access method (e.g., R-trees) can be
    used to search for approximate queries

Introduction (cont.)
  • Subsequence Matching (Goal)
  • Map time series into rectangles in feature space
  • Spatial access methods as the eventual indexing

  • To guarantee no false dismissals for range
    queries, the feature extraction function F()
    should satisfy the following formula
  • Parseval Theorem
  • The DFT preserves the Euclidean distance between
    two time series

Proposed Method
  • Mapping each time series to a trail in feature
  • Use a sliding window of size w and place it at
    every possible offset
  • For each such placement of the window, extract
    the features of the subsequence inside the window
  • A time series of length L is mapped to a trail in
    feature space, consisting of
  • L-w1 points one point for each offset

  • Example1

  • Example2
  • (a) a sample stock-price time series
  • (b) its trail in the feature space of the 0-th
    and 1-st DFT coefficients
  • (c) its trail of the 1-st and 2-nd DFT

Proposed Method (cont.)
  • Indexing the trails
  • Simply storing the individual points of the trail
    in an R-tree is inefficient
  • Exploit the fact that successive points of the
    trail tend to be similar, i.e., the contents of
    the sliding window in nearby offsets tend to be
  • Divide the trail into sub-trails and represent
    each of them with its minimum bounding
    (hyper)-rectangle (MBR)
  • Store only a few MBRs

Proposed Method (cont.)
  • Indexing the trails (cont.)
  • Can guarantee no false dismissals when a query
    arrives, all the MBRs that intersect the query
    region are retrieved, i.e., all the qualifying
    sub-trails are retrieved, plus some false alarms

  • Return to example1

Proposed Method (cont.)
  • Indexing the trails (cont.)
  • Map a time series into a set of rectangles in
    feature space
  • Each MBR corresponds to a sub-trail

Proposed Method (cont.)
  • For each MBR we have to store
  • , which are the offsets of the first
    and last such positionings
  • A unique identifier for each time series
  • The extent of the MBR in each dimension, i.e.,
  • Store the MBRs in an R-tree
  • Recursively group the MBRs into parent MBRs,
    grandparent MBRs, etc.

  • Example1 (cont.)
  • assuming a fan-out of 4

Proposed Method (cont.)
  • The structure of a leaf node and a non-leaf node

Proposed Method (cont.)
  • Two questions
  • Insertions when a new time series is inserted,
    what is a good way to divide its trail into
  • Queries how to handle queries, especially the
    ones that are longer than the sliding window

Proposed Method (cont.)
  • Insertion Dividing trails into sub-trails
  • Goal Optimal division so that the number of disk
    accesses is minimized

  • Example3 fixed heuristic adaptive

Proposed Method (cont.)
  • Insertion (cont.)
  • Group trail-points into sub-trails by means of an
    adaptive heuristic
  • Based on a greedy algorithm, using a cost
    function to estimate the number of disk accesses
    for each of the options

Proposed Method (cont.)
  • Insertion (cont.)
  • The cost function where is the
    sides of the n-dimensional MBR of a node in an
  • The marginal cost of each point
    where k is the number of points in this MBR

Proposed Method (cont.)
  • Insertion (cont.)
  • Algorithm Assign the first point of the trail to
    a sub-trail (would be a predefined small MBR) FOR
    each successive point IF it increases the
    marginal cost of the current sub-trail THEN
    start a new sub-trail ELSE include it into the
    current sub-trail

Proposed Method (cont.)
  • Insertion (cont.)
  • The algorithm may not work well under certain
  • The algorithms goal is to minimize the size of
    each MBR, why dont we use clustering techniques!

Proposed Method (cont.)
  • Searching Queries longer than w
  • If Len(Q)w, the searching algorithm goes like
  • Map Q to a point q in the feature space the
    query corresponds to a sphere with center q and
    radius e
  • Retrieve the sub-trails whose MBRs intersect the
    query region
  • Examine the corresponding time series, and
    discard the false alarms

Proposed Method (cont.)
  • Searching (cont.)
  • If Len(Q)gtw, consider the following Lemma
  • Consider two sequences Q and S of the same length
  • Consider their p disjoint subsequences
  • and where
  • If Q AND S agree within tolerance e, then at
    least one of the pairs of corresponding
    subsequence agree within tolerance

Proposed Method (cont.)
  • Searching (cont.)
  • If Len(Q)gtw, the searching algorithm goes like
  • The query time series Q is broken into p
    sub-queries which correspond to p spheres in the
    feature space with radius
  • Retrieve the sub-trails whose MBRs intersect at
    least one of the sub-query regions
  • Examine the corresponding subsequences of the
    time series, and discard the false alarms

  • Experiments are ran on a stock prices database of
    329,000 points
  • Only the first 3 frequencies of the DFT are used
    thus the feature space has 6 dimensions (real and
    imaginary parts of each retained DFT coefficient)
  • Sliding window size w512

Experiments (cont.)
  • Query time series were generated by taking random
    offsets into the time series and obtaining
    subsequences of length Len(Q) from those offsets

Experiments (cont.)
  • For groups of experiments were carried out
  • Comparison of the proposed method against the
    method that has sub-trails with only one point
  • Experiments to compare the response time
  • Experiments with queries longer than w
  • Experiments with larger databases

Related Works (citations)
  • Continuous queries over data streams
  • Similarity indexing with M-tree/SS-tree, etc.
  • Efficient time series matching by wavelets
  • Fast similarity search in the presence of noise,
    scaling, and translation in time-series databases

Thank you!