Fast Subsequence Matching in Timeseries Databases - PowerPoint PPT Presentation


PPT – Fast Subsequence Matching in Timeseries Databases PowerPoint presentation | free to view - id: b95c9-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Fast Subsequence Matching in Timeseries Databases


on Management of Data, pages 419--429, Minneapolis, May 1994. presented by ... find companies whose stock prices move similarly ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 31
Provided by: kathy128


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Timeseries Databases

Fast Subsequence Matching in Time-series
  • C. Faloutsos, M. Ranganathan, and Y.
  • University of Maryland Department of Computer
    Science and Institute for Systems Research
  • In Proc. ACM SIGMOD Int. Conf. on Management of
    Data, pages 419--429, Minneapolis, May 1994.
  • presented by Kathy Gray, Barkha Raisoni

  • The Problem
  • Background and Related Work
  • Main Approach
  • Subsequence Matching
  • Performance Results
  • Summary

The Problem
  • Business, Financial, Stock
  • find companies whose stock prices move similarly
  • find other companies that have similar sales
    patterns as with our companys product
  • Scientific
  • find past days in which the solar magnetic wind
    showed patterns similar to today (predictions of
    the earths magnetic field)

Problem Overview
  • Need fast searching methods that will search a
    database with time-series of real numbers to
    locate subsequences that are similar to a query
  • fast and correct
  • small space overhead
  • dynamic
  • handle varying length data sequences

Similarity Queries
  • Whole Matching
  • Given N data sequences of real numbers
    S1,S2,...SN and a query sequence Q, we want to
    find those data sequences that are within a
    certain tolerance ? (distance from Q)
  • Use a distance preserving transform, such as
    Discrete Fourier Transform (DFT) to extract f
    features from sequences (i.e., the first f DFT
    coefficients) thus mapping them into points in
    the f-dimensional feature space
  • then use any spatial access method (R trees)
  • exploits assumption that data and query sequences
    have same length

Similarity Queries
  • Subsequence Matching
  • Given a collection of N sequences of varying
    length real numbers, S1,S2,…SN
  • User specifies a query subsequence, Q of variable
    length Len(Q) and tolerance, ?, (maximum
    acceptable dis-similarity, or distance)
  • Find all sequences Si (1?i ? N), along with
    corrent offsets, k, such that Si k k Len(Q)
    -1 matches the query subsequence
  • D (Q, Si k k Len(Q) -1) ? ?

Related Work
  • Indexing in text and DNA databases
  • can be viewed as 1-dimensional sequences
  • but consist of discrete symbols v. continuous
  • makes a difference in the feature extraction
  • Queries on time-seq or on color images or 3-D
    brain scans (whole matching)
  • F-index method
  • apply DFT
  • store first few numbers (DFT coefficients)
  • sequence mapped into a point in f-dimensional
  • points are then organized in R-tree

Related Work (contd)
  • F-index should not result in false dismissals for
    range queries
  • Condition to be satisfied
  • where
  • O is the qualifying object
  • Dfeature is the Euclidean distance
  • F(O) is the feature vector Note proof is
    given in the paper

Main Approach
  • Generalize the whole-matching problem - find
    approximate-match queries for subsequences of
    arbitrary lengths
  • Map each data sequence into a small set of
    multidimensional rectangles in feature space
  • Then these rectangles can be indexed using
    spatial access methods, such as R trees
  • Small space overhead ? order of magnitudes
    savings over Sequential Scan

Sub-Trail (ST)-Index
  • Assume that queries have a minimum duration w
    (e.g., w 7 days)
  • Divide data sequences into
  • sliding windows of width, w
  • thus producing trails
  • i.e., data sequences of Len(Q) mapped to trails
    in feature space of Len(Q)-w1 points
  • Index these trails using I-naive method

I-naive method
  • Given query of length w and tolerance ? Extract
    the features of the query and search the spatial
    access method for range of query with radius ?
  • retrieved points correspond to promising
  • discard false alarms (outside actual distance
  • Complete desired answer set

I-naive Inefficient
  • Twice as slow as Sequential Scan!
  • 1f increase in storage requirements
  • R tree very tall and slow
  • Solution
  • exploit fact that successive points of trail are
  • divide trail into sub-trails
  • represent each with its minimum bounding
    rectangle (MBR)
  • storage of only a few MBRs required!

ST-Index Example MBRs belonging to same trail may
ST-index features
  • Map data sequence into set of rectangles in
    feature space
  • Significant improvement with respect to space and
    response time
  • We have to store for each MBR
  • tstart tend
  • Unique identifier for data sequence (sequence_id)
  • Extent of the MBR in each dimension
  • (F1low , F1high, F2low , F2 high…)

ST-index Node Structure
F1_min, F1_max F2_min, F2_max
Level Above leaves
Sequence_id T_start ,T_end F1_min, F1_max F2_min,
Leaf Level
Now Barkha...
  • Questions
  • Insertions - how to divide its trail in feature
    space into sub-trails
  • Queries - how to handle queries, especially those
    that are longer than w
  • Performance Results
  • Summary

Sub-trail size
  • Aim…
  • To find a optimal way to divide trail of feature
    space into sub-trails
  • Solution to sub-trails
  • To pack points in sub-trails according to
    pre-determined fixed number. No optimal value!!!
  • Use of function of length of the stored seq for
    sub-trail size e.g. vLen(S)

Sub-trail size (contd….)
  • Both the methods show poor results
  • I-fixed method used
  • Use of index with fixed sub-trails
  • I-naïve method a special case of I-fixed when
    sub-trail length set to 1.
  • I-adaptive method
  • Group points into sub-trail- greedy algorithm
  • Use of cost function tries to estimate number of
    disk accesses

Example I-fixed method I-adaptive method
Sub-trail size of fixed length 3
Algorithm Divide-to-Sub-trails
  • Definition of Marginal cost
  • Consider k sub-trails of with an MBR of sizes
  • Then the marginal cost in this sub-trail is
  • mc DA(L)/k where DA is
    disk accesses
  • Assign the first point of the trail in a
    (trivial) sub-trail
  • FOR each successive point
  • IF it increases the marginal cost of the
  • current sub-trail
  • THEN start another sub-trail
  • ELSE include it in the current sub-trail

Query length
  • Query of length w ……
  • Algorithm Search_Short
  • Query seq mapped to point qf in feature space
    with radius ?
  • Retrieve the sub-trails whose MBRs intersect the
    query region using the index
  • Examine corresponding subsequences of data
    sequences to discard the false alarms

Query length (contd…)
  • Queries of length greater than w
  • Complicated as ST-index only knows subsequences
    of length w
  • Solution proposed - Prefix Search
  • Select a subsequence of Q of length w
  • (e.g. prefix)
  • Use ST-index to search for data subsequences that
    match the prefix
  • Returns superset of qualifying subsequences

Query length (contd….)
  • Lemma
  • If two sequences S and Q of same length l agree
    within tolerance ?
  • Then any pair (Sij, Qij)of corresponding
    subsequences agree with same tolerance ?.
  • D (S,Q) ? ? ? D (Sij, Qij) ? ? (1 ? i ? j ?
  • Note Proof is given in paper

Query length (contd….)
  • Algorithm Search long( MultiPiece method)
  • Query sequence Q is broken in p-sub-queries
    corresponding to p-spheres in feature space with
    radius ? /vp
  • ST-index is used to retrieve the sub-trails whose
    MBRs intersect at least one sub-query regions
  • Examine corresponding subseq. of the data to
    discard the false alarms.
  • Method based on lemma 3 of the paper

Performance Results
  • Stock price sequence and its trail of 0th and 1st

Performance Results (contd…..)
I-fixed gives varying results depending on the
length of its sub-trails I-naïve method ?24 MB 2
times slower than sequential scanning
method!! I-adaptive method ?5Kb
Index space Vs average sub-trail length
Performance Results (contd….)
Relative response of Seq scanning Vs proposed
Analysis for Query length same as w Proposed
method achieves 3 up to 100 times better response
time for selectivities in the range from 10-4
to 10 Len (Q) w 512
Performance Results (contd…..)
Relative wall clock time Vs selectivity in
Analysis for Query length greater than w
I-adaptive method outperforms sequential scanning
from 2 to 40 times Len (Q) 512 w 128
Performance Results (contd….)
Points generated with a starting value of 1.5
where step increment is 0.001 Method outperforms
sequential scanning from 100 to 10 times approx
for selectivities up to 10
For random walk data in log-log scale
  • Proposed idea maps data sequences in set of boxes
    in feature space
  • Method efficiently handles approximate and exact
    queries for subsequence matching
  • Generalization of whole-matching case
  • Achieves orders of magnitude savings over
    sequential scanning
  • Small space overhead, dynamic provably
  • Future work in extension of the method in
  • 2 dimensional gray-scale images and then in
    general for n-dimensional vector fields