Fast Subsequence Matching in TimeSeries Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Fast Subsequence Matching in TimeSeries Databases

Description:

Fast Subsequence Matching in Time-Series Databases. Author: Christos Faloutsos etc. ... False alarm: non-qualified sequence not discarded, Not so bad. Discrete ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 26

Provided by: abc7182

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in TimeSeries Databases

1
Fast Subsequence Matching in Time-Series Databases

Author Christos Faloutsos etc.
Speaker Weijun He

2
What is the problem?

What is Time Series 1-dimensional data
e.g. Daily stock market price,
Daily temperature, etc
Our goal
Design fast searching methods that will locate
subsequence that match a query subsequence,
exactly or approximately

3
Motivation/Application

Financial, marketing, production
Typical query
find companies whose stock prices move
similarly
Scientific databases
Typical query
find past days in which solar magnetic wind
showed similar patterns as todays

4
Some notational conventions

If S and Q are two sequences, then
Len(S) length of S
Sij subsequence including i and j
Si i-th entry of S
D(S,Q) distance of two equal length
sequence S and Q

5
Queries

Two categories for queries
Whole Mathing len(data) len(query)
Subsequence Matching
len(data) gt len(query)
Remark
The distance function D(S,Q) is defined, e.g. D()
can be the Euclidean distance
Matching means D(S,Q) lt ?, i.e., approximately

6
Whole Matching

Any distance-preserving transform(e.g., Discrete
Fourier Transform(DFT),extract f features from
sequences(e.g., the first f DFT coefficients)
f-dimensional feature space
Any spatial access method(e.g., R-tree) can be
used for range/approximate queries

7
Mathematical Background

Lemma 1
To guarantee no false dismissals for range
queries, the feature extraction function F()
should satisfy the following formula
Dfeature(F(O1),F(O2))ltDobject(O1,O2)
False dismissal discard the qualified sequence,
BAD
False alarm non-qualified sequence not
discarded, Not so bad

8
Discrete Fourier Transform

Theorem(Parseval)
?i0,..,n-1Xi2 ?f0,..,n-1Xf2 (distance
preserving)
DFT is a linear transform, so it can be proved
that
DFT satisfy Lemma 1.
We Keep the first few(2-3) coefficients as
features
Properties 1. Only false alarm, no false
dismissal
2. Practically, false alarms
are few

9
From Whole to Subsequence matching

Question
How to generalize the method to approximate
match queries for subsequences of arbitrary
length?

10
Subsequence MatchingCriterion

Some criterion
Fast sequential scanning and distance
calculation at each and every possible offset is
too slow for large databases
Correct No false dismissals, but false
alarms are acceptable
Small space overhead
Dynamic
Varying length for data and query sequences

11
Proposed Method

Using Sliding window of w, minimum query length.
A data sequence of length Len(S) is mapped to
a trail in feature space, consisting of
len(S)-w1 points.
Sub-Trail-index

12
I-naïve method

The straightforward way is
keep track of the individual points of each
trail, storing them in spatial access method
Disadvantage
Inefficient since almost every point in a data
sequence will correspond to a point in the
f-dimensional feature space.

13
I-naïve method Contd.

How to improve
Observation the content of the sliding window in
nearby offset will be similar.
Solution Divide the trail into sub-trails and
represent each of them with its Minimum Bounding
Rectangle (MBR), thus we only need to store a few
MBRs, no false dismissals are guaranteed.

14
Illustration
15
MBR Property

Each MBR corresponds to a whole sub-trail, i.e.,
points in feature space that correspond to
successive positions of the sliding window.
Each leaf-MBR has tstart, tend which are the
offsets of the first and last such positions,
also has a unique identifier for the data
sequence (sequence_id)
The extent of the MBR in each dimension is
denoted as
(F1low,F1high, F2low,F2high,)
MBR are stored in R tree.

16
Figure2 Structure of a leaf node and a non-leaf
nodeindex node layout for the last two levels
17
ST-index

There are two questions for ST-index
Insertion (Dynamic requirement)
when new data sequence is inserted, what is a
good way to divide its trail into sub-trail?
Queries longer than w
how to handle queries, especially the ones
longer than w.

18
ST-index Insertion
19
Illustration
20
I-adaptive heuristic

Cost function
DA(L)?(Li0.5) where L(L1,L2,..Ln), 1ltiltn.
Marginal cost of a point
Consider a sub-trail of K points with a MBR of
sizes L1,Ln, each point in this sub-trail has
mcDA(L) /k

21
I-adaptive heuristic algorithm

/ Algorithm Divide-to-Subtrails /
Assign the first point of the trail in a
(trivial) sub-trail
FOR each successive point
IF it increase the marginal cost of the current
sub-trail
THEN start another sub-trail
ELSE include it in the current sub-trail

22
Searching-Queries longer than W

Two methods
PrefixSearch
select the prefix of Q of length w, match the
prefix within tolerance e
MultiPiece Search
Suppose the query sequence has length pw,
Break Q into p sub-queries which correspond to p
sphere in feature space with raius e/sqrt(p)
Use ST-index to retrieve the sub-trails whose
MBRs intersect at least one of the sub-query
region.

23
Prefix vs. MultiPiece search

Volume required in feature space(K is a
constant)
Prefixsearch K ef
Multipiece Kp(e/sqrt(p))f
Multipiece is likely to produce fewer false
alarms

24
Conclusions

The main contribution is I-adaptive method
achieves orders of magnitude savings over the
sequential scanning.
Small space overhead
It is dynamic
No false dismissal
Future work
Extend this method for 2-dimensional gray scale
images, and in general for n-dimensional
vector-fields(e.g. 3-d MRI brain scans)

25
The End