Loading...

PPT – Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects PowerPoint presentation | free to download - id: 53c328-OTFkO

The Adobe Flash plugin is needed to view this content

Fast Algorithms for Time Series with applications

to Finance, Physics, Music and other Suspects

Dennis Shasha Joint work with Yunyue Zhu,

Xiaojian Zhao, Zhihua Wang, and Alberto

Lerner shasha,yunyue, xiaojian, zhihua,

lerner_at_cs.nyu.edu Courant Institute, New

York University

Goal of this work

- Time series are important in so many applications

biology, medicine, finance, music, physics, - A few fundamental operations occur all the time

burst detection, correlation, pattern matching. - Do them fast to make data exploration faster,

real time, and more fun.

Sample Needs

- Pairs Trading in Finance find two stocks that

track one another closely. When they go out of

correlation, buy one and sell the other. - Match a persons humming against a database of

songs to help him/her buy a song. - Find bursts of activity even when you dont know

the window size over which to measure. - Query and manipulate ordered data.

Why Speed Is Important

- Person on the street As processors speed up,

algorithmic efficiency no longer matters - True if problem sizes stay same.
- They dont. As processors speed up, sensors

improve e.g. satellites spewing out a terabyte

a day, magnetic resonance imagers give higher

resolution images, etc. - Desire for real time response to queries.

Surprise, surprise

- More data, real-time response, increasing

importance of correlation - IMPLIES
- Efficient algorithms and data management more

important than ever!

Corollary

- Important area, lots of new problems.
- Small advertisement High Performance Discovery

in Time Series (Springer 2004). At this

conference.

Outline

- Correlation across thousands of time series
- Query by humming correlation shifting
- Burst detection when you dont know window size
- Aquery a query language for time series.

Real-time Correlation Across Thousands (and

scaling) of Time Series

Scalable Methods for Correlation

- Compress streaming data into moving synopses.
- Update the synopses in constant time.
- Compare synopses in near linear time with respect

to number of time series. - Use transforms simple data structures.
- (Avoid curse of dimensionality.)

GEMINI framework

Faloutsos, C., Ranganathan, M. Manolopoulos,

Y. (1994). Fast subsequence matching in

time-series databases. In proceedings of the ACM

SIGMOD Int'l Conference on Management of Data.

Minneapolis, MN, May 25-27. pp 419-429.

StatStream (VLDB,2002) Example

- Stock prices streams
- The New York Stock Exchange (NYSE)
- 50,000 securities (streams) 100,000 ticks (trade

and quote) - Pairs Trading, a.k.a. Correlation Trading
- Querywhich pairs of stocks were correlated with

a value of over 0.9 for the last three hours?

XYZ and ABC have been correlated with a

correlation of 0.95 for the last three hours. Now

XYZ and ABC become less correlated as XYZ goes up

and ABC goes down. They should converge back

later. I will sell XYZ and buy ABC

Online Detection of High Correlation

- Given tens of thousands of high speed time series

data streams, to detect high-value correlation,

including synchronized and time-lagged, over

sliding windows in real time. - Real time
- high update frequency of the data stream
- fixed response time, online

Online Detection of High Correlation

- Given tens of thousands of high speed time series

data streams, to detect high-value correlation,

including synchronized and time-lagged, over

sliding windows in real time. - Real time
- high update frequency of the data stream
- fixed response time, online

Online Detection of High Correlation

- Given tens of thousands of high speed time series

data streams, to detect high-value correlation,

including synchronized and time-lagged, over

sliding windows in real time. - Real time
- high update frequency of the data stream
- fixed response time, online

StatStream Naïve Approach

- Goal find most highly correlated stream pairs

over sliding windows - N number of streams
- w size of sliding window
- space O(N) and time O(N2w) .
- Suppose that the streams are updated every

second. - With a Pentium 4 PC, the exact computing method

can monitor only 700 streams, where each result

is produced with a separation of two minutes. - Note Punctuated result model not continuous,

but online.

StatStream Our Approach

- Use Discrete Fourier Transform to approximate

correlation as in Gemini approach. - Every two minutes (basic window size), update

the DFT for each time series over the last hour

(window size) - Use grid structure to filter out unlikely pairs
- Our approach can report highly correlated pairs

among 10,000 streams for the last hour with a

delay of 2 minutes. So, at 202, find highly

correlated pairs between 1 PM and 2 PM. At 204,

find highly correlated pairs between 102 and

202 PM etc.

StatStream Stream synoptic data structure

- Three level time interval hierarchy
- Time point, Basic window, Sliding window
- Basic window (the key to our technique)
- The computation for basic window i must finish by

the end of the basic window i1 - The basic window time is the system response

time. - Digests

StatStream Stream synoptic data structure

- Three level time interval hierarchy
- Time point, Basic window, Sliding window
- Basic window (the key to our technique)
- The computation for basic window i must finish by

the end of the basic window i1 - The basic window time is the system response

time. - Digests

Basic window digests sum DFT coefs

StatStream Stream synoptic data structure

- Three level time interval hierarchy
- Time point, Basic window, Sliding window
- Basic window (the key to our technique)
- The computation for basic window i must finish by

the end of the basic window i1 - The basic window time is the system response

time. - Digests

Basic window digests sum DFT coefs

Sliding window digests sum DFT coefs

StatStream Stream synoptic data structure

- Three level time interval hierarchy
- Time point, Basic window, Sliding window
- Basic window (the key to our technique)
- The computation for basic window i must finish by

the end of the basic window i1 - The basic window time is the system response

time. - Digests

Basic window digests sum DFT coefs

Sliding window digests sum DFT coefs

StatStream Stream synoptic data structure

- Three level time interval hierarchy
- Time point, Basic window, Sliding window
- Basic window (the key to our technique)
- The computation for basic window i must finish by

the end of the basic window i1 - The basic window time is the system response

time. - Digests

Basic window digests sum DFT coefs

Basic window digests sum DFT coefs

Basic window digests sum DFT coefs

Time point

Basic window

How general technique is applied

- Compress streaming data into moving synopses

Discrete Fourier Transform. - Update the synopses in time proportional to

number of coefficients basic window idea. - Compare synopses in real time compare DFTs.
- Use transforms simple data structures (grid

structure).

Synchronized Correlation Uses Basic Windows

- Inner-product of aligned basic windows

Stream x

Stream y

Basic window

Sliding window

- Inner-product within a sliding window is the sum

of the inner-products in all the basic windows in

the sliding window.

Approximate Synchronized Correlation

- Approximate with an orthogonal function family

(e.g. DFT)

x1 x2 x3 x4 x5

x6 x7 x8

Approximate Synchronized Correlation

- Approximate with an orthogonal function family

(e.g. DFT)

x1 x2 x3 x4 x5

x6 x7 x8

Approximate Synchronized Correlation

- Approximate with an orthogonal function family

(e.g. DFT)

x1 x2 x3 x4 x5

x6 x7 x8

Approximate Synchronized Correlation

- Approximate with an orthogonal function family

(e.g. DFT) - Inner product of the time series Inner

product of the digests - The time and space complexity is reduced from

O(b) to O(n). - b size of basic window
- n size of the digests (nltltb)
- e.g. 120 time points reduce to 4 digests

x1 x2 x3 x4 x5

x6 x7 x8

Approximate lagged Correlation

- Inner-product with unaligned windows

- The time complexity is reduced from O(b) to O(n2)

, as opposed to O(n) for synchronized

correlation. Reason terms for different

frequencies are non-zero in the lagged case.

Grid Structure(to avoid checking all pairs)

- The DFT coefficients yields a vector.
- High correlation gt closeness in the vector space
- We can use a grid structure and look in the

neighborhood, this will return a super set of

highly correlated pairs.

Empirical Study Speed

Our algorithm is parallelizable.

Empirical Study Accuracy

- Approximation errors
- Larger size of digests, larger size of sliding

window and smaller size of basic window give

better approximation - The approximation errors (mistake in correlation

coef) are small.

Sketches Random Projection

- Correlation between time series of the returns of

stock - Since most stock price time series are close to

random walks, their return time series are close

to white noise - DFT/DWT cant capture approximate white noise

series because the energy is distributed across

many frequency components. - Solution Sketches (a form of random landmark)
- Sketch pool list of random vectors drawn from

stable distribution - Sketch The list of inner products from a data

vector to the sketch pool. - The Euclidean distance (correlation) between time

series is approximated by the distance between

their sketches with a probabilistic guarantee.

- W.B.Johnson and J.Lindenstrauss. Extensions of

Lipshitz mapping into hilbert space. Contemp.

Math.,26189-206,1984 - D. Achlioptas. Database-friendly random

projections. In Proceedings of the twentieth ACM

SIGMOD-SIGACT-SIGART symposium on Principles of

database systems, ACM Press,2001

Sketches Intuition

- You are walking in a sparse forest and you are

lost. - You have an old-time cell phone without GPS.
- You want to know whether you are close to your

friend. - You identify yourself as 100 meters from the

pointy rock, 200 meters from the giant oak etc. - If your friend is at similar distances from

several of these landmarks, you might be close to

one another. - The sketch is just the set of distances.

Sketches Random Projection

inner product

sketches

random vector

raw time series

Sketches approximate distance well (Real

distance/sketch distance)

(Sliding window size256 and sketch size80)

Empirical Study Sketch on Price and Return Data

- DFT and DWT work well for prices (todays price

is a good predictor of tomorrows) - But badly for returns (todayprice

yesterdayprice)/todayprice. - Data length256 and the first 14 DFT coefficients

are used in the distance computation, db2 wavelet

is used here with coefficient size16 and sketch

size is 64

Empirical Comparison DFT, DWT and Sketch

Empirical Comparison DFT, DWT and Sketch

Sketch Guarantees

- Note Sketches do not provide approximations of

individual time series window but help make

comparisons.

- Johnson-Lindenstrauss Lemma
- For any and any integer n, let k be

a positive integer such that - Then for any set V of n points in , there

is a map such that for all - Further this map can be found in randomized

polynomial time

Overcoming curse of dimensionality

- May need many random projections.
- Can partition sketches into disjoint pairs or

triplets and perform comparisons on those. - Each such small group is placed into an index.
- Algorithm must adapt to give the best results.

Idea from P.Indyk,N.Koudas, and S.Muthukrishnan.

Identifying representative trends in massive

time series data sets using sketches. VLDB 2000.

Inner product with random vectors

r1,r2,r3,r4,r5,r6

(No Transcript)

Further Performance Improvements

- -- Suppose we have R random projections of window

size WS. - -- Might seem that we have to do RWS work for

each timepoint for each time series. - -- In ongoing work with colleague Richard Cole,

we show that we can cut this down by use of

convolution and an oxymoronic notion of

structured random vectors.

Idea from Dimitris Achlioptas,

Database-friendly Random Projections,

Proceedings of the twentieth ACM

SIGMOD-SIGACT-SIGART symposium on Principles of

database systems

Empirical Study Speed

- Sketch/DFTGrid structure
- Sliding Window Size3616, basic window size32
- Correlationgt0.9

Empirical Study Breakdown

Empirical Study Breakdown

Query by humming Correlation Shifting

Query By Humming

- You have a song in your head.
- You want to get it but dont know its title.
- If youre not too shy, you hum it to your friends

or to a salesperson and you find it. - They may grimace, but you get your CD?

With a Little Help From My Warped Correlation

- Karens humming Match
- Denniss humming Match
- What would you do if I sang out of tune?"
- Yunyues humming Match

Related Work in Query by Humming

- Traditional method String Matching

Ghias et. al. 95, McNab

et.al. 97,Uitdenbgerd and Zobel 99 - Music represented by string of pitch directions

U, D, S (degenerated interval) - Hum query is segmented to discrete notes, then

string of pitch directions - Edit Distance between hum query and music score
- Problem
- Very hard to segment the hum query
- Partial solution users are asked to hum

articulately - New Method matching directly from audio

Mazzoni and Dannenberg 00 - We use both.

Time Series Representation of Query

Segment this!

- An example hum query
- Note segmentation is hard!

How to deal with poor hummers?

- No absolute pitch
- Solution the average pitch is subtracted
- Inaccurate pitch intervals
- Solution return the k-nearest neighbors
- Incorrect overall tempo
- Solution Uniform Time Warping
- Local timing variations
- Solution Dynamic Time Warping
- Bottom line timing variations take us beyond

Euclidean distance.

Dynamic Time Warping

- Euclidean distance sum of point-by-point

distance - DTW distance allowing stretching or squeezing

the time axis locally

Envelope Transform using Piecewise Aggregate

Approximation(PAA) Keogh VLDB 02

Envelope Transform using Piecewise Aggregate

Approximation(PAA)

- Advantage of tighter envelopes
- Still no false negatives, and fewer false

positives

Container Invariant Envelope Transform

- Container-invariant A transformation T for

envelope such that - Theorem if a transformation is

Container-invariant and Lower-bounding, then the

distance between transformed times series x and

transformed envelope of y lower bound their DTW

distance.

Framework to Scale Up for Large Database

Improvement by Introducing Humming with ta

- Solve the problem of note segmentation
- Compare humming with la and ta

Idea from N. Kosugi et al A pratical

query-by-humming system for a large music

database ACM Multimedia 2000

Improvement by Introducing Humming with ta(2)

- Still use DTW distance to tolerate poor humming
- Decrease the size of time series by orders of

magnitude. - Thus reduce the computation of DTW distance

Statistics-Based Filters

- Low dimensional statistic feature comparison
- Low computation cost comparing to DTW distance
- Quickly filter out true negatives
- Example
- Filter out candidates whose note length is much

larger/smaller than the querys note length - More
- Standard Derivation of note value
- Zero crossing rate of note value
- Number of local minima of note value
- Number of local maxima of note value

Intuition from Erling Wold et al Content-based

classification, search and retrieval of audio

IEEE Multimedia 1996 http//www.musclefish.com

Boosting Statistics-Based Filters

- Characteristics of statistics-based filters
- Quick but weak classifier
- May have false negatives/false positives
- Ideal candidates for boosting
- Boosting
- An algorithm for constructing a strong

classifier using only a training set and a set of

weak classification algorithm - A particular linear combination of these weak

classifiers is used as the final classifier which

has a much smaller probability of

misclassification

Cynthia Rudin et al On the Dynamics of

Boosting In Advances in Neural Information

Processing Systems 2004

Verify Rhythm Alignment in the Query Result

- Nearest-N search only used melody information
- Will A. Arentz et al suggests combining rhythm

and melody - Results are generally better than using only

melody information - Not appropriate when the sum of several notes

duration in the query may be related to duration

of one note in the candidate - Our method
- First use melody information for DTW distance

computing - Merge durations appropriately based on the note

alignment - Reject candidates which have bad rhythm alignment

Will Archer Arentz Methods for retrieving

musical information based on rhythm and pitch

correlation CSGSC 2003

Experiments Setup

- Data Set
- 1049 songs Beatles, American Rock and Pop, one

Chinese song - 73,051 song segments
- Query Algorithms Comparison
- TS matching pitch-contour time series using DTW

distance and envelope filters - NS matching ta-based note-sequence using DTW

distance and envelope filters plus length and

standard-derivation based filters - NS2 NS plus boosted statistics-based filters and

alignment verifiers - Training Set Human humming query set
- 17 ta-style humming clips from 13 songs (half

are Beatles songs) - The number of notes varies from 8 to 25, average

15 - The matching song segments are labeled by

musicians - Test Set Simulated queries
- Queries are generated based on 1000 randomly

selected segments in database - The following error types are simulated

different keys/tempos, inaccurate pitch

intervals, varying tempos, missing/extra notes

and background noise

The error model is experimental and the error

parameters are based on the analysis of training

set

Experiments Result

- NS performs better than TS when the scale is

large - NS2 is roughly 610 times faster than NS with a

small loss of hit rate at about 1 - NS2 can achieve
- A high hit rate 97.0 (Top10)
- Quick responses 0.42 seconds on average , 2

seconds in a worst case scenario

Query by Humming Demo

- 1039 songs (73051 note/duration sequences)

Burst detection when window size is unknown

Burst Detection Applications

- Discovering intervals with unusually large

numbers of events. - In astrophysics, the sky is constantly observed

for high-energy particles. When a particular

astrophysical event happens, a shower of

high-energy particles arrives in addition to the

background noise. Might last milliseconds or

days - In telecommunications, if the number of packages

lost within a certain time period exceeds some

threshold, it might indicate some network

anomaly. Exact duration is unknown. - In finance, stocks with unusual high trading

volumes should attract the notice of traders (or

perhaps regulators).

Bursts across different window sizes in Gamma Rays

- Challenge to discover not only the time of the

burst, but also the duration of the burst.

Burst Detection Challenge

- Single stream problem.
- What makes it hard is we are looking at multiple

window sizes at the same time. - Naïve approach is to do this one window size at a

time.

Elastic Burst Detection Problem Statement

- Problem Given a time series of positive numbers

x1, x2,..., xn, and a threshold function f(w),

w1,2,...,n, find the subsequences of any size

such that their sums are above the thresholds - all 0ltwltn, 0ltmltn-w, such that xm xm1 xmw-1

f(w) - Brute force search O(n2) time
- Our shifted binary tree (SBT) O(nk) time.
- k is the size of the output, i.e. the number of

windows with bursts

Burst Detection Data Structure and Algorithm

- Define threshold for node for size 2k to be

threshold for window of size 1 2k-1

Burst Detection Example

Burst Detection Example

True Alarm

False Alarm

Burst Detection Algorithm

- In linear time, determine whether any node in SBT

indicates an alarm. - If so, do a detailed search to confirm (true

alarm) or deny (false alarm) a real burst. - In on-line version of the algorithm, need keep

only most recent node at each level.

False Alarms (requires work, but no errors)

Empirical Study Gamma Ray Burst

Case Study Burst Detection(1)

Background Motivation In astrophysics, the sky

is constantly observed for high-energy particles.

When a particular astrophysical event happens, a

shower of high-energy particles arrives in

addition to the background noise. An unusual

event burst may signal an event interesting to

physicists.

Technical Overview 1.The sky is partitioned into

1800900 buckets. 2.14 Sliding window lengths are

monitored from 0.1s to 39.81s 3.The original

code implements the naive algorithm.

Case Study Burst Detection(2)

- The challenges
- 1.Vast amount of data
- 1800900 time series, so any trivial overhead

may be accumulated to become a nontrivial

expense. - 2. Unavoidable overheads of data transformations

- Data pre-processing such as fetching and storage

requires much work. - SBT trees have to be built no matter how many

sliding windows to be investigated. - Thresholds are maintained over time due to the

different background noises. - Hit on one bucket will affect its neighbours as

shown in the previous figure

Case Study Burst Detection(3)

- Our solutions
- 1. Combine near buckets into one to save space

and processing time. If any alarms reported for

this large bucket, go down to see each small

components (two level detailed search). - 2. Special implementation of SBT tree
- Build the SBT tree only including those levels

covering the sliding windows - Maintain a threshold tree for the sliding

windows and update it over time. - Fringe benefits
- 1. Adding window sizes is easy.
- 2. More sliding windows monitored also benefit

physicists.

Case Study Burst Detection(4)

Experimental results 1. Benefits improve with

more sliding windows. 2. Results consistent

across different data files. 3. SBT algorithm

runs 7 times faster than current algorithm. 4.

More improvement possible if memory limitations

are removed.

Extension to other aggregates

- SBT can be used for any aggregate that is

monotonic - SUM, COUNT and MAX are monotonically increasing
- the alarm threshold is aggregateltthreshold
- MIN is monotonically decreasing
- the alarm threshold is aggregateltthreshold
- Spread MAX-MIN
- Application in Finance
- Stock with burst of trading or quote(bid/ask)

volume (Hammer!) - Stock prices with high spread

Empirical Study Stock Price Spread Burst

Extension to high dimensions

Elastic Burst in two dimensions

- Population Distribution in the US

Can discover numeric thresholds from probability

threshold.

- Suppose that the moving sum of a time series is a

random variable from a normal distribution. - Let the number of bursts in the time series

within sliding window size w be So(w) and its

expectation be Se(w). - Se(w) can be computed from the historical data.
- Given a threshold probability p, we set the

threshold of burst f(w) for window size w such

that PrSo(w) f(w) p.

Find threshold for Elastic Bursts

- F(x) is the normal cumulative dens funct, so

little prob at ends - Red and blue line touch at negative x value.

Symmetric above.

F(x)

x

p

F-1(p)

Summary of Burst Detection

- Able to detect bursts on many different window

sizes in essentially linear time. - Can be used both for time series and for spatial

searching. - Can specify thresholds either with absolute

numbers or with probability of hit. - Algorithm is simple to implement and has low

constants (code is available). - Ok, its embarrassingly simple.

AQuery A Database System for Order

Time Series and DBMSs

- Usual approach is to store time series as a User

Defined Datatype (UDT) and provide methods for

manipulating it - Advantages
- series can be stored and manipulated by DBMS
- Disadvantages
- UDTs are somewhat opaque to the optimizer (it

misses opportunities) - Operations that mix series and regular data may

be awkward to write

The Case for Order1

- Series are sequences! Support them.
- Add order to multiset-based DBMS2,3,4
- AQuery is a query language (conservative SQL

extension) and a data model that supports order

from the ground up

1 A Call to Order, David Maier and Bennet

Vance, PODS93 2 SRQL Sorted Relational Query

Language, Ramakrishnan et al., SSDBM98 3 SEQ

A Model for Sequence Databases, Seshadri et al.,

ICDE95 4 OLAP Amendment to SQL1999

An Arrable-based Data Model

- Arrable a collection of named arrays with same

cardinality - Arrays vector of basic types or vectors thereof

(no further nesting!) - Arrables order may be declared
- Two ways to accommodate series horizontally

and vertically

Series

ts 1 2 5 9 13

ID IBM MSFT IBM IBM MSFT

Ticks

price 12.02 43.23 12.04 12.05 43.22

ts 1 5 9 2 13

ID IBM MSFT

price 12.02 12.04.12.05 43.23 43.22

ID,ts time series order

ts time series order

AQuerys New Language Elements

- ASSUMING ORDER clause
- Order is explicitly declared on a per-query basis
- All other clauses can count on order (and

preserve it) - Column-Oriented Semantics
- AQuery R.col in a query binds to an array/vector

(an entire column) - SQL R.col binds successively to the scalar field

values in a column - Built-in functions for order-dependent aggs over

columns - Can group by without aggregation

Moving Average over Arrables

sold 140 120 140 100 130

month 4 2 3 1 5

SELECT month, sold, avgs(3,sold) FROM Sales ASSUM

ING ORDER month

- Each query defines the data ordering it wants to

work with - Order is enforced in the beginning of the

query, right after the FROM clause - All expressions and clauses are order preserving

sold 100 120 140 140 130

month 1 2 3 4 5

Moving Average over Arrables

sold 100 120 140 140 130

month 1 2 3 4 5

SELECT month, sold, avgs(3,sold) FROM Sales ASSUM

ING ORDER month

- Variables are bound to an entire column at once,

as opposed to a succession of its values - Expressions are mappings from a list of columns

(arrays) into a column

Moving Average over Arrables

sold 100 120 140 140 130

3-avg 100 110 120 133 136

month 1 2 3 4 5

SELECT month, sold, avgs(3,sold) FROM Sales ASSUM

ING ORDER month

- Several built in vector-to-vector functions
- Avgs, for instance, takes a window size and a

column and returns the resulting moving average

column - Other v2v functions prev, next, first, last,

sums, mins, ..., and UDFs

Best-Profit Query

Find the best profit one could make by buying a

stock and selling it later in the same day using

Ticks(ID, price, tradeDate, ts)

- price 15 19 16 17 15 13 5 8 7 13 11 14 10 5

2 5

Formulating the Best-Profit Query

Find the best profit one could make by buying a

stock and selling it later in the same day using

Ticks(ID, price, tradeDate, ts)

Sell at 14

Buy at 5

- price 15 19 16 17 15 13 5 8 7 13 11 14 10 5

2 5 - mins(price)15 15 15 15 15 13 5 5 5 5 5 5 5 5

2 2 - 0 4 1 2 0 0 0 3 2 8 6 9 0 0

0 3

Best-Profit Query Comparison

- AQuery
- SELECT max(pricemins(price))
- FROM ticks ASSUMING timestamp
- WHERE IDS
- AND tradeDate1/10/03'

- SQL1999
- SELECT max(rdif)
- FROM (SELECT ID,tradeDate,
- price - min(price)
- OVER
- (PARTITION BY ID,

tradeDate - ORDER BY timestamp
- ROWS UNBOUNDED
- PRECEDING) AS rdif
- FROM Ticks ) AS t1
- WHERE IDS
- AND tradeDate1/10/03'

Best-Profit Query Comparison

- AQuery
- SELECT max(pricemins(price))
- FROM ticks ASSUMING timestamp
- WHERE IDS
- AND tradeDate1/10/03'

- SQL1999
- SELECT max(rdif)
- FROM (SELECT ID,tradeDate,
- price - min(price)
- OVER
- (PARTITION BY ID,

tradeDate - ORDER BY timestamp
- ROWS UNBOUNDED
- PRECEDING) AS rdif
- FROM Ticks ) AS t1
- WHERE IDS
- AND tradeDate1/10/03'

AQuery optimizer can push down the selection

simply by testing whether selection is order

dependent or not SQL1999 optimizer has to

figure out if the selection would alter semantics

of the windows (partition), a much harder

question.

Best-Profit Query Performance

MicroArrays in a Nutshell

- Device to analyze thousands of gene expressions

at once - Base mechanism to acquire data in a number of

genomic experiments - Generates large matrices and/or series that then

have to be analyzed, quite often with statistical

methods

each spot corresponds to a cDNA

Entire chip is exposed to stimuli (light,

nutrients, etc) and each spot reacts by producing

more mRNA (expressing) or less (repressing) than

base quantity

Generating Series in a Query

geneId g1 g2 g3 g4 ...

value 1.89 -0.5 1.3 0.8 ...

exprId 1 1 1 1 ...

SELECT geneID, value FROM RawExpr ASSUMING ORDER

geneID, exprID GROUP BY geneID

- AQuery allows grouping without aggregation
- Non-grouped columns become vector-fields
- Queries can manipulate vector- fields in the

same way it does other attributes

geneId g1 g2 g3 g4 ...

value 1.89 2.32 0.32 ... -0.5

1.22 1.03 ... 1.3 -0.38 -1.43 ... 0.8

0.79 3.23 ... ...

Taking Advantage of Ordering

SELECT geneID, value FROM RawExpr ASSUMING ORDER

geneID, exprID GROUP BY geneID

? geneID, value

each

Gby geneID

- Group by and sort share prefix
- Sort may be eliminated if RawExpr is in a

convenient order1,2 - Sort and Group by are commutative
- either sort all and then slice the
- groups, or group by and perform
- one smaller sort per group

sort geneID, exprID

RawExpr

1 Bringing Order to Query Optimization,

Slivinskas et al., SIGMOD Record 31(2) 2

Fundamental Techniques for Order Optimization,

Simmen et al., SIGMOD96

Sort Cost Can Be Reduced

Manipulating Series

geneId g1 g2 g3 g4 ...

value 1.89 2.32 0.32 ... -0.5

1.22 1.03 ... 1.3 -0.38 -1.43 ... 0.8

0.79 3.23 ... ...

SELECT T1.geneID, T2.geneID FROM SeriesExpr T1,

SeriesExpr T2 WHERE corr(T1.value, T2.value)gt 0.9

- corr() implements Pearsons correlation
- It takes two columns of vector- fields as

arguments - It returns a vector of correlation factors
- AQuery can be easily extended with other UDFs

AQuery Summary

- AQuery is a natural evolution from the Relational

Model - AQuery is a concise language for querying order
- Optimization possibilities are vast
- Applications to Finance, Physics, Biology,

Network Management, ...

Overall Summary

- Improved technology implies better sensors and

more data. - Real-time response is very useful and profitable

in applications ranging from astrophysics to

finance. - These require better algorithms and better data

management. - This is a great field for research and teaching.