Efficient Processing of XML Update Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Processing of XML Update Streams

Description:

real-time processing. high throughput, low latency, fast mean response time, low jitter ... in a stock ticker feed stream, where updates to ticker values come ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 28
Provided by: leonidas7
Learn more at: https://lambda.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Processing of XML Update Streams


1
Efficient Processing of XML Update Streams
  • Leonidas Fegaras
  • University of Texas at Arlington

2
Data Stream Processing
  • What is a data stream?
  • continuous, time-varying data arriving at
    unpredictable rates
  • continuous updates, long-running queries,
    continuous results
  • Sought characteristics of stream processing
    engines
  • real-time processing
  • high throughput, low latency, fast mean response
    time, low jitter
  • low memory footprint
  • Why bother?
  • many data are already available in stream form
  • sensor networks, network traffic monitoring,
    stock tickers
  • publisher-subscriber systems
  • data stream mining for fraud detection
  • data may be too volatile to index
  • continuous measurements

3
XML Stream Processing
  • Why XML?
  • There is no reason to normalize stream data
  • Various sources of XML streams
  • tokenized XML documents
  • RSS feeds
  • web service results
  • Granularity
  • XML tokens (events) , , text, etc
  • region-encoded XML elements (eg, based on
    pre-order numbering)?
  • XML fragments (hole-filler model)?
  • Push-based processing SAX
  • Pull-based processing StAX

4
Traditional Stream Processing
output stream
input stream
  • Works on streams that consist of numerical values
    or relational tuples
  • Focuses on a sliding window
  • fixed number of tuples, or
  • fixed time span
  • Calculates approximate results
  • Uses a small (bounded) state
  • Examples
  • top-k most frequent values
  • group-by SQL queries

past
stream engine
sliding window
state
future
5
Our Goals
  • Handle continuous XQueries over continuous
    streamed XML data
  • Embedded updates in the streams
  • Exact rather than approximate answers
  • Produce continuous results, even when the results
    are not complete
  • Problem most interesting operations are blocking
    and/or require unbounded state
  • grouping aggregation
  • predicate evaluation
  • sorting
  • sequence concatenation
  • backward axis steps
  • We want to address the blocking problem
    differently
  • Display the current result of the blocking
    operation continuously in the form of an update
    stream
  • incoming vs. generated updates

6
Our View of XML Update Streams
  • A continuous (possibly infinite) sequence of XML
    tokens with embedded updates
  • Typically, a finite data stream followed by an
    infinite stream of updates
  • SAX-like events
  • three basic types of tokens , ,
    text
  • the target of an update is a stream subsequence
    that contains zero, one, or more complete XML
    elements
  • the source is also a token sequence that contains
    complete XML elements
  • updates are embedded in the data stream and can
    come at any time
  • update events can be interleaved with data events
    and with each other
  • each event must now have an ID to associate it
    with an update
  • updated regions can be updated too
  • to update a stream subsequence, you wrap it into
    a Mutable region
  • three types of updates
  • replace, insertBefore, insertAfter

7
Example
  • id Event
    equivalent to
  • 1
  • 1
  • 2 startMutable(1)
  • 2 Y
  • 2 X
  • 2
  • 2 endMutable(1) X
  • 1
  • 3 startInsertBefore(2)
  • 3
  • 3 Y
  • 3
  • 3 endInsertBefore(2)?
  • 1

8
Continuous Results
  • Our stream engine is implemented as a pipeline
  • each pipeline stage performs a very simple task
  • The final pipeline stage is the Result Display
    that displays the query results continuously
  • the display is an editable text window, where
    text can be inserted, deleted, and replaced at
    any point
  • when an update is coming in the input stream, it
    is propagated all the way to the result display,
    where it causes an update to the display text!

result display
input stream with updates
output stream with updates
query pipeline
9
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • Multiple points of blocking
  • distinct-values
  • count
  • self-join
  • order-by

10
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • The result display is refreshed continuously

display

T1
input stream
currently
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
11
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • The result display is refreshed continuously

display

T2 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
12
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • The result display is refreshed continuously

display

T2
T3 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
13
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • The result display is refreshed continuously

display

T2
T3 nameB T4

T1
input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
14
Motivating Example
  • Group and order book titles by author
  • let al distinct-values(doc(bib.xml)//book/au
    thor)?
  • return
  • for a in al order by a
  • return

  • doc(bib.xml)//bookauthora/title
  • The result display is refreshed continuously

display

T2
T3 nameB T4
T5 nameD T1

input stream
DT1 bookAT2 ookAT3 okBT4 kBT5 AT6
CT7 authorCT8 uthorAT9 thorBT10 thorDT11
...
currently
15
Why?
  • Because, this is what you really want to see as
    the result of a query
  • eg, in a stock ticker feed stream, where updates
    to ticker values come continuously
  • It leads to optimistic evaluation where
    results are displayed immediately, to be
    retracted or modified later when more information
    is available
  • addresses the blocking problem
  • we proceed without waiting, displaying the
    results so far, but later we may have to send
    updates
  • generalizes on-line aggregation

16
Optimistic Evaluation
  • Pessimistic evaluation at all times, the query
    display must always show the correct results up
    to that point
  • Optimistic evaluation display any possible
    output without delay and later, if necessary,
    retract it or modify it to make it correct
  • complementary to, but different than, lazy
    evaluation
  • How?
  • Generated and incoming updates are propagated
    through the evaluation pipeline until they are
    processed by the display
  • They may cause changes to the states of the
    pipeline stages
  • Examples
  • Event counting instead of waiting until we count
    all events, we generate updates that continuously
    display the counter so far
  • Predicate testing assume the predicate is true,
    but when you later find that it is false, retract
    all output associated with this predicate
  • Sorting wrap each element to be sorted around an
    update that inserts it into the correct place to
    the element sequence so far

17
Contributions
  • Instead of eagerly performing the updates on
    cached portions of the stream, we propagate the
    updates through the pipeline
  • all the way to the query result display
  • the display prints the results continuously,
    replacing old results with new
  • Other approaches
  • display approximate answers continuously by
    focusing on a sliding window over the stream
  • Our approach
  • generates exact answers continuously in the form
    of an update stream
  • But the propagated updates may affect the state
    of the operators
  • we developed a uniform methodology to incorporate
    state change
  • Used this framework to unblock operations and
    reduce buffering
  • let the operations themselves embed new updates
    into the stream that retroactively perform the
    blocking parts of the operation
  • why? because later is often better than now

18
State Transformers
  • Each stage in the query evaluation pipeline is a
    state transformer
  • Input a single event and a state S
  • Output a sequence of events and a new state S
  • Implemented as a function from an event to a
    sequence of events that destructively modifies
    the state
  • can be used in both pull- and push-based stream
    processing
  • The state transformers need only handle the basic
    SAX events , , text, and
    begin/end of stream

state
state transformer
stream with regular and update events
stream with regular events
19
Example Element Counting
  • The state is an integer counter, count
  • A blocking state transformer, f(e)
  • if e is a text event
  • count count1
  • return
  • else if e is end-of-stream
  • return count value
  • A non-blocking state transformer
  • if e is begin-of-stream
  • return startMutable(id), 0, endMutable(id)
  • else if e is a text event
  • count count1
  • return startReplace(d), count value,
    endReplace(id)

20
XPath Steps
  • The state transformers of simple XPath forward
    steps are trivial to implement
  • Example the Child step (/tag)
  • state
  • need a counter nest to keep track of the nesting
    depth, and
  • a flag pass to remember if we are currently
    passing through or discarding events
  • logic
  • when we see the event at nest1, we enter
    pass mode and stay there until we see at
    nest1
  • when in pass mode, we return the current event
  • otherwise, we return

21
Handling Updates
  • It would be cumbersome to modify each state
    transformer to handle incoming update events
  • Our solution the update events are handled in
    the same way for all state transformers
  • Each state transformer is wrapped by a fixed
    function that handles update events by adjusting
    the states, while passing the regular events to
    the state transformer

state
update events
state adjustment
.
.
.
state
current state
regular and update events
regular and update events
state transformer
regular events
22
Adjusting States
  • For each state transformer, we need to provide
    only one function
  • adjust(s1,s2,s3)
  • if state s2 is replaced by s3, adjust the
    succeeding state s1 accordingly
  • Example the adjust function for element counting
    is
  • adjust(s1,s2,s3).count s1.count(s3.count-s2.c
    ount)
  • The adjust function for XPath steps is the
    identity
  • adjust(s1,s2,s3) s1

23
Adjusting State for Element Counting
  • id Event element
    counting adjustment
  • 1
  • 1
  • 2 startMutable(1) start state(2)
    n n1
  • 2
  • 2 X
  • 2
  • 2 endMutable(1) end state(2) n1 n2
  • 1
  • 3 startInsertBefore(2) start state(3) n
  • 3
  • 3 Y
  • 3
  • 3 endInsertBefore(2) end state(3) n1
  • 1

work with id2 state copy
work with id3 state copy
24
The Result Display
  • Its like any other state transformer but it also
    does side-effects to the display screen
  • The state of an update id is its position in the
    screen
  • adjust(s1,s2,s3) s1s3-s2
  • Side effects
  • remove_text(start,end)?
  • insert_text(position,text)?
  • The state transformer is very simple
  • eg, for a event insert in the
    screen at position

25
Problem
  • For each update id, each state transformer must
    keep a separate copy of the state
  • Will lead to space explosion for an infinite
    stream of updates
  • Not applicable to replacement updates, since our
    queries are snapshot, not historical
  • For incoming updates, we can ignore updates that
    are irrelevant to the query
  • Hard for content-based predicates
  • For generated updates, the scope of an update is
    usually limited
  • The scope is often known at run time
  • Allows the removal of out-of-scope states
  • eg, predicate testing

26
Unblocking XQuery Operations
  • We have used this technique for unblocking
  • concatenation
  • predicates
  • descendant
  • backward steps
  • sorting
  • In our preliminary results, many XQueries on
    large data sets had high throughput, required
    very little buffering, and (of course) had very
    fast first response time

27
Future Work
  • Plan to cover most XQuery features completely
  • Would like to handle historical queries
  • Same model for update streams
  • ... but now replacement updates may add a new
    version
  • Example
  • which stock increased its value by at least 10
    since the last update?
  • Need to extend the XQuery syntax with historical
    features
  • Need to cut-off out-of-scope historical data
Write a Comment
User Comments (0)
About PowerShow.com