The Joy of SAX - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

The Joy of SAX

Description:

The Joy of SAX Leonidas Fegaras University of Texas at Arlington fegaras_at_cse.uta.edu http://lambda.uta.edu/ Design Goals Want to build an XQuery engine based entirely ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 15
Provided by: lambdaUta7
Learn more at: http://lambda.uta.edu
Category:
Tags: sax | joy | pull | push

less

Transcript and Presenter's Notes

Title: The Joy of SAX


1
The Joy of SAX
  • Leonidas Fegaras
  • University of Texas at Arlington
  • fegaras_at_cse.uta.edu
  • http//lambda.uta.edu/

2
Design Goals
  • Want to build an XQuery engine based entirely on
    SAX handlers
  • all the way from the points the input documents
    are read by the SAX parser up to the point the
    query results are printed
  • This engine should consist of operators that
  • naturally reflect the syntactic structures of
    XQuery and
  • can be composed into pipelines in the same way
    the corresponding XQuery structures are composed
    to form complex queries
  • The XQuery translation should be concise, clean,
    and completely compositional
  • Even though it cannot compete with transducers
    for simple XPaths, it should not sacrifice much
    on performance in terms of memory and
    computational overhead
  • But, ... it should be able to beat transducers
    for complex predicates and deeply nested queries

3
Pull-Based Approach
  • Based on iterators
  • class Iterator
  • Tuple current() // current tuple from
    stream
  • void open () // open the stream iterator
  • Tuple next () // get the next tuple from
    stream
  • boolean eos () // is this the end of
    stream?
  • An iterator reads data from the input stream(s)
    and delivers data to the output stream
  • Connected through pipelines
  • an iterator (the producer) delivers a stream
    element to the output only when requested by the
    next operator in pipeline (the consumer)
  • to deliver one stream element to the output, the
    producer becomes a consumer by requesting from
    the previous iterator as many elements as
    necessary to produce a single element, etc, until
    the end of stream

4
What is a Tuple?
  • A vector of components
  • one component for each scoped for-variable
  • has fixed-size at each point in a pipeline (known
    at compile time)
  • doesn't need to include the variable names
  • A tuple component is the unit of communication
    between iterators
  • Passing fully constructed XML elements through
    iterators is a bad idea for a compositional
    translation
  • initially, we would have to pass the entire
    document as a tree!
  • The unit of communication should be
  • a single event or
  • a fragment (a reference to an XML element in a
    document)
  • this requires a structural index for fragments
  • A proposal for a pull parser XML Pull Parser 3
  • www.xmlpull.org
  • BEA/XQRL token stream token iterators

5
Event-Oriented Approach
  • A tuple in an event-oriented approach consists of
    a sequence of events, ending with an End-Of-Tuple
    (EOT) event
  • Single-node event sequence
  • depth-first unfolding of a single XML node
  • ltstart Agt
  • ltstart Bgt
  • lttext xgt
  • ltend Bgt
  • ltstart Bgt
  • lttext ygt
  • ltend Bgt
  • ltend Agt
  • lttext zgt
  • ltstart Agt
  • ltstart Bgt
  • lttext wgt
  • ltend Bgt
  • ltend Agt
  • ltEOTgt

A tuple with 3 components
6
Element vs Event Granularity
Stream unit is a single event abstract class
Event class Start extends Event String tag
class End extends Event String tag class
Text extends Event String text class EOT
extends Event class Child extends Iterator
Iterator input String tagname
boolean keep false int nest 0
Event next () while (!input.eos())
current input.current() if (current
instanceof Start) if (nest 1)
keep ((Start) current).tag
.equals(tagname) else if (current instanceof
End) if (nest-- 1) keep false
input.next() if (keep) return
current
Stream unit is a DOM-like element abstract
class Element class Node extends Element
String tag Element sequence class
Text extends Element String text class
Tuple Element components class Child
extends Iterator Iterator input String
tagname int index 0 Tuple next ()
while (!input.eos()) if
(input.current().get(0) instanceof Node) Node
ce (Node) input.current().get(0) if (index lt
ce.sequence.length) if (ce.sequenceindex
instanceof Node ((Node)
ce.sequenceindex) .tag.equals(tagname)
) current new Tuple(ce.sequenceindex)
return current else index else
index 0 input.next() else
index 0 input.next()
7
For-Loop using Iterators
  • Need a stepper for a for-loop

class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Not a good idea if right reads a document!
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
8
Let-Bindings using Iterators
  • Let-bindings are harder to implement
  • the let-value may be a sequence
  • one producer -- many consumers
  • we do not want to materialize the let-value in
    memory

queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
9
Push-based Pipelines
  • Unit of communication between pipelines
  • messages rather than events
  • Pipeline components are SAX-like event handlers
  • they are instances of Operator subclasses
  • abstract class Operator
  • void suspend ()
  • void release ()
  • void startDocument ( int node )
  • void endDocument ( int node )
  • Status endTuple ( int node )
  • Status startElement ( int node, String tag )
  • Status endElement ( int node, String tag )
  • Status characters ( int node, String text )
  • ('node' identifies a for-variable)

10
The Child Operator
  • class Child extends Operator
  • Operator next
  • String tagname
  • int nest 0
  • boolean keep false
  • Status startElement ( int node, String tag )
  • if (nest 1)
  • keep tagname.equals(tag)
  • if (keep)
  • return next.startElement(node,tag)
  • else return invalid
  • Example document(...)/A///B

Document
Child A
Any
Descendant B
Kick
Print
11
For-Loops
  • One thread per document reader
  • Need to queue one tuple from the outer stream
    each time
  • for x in E1, y in E2 return ...

startElement, endElement, .... if nodex,
insert the event into Queue else emit the event
to the output (next) endTuple if nodex,
suspend outer stream send all events in Queue
to E2 else emit all events in Queue to the output
(next) endDocument if nodey, clear Queue
release outer stream
E2
E1
For y
For x
inner
outer
Queue
Loop x
next
  • Not a good idea if E2 reads a document
  • the document is read as many times as the tuples
    in E1
  • but we can cache the output of E2 and push the
    cached data instead

12
Other Issues
  • Let-bindings can be easily done using splitters
    (repeaters)
  • no caching is necessary
  • But, ... binary concatenation needs to cache the
    second stream
  • so, let ve return (v,v) is still
    hopeless
  • We dont need to cache path/FLWOR conditionals
  • the returned status of the condition events
    determines the predicate outcome (existential
    semantics)
  • initially, Predicate sends a suspend() event to
    the next stream and then the input events are
    propagated as is (to both pred and next)
  • if and when the predicate becomes true, the
    output is released

Predicate
condition
pred
next
Sink
13
So, to Pull or to Push?
  • For event streams, it doesn't really make a
    difference in terms of efficiency/storage
    requirements
  • a matter of programming style
  • push-based is a bit more difficult to program and
    harder to debug (threads)
  • But, ... if you want to use indexes, pulling is
    better
  • For indexing, fragments are a better alternative
    to events
  • fragment a reference to an element in a
    document
  • a fragment corresponds to a tree node, and you
    need an index to access descendants
  • need to guarantee that indexes deliver fragments
    sorted, so that all stream operators can be
    implemented using merge joins
  • examples
  • structural indexes based on region encoding or on
    preorder/postorder ranks
  • IR-style content-based inverse indexes
  • see my recent work on XQuery processing with
    relevance ranking
  • http//lambda.uta.edu/XQueryRank.pdf

14
Related Work
  • Joost XSLT transformation based on SAX
  • BEA/XQRL pull-based XQuery processing
  • Apache Cocoon user-constructed pipelines made
    out of SAX handlers
  • Many XQuery processors Galax, Xalan, Qizx,
    Saxon, ...
  • Lots of work on XPath/XQuery processing based on
    transducers
Write a Comment
User Comments (0)
About PowerShow.com