Efficient Evaluation of Regular Path Expressions on Streaming XML Data - PowerPoint PPT Presentation


PPT – Efficient Evaluation of Regular Path Expressions on Streaming XML Data PowerPoint presentation | free to download - id: 434aa-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Efficient Evaluation of Regular Path Expressions on Streaming XML Data


name Seattle Bio Lab /name location city Seattle /city country USA /country ... 52. Are We Going in Circles ? Considering the following XML graph #1 #2 ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 83
Provided by: csHu
Learn more at: http://www.cs.huji.ac.il


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

Efficient Evaluation of Regular Path Expressions
on Streaming XML Data
  • By - Zachary G. Ives, Alon Y. Levy and Daniel S.

Table of Contents
  • A bit about XML (yes, again)
  • Our goal, problem and solution
  • Our XML data model
  • How to ask questions ?

Table of Contents
  • X-scan operation and structure
  • Digging deep into x-scan
  • How good is it ? Performance Evaluation
  • Conclusion

A Bit About XML (yes, again)
  • XML the eXtensible Markup Language
  • Become a standard
  • Useful for the dissemination and exchange of

A Bit About XML (yes, again)
  • Advantages
  • Simple
  • Self-describing nature
  • Flexible
  • Represents both structured and semi-structured

XML Structure
  • Consists of
  • Elements pairs of matching open and close tags.
  • Elements may enclose additional elements or data
  • Attributes included in element tags.
  • Attributes are single-valued and describe the

XML Structure (Cont.)
  • ID is special attribute which uniquely identify
    the element.
  • IDREF form links the other elements in the
  • Combining ID and IDREF forms a graph structure
    rather than just a tree structure.

XML Example
We will use this example throughout the rest of
the lecture
Our Goal
  • Our goal is to perform queries and search
    operations on the XML document.
  • Several query languages have been proposed.
  • Represents the XML document as a graph.

Our Goal (Cont.)
  • Represents the query as a regular path expression
    that should be matched against XML source.
  • These regular path expressions describe
    traversals along edges in the XML graph.
  • The variables in the query are mapped to XML
    elements along these paths.

Our Problem
  • Most XML query processors
  • Loading the data into a local repository
  • Building indexes on the repository
  • Processing the query
  • The repository is either
  • Relational database
  • An object oriented database
  • A repository of semi structured data

Our Problem (Cont.)
  • The local storing and indexing is expensive.
  • Especially when the query is made over streams of
    incoming XML.
  • The streams can come from many sources, some fast
    and some slow.
  • Sometimes we want some partial answer but as soon
    as possible.

Our Solution
  • The query can be performed while the data streams
  • The XML-Scan (x-scan) operator does exactly that.
  • Used at the lowest level of the query plan and
    supplies data to other operators.

The X-Scan Operator
  • Input
  • An XML data stream.
  • Set of regular path expressions.
  • Output
  • Stream of binding for the variables occurring in
    the expressions.
  • The bindings are produced incrementally, as the
    XML data is streaming in.

The X-Scan Operator (Cont.)
  • The entire graph can be constructed in a single
  • X-Scan simultaneously.
  • Parse the XML data.
  • Indexing nodes by their IDs.
  • Resolving IDREFs.
  • Return the nodes that match the path expressions
    of the query.

The X-Scan Operator (Cont.)
  • Some issues in the X-Scan operation are
  • Deal with possibly cyclic data
  • Preserve order of elements
  • Remove duplicate bindings that are generated due
    to multiple paths to the same elements

Data Model for XML
  • Naturally, the XML data model is a graph.
  • Each XML tag is an edge labeled with the tag
  • It is directed to a node which label is the tags
    ID. (if it has no ID it gets a number).
  • A given element node will have labeled edges
    directed to its attribute values, sub-elements,
    and any other elements referenced via IDREF.

Data Model for XML (Cont.)
  • Example is always the best way

How to ask questions ?
  • A variety of query languages have been proposed.
  • The key feature in all of these languages is the
    use of regular path expressions over the data.
  • Most of them also give the answer to the query as
    XML document.
  • X-Scan uses XML-QL.

The XML-QL Syntax
  • The syntax of XML-QL is
  • patterni template is matched against the XML data
    graph from sourcei and the resulted tuples are
    formatted as described in result.

WHERE pattern1 IN source1, pattern2 IN
source2, CONSTRUCT result
The XML-QL Syntax (Cont.)
  • An XML-QL pattern is a set of nested tags with
    embedded variable names (prefixed by ) that
    specify bindings of graph nodes to variables.
  • The CONSTRUCT clause specifies a tree-structured
    set of edges and nodes to add to the output graph
    for each tuple of variable bindings.

The XML-QL Syntax (Cont.)
  • Again, example is the best way
  • Lets look at

WHERE ltdbgt ltlabgt ltnamegtnlt/gt lt_gtltcitygt
clt/gtlt/gt lt/gt ELEMENT_AS l lt/gt IN
fig1.xml CONSTRUCT ltresultgt ltcentergt ltname
gtnlt/gt ltlocationgtclt/gt lt/gt lt/gt
The XML-QL Syntax (Cont.)
  • As we can see, the result will be

ltresultgt ltcentergt ltnamegtSeattle Bio
Lablt/namegt ltlocationgtSeattlelt/locationgt lt/cente
rgt ltcentergt ltnamegtPMBLlt/namegt ltlocationgtPhila
delphialt/locationgt lt/centergt lt/resultgt
The XML-QL Syntax (Cont.)
  • If the variable is bound to a node with
    sub-elements, all the sub-graph will be inserted
    to the resulted graph.
  • We will use dot-notation to describe the X-Scan
  • The previous example will rewritten as.
  • El root.db.lab
  • En El.name
  • Ec El._.city

The X-Scan Place
  • The goal of the X-Scan operator is therefore to
    produce a set of bindings for each pattern in the
    WHERE clause.

So, What X-Scan do ?
  • Given the XML Stream and a set of regular path
    expressions, outputs a stream of tuples assigning
    binding values to each variable in the set of
    regular path expression.
  • The central mechanism is a set of state machines
    that traverse the XML graph, trying to satisfy
    the path expressions.

What is it made of ?
  • The data components of X-Scan are

Where the data flows?
  • As the data streams into the system, several
    structures are created
  • The data get parsed and stored locally
  • A structural index of the XML graph is created
  • An ID index records the IDs of all elements and
    their location in the structural index
  • A list of references to not-yet-seen element IDs
    is maintained

Where the data flows?
  • In parallel to the creation of those structures,
    a set of finite state machines perform a DFS over
    the partial structural index.
  • When a machine reaches an accepting state, a new
    value is added to the binding-value table of that
  • Those values are later combine to form the
    complete image.

Example problems
  • It sounds easy, but yet there some problems to
    meet, for example
  • The handling of cycles
  • How to prune duplicate bindings as they are
    created ? Remember X-Scan is online operator

The State Machines
  • As described earlier, we create one regular
    expression for every variable in the query in
    the dot-notation.
  • So, we build a finite-state machine for each
  • State transition is correspond to edge traversals
    in the XML data graph

The State Machines (Cont.)
  • The end of the path expression yield an accepting
    state, which outputs instances of the
    corresponding variables.
  • When one variable is dependent upon other
    variable, the other variable machine accepting
    state is pointing to the state machine of the
    first one.

The State Machines (Cont.)
  • And back to our example

Indexing the XML Graph
  • The structural index should allow x-scan to
    quickly traverse the XML data graph.
  • Each node in the index contains
  • The ID of the element and its offset in the
  • Pointers to all the sub-elements, attributes and
    IDREFs of the element.
  • Essentially it looks like the graph except for
    the leafs.

The Algorithm Step by Step
  • X-Scan proceeds by building the structural index
    and running a set of active state machines in
  • The core algorithm is in fact the way those state
    machines run, lets focus on that by running our

The Algorithm Step by Step
  • Initially, only the top level machine is active.
  • When a machine M reaches an accepting state, it
    produces a binding b for its variable, writes it
    and the parent value to its table and activates
    all of its dependent state machines.

The Algorithm Step by Step
  • Those machines remain active while x-scan is
    scanning b or any element accessible by a path
    from b.
  • The final output of x-scan is the equi-join of
    all the appropriate tables.

The Algorithm By Example
  • Ml is initialized on state 1 as the only active

The Algorithm By Example
  • The root got a db edge, so the machine is
    pushed to its stack and moving to state 2 with
    value node 1

The Algorithm By Example
  • Next, following the first outgoing edge, pushing
    the old state value, and setting Ml to state 3
    with value baselab

The Algorithm By Example
  • Since it now in accepting state
  • the baselab value is written to the Ml table
  • Ml is suspended
  • Mn and Mc are activated

The Algorithm By Example
  • The next edge takes Mn from state 4 to 5
  • And Mc run on the loop back to state 6
  • Both machines have 2 as binding value

The Algorithm By Example
  • Since Mn is now in an accept state x-scan writes
    lt2,baselabgt into Mns table.
  • Since no edges remain for exploration, x-scan
    pops the stack and backs up the state machines,
    resetting Mn to state 4 and Mc to state 6

The Algorithm By Example
  • The next edge is labeled location so
  • Mn stay in state 4
  • Mc also stay in state 6 but advanced to node 3
  • Then Mc is advanced to state 7 on the city edge
    to node 4

The Algorithm By Example
  • At this point x-scan writes lt4,baselabgt into
    Mcs table.
  • It can also produce the first tuple of bindings

The Algorithm By Example
  • X-Scan keeps running Mc but no more cities are
  • It pops back up to baselab
  • Running Mc along the IDREF to smith1 gives no
    more cities

The Algorithm By Example
  • Now, Mn and Mc are deactivated and the control
    return to Ml
  • X-scan pops up to node 1 to state 2
  • The other lab edge yield another tuple

Where should we go ?
  • On occasion x-scan will encounter an IDREF to a
    node that has not yet been parsed.
  • Unknown node simply will not be in the ID index.

Where should we go ?
  • When X-Scan hits such unseen reference
  • It pauses all the relevant state machines
  • Adds an entry to the list of unresolved IDREFs
  • ltdesired ID value, referrers addressgt
  • Continue to parse and build the structural index

Where should we go ?
  • Once the target element is parsed x-scan
  • fills its address into each referring IDREF in
    the structural index
  • Removes the entry from the list of unresolved
  • Awakens the state machines and proceeds

Are We Going in Circles ?
  • Sometimes the input XML graph contains a cycles.
  • X-Scan must not get trap in an infinite loop.

Are We Going in Circles ?
  • Considering the following XML graph

Are We Going in Circles ?
  • If we refuse to move in circles, we will miss the
    answer to the query
  • Exroot._.b.a
  • But if we allow moving in circles we going to get
    in trouble with this one
  • Eyroot._.z

Are We Going in Circles ?
  • What can we do ?
  • Now we going to use the stack
  • The stack contains pairs of the form
  • (binding, state)
  • Describing which bindings have been associated
    with states of the machine along the current path.

Are We Going in Circles ?
  • Since x-scan uses deterministic finite state
    machines, returning to a previous state with the
    same binding will not add any new possible
  • So, when a machine enters a state, it should
    checks to see that this state has not been bound
    to the same binding along the current path.

Are We Going in Circles ?
  • Is it working for our example ?
  • Look at those state machines.

Some Enhancements
  • It is important to prevent the operator from
    spending time evaluating paths that are not
  • There are two enhancements.
  • Selection Push-Down.
  • Duplicate elimination.

Selection Push-Down
  • The query optimizer creates and push selection
    operators down into the x-scan operation.
  • Works only on attributes since they are single
  • So X-Scan evaluates all node attribute edges
    before sub-elements edges.

Selection Push-Down
  • Here also, the best way to explain is by example.

WHERE ltdbgt ltlab managersmith1gt ltnamegtnlt
/gt lt_gtltcitygtclt/gtlt/gt lt/gt ELEMENT_AS
l lt/gt IN fig1.xml CONSTRUCT ltresultgtltcentergt
ltnamegtnlt/gt ltlocationgtclt/gt lt/gtlt/gt
Selection Push-Down
  • The query plan generator must create an
    additional temporary variable temp1 and a regular
    path expression
  • Etemp1El._at_manager
  • It also adds a selection predicate
  • Etemp1smith1

Selection Push-Down
  • Now, for the second lab, since it got only ID
    attribute, as X-Scan iterates through all lab2
    attributes it finds no manager attribute.
  • So it can short-circuit on this sub-graph.
  • Discarding the value of l and ignoring its

Duplicate Elimination
  • Sometimes we can visit an element multiple times
    through different paths.
  • This can produce duplicate binding tuples.
  • The naive way is to do some post-processing
  • It can be done smartly so there is no need to
    save the entire history.

There is a big one on the way
  • Main memory may not be large enough to handle all
    of the index structures.
  • The way to handle is by
  • Paging the XML source document
  • Paging the structural index
  • Conventional buffer manager using LRU or some
    other similar policy is sufficient.

There is a big one on the way
  • But what about the ID lookup index, list of
    unresolved IDREFs and the state machine stack?
  • They use either a B-Tree or a multilevel
  • The size of stack is bounded by the product of
    number of variables and the longest non-repeating
  • Inactive state machine stack can be naturally

How good is it ?
  • They used the IBM XML4C parser version 3.0.1 with
    the SAX parser API to implement the X-Scan.
  • The SAX API provides callbacks to the code as
    elements are read, and so allowing X-Scan to
    evaluate streaming XML data

How good is it ?
  • They compare the X-Scan against
  • Stanfords Lore semi-structured/XML database
  • A commercial OO-based XML repository
  • The experiments were performed with locally
    stored XML files
  • X-Scan lose some of its advantages

How good is it ?
  • Also this X-Scan implementation didnt include
    selection predicates.
  • All the queries were performed on a single
    processor 450MHz Pentium II with 256 MB of
  • X-Scan and the OO-based system run on Windows-NT
    and Lore run on Linux.

How good is it ?
  • The queries they had performed included the
    following documents.
  • Mondial and VLDB contains many references whereas
    the rest are mostly tree structures.

How good is it ?
How good is it ?
  • Conclusions.
  • Neither Lore nor the commercial system scale up
    well to queries across multi-megabyte data files.
  • They failed particularly on files that contain
    graph structure.
  • X-Scan scale better in all cases.

How good is it ?
  • Another experiments on synthetic XML data files
    were conducted.
  • Those XMLs are random generated.
  • They was to check the scalability of X-Scan.
  • They averaged three different runs across each of
    the three different random graphs of the same
    generation parameters.

How good is it ?
  • The first experiments were conducted on a
    tree-structured data.
  • Therefore they didnt have to build the
    structural index, ID and IDREFs tables.
  • This was to check how good the state machines

How good is it ?
  • The results were

How good is it ?
  • Next, they wanted to check for the cost of the
    graph indexing and resolving references.
  • Without traversing the IDREFs.
  • They took the same graphs from the previous tests
    and change back the DTD so it will considered as
  • The results were.

How good is it ?
How good is it ?
How good is it ?
  • Next, they wanted to check the effectiveness of
    the structural index when called to evaluate such
    reference edges.
  • The results were.

How good is it ?
How good is it
  • X-Scan differs in three key ways from previous
  • The structural index allow more efficient
    traversing without splitting the data to table or
    objects that should be re combined later
  • X-Scan state machines are based on the query
    rather then on the data source not reusable,
    but we always reread the data
  • X-Scan is pipelined and produces bindings as data
    is being streamed into the system

Conclusion (Cont.)
  • Another points regarding to X-Scan are
  • It handles cycles well
  • It preserves the document order and structure
  • Eliminate duplicate tuples
  • The state machines are independent and so can run
    in parallel
  • X-Scan is very efficient, typically imposing 8
    overhead on top of the time required to parse the
    XML document

About PowerShow.com