Title: Mixed Mode XML Query Processing Halverson, Burger, Galanis University of Wisconsin, Madison
1Mixed Mode XML Query ProcessingHalverson,
Burger, GalanisUniversity of Wisconsin, Madison
Emre Tapçi 2002701525
2Overview
- Introduction
- Purpose of the article
- System Architecture
- Mixed Mode Query Processing
- Experiments
- Related Work
- Conclusion and Future Work
3Introduction
- Mixed mode XML query processing employs
- inverted list filtering
- tree navigation
4Purpose of the article
- To show that systems which keep inverted list
filtering and tree navigation seperately are
suboptimal. To build an optimal system, these two
types of processing must be integrated.
5Basic System Architecture
6System Architecture
- The Data Manager stores a tree representation of
the XML document - The Index Manager stores a set of inverted lists,
mapping objects in the XML document to lists of
exact locations within the document
7Numbering Scheme
- The data manager and index manager must share a
common scheme for numbering the elements in an
XML document
8Numbering Scheme
9Data Manager
- Each XML document is stored in the Data Manager
using a B tree structure. - The key of B tree index is a (document_ID,element
_ID) pair that we refer to as an XKey.
10Data Manager
- Each leaf entry contains
- Term ID
- Record ID (RID)
11Data Manager
- The B tree corresponding to the previously given
numbered XML document example
12Data Manager
- Data Manager Tree Structure
13Data Manager
- Child Axis Cursor (CA)
- Descendent Axis Cursor (DA)
14Index Manager
- Indexing information is stored in a two level
index structure - B tree as top level
- Second level info is referred as postings, where
postings make up a posting list
15Index Manager
- Index Manager Tree Structure
16Mixed Mode Query Processing
- Data Manager
- Navigation Based Algorithms
- Ex Unnest Algorithm
- Index Manager
- Multi-predicate merge join algorithms
- Ex ZigZag algorithm
17Unnest Algorithm
- Takes as input a path expression and a stream of
XKeys. - Evaluates the path expression for each XKey in
the input, and outputs XKeys corresponding to the
satisfying elements. - Ex document()/A/B/C
18Unnest Algorithm
- Uses a Finite State Machine (FSM) to evaluate
path expressions - Each state of the FSM represents having satisfied
some prefix of the path expression, while an
accepting state indicates a full match.
19Unnest Algorithm
- Each state is associated with a cursor that
corresponds to the next step to be applied for
the path expression. - For each XKey obtained from the cursor, make the
appropriate transition in the FSM, and continue
with the next XKey in the next state. - If the cursor terminates, return to the previous
state and continue by enumerating its cursor.
20Unnest FSM for A/B
21Unnest DA-FSM for A/B
22Unnest CA-FSM for A//B
23Cost Model For Unnest Algorithm
- There are two relevant cost formulas
- Cost of a child axis unnest
- Cost of a descendant axis unnest
24The Cost of Unnest
25ZigZag Join Algorithm
- Uses the indices present on the posting lists.
- These algorithms assume that the posting lists
are sorted in order by - (document ID,Start number)
26ZigZag Join Algorithm
- The point of the algorithm is to skip forward
over parts of a posting list that are guaranteed
not to have any matching postings on the other
list.
27ZigZag Join Algorithm
28ZigZag Join Algorithm
- Check the containment of the first B within the
first A, and output the pair. - Increment the B posting list pointer
- Find that the second B is not contained by the
first A - Increment the A posting list pointer.
- If a second A is beyond the second B then
increment B posting list pointer
29ZigZag Join Algorithm
- Since the current B posting has no A posting
matches, use the second level index to seek
forward using the current A postings start
number - Then it skips over to the fifth B posting.
30Cost Model of ZigZag Join Algorithm
- The CPU cost can be quite dependent on actual
document structure, because the algorithm can
skip over sections of either input posting list
and can backtrack in a complex fashion
31Cost Model of ZigZag Join Algorithm
32Enabling Mixed Mode Execution
- Unnest operator takes a list of XKeys as input
- ZigZag Join operator takes posting lists as input
- To enable query plans that use a mixture of these
two operators, we must provide efficient
mechanisms for switching between the two formats.
33Enabling Mixed Mode Execution
- To convert postings into XKeys
- Remove the end number and level
34Enabling Mixed Mode Execution
- To convert the XKeys into postings
- We need to look up an end number and level
- To support this operation we store the end number
and level in the information record for each
element
35Selecting A Plan
- We heuristically limit our search space to
include only left deep evaluation plans for
structural joints. - To choose the best plan, we use a dynamic
programming approach
36Selecting A Plan
37Selecting A Plan
- For a path expression query, the cost can be
expressed as the sum of the last operation and
the minimum cost for the rest of the last
operation and the minimum cost
38Experiments
- Experimental results of Mixed Mode Query
Processing Approach - Carried out on a dual processor 550 MHz P3 PC
running Redhat Linux 6.2, 1GB memory and SCSI
disks.
39Experiments
- The XML Schema used in the experiments
40Experiments
- Test queries with predicted optimal plans
41Experiment Results
- Execution times in miliseconds for Query1
Execution times in miliseconds for Query2
42Experiment Results
Execution times in miliseconds for Query3
- Execution times in miliseconds for Query4
43Related Work
- There has been a lot of work on developing
efficient algorithms for structural joins that
identify occurences of structural relationships - Using pre-order and post-order numbers
- Stack-merge algorithms to make use of B-tree
indices on the inverted lists
44Related Work
- There has also been some work on converting path
expression queries into state machines - Several algorithms were proposed for optimizing
branching path expressions in the navigational
access methods only.
45Related Work
- Some recent research studies have considered the
problem of maintaining summary structures of XML
documents to provide statistics information - XML management systems have been also built on
top of either relational or object-oriented
systems.
46Conclusion and Future Work
- Mixed mode XML is better than other query
processing techniques considered seperately. - Only single axis paths were considered in ZigZag
Join algorithm more complex algorithms could be
integrated.
47Conclusion and Future Work
- Parallel execution of operators is supported in
the system, so we can benefit from branched
execution plans - This work uses only simple models and examples
extensions might be done to include more complex
queries and cost models
48Thank You
- We thank you all for listening
- can.tamyaman_at_arcelik.com
- tapci_at_alumni.bilkent.edu.tr