Title: XMLTK: An XML Toolkit for Scalable XML Stream Processing
1XMLTK An XML Toolkit for Scalable XML Stream
Processing
- I. Avila-Campillo, T.J. Green, A. Gupta, M.
Onizuka, - D. Raven, D. Suciu
2Motivation
- Lots of data sits in large text files
- ad hoc data formats
- Queried with Unix command line tools
- grep, sort, tail, etc
- Would be nice to XML-ize it...
- ...but then the Unix command line tools wont
work any more.
3Example
Text file
score decision paperID title
- accept P054 Theory of XML parsing
- reject P021 Experience with an XML
optimizer - accept P069 Towards a unified theory
of data models - . . . . . .
- Find the top ten rejected papers (in score
order)
grep reject papers.txt sort tail 10
4Example (contd)
ltsubmissionsgt ltpapergt ltscoregt 6 lt/scoregt
ltdecisiongt accept lt/decisiongt ltpaperIDgt
P054 ltpaperIDgt lttitlegtTheory of XML parsing
lt/titlegt lt/papergt ltpapergt ltscoregt 3
lt/scoregt ltdecisiongt reject lt/decisiongt
ltpaperIDgt P021 lt/paperIDgt lttitlegt Experience
with an XML optimizer lt/titlegt lt/papergt . . . . .
cant use those tools anymore ?
5Example (cond)
- Doing it with the XML Toolkit
- Finds top ten rejected ltpapergts, in ltscoregt order
xsort c /submissions e paperdecision/text()re
ject k
score/text() papers.xml xtail c
/submissions e paper n 10
6Goals of the XML Toolkit
- Simple, scalable tools for XML processing
- Provides service there are people who need this
- Provides a research platform for XML stream
processing
7Outline
- The tools
- The XPath processing engine
- Conclusions
8The Tools
- Current tools
- xsort
- xagg
- xnest
- xflatten
- xdelete
- xpair
- xhead
- xtail
- file2xml
- xmill
Will talk only about this
May look plenty, but actually still incomplete...
9XSort Definition
General form
xsort (c XPathExpr (-e XPathExpr (-k
XPathExpr)))
- -c the context, i.e. where to sort
- -e the item, i.e what to sort
- -k the key, i.e. what to sort on
10XSort Definition
11XSort Examples
Examples illustrated on data like this
ltbibgt ltbookgt ltauthorgtElliotte Rusty
Haroldlt/authorgt ltauthorgtW. Scott
Meanslt/authorgt lttitlegtXML in a
Nutshelllt/titlegt ltpublishergtO'Reillylt/publishe
rgt ltyeargt2001lt/yeargt ltisbngt0-596-00058-8lt/
isbngt lt/bookgt ltpapergt
ltauthorgtSylvain Devillerslt/authorgt lttitlegtXML
and XSLT Modeling for Multimedia Bitstream
Manipulation.lt/titlegt ltyeargt2001lt/yeargt
ltbooktitlegtWWW Posterslt/booktitlegt
lteegthttp//www10.org/cdrom/posters/1112.pdflt/eegt
lturlgtdb/conf/www/www2001p.htmlDevillers01lt/url
gt lt/papergt . . . . .
12XSort Examples
xsort c /bib e paper k title/text()
Sorts the ltpapergts, by lttitlegt The ltbookgts are
dropped from the output
Compare to
ltbibgt ltpapergt . . . lt/papergt ltpapergt
. . . lt/papergt . . . . . lt/bibgt
xsort c /bib e k title/text()
xsort c /bib e paper k title/text()
e book k title/text()
13XSort Examples
xsort c /bib e paper/author k lastName/text()
k
firstName/text()
Sorts the ltauthorgts, by ltlastNamegt then
ltfirstNamegt
ltbibgt ltauthorgt . . . lt/authorgt
ltauthorgt . . . lt/authorgt . . . . . lt/bibgt
14XSort Examples
xsort c /bib e paper e article e book e
ltpapergts first, then ltarticlegts, then ltbookgts,
then all the rest
ltbibgt ltpapergt . . . lt/papergt ltpapergt
. . . lt/papergt . . . . .
ltarticlegt . . . lt/articlegt . . . . .
ltbookgt . . . lt/bookgt . . . .
. lt/bibgt
15XSort Examples
xsort c /bib/ e author e title e year e
Normalize all entries ltauthorgts first, then
lttitlegts, then ltyeargtsthen all the other elements
xsort c /bib/paper e author e c
/bib/book e title e
In ltpapergts list the ltauthorgts first in ltbookgts
list the lttitlegt first Leave other entries
unchanged
16XSort Implementation
- Sorts one context at a time, copies the rest
- For each context
- Create a global key for each item
- Sort items, with a two-pass, multiway merge sort
- Quote from Databases 101 (news from the
trenches) - with disk blocks of 4KB and 128MB of main memory,
one can sort files up to 4TB in two passes !
17XSort Performance
xsort c /dblp e k title/text()
Size (KB) Xalan (sec) Xsort (sec)
0.41 0.08 0.00
4.91 0.09 0.00
76.22 0.27 0.02
991.79 2.52 0.26
9671.79 27.45 2.85
100964.43 - 43.97
1009643.71 - 461.36
1GB !
8minutes
18Outline
- The tools
- The XPath processing engine
- Conclusions
19The XPath Processor
- Common to all tools is the following problem
- Given
- Set of correlated XPath expressions
- Stream of SAX events
- Decide
- When are the expressions true ? variable events
20Example
xsort c /bib e paper k publisher
e book k title e
Tree pattern
Variable events
r
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
c
e2
k2
21The XPath Processor
- How we did it
- All Xpath expressions ? Deterministic Finite
Automaton - Restriction no predicates yet (current work...)
- Does this scale to many, many XPath expressions ?
- Yes, if we compute the DFA lazily (upcoming
ICDT2003 paper) - Evaluation time is parsing time
- Can do even better with a Stream IndeX (next)
22Stream IndeX (SIX)
- Solution Index the XML stream, parse only
partially - Definition The SIX a table of (start, end)
offsets
23Stream IndeX (SIX) Construction
SIX
XML
start end
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
24Stream IndeX (SIX) Skip Parsing
XPath
XML
/bib/paper/title. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgtltpapergt. . . . .
. lt/bibgt
25Stream IndeX (SIX) in XML Stream Processing
SIX (E.g. DIME)
0 205
30 66
72 188
90 110
95 98
0 205
30 66
72 188
0 205
30 66
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
XML
The SIX stream is about 6 of the data stream And
can be made MUCH smaller
26(No Transcript)
27(No Transcript)
28Outline
- The tools
- The XPath processing engine
- Conclusions
29Conclusions
- The toolkit is already available
- http//www.cs.washington.edu/homes/suciu/XMLTK
- http//xmltk.sourceforge.net
- What it does so far it does very well
- Sorting, aggregation, nest/unnest
- But doesnt do too much
- Restricted selections, no projections, no
restructurings yet - Volunteers welcome !
- Can one process XML data without parsing it
completely ? - SIX