XMLTK: An XML Toolkit for Scalable XML Stream Processing - PowerPoint PPT Presentation

About This Presentation
Title:

XMLTK: An XML Toolkit for Scalable XML Stream Processing

Description:

Lots of data sits in large text files. ad hoc data formats ' ... booktitle WWW Posters /booktitle ee http://www10.org/cdrom/posters/1112.pdf /ee ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 29
Provided by: csWash
Category:

less

Transcript and Presenter's Notes

Title: XMLTK: An XML Toolkit for Scalable XML Stream Processing


1
XMLTK An XML Toolkit for Scalable XML Stream
Processing
  • I. Avila-Campillo, T.J. Green, A. Gupta, M.
    Onizuka,
  • D. Raven, D. Suciu

2
Motivation
  • Lots of data sits in large text files
  • ad hoc data formats
  • Queried with Unix command line tools
  • grep, sort, tail, etc
  • Would be nice to XML-ize it...
  • ...but then the Unix command line tools wont
    work any more.

3
Example
Text file
  • In the old Unix world

score decision paperID title
  • accept P054 Theory of XML parsing
  • reject P021 Experience with an XML
    optimizer
  • accept P069 Towards a unified theory
    of data models
  • . . . . . .
  • Find the top ten rejected papers (in score
    order)

grep reject papers.txt sort tail 10
4
Example (contd)
  • In the new XML world

ltsubmissionsgt ltpapergt ltscoregt 6 lt/scoregt
ltdecisiongt accept lt/decisiongt ltpaperIDgt
P054 ltpaperIDgt lttitlegtTheory of XML parsing
lt/titlegt lt/papergt ltpapergt ltscoregt 3
lt/scoregt ltdecisiongt reject lt/decisiongt
ltpaperIDgt P021 lt/paperIDgt lttitlegt Experience
with an XML optimizer lt/titlegt lt/papergt . . . . .
cant use those tools anymore ?
5
Example (cond)
  • Doing it with the XML Toolkit
  • Finds top ten rejected ltpapergts, in ltscoregt order

xsort c /submissions e paperdecision/text()re
ject k
score/text() papers.xml xtail c
/submissions e paper n 10
6
Goals of the XML Toolkit
  • Simple, scalable tools for XML processing
  • Provides service there are people who need this
  • Provides a research platform for XML stream
    processing

7
Outline
  • The tools
  • The XPath processing engine
  • Conclusions

8
The Tools
  • Current tools
  • xsort
  • xagg
  • xnest
  • xflatten
  • xdelete
  • xpair
  • xhead
  • xtail
  • file2xml
  • xmill

Will talk only about this
May look plenty, but actually still incomplete...
9
XSort Definition
General form
xsort (c XPathExpr (-e XPathExpr (-k
XPathExpr)))
  • -c the context, i.e. where to sort
  • -e the item, i.e what to sort
  • -k the key, i.e. what to sort on

10
XSort Definition
11
XSort Examples
Examples illustrated on data like this
ltbibgt ltbookgt ltauthorgtElliotte Rusty
Haroldlt/authorgt ltauthorgtW. Scott
Meanslt/authorgt lttitlegtXML in a
Nutshelllt/titlegt ltpublishergtO'Reillylt/publishe
rgt ltyeargt2001lt/yeargt ltisbngt0-596-00058-8lt/
isbngt lt/bookgt ltpapergt
ltauthorgtSylvain Devillerslt/authorgt lttitlegtXML
and XSLT Modeling for Multimedia Bitstream
Manipulation.lt/titlegt ltyeargt2001lt/yeargt
ltbooktitlegtWWW Posterslt/booktitlegt
lteegthttp//www10.org/cdrom/posters/1112.pdflt/eegt
lturlgtdb/conf/www/www2001p.htmlDevillers01lt/url
gt lt/papergt . . . . .
12
XSort Examples
xsort c /bib e paper k title/text()
Sorts the ltpapergts, by lttitlegt The ltbookgts are
dropped from the output
Compare to
ltbibgt ltpapergt . . . lt/papergt ltpapergt
. . . lt/papergt . . . . . lt/bibgt
xsort c /bib e k title/text()
xsort c /bib e paper k title/text()
e book k title/text()
13
XSort Examples
xsort c /bib e paper/author k lastName/text()
k
firstName/text()
Sorts the ltauthorgts, by ltlastNamegt then
ltfirstNamegt
ltbibgt ltauthorgt . . . lt/authorgt
ltauthorgt . . . lt/authorgt . . . . . lt/bibgt
14
XSort Examples
xsort c /bib e paper e article e book e
ltpapergts first, then ltarticlegts, then ltbookgts,
then all the rest
ltbibgt ltpapergt . . . lt/papergt ltpapergt
. . . lt/papergt . . . . .
ltarticlegt . . . lt/articlegt . . . . .
ltbookgt . . . lt/bookgt . . . .
. lt/bibgt
15
XSort Examples
xsort c /bib/ e author e title e year e
Normalize all entries ltauthorgts first, then
lttitlegts, then ltyeargtsthen all the other elements
xsort c /bib/paper e author e c
/bib/book e title e
In ltpapergts list the ltauthorgts first in ltbookgts
list the lttitlegt first Leave other entries
unchanged
16
XSort Implementation
  • Sorts one context at a time, copies the rest
  • For each context
  • Create a global key for each item
  • Sort items, with a two-pass, multiway merge sort
  • Quote from Databases 101 (news from the
    trenches)
  • with disk blocks of 4KB and 128MB of main memory,
    one can sort files up to 4TB in two passes !

17
XSort Performance
xsort c /dblp e k title/text()
Size (KB) Xalan (sec) Xsort (sec)
0.41 0.08 0.00
4.91 0.09 0.00
76.22 0.27 0.02
991.79 2.52 0.26
9671.79 27.45 2.85
100964.43 - 43.97
1009643.71 - 461.36
1GB !
8minutes
18
Outline
  • The tools
  • The XPath processing engine
  • Conclusions

19
The XPath Processor
  • Common to all tools is the following problem
  • Given
  • Set of correlated XPath expressions
  • Stream of SAX events
  • Decide
  • When are the expressions true ? variable events

20
Example
xsort c /bib e paper k publisher
e book k title e
Tree pattern
Variable events
r
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
c
e2
k2
21
The XPath Processor
  • How we did it
  • All Xpath expressions ? Deterministic Finite
    Automaton
  • Restriction no predicates yet (current work...)
  • Does this scale to many, many XPath expressions ?
  • Yes, if we compute the DFA lazily (upcoming
    ICDT2003 paper)
  • Evaluation time is parsing time
  • Can do even better with a Stream IndeX (next)

22
Stream IndeX (SIX)
  • Solution Index the XML stream, parse only
    partially
  • Definition The SIX a table of (start, end)
    offsets

23
Stream IndeX (SIX) Construction
SIX
XML
start end
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgt lt/bibgt
24
Stream IndeX (SIX) Skip Parsing
XPath
XML
/bib/paper/title. . .
ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbookgt ltpublishergt
Freeman lt/publishergt ltauthorgt
Jeffrey D. Ullman lt/authorgt lttitlegt
Principles of Database and
Knowledge Base Systems lt/titlegt
ltyeargt 1998 lt/yeargtlt/bookgtltpapergt. . . . .
. lt/bibgt
25
Stream IndeX (SIX) in XML Stream Processing
SIX (E.g. DIME)
0 205
30 66
72 188
90 110
95 98
0 205
30 66
72 188
0 205
30 66
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
ltbibgt ltbookgt ... lt/bibgt
XML
The SIX stream is about 6 of the data stream And
can be made MUCH smaller
26
(No Transcript)
27
(No Transcript)
28
Outline
  • The tools
  • The XPath processing engine
  • Conclusions

29
Conclusions
  • The toolkit is already available
  • http//www.cs.washington.edu/homes/suciu/XMLTK
  • http//xmltk.sourceforge.net
  • What it does so far it does very well
  • Sorting, aggregation, nest/unnest
  • But doesnt do too much
  • Restricted selections, no projections, no
    restructurings yet
  • Volunteers welcome !
  • Can one process XML data without parsing it
    completely ?
  • SIX
Write a Comment
User Comments (0)
About PowerShow.com