Emory University - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Emory University

Description:

Is there any way to prevent DOM- or SAX- parsers from processing the entire document? ... we improve the standard DOM/SAX processing models without modifying ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 73
Provided by: iisSin
Category:
Tags: dom | emory | university

less

Transcript and Presenter's Notes

Title: Emory University


1
XML Evolution Two-phase XML Processing
ModelUsing XML Prefiltering Techniques
Demonstrated at VLDB 2006.
  • Emory University
  • November 17, 2006

Chia-Hsin Huang, IIS, Academia Sinica Tyng-Ruey
Chuang, IIS, Academia Sinica James J. Lu,
MathCS, Emory Hahn-Ming Lee, CSIE, NTUST
2
Agenda
  • Issues of Conventional XML Processing Models (DOM
    and SAX)
  • Motivation
  • Two-phase XML Processing Models and Prefiltering
    Techniques
  • Experiments and Analysis
  • GIS Applications (Optional)
  • Conclusion and Future Work
  • Related Work (Optional)

3
Issues of Conventional XML Processing Model DOM
1/4
XPath expression /html//p/text
child axis
descendant axis
Source http//www.cee.hw.ac.uk/alison/
netapp/dom/sld006.htm
4
Issues of Conventional XML Processing Model DOM
2/4
  • Pros
  • Provide flexible tree-traversal ability
  • Suitable for supporting XPath axes
  • Random access to the document
  • Cons
  • Need a lot of resources to build a DOM-tree
  • CPU time
  • Memory space
  • (Size of the XML doc.) (Size of the DOM tree)
    1 5

5
Issues of Conventional XML Processing Model SAX
3/4
XPath expression //entry_at_ida2
Backward reference XPE //footext()baz/ances
torentry
Source http//www.informatik.hu-berlin.de/obecke
r/Lehre/SS2002/XML/images/sax.t.gif
6
Issues of Conventional XML Processing Model SAX
4/4
  • Pros (compared with the DOM model)
  • Consume much less resources
  • A constant and small amount of memory
  • Support streaming process
  • Cons
  • Parse over the document
  • No backtrack mechanisms (look forward parsing)
  • Lack of interactive mechanisms

7
Problems in Standard DOM and SAX Processing Models
  • Both DOM and SAX processing models waste a large
    amount of computational resources by processing
    uninteresting fragments.

8
Motivation
  • If programs require only small parts of the
    document, why do we need to process the entire
    document in order to find those fragments!?
  • Is there any way to prevent DOM- or SAX- parsers
    from processing the entire document? and HOW?
  • Can we improve the standard DOM/SAX processing
    models without modifying (or just a little mod.)
    them?
  • What is the benefit?
  • What cost will we pay?

9
XML Prefiltering Technique
Our Solution
XPath Expression (Issued by users apps.)
Prefiltering Techniques (A tiny search engine)
Candidate-setXML document
XML Parsers (DOM/SAX)
XML document
10
Two-phase XML Processing Model Enhanced User
Applications
11
Two-phase XML Processing Model Enhanced XPath
Processors
12
Two-phase XML Processing Model Enhanced User
Applications
13
Two-phase XML Processing Model Enhanced
Stream-based XPath Processors
14
A Source Code Fragment of an XPath Processor with
the XML Prefilter

15
An Example
  • XPath expression //A//E
  • Answers sub-trees rooted by (E8,E15) and (E9,E14)

XPath Processors
XML Prefilter
The candidate-set XML Document
The XML Document
Exact Answers
16
Prefiltering Technique Requirements and
Limitations
  • Requirements
  • 100 recall rate (correctness)
  • Transparency (easy to use)
  • Non-intrusive? (easy to integrate to XML
    processors)
  • Lightweight (XML DBs are expensive)
  • Efficient
  • Limitations
  • Need a user query (We do not take multiple
    queries at a time because the candidate-set XML
    doc. may still very large)
  • Need preprocess the XML document (large size and
    infrequently updated)

17
System Architecture of the Prefiltering
Technique (DOM)
18
System Architecture of the Prefiltering
Technique (SAX)
19
Prefiltering Technique Indexer
  • Position List (start tag position, end tag
    position)We use the preorder number to express
    the tag offset
  • Random assess to the document by move file
    pointer to tag positions

20
Prefiltering Technique Query Simplifier (QS)
  • Goal Reduce the cost of query evaluation
  • Simplification Rules
  • SR1 omitting internal steps (b)
  • SR2 omitting branch steps (c1 and c2)
  • SR3 omitting wildcard steps (d) and
  • SR4 replacing the parent/child axes with the
    ancestor/descendant axes (e).
  • Always applySR4 because our query evaluation
    algorithm can determine the ancestor/descendant
    relationships more efficiently than the
    parent/child relationships

21
Prefiltering Technique Query Simplifier (QS)
  • SR1 omitting internal steps (b)
  • SR2 omitting branch steps (c1 and c2)
  • SR3 omitting wildcard steps (d) and
  • SR4 replacing the parent/child axes with the
    ancestor/descendant axes (e).

22
Query Simplifier SR5 Omitting Uninformative
Steps
skipped Intermediate nodes
XPE /A/B/C/E/F
simXPE //C//F (returns the same results.) The
prefilter runs more efficient!!
Intermediate nodes
A, B, and E are Uninformative Steps
Matched nodes
23
Prefiltering Technique Fast Lightweight
Steps-Axes Analyzer (FLISA)
  • Determine the candidate fragments in the XML
    document by evaluating the simplified XPath
    expression

The equations of evaluating u/axisv
The answer of //A//E is (E8,E15). Note that the
subtree rooted by (E9, E14) will be removed.
24
Prefiltering Technique Fragment Gatherer (FG)
  • Generate a candidate-set XML document
  • Generate fragments (simple outputs of FLISA)
  • Generate path information (only deal with the
    descendant axis)
  • Parse XML document from the root
  • When a start-tag is recognized, use its position
    to look up the corresponding end-tag position in
    the inverted index table
  • Check whether the parsed node N contains any
    candidate fragment as its descendant or N itself
    is a candidate fragment
  • If yes, output N.
  • If not, directly move the file pointer to its
    end-tag position (skip the frag.)
  • Note that currently we have no efficient way to
    generate the path information if the users XPath
    expression contains the preceding, following, or
    sibling axes.

25
Prefiltering Technique Micro XML Streaming
Parser (MXSP)
  • Transforms the candidate fragments into
    SAX-events
  • The procedure is similar to that of Fragment
    Gatherer
  • Provides interactive mechanisms by using the
    following additional flow-control operators
  • close-the-current-fragment (CCF)
  • jump-to-the-next-fragment (JNF)
  • terminate-the-parsing-process (TPP)
  • parse-next-node (PNN)
  • reparse-previous-fragment (RPF)
  • reparse-current-fragment (RCF)

26
Experiments and Analysis
27
Experiment and Analysis Testing of
Attributes-Testing Nodes
Path of the query /site/regions/namerica/item_at_id
"item20748"/name
Dataset XMark Benchmark Source
http//www-rocq.inria.fr/gemo/Gemo/Projects/SUMMAR
Y/DTD-xmark.jpg
28
Querying Large XML Docs
Query /site/regions/item_at_id"item1"/name
(matching one node)
N/A means that the method runs out of memory and
did not finish.
29
Querying Large XML Docs
Query /childsite/childregions/childasia 
(matching 4.5 nodes of the source document)
N/A means that the method runs out of memory and
did not finish.
30
Chinese Treebank
  • Semantically annotated corpus
  • Help parse and study Chinese sentences
  • Applications
  • Machine translation processing
  • Building example-based parsers
  • Comparing and integrating grammars
  • Developing and enlarging Treebank
  • ...
  • About 20,000 sentences in the CKIP Treebank V1.0
  • VP(HeadVK1??goalNP(HeadNdabe???))

(http//godel.iis.sinica.edu.tw/CKIP/trees1000.txt
)
31
Experiment and Analysis Sample Queries 1/2
32
Experiment and Analysis Sample Queries 2/2
33
Experiment and Analysis Treebank Search Engine
Over simplify a query
  • StreamPCRI is a stream-based structural pattern
    matching algorithm.

Our setup is an Intel Pentium-4 PC running at
2.53GHz, with a 1GB DDR-RAM, All programs were
coded in ActivePerl-5.6.1.629. XML-SAX module
(v0.12) and the XML-SAX-Expat (v0.37), Huang et
al., 2005
34
Experiment and Analysis Testing Flow-Control
Operators
  • Dataset GML Document (162MB)
  • The XPath expression was to find all buildings
    within a range of 20,000 square meters, from
    (305500, 2767060) to (305600, 2767100).

35
Bounded Box and Query
The Bounded Box (BBox) of the Geo-obj.
Query 1 (mismatch)
Query 3 (unmatch)
  • Matching Process
  • Check BBox
  • Check boundary

Query 2 (match)
36
Skipping Parsing Uninteresting Fragments using
JNF Flow-Control Operators (in MXSP)
Source XML Document
Candidate Frag. 2 (Matched)
Candidate Frag. 1(Matched)
Unmatched
jump
jump-to-the-next-fragment (JNF)
Candidate Frag. 3
Candidate Frag. n
37
Experiment and Analysis Testing of
Flow-Control Operators
  • Lower the cost, parse less nodes, and perform
    less Disk I/O
  • However, consume a lot of memory

38
GIS Applications (presented at ACM-GIS06)
39
Snapshots of the GML-based Web GIS
Query by BBoxes
Query by Layers
Query by ID
Scalable Vector Graphics (SVG) Map Navigator
(powered by www.carto.net)
40
A GML Fragment
Geospatial Data (Coordinates)
XML/GML Tags
41
System Architecture of the GML-based Web GIS
  • GeoXQuery a GML query engine Boucelma and
    Colonna, 2004
  • Extending the Saxon Java XQuery processor by
    calling spatial functions libraries of JTS (Java
    Topology Suite).
  • GeoSAX -- a GML streaming parser
  • Extending the Suns SAX parser to support the
    spatial functions.

42
Problems in the GML Solution

GML
Web Server CGI
WebBrowser (SVG Nav.)
BIG
XQuery Expressions.
Query (BBox, Layers, Obj ID)
SVG Elements
SVG Elements
GeoXQuery or GeoSAX
  • If the GML documents are Large
  • GeoXQuery may not work (DOM data model consumes a
    huge amount of main memory.)
  • GeoSAX needs a stream-based query algorithm.

43
Integrating with an XML Pre-filter
  • Using an XML Pre-filter Technique Huang et al.
    2006. to cut off uninteresting XML/GML fragments
    by approximately executing user query.
  • However, the prefilter does not support the
    functionality of prefiltering Geospatial data.
  • I.e., cannot handle the BBox query constraint.

44
Bounding-Box Indexing Plug-in Module (BIPM) for
the XML Pre-filter
  • Bounding-box Indexing Plug-in Module (BIPM) is
    developed for the XML pre-filtering technique to
    perform geospatial filtering functionality.
  • BIPM can index the boundary of each geographical
    feature in the documents and provides an
    intersection operation to query indexed features.

45
Indexing Bounded Boxes
Indexing the Bounded Boxes (BBox) for all
Geo-objects.
46
Prefiltering with the Bounding-Box Indexing
Plug-in Module
//Rivers//FootPrint
XML Prefilter
Intersection
BIPM
BBox(xx,yy,xx,yy)
Final Pre-filtering Results
47
Environment and Datasets
  • Two datasets
  • 1.1 GB GML document (the Taipei city)
  • 152 MB GML document (the Xinyi area)
  • Six GML processors
  • GeoXQuery
  • GPXQuery with BIPM
  • GPXQuery without BIPM
  • GeoSAX
  • GPSAX with BIPM
  • GPSAX without BIPM
  • Setup
  • an Intel Pentium-4 PC running at 2.53 GHz with 1
    GB DDR-RAM,
  • a 120 GB EIDE hard disk,
  • the MS Windows 2000 server.
  • Java 2 (Standard Edition V.1.4.2).

48
Query Constraints
49
Datasets
Large datasetTaipei, 1.1 GB
Small datasetXinyi, 152 MB
V2
V4
50
Querying by a Feature IDXQuery-based Processors
The query returns a geo-feature.
N/A means that the processor run out of memory
and did not finish
The pre-filtering technique lowers resource
consumption.
51
Querying by a Layer and a BBoxXQuery-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
The pre-filtering technique lowers resource
consumption.
52
Querying by a BBoxXQuery-based Processors
The query returns geo-features in V4.
The query returns geo-features in V2.
BIPM can efficiently filter out uninteresting
geographic features.
53
Querying by a Feature ID SAX-based Processors
The query returns a geo-feature.
The pre-filtering technique lowers the run time
but increases memory consumption.
54
Querying by a Layer and a BBox SAX-based
Processors
The query returns the Energy Supply Utility layer
in V4.
The query returns the Energy Supply Utility in V2.
55
Querying by a BBox SAX-based Processors
The query returns geo-features in V4.
The Cost of pre-filtering GML docs.
The query returns geo-features in V2.
56
Conclusion
  • If programs require only small parts of the
    document, why do we need to process the entire
    document in order to find those fragments!? No,
    it is unnecessary.
  • Is there any possible way to prevent DOM- or SAX-
    parsers from processing the entire document? and
    HOW? Yes, prefilter XML documents.
  • Can we improve the standard DOM/SAX processing
    models without modifying (or just a little mod.)
    them? One instruction is enough (using the
    two-phase processing model)
  • What is the benefit? More efficient XML document
    processing.
  • What cost will we pay? Memory, storage, and cost
    of indexing.

57
Future Work
  • Lowering memory consumption
  • Developing index management subsystems
  • Investigating more efficient way to prune the XML
    doc. and generate path information of the
    candidate-set document
  • Integrating the prefiltering technique into DOM-
    and stream-based XPath processors and XQuery
    processors (already done, see http//www.iis.sini
    ca.edu.tw/jashing/prefiltering/)

58
Thank you for your attentionQuestions and
Comments
All the software packages of the XML Prefilter
are available at http//www.iis.sinica.edu.tw/ja
shing/prefiltering/
59
XML Processing Enhancements
XML Applications
  • Unchangeable?
  • or a few modifications!

Requirements?
XML Standards
  • Unchangeable!?

60
Issues in ExistingXML Processing Enhancements
  • Consume large amount of disk/memory space and CPU
    time (Cost )
  • Large-scale (Cost )
  • Integrate with relational database (Cost )
  • Complicated index/query algorithms (Cost )
  • Intrusive (considerable modifications) (Cost
    )
  • Non-transparent (apps. need to be aware of the
    mechanics) (Cost )

61
Experiment and Analysis - Datasets
Note The CKIP Chinese Treebank corpus and the
GML file are encoded in UTF-8. Although we
initialize the Expat SAX parser and the primitive
SAX parser with the parameter (ProtocolEncoding
gt UTF-8), it still did not work and showed the
warning messages Wide character in print
62
Experiment Analysis Treebank Search Engine
Element freq. of the Chinese Treebank
Removing half of steps with consideration to
element frequencies
63
Related Work
  • Lazy XML processing Noga et al., 2002 and the
    Apache Xercess lazy processing The Apache
    Xerces2 parser 2.8.1 Release
  • approaches avoid parsing an entire document into
    memory by incrementally building a DOM tree as
    different parts of the document are requested by
    the user.

64
Related Work
  • Projecting XML documents Marian and Simeon.
    2003.
  • are that pruning the uninteresting fragments in
    the target XML document by considering users
    XPath expression when loading the document.

65
Related Work
  • Type-based XML projection Benzaken et al.,
    2006.
  • prunes an XML document more precisely in the
    presence of the document type definition (DTD) or
    the schema the document.

Projector
66
Related Work
  • Accelerating queries by pruning XML documents
    Bressan et al., 2005.

67
Issues of Manipulating GML Docs.
  • GML, providing rich vocabulary and flexible
    document structure to express complicated
    geospatial data and non-geospatial data.
  • Although GML is a kind of XML, the existing XML
    processors (DOM, SAX, XPath, and XQuery) are not
    suitable for processing GML.

DOM, SAX, XPath, XQuery
?
XML
GML
68
Solutions
  • GIS databases,
  • Open source software, PostgreSQL/PostGIS.
  • Many people choose this way.
  • Extending the existing XML processors
  • We now are talking about this way.

DOM, SAX, XPath, XQuery
GeoSAX, GeoXQuery (GeoXPath)
XML
GML
69
Contributions
  • Proposing two efficient GML-native processors.
  • Enabling the GML processors to query large GML
    docs.
  • Building a GML-based Web GIS using the GML
    processors.

Bounding-box Indexing Plug-in Module
Indexing
XML Pre-filtering Technique
Spatial Extension
GML Query Engines
XQuery
SAX
XML/GML
Data storage
Streaming
DOM
70
XQuery Expression with Geospatial Extension
Libraries
Geospatial extension for XQuery
  • 1. declare namespace my"http//www.sinica.edu.tw/
    "
  • 2. declare namespace gml"javaGML.XQGeoExtensions
    "
  • 3. declare namespace svg"javaGML.XQSVGExtensions
    "
  • 4. declare function myget_geo1() as element()
  • 5. for var1 in doc("lanyu.xml")//Rivers//FootPr
    int_at_id "21001000000-11"
  • 6. return ltresult1gtvar1lt/result1gt
  • 7. declare function myget_geo2() as element()
  • 8. for var1 in doc("lanyu.xml")//Roadways//Foot
    Print_at_id "4230904000-31"
  • 9. return ltresult1gtvar1lt/result1gt
  • 10. svgGML2SVG(gmlBuffer(
  • gmlIntersection (myget_geo1() , myget_geo2()
    ), 50))

Geospatial Operations
Calculating the buffer of the intersection of a
road and a river
71
Query Results
(a) A road.
(b) A river.
(c) The results of buffering the intersection of
the road and the river
(d) Combine and recolor (a), (b), and (c) in a
SVG map.
72
GML-native Processors
Bounding-box Indexing Plug-in Module
GPXQuery
GPSAX
XML Pre-filtering Technique
Spatial Extension
GeoSAX
GeoXQuery
XQuery
SAX
XML/GML
Streaming
DOM
Write a Comment
User Comments (0)
About PowerShow.com