Title: Document Value Model: Value-oriented XML processing for the internet
1Document Value Model Value-oriented XML
processing for the internet
- Fritz Henglein
- DIKU, University of Copenhagen
- henglein_at_diku.dk
2Abstract
- XML is all the rage. How do we store and process
XML documents, however? In this talk we present
XML Value Store, a persistent distributed
(peer-to-peer) storage manager with a
value-oriented interface, the Document Value
Model (DVM), for XML documents whose parts may be
distributed around the net and even moving around
(such as on a cell phone at 140 km/h on a
motorway). We compare DVM with existing XML
processing languages and specifically the
W3-consortium based Document Object Model (DOM).
We argue that, apart from a series of technical
advantages, the central benefit of DVM is a
simplified programming model that lets the
programmer focus on application logic, and the
XML middleware on persistence management,
caching, replication, coalescing, encryption,
distribution, lookup, routing and internet data
transport. We finally sketch a simple extension
of XML Value Store with remote execution.
Together with storing code in the XML Value Store
this lets users send queries to remote XML Value
Store for execution and promises highly scalable
grid computing functionality with a simple,
problem-oriented programming model.
3Abstract (long)
- XML (eXtended Markup Language) is emerging as the
universal language for representing
semi-structured data for distributed storage and
information interchange on the internet and as
such is destined to be the universal tissue --
the lingua france -- for interoperable web
services and databases interconnecting the
internet. This makes XML processing an
undisputed growth industry. But how is it done?
We give examples of processing XML documents
using domain-specific languages XSLT and XQUERY,
and general purpose interfaces SAX an DOM for
manipulating structure and contents of XML
documents. The latter, Document Object Model, is
based on object-oriented programming principles
in which tree nodes are mutable objects, with
associated methods for imperatively updating
their state. Furthermore, each tree node in DOM
is equipped with a (single) parent reference and
a (single) document root reference, which means
DOM-nodes cannot be shared and cannot be moved to
other documents. Furthermore, in practice a DOM
program starts by parsing an XML document
completely into memory before performing any
processing on it, however little part of the
document is actually required for that, and it
finally pretty-prints its tree structure and
writes the whole document out on disk, however
little of it is actually changed since it was
read. In this talk we present DVM (Document Value
Model), a value-oriented interface for processing
XML documents, and XML Value Store, a distributed
(peer-to-peer) storage manager for storing XML
documents in parsed form. We illustrate XML
documents and document nodes, based on treating
nodes as values -- immutable objects. The
immutability of nodes in DVM allows aggressive
and safe use of sharing through value references
-- universal pointers to immutable objects stored
anywhere -- in the XML Value Store. This has a
number of technical advantages over DOM document
nodes are sharable, also across multiple
documents loading and saving of nodes into/from
memory is done by need, that is only those nodes
needed by a computation are loaded and only those
not already saved on disk are actually saved
node pointers point to nodes whereever they are
stored, even they move around frequently parsing
and unparsing for persisting (storing on disk)
are eliminated since the XML tree, not its
linearized form is stored nodes can be cached
and replicated aggressively for performance
without concern for (cache) incoherence
identical nodes (document parts) are only saved
once on a disk as opposed to multiple times, even
if the different users accidentally store the
same are different no parent and root nodes are
stored, yet navigation to parent and root are
still possible. The main advantage, we argue, of
this is that these 'generic' (computer sciency)
data management concerns can be and are handled
in the XML Value Store, not in the programmer's
application logic. A planned extension of XML
Value Store is the addition of a 'higher-order'
interface, which allows remote execution. This
allows sending scripts (queries) to an XML Store
for remote execution and promises to provide
scalable grid computing functionality with a
simple, problem-oriented programming model.
4Abstract (for functional programmers)
- This talk is basically about programming with
(disk and network) I/O in a functional,
high-level fashion.
5Overview (buzzy version)
6Overview
- OOP object-oriented programming and distribution
and mobility - VOP value-oriented programming
- XML XML processing models and languages
- DOM Document Object Model
- DVM Document Value Model
- X The Unknown (future work)
7Overview
General theme programming with values
(immutable objects)
... and objects (and carefully distinguishing
between them).
8OOP or ratherimperative programming
- Basic model of programming
- primitive in-place update operationsobj.field
obj2ref - compound update operations controlled sequential
execution of updates e.g.(for int i 0 i lt
arr.size i) arri newVal(i)
9Imperative programming theme
- Goal Global state transition from State0 to
Staten State0 is destroyed. - Implementation (ephemeral state updates) State0
-gt ... -gt Statei -gt Staten of primitive state
transitions, where - each primitive update destroys the previous state
10Consequence 1
- software component interfaces are state-oriented
and stateful - which operations are available depends on history
of operations executed in the - responses from components depend on history of
operations executed - Example Unix file I/O
- NB Operations on such components are not
necessarily atomic (or even recoverable)
11Copy-and-update programming
- Note
- data get copied
- they are not always coherent
- they get copied again
input(f)
process(s)
output(f)
12Why (and when) it works (well)
- no concurrent access to file
- sequential and synchronous programming (control
over sequence of state changes) - no partial failures atomic abort due to single
point of failure (single-process execution on
single processor) - no replication of stateful data
- random access to location of data (rapid access
no matter where they are stored)
13Consequence 2
- Software/hardware component APIs are
copy-oriented data referenced by a pointer get
copied before being manipulated to ensure
integrity - Example Modern operating systems are based on
separation of address spaces require copying of
data or delegation of tasks (ask the other
process to do something for me)
14Imperative programming Problem areas
- caching and replication require heavy coherence
protocols or different states are observable by
clients and users - e.g. file save under NFS (wait for 30 seconds!!)
- atomic (commit of compound) update is difficult
to achieve in the presence of partial failures - rollback is not naturally supported, but
normally required in situations where (atomic)
updates can fail - coalescing identical data (storing data only
once) cannot be done (easily)
15Imperative programming Problem areas...
- programming is mostly synchronous to control
degree of nondeterminism due to concurrency - access to storage locations is not random (no
modern file system does whats shown before) - access to updatable objects is typically
location-based mobile objects are not
naturally supported - lots of data stored multiple times
16...but, of course
- Updatable objects are excellent for propating
information to an arbitrary number of clients (to
any caller of the object, neednt even know or
keep track number or identity of callers)
17Properties of distributed (mobile) systems
- Partial failures
- cant even distinguish network failures from
computing node failures - Concurrency
- Difficult (exact) synchronization of processes
- Widely varying access latency
- rpc may block arbitrarily long time
18Techniques for battling these problems
- Caching, replication, memoization
- (buffered) asynchronous message passing
- relaxed or indeterminate semantics
- time-outs
- observational differences between processes
running on same machine or on different machines
Not good for mobile code!
19Central problem
- ...not reading (loading)
- ...not writing (saving, allocating)
- but updating (overwriting)
Breaks commuting
Note The more updating, the less operations
commute and the more their execution needs to be
controlled (synchronized).
20VOP Value-oriented programming
- Programming with
- arbitrarily large values (immutable objects),
stored not only in RAM, but also on disk and on
the net - location-independent value references (short,
probabilistically unique identifiers of values,
wherever they are stored) can be thought of as
light-weight proxies for actual (big) values - plus small stateful cells (mutable objects) and
cell references, incl. - wait-free registers with consensus number
infinity (e.g., compare-and-swap registers)
21Benefits/goals
- Value references
- efficient sharing of immutable data
- efficient message passing
- Arbitrarily large values
- programmatic support of efficient atomic update
build new (global) state as value, then perform
update atomically by assigning value reference of
new value to register holding present state. - Small registers
- guaranteeing atomic update, with no (or minimal)
locking - wait-freeness ensure progress (doesnt get
blocked forever or for too long) of each client,
even in the face of partial failures elsewhere
22XML
- XML info set ((Minimal) XML tree) labeled
ordered tree, with - character data at the leaves
- key/value pairs (attributes) at the internal
nodes - XML document
- linearized representation of XML tree based on
pre/post-order traversal of XML tree
23XML example
lt?xml ...?gt ltbookgt ltauthorgt Susanne
Staun lt/authorgt lttitlegt Mit smukke lig
lt/titlegt lt/bookgt
book
author
title
Susanne Staun
Mit smukke lig
24Document Object Model (DOM)
book
author
title
Susanne Staun
Mit smukke lige
25DOM characteristics
- object-oriented nodes are objects, have methods
that, amongst others, update their properties
(children, attributes, parent pointer) - purely tree oriented each node has at most one
predecessor, no node sharing - cloning is used to copy a node into another
place of a document
26DOM specification
- Specified by W3C, see www.w3.org/DOM
- Specification has 3 levels (specifying more and
more functionality for document objects)
27Programming with DOM
- Typical scenario
- Read linearized XML document from file or network
pipe (socket). - Parse XML document into an in-memory tree data
structure corresponding to DOM - Traverse and manipulate in-memory structure
- Unparse in-memory structure to linearized XML
document - Write out XML document through file or network
pipe interface.
28Document Value Model (DVM)
Sharing!!
book
author
title
Susanne Staun
Mit smukke lige
Isnt that just a picture of the XML tree model?
29Navigation
- How do we navigate in an XML tree without parent
and root pointers? - DOM current node contains complete navigation
state, including parent and root-pointers - DVM navigation state characterized by n0, ...,
nk where n0 is root and nk iscurrent node - allows navigation to parent and root, just as in
DOM - does not require any storage in nodes, as in DOM
- works also for shared nodes (bread crumbs
method for finding ones way back in a labyrinth
dag)
30DVM basic interface
- The type of XML trees is an inductive datatype
- Basic constructors (factory methods)
- Combine attributes, child list, tag into new
element node - Make chardata node from string
- Basic deconstructors (projections)
- Get attributes, child list, tag, chardata
- Cells (updatable nodes)
- setState, getState atomic operations
31DVM general interface
- Equip nodes with the ability to receive and apply
any function to itself or a function that is
applied to every of its subnode - Called Visitor pattern in OO design
- Corresponds to unique homomorphism/type
elimination rule (fold) known from algebraic
datatypes/type theory - Lets nodes not only receive single commands for
execution, but whole programs.
32Share-and-create style updating
book
author
title
Susanne Staun
Mit smukke lig
33Universal references
Never loaded from disk!
book
author
title
Susanne Staun
Mit smukke lig
RAM
disk storage
34Universal references
- Value references are location independent
- always designate value, not where value is stored
- require routing service to be resolved!
- Value references can point from any place to any
place - from RAM to disk, from disk to disk, from disk to
network, from disk to RAM (!)...
35XML Value Store
- Distributed persistence manager for XML elements
- Peer-to-peer architecture
- Global name server for binding and rebinding
value references to human-readable names - Rebinding bindings can be updated atomically.
36XML Store Basic interface
- Load value Value load(ValueRef vr)
- Save valueValueRef save(Value v)
- (Thats it)
- Security/authentication not addressed yet
- extended access control based interface
- encrypted storage
37XML Store General interface
- The visitor interface allows nodes to receive any
function and apply it to its state. - Lets do the same with the XML value store
interface Extending it with a visitor interface
allows XML value stores to receive arbitrary code
and execute it. - Allows implementation of
- query languages
- general remote processing (e.g. for grid
computing)
38Code as values
- Program code value Code can be stored in the
XML store. - Remote execution then involves passing a value
reference to the code to the receiver. If the
receiver already has the corresponding value
(code) e.g. due to caching in the XML value
store, no further communication is necessary
otherwise the value is requested (pulled in) by
the receiver.
39XML Value Store architecture
- Base configuration each peer is a single
component made up of - raw disk manager
- network proxy for group of remote XML-store peers
- group communication presently based on
- IP-multicast (Pedersen/Tejlgaard 2002), or
- Chord-routing protocol (Baumann/Fennestad/Thorn
2002)
40Configurable XML Stores
- Goal Clients can construct XML Stores by
constructing them from - primitive XML stores (disk manager, in-RAM
manager, adapters to databases, file managers
etc.), and - XML store constructors (decorators)
- caching reads and writes
- asynchronous load/save
- buffered load/save requests
- encryption/decryption
- Target date August 2003
41A simple challenge
- Write a little program that implements a
dictionary, e.g. for looking up phone numbers,
and inserting and updating records. - It should work on the net (concurrent access).
- It should work for a while (also after the
machine has been taken down and restarted). - Surprisingly more complex to program than the
routines you learned in algorithm class...
42Summary
- Value-oriented model for manipulating
semistructured data - supports light-weight caching, replication,
asynchronous computing in the XML middleware - Configurable XML middleware (client can order the
properties one wants from the XML store) - Separation of program logic (in the client code)
from generic deal - Encourages clients to write transaction safe code
programmatically
43More info
- Website www.plan-x.org
- Presently contains material from seminar on
distributed and mobile data and software
(including lots of references not mentioned here) - Email henglein_at_diku.dk