Document Value Model: Value-oriented XML processing for the internet - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Document Value Model: Value-oriented XML processing for the internet

Description:

Document Value Model: Value-oriented XML processing for the internet Fritz Henglein DIKU, University of Copenhagen henglein_at_diku.dk – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 44
Provided by: FritzHe1
Category:

less

Transcript and Presenter's Notes

Title: Document Value Model: Value-oriented XML processing for the internet


1
Document Value Model Value-oriented XML
processing for the internet
  • Fritz Henglein
  • DIKU, University of Copenhagen
  • henglein_at_diku.dk

2
Abstract
  • XML is all the rage. How do we store and process
    XML documents, however? In this talk we present
    XML Value Store, a persistent distributed
    (peer-to-peer) storage manager with a
    value-oriented interface, the Document Value
    Model (DVM), for XML documents whose parts may be
    distributed around the net and even moving around
    (such as on a cell phone at 140 km/h on a
    motorway). We compare DVM with existing XML
    processing languages and specifically the
    W3-consortium based Document Object Model (DOM).
    We argue that, apart from a series of technical
    advantages, the central benefit of DVM is a
    simplified programming model that lets the
    programmer focus on application logic, and the
    XML middleware on persistence management,
    caching, replication, coalescing, encryption,
    distribution, lookup, routing and internet data
    transport. We finally sketch a simple extension
    of XML Value Store with remote execution.
    Together with storing code in the XML Value Store
    this lets users send queries to remote XML Value
    Store for execution and promises highly scalable
    grid computing functionality with a simple,
    problem-oriented programming model.

3
Abstract (long)
  • XML (eXtended Markup Language) is emerging as the
    universal language for representing
    semi-structured data for distributed storage and
    information interchange on the internet and as
    such is destined to be the universal tissue --
    the lingua france -- for interoperable web
    services and databases interconnecting the
    internet. This makes XML processing an
    undisputed growth industry. But how is it done?
    We give examples of processing XML documents
    using domain-specific languages XSLT and XQUERY,
    and general purpose interfaces SAX an DOM for
    manipulating structure and contents of XML
    documents. The latter, Document Object Model, is
    based on object-oriented programming principles
    in which tree nodes are mutable objects, with
    associated methods for imperatively updating
    their state. Furthermore, each tree node in DOM
    is equipped with a (single) parent reference and
    a (single) document root reference, which means
    DOM-nodes cannot be shared and cannot be moved to
    other documents. Furthermore, in practice a DOM
    program starts by parsing an XML document
    completely into memory before performing any
    processing on it, however little part of the
    document is actually required for that, and it
    finally pretty-prints its tree structure and
    writes the whole document out on disk, however
    little of it is actually changed since it was
    read. In this talk we present DVM (Document Value
    Model), a value-oriented interface for processing
    XML documents, and XML Value Store, a distributed
    (peer-to-peer) storage manager for storing XML
    documents in parsed form. We illustrate XML
    documents and document nodes, based on treating
    nodes as values -- immutable objects. The
    immutability of nodes in DVM allows aggressive
    and safe use of sharing through value references
    -- universal pointers to immutable objects stored
    anywhere -- in the XML Value Store. This has a
    number of technical advantages over DOM document
    nodes are sharable, also across multiple
    documents loading and saving of nodes into/from
    memory is done by need, that is only those nodes
    needed by a computation are loaded and only those
    not already saved on disk are actually saved
    node pointers point to nodes whereever they are
    stored, even they move around frequently parsing
    and unparsing for persisting (storing on disk)
    are eliminated since the XML tree, not its
    linearized form is stored nodes can be cached
    and replicated aggressively for performance
    without concern for (cache) incoherence
    identical nodes (document parts) are only saved
    once on a disk as opposed to multiple times, even
    if the different users accidentally store the
    same are different no parent and root nodes are
    stored, yet navigation to parent and root are
    still possible. The main advantage, we argue, of
    this is that these 'generic' (computer sciency)
    data management concerns can be and are handled
    in the XML Value Store, not in the programmer's
    application logic. A planned extension of XML
    Value Store is the addition of a 'higher-order'
    interface, which allows remote execution. This
    allows sending scripts (queries) to an XML Store
    for remote execution and promises to provide
    scalable grid computing functionality with a
    simple, problem-oriented programming model.

4
Abstract (for functional programmers)
  • This talk is basically about programming with
    (disk and network) I/O in a functional,
    high-level fashion.

5
Overview (buzzy version)
  • OOP
  • VOP
  • XML
  • DOM
  • DVM
  • X

6
Overview
  • OOP object-oriented programming and distribution
    and mobility
  • VOP value-oriented programming
  • XML XML processing models and languages
  • DOM Document Object Model
  • DVM Document Value Model
  • X The Unknown (future work)

7
Overview
  • OOP
  • VOP
  • XML
  • DOM
  • DVM
  • X

General theme programming with values
(immutable objects)
... and objects (and carefully distinguishing
between them).
8
OOP or ratherimperative programming
  • Basic model of programming
  • primitive in-place update operationsobj.field
    obj2ref
  • compound update operations controlled sequential
    execution of updates e.g.(for int i 0 i lt
    arr.size i) arri newVal(i)

9
Imperative programming theme
  • Goal Global state transition from State0 to
    Staten State0 is destroyed.
  • Implementation (ephemeral state updates) State0
    -gt ... -gt Statei -gt Staten of primitive state
    transitions, where
  • each primitive update destroys the previous state

10
Consequence 1
  • software component interfaces are state-oriented
    and stateful
  • which operations are available depends on history
    of operations executed in the
  • responses from components depend on history of
    operations executed
  • Example Unix file I/O
  • NB Operations on such components are not
    necessarily atomic (or even recoverable)

11
Copy-and-update programming
  • Note
  • data get copied
  • they are not always coherent
  • they get copied again

input(f)
process(s)
output(f)
12
Why (and when) it works (well)
  • no concurrent access to file
  • sequential and synchronous programming (control
    over sequence of state changes)
  • no partial failures atomic abort due to single
    point of failure (single-process execution on
    single processor)
  • no replication of stateful data
  • random access to location of data (rapid access
    no matter where they are stored)

13
Consequence 2
  • Software/hardware component APIs are
    copy-oriented data referenced by a pointer get
    copied before being manipulated to ensure
    integrity
  • Example Modern operating systems are based on
    separation of address spaces require copying of
    data or delegation of tasks (ask the other
    process to do something for me)

14
Imperative programming Problem areas
  • caching and replication require heavy coherence
    protocols or different states are observable by
    clients and users
  • e.g. file save under NFS (wait for 30 seconds!!)
  • atomic (commit of compound) update is difficult
    to achieve in the presence of partial failures
  • rollback is not naturally supported, but
    normally required in situations where (atomic)
    updates can fail
  • coalescing identical data (storing data only
    once) cannot be done (easily)

15
Imperative programming Problem areas...
  • programming is mostly synchronous to control
    degree of nondeterminism due to concurrency
  • access to storage locations is not random (no
    modern file system does whats shown before)
  • access to updatable objects is typically
    location-based mobile objects are not
    naturally supported
  • lots of data stored multiple times

16
...but, of course
  • Updatable objects are excellent for propating
    information to an arbitrary number of clients (to
    any caller of the object, neednt even know or
    keep track number or identity of callers)

17
Properties of distributed (mobile) systems
  • Partial failures
  • cant even distinguish network failures from
    computing node failures
  • Concurrency
  • Difficult (exact) synchronization of processes
  • Widely varying access latency
  • rpc may block arbitrarily long time

18
Techniques for battling these problems
  • Caching, replication, memoization
  • (buffered) asynchronous message passing
  • relaxed or indeterminate semantics
  • time-outs
  • observational differences between processes
    running on same machine or on different machines

Not good for mobile code!
19
Central problem
  • ...not reading (loading)
  • ...not writing (saving, allocating)
  • but updating (overwriting)

Breaks commuting
Note The more updating, the less operations
commute and the more their execution needs to be
controlled (synchronized).
20
VOP Value-oriented programming
  • Programming with
  • arbitrarily large values (immutable objects),
    stored not only in RAM, but also on disk and on
    the net
  • location-independent value references (short,
    probabilistically unique identifiers of values,
    wherever they are stored) can be thought of as
    light-weight proxies for actual (big) values
  • plus small stateful cells (mutable objects) and
    cell references, incl.
  • wait-free registers with consensus number
    infinity (e.g., compare-and-swap registers)

21
Benefits/goals
  • Value references
  • efficient sharing of immutable data
  • efficient message passing
  • Arbitrarily large values
  • programmatic support of efficient atomic update
    build new (global) state as value, then perform
    update atomically by assigning value reference of
    new value to register holding present state.
  • Small registers
  • guaranteeing atomic update, with no (or minimal)
    locking
  • wait-freeness ensure progress (doesnt get
    blocked forever or for too long) of each client,
    even in the face of partial failures elsewhere

22
XML
  • XML info set ((Minimal) XML tree) labeled
    ordered tree, with
  • character data at the leaves
  • key/value pairs (attributes) at the internal
    nodes
  • XML document
  • linearized representation of XML tree based on
    pre/post-order traversal of XML tree

23
XML example
lt?xml ...?gt ltbookgt ltauthorgt Susanne
Staun lt/authorgt lttitlegt Mit smukke lig
lt/titlegt lt/bookgt
book
author
title
Susanne Staun
Mit smukke lig
24
Document Object Model (DOM)
book
author
title
Susanne Staun
Mit smukke lige
25
DOM characteristics
  • object-oriented nodes are objects, have methods
    that, amongst others, update their properties
    (children, attributes, parent pointer)
  • purely tree oriented each node has at most one
    predecessor, no node sharing
  • cloning is used to copy a node into another
    place of a document

26
DOM specification
  • Specified by W3C, see www.w3.org/DOM
  • Specification has 3 levels (specifying more and
    more functionality for document objects)

27
Programming with DOM
  • Typical scenario
  • Read linearized XML document from file or network
    pipe (socket).
  • Parse XML document into an in-memory tree data
    structure corresponding to DOM
  • Traverse and manipulate in-memory structure
  • Unparse in-memory structure to linearized XML
    document
  • Write out XML document through file or network
    pipe interface.

28
Document Value Model (DVM)
Sharing!!
book
author
title
Susanne Staun
Mit smukke lige
Isnt that just a picture of the XML tree model?
29
Navigation
  • How do we navigate in an XML tree without parent
    and root pointers?
  • DOM current node contains complete navigation
    state, including parent and root-pointers
  • DVM navigation state characterized by n0, ...,
    nk where n0 is root and nk iscurrent node
  • allows navigation to parent and root, just as in
    DOM
  • does not require any storage in nodes, as in DOM
  • works also for shared nodes (bread crumbs
    method for finding ones way back in a labyrinth
    dag)

30
DVM basic interface
  • The type of XML trees is an inductive datatype
  • Basic constructors (factory methods)
  • Combine attributes, child list, tag into new
    element node
  • Make chardata node from string
  • Basic deconstructors (projections)
  • Get attributes, child list, tag, chardata
  • Cells (updatable nodes)
  • setState, getState atomic operations

31
DVM general interface
  • Equip nodes with the ability to receive and apply
    any function to itself or a function that is
    applied to every of its subnode
  • Called Visitor pattern in OO design
  • Corresponds to unique homomorphism/type
    elimination rule (fold) known from algebraic
    datatypes/type theory
  • Lets nodes not only receive single commands for
    execution, but whole programs.

32
Share-and-create style updating
book
author
title
Susanne Staun
Mit smukke lig
33
Universal references
Never loaded from disk!
book
author
title
Susanne Staun
Mit smukke lig
RAM
disk storage
34
Universal references
  • Value references are location independent
  • always designate value, not where value is stored
  • require routing service to be resolved!
  • Value references can point from any place to any
    place
  • from RAM to disk, from disk to disk, from disk to
    network, from disk to RAM (!)...

35
XML Value Store
  • Distributed persistence manager for XML elements
  • Peer-to-peer architecture
  • Global name server for binding and rebinding
    value references to human-readable names
  • Rebinding bindings can be updated atomically.

36
XML Store Basic interface
  • Load value Value load(ValueRef vr)
  • Save valueValueRef save(Value v)
  • (Thats it)
  • Security/authentication not addressed yet
  • extended access control based interface
  • encrypted storage

37
XML Store General interface
  • The visitor interface allows nodes to receive any
    function and apply it to its state.
  • Lets do the same with the XML value store
    interface Extending it with a visitor interface
    allows XML value stores to receive arbitrary code
    and execute it.
  • Allows implementation of
  • query languages
  • general remote processing (e.g. for grid
    computing)

38
Code as values
  • Program code value Code can be stored in the
    XML store.
  • Remote execution then involves passing a value
    reference to the code to the receiver. If the
    receiver already has the corresponding value
    (code) e.g. due to caching in the XML value
    store, no further communication is necessary
    otherwise the value is requested (pulled in) by
    the receiver.

39
XML Value Store architecture
  • Base configuration each peer is a single
    component made up of
  • raw disk manager
  • network proxy for group of remote XML-store peers
  • group communication presently based on
  • IP-multicast (Pedersen/Tejlgaard 2002), or
  • Chord-routing protocol (Baumann/Fennestad/Thorn
    2002)

40
Configurable XML Stores
  • Goal Clients can construct XML Stores by
    constructing them from
  • primitive XML stores (disk manager, in-RAM
    manager, adapters to databases, file managers
    etc.), and
  • XML store constructors (decorators)
  • caching reads and writes
  • asynchronous load/save
  • buffered load/save requests
  • encryption/decryption
  • Target date August 2003

41
A simple challenge
  • Write a little program that implements a
    dictionary, e.g. for looking up phone numbers,
    and inserting and updating records.
  • It should work on the net (concurrent access).
  • It should work for a while (also after the
    machine has been taken down and restarted).
  • Surprisingly more complex to program than the
    routines you learned in algorithm class...

42
Summary
  • Value-oriented model for manipulating
    semistructured data
  • supports light-weight caching, replication,
    asynchronous computing in the XML middleware
  • Configurable XML middleware (client can order the
    properties one wants from the XML store)
  • Separation of program logic (in the client code)
    from generic deal
  • Encourages clients to write transaction safe code
    programmatically

43
More info
  • Website www.plan-x.org
  • Presently contains material from seminar on
    distributed and mobile data and software
    (including lots of references not mentioned here)
  • Email henglein_at_diku.dk
Write a Comment
User Comments (0)
About PowerShow.com