XML, XPath, and XQuery - PowerPoint PPT Presentation

About This Presentation
Title:

XML, XPath, and XQuery

Description:

CIS 550 Database & Information Systems. October 18, 2005 ... 'Lingua franca' of data. It's parsable even if we don't know what it means! Original expectation: ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 48
Provided by: zack4
Category:
Tags: xml | franca | html | lingua | tags | xpath | xquery

less

Transcript and Presenter's Notes

Title: XML, XPath, and XQuery


1
XML, XPath, and XQuery
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • October 18, 2005

Some slide content courtesy of Susan Davidson
Raghu Ramakrishnan
2
Administrivia
  • Upcoming recitation section on the RSS feeder
  • Project plans due 11/1
  • For projects other than RSS feeder description
    of project goals
  • In all cases Describe how you will be dividing
    up the work
  • List of project milestones (remember to leave
    time for integration!)
  • Homework 4 (XML) will be due 11/3

3
Why XML?
  • XML is the confluence of several factors
  • The Web needed a more declarative format for data
  • Documents needed a mechanism for extended tags
  • Database people needed a more flexible
    interchange format
  • Lingua franca of data
  • Its parsable even if we dont know what it
    means!
  • Original expectation
  • The whole web would go to XML instead of HTML
  • Todays reality
  • Not so But XML is used all over under the
    covers

4
Why DB People Like XML
  • Can get data from all sorts of sources
  • Allows us to touch data we dont own!
  • This was actually a huge change in the DB
    community
  • Interesting relationships with DB techniques
  • Useful to do relational-style operations
  • Leverages ideas from object-oriented,
    semistructured data
  • Blends schema and data into one format
  • Unlike relational model, where we need schema
    first
  • But too little schema can be a drawback, too!

5
XML Anatomy
Processing Instr.
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • ltdblpgt
  • ltmastersthesis mdate"2002-01-03"
    key"ms/Brown92"gt
  •   ltauthorgtKurt P. Brownlt/authorgt
  •   lttitlegtPRPL A Database Workload
    Specification Languagelt/titlegt
  •   ltyeargt1992lt/yeargt
  •   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
  •   lt/mastersthesisgt
  • ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
    018"gt
  •   lteditorgtPaul R. McJoneslt/editorgt
  •   lttitlegtThe 1995 SQL Reunionlt/titlegt
  •   ltjournalgtDigital System Research Center
    Reportlt/journalgt
  •   ltvolumegtSRC1997-018lt/volumegt
  •   ltyeargt1997lt/yeargt
  •   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
  •   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
    n_95/lt/eegt
  •   lt/articlegt

Open-tag
Element
Attribute
Close-tag
6
Well-Formed XML
  • A legal XML document fully parsable by an XML
    parser
  • All open-tags have matching close-tags (unlike so
    many HTML documents!), or a special
  • lttag/gt shortcut for empty tags (equivalent to
    lttaggtlt/taggt
  • Attributes (which are unordered, in contrast to
    elements) only appear once in an element
  • Theres a single root element
  • XML is case-sensitive

7
XML as a Data Model
  • XML information set includes 7 types of nodes
  • Document (root)
  • Element
  • Attribute
  • Processing instruction
  • Text (content)
  • Namespace
  • Comment
  • XML data model includes this, plus typing info,
    plus order info and a few other things

8
XML Data Model Visualized(and simplified!)
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
9
What Does XML Do?
  • Serves as a document format (super-HTML)
  • Allows custom tags (e.g., used by MS Word,
    openoffice)
  • Supplement it with stylesheets (XSL) to define
    formatting
  • Data exchange format (must agree on terminology)
  • Marshalling and unmarshalling data in SOAP and
    Web Services

10
XML as a Super-HTML(MS Word)
  • lth1 class"Section1"gtlta name"_top /gtCIS 550
    Database and Information Systemslt/h1gt
  • lth2 class"Section1"gtFall 2004lt/h2gt
  • ltp class"MsoNormal"gt
  • ltplacegt311 Townelt/placegt, Tuesday/Thursday
  • lttime Hour"13" Minute"30"gt130PM
    300PMlt/timegt
  • lt/pgt

11
XML Easily Encodes Relations
Student-course-grade
sid serno exp-grade
1 570103 B
23 550103 A
  • ltstudent-course-gradegt
  • lttuplegtltsidgt1lt/sidgtltsernogt570103lt/sernogtltexp-grad
    egtBlt/exp-gradegtlt/tuplegt
  • lttuplegtltsidgt23lt/sidgtltsernogt550103lt/sernogtltexp-gra
    degtAlt/exp-gradegtlt/tuplegt
  • lt/student-course-gradegt

12
But XML is More FlexibleNon-First-Normal-Form
(NF2)
  • ltparentsgt
  • ltparent nameJean gt
  • ltsongtJohnlt/songt
  • ltdaughtergtJoanlt/daughtergt
  • ltdaughtergtJilllt/daughtergt
  • lt/parentgt
  • ltparent nameFenggt
  • ltdaughtergtFelicitylt/daughtergt
  • lt/parentgt

Coincides with semi-structured data, invented
by DB people at Penn and Stanford
13
Integrating XML What If We Have Multiple
Sources with the Same Tags?
  • Namespaces allow us to specify a context for
    different tags
  • Two parts
  • Binding of namespace to URI
  • Qualified names
  • ltroot xmlnshttp//www.first.com/aspace
    xmlnsothernsgt
  • lttag xmlnsmynshttp//www.fictitious.com/mypath
    gt
  • ltthistaggtis in the default namespace
    (aspace)lt/thistaggt
  • ltmynsthistaggtis in mynslt/mynsthistaggtltotherns
    thistaggtis a different tag in othernslt/othernsthi
    staggt
  • lt/taggt
  • lt/rootgt

14
XML Isnt Enough on Its Own
  • Its too unconstrained for many cases!
  • How will we know when were getting garbage?
  • How will we query?
  • How will we understand what we got?
  • We also need
  • Some idea of the structure
  • Our focus next
  • Presentation, in some cases XSL(T)
  • Well talk about this soon
  • Some way of interpreting the tags?
  • Well talk about this later in the semester

15
Structural ConstraintsDocument Type Definitions
(DTDs)
  • The DTD is an EBNF grammar defining XML structure
  • XML document specifies an associated DTD, plus
    the root element
  • DTD specifies children of the root (and so on)
  • DTD defines special significance for attributes
  • IDs special attributes that are analogous to
    keys for elements
  • IDREFs references to IDs
  • IDREFS a nasty hack that represents a list of
    IDREFs

16
An Example DTD
  • Example DTD
  • lt!ELEMENT dblp((mastersthesis article))gt
  • lt!ELEMENT mastersthesis(author,title,year,school,c
    ommitteemember)gt
  • lt!ATTLIST mastersthesis(mdate CDATA REQUIRED ke
    y ID REQUIRED
  • advisor CDATA IMPLIEDgt
  • lt!ELEMENT author(PCDATA)gt
  • Example use of DTD in XML file
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • lt!DOCTYPE dblp SYSTEM my.dtd"gt
  • ltdblpgt

17
Representing Graphs and Links in XML
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • lt!DOCTYPE graph SYSTEM special.dtd"gt
  • ltgraphgt
  • ltauthor idauthor1gt
  • ltnamegtJohn Smithlt/namegt
  • lt/authorgt
  • ltarticlegt
  • ltauthor refauthor1 /gt lttitlegtPaper1lt/titlegt
  • lt/articlegt
  • ltarticlegt
  • ltauthor refauthor1 /gt lttitlegtPaper2lt/titlegt
  • lt/articlegt

18
Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
author1
author1
19
Graph Data Model
Root
graph
?xml
!DOCTYPE
article
article
author
id
title
title
author
author
name
Paper1
author1
ref
Paper2
ref
John Smith
20
DTDs Arent Expressive Enough
  • DTDs capture grammatical structure, but have some
    drawbacks
  • Not themselves in XML inconvenient to build
    tools for them
  • Dont capture database datatypes domains
  • IDs arent a good implementation of keys
  • Why not?
  • No way of defining OO-like inheritance

21
XML Schema
  • Aims to address the shortcomings of DTDs
  • XML syntax
  • Can define keys using XPaths
  • Type subclassing thats more complex than in a
    programming language
  • Programming languages dont consider order of
    member variables!
  • Subclassing by extension and by restriction
  • And, of course, domains and built-in datatypes

22
Basics of XML Schema
  • Need to use the XML Schema namespace (generally
    named xsd)
  • simpleTypes are a way of restricting domains on
    scalars
  • Can define a simpleType based on integer, with
    values within a particular range
  • complexTypes are a way of defining
    element/attribute structures
  • Basically equivalent to !ELEMENT, but more
    powerful
  • Specify sequence, choice between child elements
  • Specify minOccurs and maxOccurs (default 1)
  • Must associate an element/attribute with a
    simpleType, or an element with a complexType

23
Simple Schema Example
  • ltxsdschema xmlnsxsd"http//www.w3.org/2001/XMLS
    chema"gt
  • ltxsdelement namemastersthesis"
    typeThesisType"/gt
  • ltxsdcomplexType nameThesisType"gt
  • ltxsdattribute namemdate" type"xsddate"/gt
  • ltxsdattribute namekey" type"xsdstring"/gt
  • ltxsdattribute nameadvisor" type"xsdstring"/gt
  • ltxsdsequencegt
  • ltxsdelement nameauthor" typexsdstring"/gt
  • ltxsdelement nametitle" typexsdstring"/gt
  • ltxsdelement nameyear" typexsdinteger"/gt
  • ltxsdelement nameschool" typexsdstring/gt
  • ltxsdelement namecommitteemember"
    typeCommitteeType minOccurs0"/gt
  • lt/xsdsequencegt
  • lt/xsdcomplexTypegt
  • lt/xsdschemagt

24
Designing an XML Schema/DTD
  • Not as formalized as relational data design
  • We can still use ER diagrams to break into
    entity, relationship sets
  • ER diagrams have extensions for aggregation
    treating smaller diagrams as entities and for
    composite attributes
  • Note that often we already have our data in
    relations and need to design the XML schema to
    export them!
  • Generally orient the XML tree around the
    central objects
  • Big decision element vs. attribute
  • Element if it has its own properties, or if you
    might have more than one of them
  • Attribute if it is a single property or perhaps
    not!

25
Recap XML as a Data Model
  • XML is a non-first-normal-form (NF2)
    representation
  • Can represent documents, data
  • Standard data exchange format
  • Several competing schema formats esp., DTD and
    XML Schema provide typing information

26
Querying XML
  • How do you query a directed graph? a tree?
  • The standard approach used by many XML,
    semistructured-data, and object query languages
  • Define some sort of a template describing
    traversals from the root of the directed graph
  • In XML, the basis of this template is called an
    XPath

27
XPaths
  • In its simplest form, an XPath is like a path in
    a file system
  • /mypath/subpath//morepath
  • The XPath returns a node set representing the XML
    nodes (and their subtrees) at the end of the path
  • XPaths can have node tests at the end, returning
    only particular node types, e.g., text(),
    processing-instruction(), comment(), element(),
    attribute()
  • XPath is fundamentally an ordered language it
    can query in order-aware fashion, and it returns
    nodes in order

28
Sample XML
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt
  • ltdblpgt
  • ltmastersthesis mdate"2002-01-03"
    key"ms/Brown92"gt
  •   ltauthorgtKurt P. Brownlt/authorgt
  •   lttitlegtPRPL A Database Workload
    Specification Languagelt/titlegt
  •   ltyeargt1992lt/yeargt
  •   ltschoolgtUniv. of Wisconsin-Madisonlt/schoolgt
  •   lt/mastersthesisgt
  • ltarticle mdate"2002-01-03" key"tr/dec/SRC1997-
    018"gt
  •   lteditorgtPaul R. McJoneslt/editorgt
  •   lttitlegtThe 1995 SQL Reunionlt/titlegt
  •   ltjournalgtDigital System Research Center
    Reportlt/journalgt
  •   ltvolumegtSRC1997-018lt/volumegt
  •   ltyeargt1997lt/yeargt
  •   lteegtdb/labs/dec/SRC1997-018.htmllt/eegt
  •   lteegthttp//www.mcjones.org/System_R/SQL_Reunio
    n_95/lt/eegt
  •   lt/articlegt

29
XML Data Model Visualized
attribute
root
p-i
element
Root
text
dblp
?xml
mastersthesis
article
mdate
mdate
key
key
author
title
year
school
2002
editor
title
year
journal
volume
ee
ee
2002
1992
1997
The
ms/Brown92
tr/dec/
PRPL
Digital
db/labs/dec
Univ.
Paul R.
Kurt P.
SRC
http//www.
30
Some Example XPath Queries
  • /dblp/mastersthesis/title
  • /dblp//editor
  • //title
  • //title/text()

31
Context Nodes and Relative Paths
  • XPath has a notion of a context node its
    analogous to a current directory
  • . represents this context node
  • .. represents the parent node
  • We can express relative paths
  • subpath/sub-subpath/../.. gets us back to the
    context node
  • By default, the document root is the context node

32
Predicates Selection Operations
  • A predicate allows us to filter the node set
    based on selection-like conditions over
    sub-XPaths
  • /dblp/articletitle Paper1
  • which is equivalent to
  • /dblp/article./title/text() Paper1

33
Axes More Complex Traversals
  • Thus far, weve seen XPath expressions that go
    down the tree (and up one step)
  • But we might want to go up, left, right, etc.
  • These are expressed with so-called axes
  • selfpath-step
  • childpath-step parentpath-step
  • descendantpath-step ancestorpath-step
  • descendant-or-selfpath-step ancestor-or-selfpa
    th-step
  • preceding-siblingpath-step following-siblingpa
    th-step
  • precedingpath-step followingpath-step
  • The previous XPaths we saw were in abbreviated
    form

34
Querying Order
  • We saw in the previous slide that we could query
    for preceding or following siblings or nodes
  • We can also query a node for its position
    according to some index
  • fnfirst() , fnlast() return index of 0th
    last element matching the last step
  • fnposition() gives the relative count of the
    current node
  • childarticlefnposition() fnlast()

35
Users of XPath
  • XML Schema uses simple XPaths in defining keys
    and uniqueness constraints
  • XQuery
  • XSLT
  • XLink and XPointer, hyperlinks for XML

36
XQuery
  • A strongly-typed, Turing-complete XML
    manipulation language
  • Attempts to do static typechecking against XML
    Schema
  • Based on an object model derived from Schema
  • Unlike SQL, fully compositional, highly
    orthogonal
  • Inputs outputs collections (sequences or bags)
    of XML nodes
  • Anywhere a particular type of object may be used,
    may use the results of a query of the same type
  • Designed mostly by DB and functional language
    people
  • Attempts to satisfy the needs of data management
    and document management
  • The database-style core is mostly complete (even
    has support for NULLs in XML!!)
  • The document keyword querying features are still
    in the works shows in the order-preserving
    default model

37
XQuerys Basic Form
  • Has an analogous form to SQLs SELECT..FROM..WHERE
    ..GROUP BY..ORDER BY
  • The model bind nodes (or node sets) to
    variables operate over each legal combination of
    bindings produce a set of nodes
  • FLWOR statement
  • for iterators that bind variables
  • let collections
  • where conditions
  • order by order-conditions (the handout uses old
    SORTBY)
  • return output constructor

38
Iterations in XQuery
  • A series of (possibly nested) FOR statements
    assigning the results of XPaths to variables
  • for root in document(http//my.org/my.xml)
  • for sub in root/rootElement,
  • sub2 in sub/subElement,
  • Something like a template that pattern-matches,
    produces a binding tuple
  • For each of these, we evaluate the WHERE and
    possibly output the RETURN template
  • document() or doc() function specifies an input
    file as a URI
  • Old version was document now doc but it
    depends on your XQuery implementation

39
Two XQuery Examples
  • ltroot-taggt
  • for p in document(dblp.xml)/dblp/proceedings,
  • yr in p/yr
  • where yr 1999
  • return ltprocgt p lt/procgt
  • lt/root-taggt
  • for i in document(dblp.xml)/dblp/inproceedings
    author/text() John Smith
  • return ltsmith-papergt
  • lttitlegt i/title/text() lt/titlegt
  • ltkeygt i/_at_key lt/keygt
  • i/crossref
  • lt/smith-papergt

40
Nesting in XQuery
  • Nesting XML trees is perhaps the most common
    operation
  • In XQuery, its easy put a subquery in the
    return clause where you want things to repeat!
  • for u in document(dblp.xml)/universities
  • where u/country USA
  • return ltms-theses-99gt
  • u/title
  • for mt in u/../mastersthesis
  • where mt/year/text() 1999 and
    ____________
  • return mt/title
  • lt/ms-theses-99gt

41
Collections Aggregation in XQuery
  • In XQuery, many operations return collections
  • XPaths, sub-XQueries, functions over these,
  • The let clause assigns the results to a variable
  • Aggregation simply applies a function over a
    collection, where the function returns a value
    (very elegant!)
  • let allpapers document(dblp.xml)/dblp/articl
    e
  • return ltarticle-authorsgt
  • ltcountgt fncount(fndistinct-values(allpapers/
    authors)) lt/countgt
  • for paper in doc(dblp.xml)/dblp/article
  • let pauth paper/author
  • return ltpapergt paper/title
  • ltcountgt fncount(pauth) lt/countgt
  • lt/papergt
  • lt/article-authorsgt

42
Collections, Ctd.
  • Unlike in SQL, we can compose aggregations and
    create new collections from old
  • ltresultgt
  • let avgItemsSold fnavg(for order in
    document(my.xml)/orders/orderlet totalSold
    fnsum(order/item/quantity)return
    totalSold)return avgItemsSold
  • lt/resultgt

43
Sorting in XQuery
  • SQL actually allows you to sort its output, with
    a special ORDER BY clause (which we havent
    discussed, but which specifies a sort key list)
  • XQuery borrows this idea
  • In XQuery, what we order is the sequence of
    result tuples output by the return clause
  • for x in document(dblp.xml)/proceedings
  • order by x/title/text()
  • return x

44
What If Order Doesnt Matter?
  • By default
  • SQL is unordered
  • XQuery is ordered everywhere!
  • But unordered queries are much faster to answer
  • XQuery has a way of telling the DBMS to avoid
    preserving order
  • unordered for x in (mypath)

45
Distinct-ness
  • In XQuery, DISTINCT-ness happens as a function
    over a collection
  • But since we have nodes, we can do duplicate
    removal according to value or node
  • Can do fndistinct-values(collection) to remove
    duplicate values, or fndistinct-nodes(collection)
    to remove duplicate nodes
  • for years in fndistinct-values(doc(dblp.xml)//
    year/text()
  • return years

46
Querying Defining Metadata Cant Do This in
SQL
  • Can get a nodes name by querying node-name()
  • for x in document(dblp.xml)/dblp/
  • return node-name(x)
  • Can construct elements and attributes using
    computed names
  • for x in document(dblp.xml)/dblp/,
  • year in x/year,
  • title in x/title/text(),
  • element node-name(x)
  • attribute year- year title

47
XQuery Summary
  • Very flexible and powerful language for XML
  • Clean and orthogonal can always replace a
    collection with an expression that creates
    collections
  • DB and document-oriented (we hope)
  • The core is relatively clean and easy to
    understand
  • Turing Complete well talk more about XQuery
    functions soon
Write a Comment
User Comments (0)
About PowerShow.com