Processing%20XML%20with%20Java - PowerPoint PPT Presentation

About This Presentation
Title:

Processing%20XML%20with%20Java

Description:

font color=blue unquoted. attribute values /font font color='blue' quoted. attribute values /font ... book title xhtml:em DBI: /xhtml:em The Course I ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 75
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Processing%20XML%20with%20Java


1
Processing XML with Java
  • Representation and Management of Data on the
    Internet

2
XML eXtensible Markup Language
  • XML is a metalanguage
  • A language used to describe other languages using
    markup tags that describe properties of the
    data
  • Designed to be structured
  • Strict rules about how data can be formatted
  • Designed to be extensible
  • Can define own terms and markup
  • When will we use XML?

3
XML Family
  • XML is an official recommendation of the W3C
  • Aims to accomplish what HTML cannot and be
    simpler to use and implement than SGML

HTML
XML
SGML
4
The Essence of XML
  • Syntax The permitted arrangement or structure of
    letters and words in a language as defined by a
    grammar (XML)
  • SemanticsThe meaning of letters or words in a
    language
  • XML uses Syntax to add Semantics to the documents

5
Using XML
  • In XML there is a separation of the content from
    the display
  • XML can be used for
  • Data representation
  • Data exchange

6
HTML vs. XML
HTML
XML
7
HTML vs. XML
HTML
XML
HTML was designed to ease the work of authors XML
was design to ease the work of ????
8
Parsing XML The Idea
  • Two approaches for parsing XML
  • the SAX approach and
  • the DOM approach

9
Parsing XML
  • What is a parser?
  • Software for analyzing language a computer
    program that breaks natural language or
    programming language statements or instructions
    into smaller more easily interpreted units
    understandable to the computer. The parser
    determines how a sentence can be constructed from
    the grammar of the language, producing a parse
    tree about the statement as the output.
  • How should an XML parser work?

10
Sample Document
  • lttransactiongt
  • ltaccountgt89-344lt/accountgt
  • ltbuy shares100gt
  • ltticker exchNASDAQgtWEBMlt/tickergt
  • lt/buygt
  • ltsell shares30gt
  • ltticker exchNYSEgtGElt/tickergt
  • lt/sellgt
  • lt/transactiongt

11
DOM Parser
  • DOM Document Object Model
  • Parser creates a tree object out of the document
  • User accesses data by traversing the tree
  • The API allows for constructing, accessing and
    manipulating the structure and content of XML
    documents

12
Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
13
Advantages and Disadvantages
  • Advantages
  • Natural and relatively easy to use
  • Can repeatedly traverse tree
  • Disadvantages
  • High memory requirements the whole document is
    kept in memory
  • Must parse the whole document and construct many
    objects before use

14
SAX Parser
  • SAX Simple API for XML
  • Parser creates events while traversing tree
  • Parser calls methods (that you write) to deal
    with the events
  • Similar to an I/O-Stream, goes in one direction

15
Document as Events
  • lttransactiongt
  • ltaccountgt89-344lt/accountgt
  • ltbuy shares100gt
  • ltticker exchNASDAQgtWEBMlt/tickergt
  • lt/buygt
  • ltsell shares30gt
  • ltticker exchNYSEgtGElt/tickergt
  • lt/sellgt
  • lt/transactiongt

16
Advantages and Disadvantages
  • Advantages
  • Requires little memory
  • Fast
  • Disadvantages
  • Cannot read backwards
  • Does not support transformation of the document
    such as cut and paste of fragments
  • Difficult to program

17
Programming using SAX is Difficult
  • In some cases, programming with SAX is difficult
  • How can we find, using a SAX parser, an element
    e1 with ancestor e2?
  • How can we find, using a SAX parser, elements e1
    that have a descendant element e2?
  • What about cases that are even more complex?

18
Which should we use?DOM vs. SAX
  • If your document is very large and you only need
    a few elements use SAX
  • If you need to manipulate (i.e., change) the XML
    use DOM
  • If you need to access the XML many times use
    DOM (assuming the file is not too large)

19
XML Parsers
20
XML Parsers
  • There are several different ways to categorise
    parsers
  • Validating versus non-validating parsers
  • DOM parsers versus SAX parsers
  • Parsers written in a particular language (Java,
    C, Perl, etc.)

21
Validating Parsers
  • A validating parser makes sure that the document
    conforms to the specified DTD
  • This is time consuming, so a non-validating
    parser is faster

22
Using an XML Parser
  • Three basic steps
  • Create a parser object
  • Pass the XML document to the parser
  • Process the results
  • Generally, writing out XML is not in the scope of
    parsers (though some may implement proprietary
    mechanisms)

23
SAX Simple API for XML
24
SAX Parsers
When you see the start of the document do
SAX Parser
When you see the start of an element do
When you see the end of an element do
25
The SAX Parser
  • SAX parser is an event-driven API
  • An XML document is sent to the SAX parser
  • The XML file is read sequentially
  • The parser notifies the class when events happen,
    including errors
  • The events are handled by the implemented API
    methods to handle events that the programmer
    implemented

26
Handles document events start tag, end tag, etc.
Used to create a SAX Parser
Handles Parser Errors
Handles DTDs and Entities
27
Problem
  • The SAX interface is an accepted standard
  • There are many implementations
  • Like to be able to change the implementation used
    without changing any code in the program
  • How is this done?

28
Factory Design Pattern
  • Have a Factory class that creates the actual
    Parsers
  • The Factory checks the value of a system property
    that states which implementation should be used
  • In order to change the implementation, simply
    change the system property

29
Creating a SAX Parser
  • Import the following packages
  • org.xml.sax.
  • org.xml.sax.helpers.
  • Set the following system property
  • System.setProperty("org.xml.sax.driver",
    "org.apache.xerces.parsers.SAXParser")
  • Create the instance from the Factory
  • XMLReader reader XMLReaderFactory.createXMLReade
    r()

30
  • import org.xml.sax.
  • import org.xml.sax.helpers.
  • public static void main(String args)
  • try
  • XMLReader parser
  • XMLReaderFactory.createXMLReader(
    "org.apache.xerces.parsers.SAXParser" )
    ContentHandler handler new
  • SomethingThatExtendsDefaultHandler()
    parser.setContentHandler(handler)
    parser.parse(args0)
  • catch (Exception e) System.err.println(e)
  • // end

31
Receiving Parsing Information
  • A SAX Parser calls methods such as
    startDocument, startElement, etc., as it runs
  • In order to react to such events we must
  • implement the ContentHandler interface
  • set the parsers content handler with an instance
    of our ContentHandler implementation

32
ContentHandler
  • // Methods (partial list)
  • public void startDocument()
  • public void endDocument()
  • public void characters(char ch, int start, int
    length)
  • public void startElement(String namespaceURI,
  • String localName, String qName,
  • Attributes atts)
  • public void endElement(String namespaceURI,
  • String localName, String qName)

What to implement in a ContentHandler
33
Namespaces and Element Names
  • lt?xml version'1.0' encoding'utf-8'?gt
  • ltforsale date"12/2/03"
  • xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
  • ltbookgt
  • lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
  • The Course I Wish I never Took
  • lt/titlegt
  • ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
    book!
  • lt/commentgt
  • lt/bookgt
  • lt/forsalegt

34
Namespaces and Element Names
namespaceURI "" localName book qName book
  • lt?xml version'1.0' encoding'utf-8'?gt
  • ltforsale date"12/2/03"
  • xmlnsxhtml "urnhttp//www.w3.org/1999/xhtml"gt
  • ltbookgt
  • lttitlegt ltxhtmlemgt DBI lt/xhtmlemgt
  • The Course I Wish I never Took
  • lt/titlegt
  • ltcommentgt My ltxhtmlbgt favorite lt/xhtmlbgt
    book!
  • lt/commentgt
  • lt/bookgt
  • lt/forsalegt

namespaceURI urnhttp//www.w3.org/1999/x
html localName em qName xhtmlem
35
Receiving Parsing Information (cont.)
  • An easy way to implement the ContentHandler
    interface is the extend the DefaultHandler, which
    implements this interface (and a few others) in
    an empty fashion
  • To actually parse a document, create an
    InputSource from the document and supply the
    input source to the parse method of the XMLReader

36
import java.io. import org.xml.sax. import
org.xml.sax.helpers. public class InfoWithSax
extends DefaultHandler public static void
main(String args) System.setProperty("org.xm
l.sax.driver", "org.apache.xerces.parsers.S
AXParser") try XMLReader reader
XMLReaderFactory.createXMLReader() reader.
setContentHandler(new InfoWithSax()) reader.par
se(new InputSource(new FileReader(args0)))
catch(Exception e) e.printStackTrace()
37
public static startDocument() throws
SAXException System.out.println(START
DOCUMENT) public static endDocument()
throws SAXException System.out.println(END
DOCUMENT) int depth String indent
private void println(String header, String
value) for (int i 0 i lt depth i)
System.out.print(indent) System.out.println(hea
der " " value)
38
public void characters(char buf, int offset,
int len) throws SAXException String s
(new String(buf, offset, len)).trim() if
(!"".equals(s)) println("CHARACTERS", s)
public void endElement(String namespaceURI,
String localName, String name)
throws SAXException depth-- String
elementName name if (!"".equals(namespaceURI)
!"".equals(localName)) elementName
namespaceURI "" localName println("END
ELEMENT", elementName)
39
public static startElement(String namespaceURI,
String localName, String name,
Attributes attrs) throws SAXException
String elementName name if
(!"".equals(namespaceURI) !"".equals(localName)
) elementName namespaceURI ""
localName println("START ELEMENT",
elementName) if (attrs ! null
attrs.getLength() gt 0) for (int i
0 i lt attrs.getLength() i)
println("ATTRIBUTE", attrs.getLocalName(i)
attrs.getValue(i))
depth
40
Bachelor Tags
  • What do you think happens when the parser parses
    a bachelor tag?
  • ltrating stars"five" /gt

41
Attributes Interface
  • Elements may have attributes
  • There is no distinction between attributes that
    are defined explicitly from those that are
    specified in the DTD (with a default value)

42
Attributes Interface (cont.)
  • int getLength()
  • String getQName(int i)
  • String getType(int i)
  • String getValue(int i)
  • String getType(String qname)
  • String getValue(String qname)
  • etc.

43
Attributes Types
  • The following are possible types for attributes
  • "CDATA",
  • "ID",
  • "IDREF", "IDREFS",
  • "NMTOKEN", "NMTOKENS",
  • "ENTITY", "ENTITIES",
  • "NOTATION"

44
Setting Features
  • It is possible to set the features of a parser
    using the setFeature method.
  • Examples
  • reader.setFeature(http//xml.org/sax/features/nam
    espaces, true)
  • reader.setFeature(http//xml.org/sax/features/val
    idation", false)
  • For a full list, see http//www.saxproject.org/?s
    electedget-set

45
ErrorHandler Interface
  • We implement ErrorHandler to receive error events
    (similar to implementing ContentHandler)
  • DefaultHandler implements ErrorHandler in an
    empty fashion, so we can extend it (as before)
  • An ErrorHandler is registered with
  • reader.setErrorHandler(handler)
  • Three methods
  • void error(SAXParseException ex)
  • void fatalError(SAXParserExcpetion ex)
  • void warning(SAXParserException ex)

46
Extending the InfoWithSax Program
public void warning(SAXParseException err)
throws SAXException System.out.println(War
ning in line err.getLineNumber()
and column err.getColumnNumber())
public void error(SAXParseException err)
throws SAXException System.out.println(Oy
vaavoi, an error!) public void
fatalError(SAXParseException err) throws
SAXException System.out.println(OY VAAVOI,
a fatal error!)
47
Which to Call
  • Which callback should be used to report the
    violation of a validity constraint?
  • Which callback should be used to report the
    violation of a well-formedness constraint?

warning
fatal error
error
48
Lexical Events
  • Lexical events have to do with the way that a
    document was written and not with its content
  • Examples
  • A comment is a lexical event (lt!-- comment --gt)
  • The use of an entity is a lexical event (gt)
  • These can be dealt with by implementing the
    LexicalHandler interface, and set on a parser by
  • reader.setProperty("http//xml.org/sax/properties/
    lexical-handler",  mylexicalhandler)     

49
LexicalHandler
  • // Methods (partial list)
  • public void startEntity(String name)
  • public void endEntity(String name)
  • public void comment(char ch, int start,
  • int length)
  • public void startCDATA()
  • public void endCDATA()

50
DOM Document Object Model
51
(No Transcript)
52
(No Transcript)
53
Creating a DOM Tree
  • How can we create a DOM Tree independently of the
    implementation chosen?
  • Creating a DOM Tree using the Apache Xerces
    package
  • Import org.apache.xerces.parsers.DOMParser
  • Import org.w3c.dom.
  • Use the following lines of code
  • DOMParser dom new DOMParser()
    dom.parse(fileName)
  • Document doc dom.getDocument()

54
Using a DOM Tree
55
Nodes in a DOM Tree
Figure as appears in The XML Companion - Neil
Bradley
DocumentFragment
Document
Text
CDATASection
CharacterData
Comment
Attr
Element
Node
DocumentType
Notation
Entity
EntityReference
ProcessingInstruction
DocumentType
56
DOM Tree
Document
57
Normalizing a Tree
  • Normalizing a DOM Tree has two effects
  • Combine adjacent textual nodes
  • Eliminate empty textual nodes
  • To normalize, apply the normalize() method to the
    document element

58
Node Methods
  • Three categories of methods
  • Node characteristics name, type, value
  • Contextual location and access to relatives
    parents, siblings, children, ancestors,
    descendants
  • Node modification Edit, delete, re-arrange child
    nodes

59
Node Methods (2)
  • short getNodeType()
  • String getNodeName()
  • String getNodeValue() throws DOMException
  • void setNodeValue(String value) throws
    DOMException
  • boolean hasChildNodes()
  • NamedNodeMap getAttributes()
  • Document getOwnerDocument()

60
Node Types - getNodeType()
ELEMENT_NODE 1 ATTRIBUTE_NODE 2 TEXT_NODE
3 CDATA_SECTION_NODE 4 ENTITY_REFERENCE_NODE
5 ENTITY_NODE 6
PROCESSING_INSTRUCTION_NODE 7 COMMENT_NODE
8 DOCUMENT_NODE 9 DOCUMENT_TYPE_NODE
10 DOCUMENT_FRAGMENT_NODE 11 NOTATION_NODE
12
if (myNode.getNodeType() Node.ELEMENT_NODE)
//process node
61
(No Transcript)
62
Node Navigation
  • Every node has a specific location in tree
  • Node interface specifies methods to find
    surrounding nodes
  • Node getFirstChild()
  • Node getLastChild()
  • Node getNextSibling()
  • Node getPreviousSibling()
  • Node getParentNode()
  • NodeList getChildNodes()

63
Node Navigation (2)
Figure as from The XML Companion - Neil Bradley
getPreviousSibling()
getParentNode()
getFirstChild()
getChildNodes()
getLastChild()
getNextSibling()
64
import org.apache.xerces.parsers.DOMParser import
org.w3c.dom. public class InfoWithDom
public static void main(String args)
try DOMParser dom new DOMParser()
dom.parse(args0) Document doc
dom.getDocument() new InfoWithDom().echo(doc
) catch(Exception e) e.printStackTrace()

65
private int depth 0 private final String
indent " " private String NODE_TYPES
"", "ELEMENT", "ATTRIBUTE", "TEXT",
"CDATA", "ENTITY_REF", "ENTITY",
"PROCESSING_INST", "COMMENT", "DOCUMENT",
"DOCUMENT_TYPE", "DOCUMENT_FRAG",
"NOTATION" private void outputIndentation()
for (int i 0 i lt depth i)
System.out.print(indent)
66
private void printlnCommon(Node n)
System.out.print(NODE_TYPESn.getNodeType()
"") System.out.print(" nodeName"
n.getNodeName()) String val if ((val
n.getNamespaceURI()) ! null) System.out.print(
" uri" val) if ((val n.getPrefix()) !
null) System.out.print(" pre" val) if
((val n.getLocalName()) ! null)
System.out.print(" local" val) if ((val
n.getNodeValue()) ! null !val.trim().equals(""
)) System.out.print(" nodeValue"
val) System.out.println()
67
private void echo(Node n) outputIndentation()
printlnCommon(n) if (n.getNodeType()
Node.ELEMENT_NODE) NamedNodeMap
atts n.getAttributes() indent
2 for (int i 0 i lt
atts.getLength() i) echo(atts.item(i))
indent - 2 indent for (Node
child n.getFirstChild() child ! null
child child.getNextSibling()) echo(child)
indent--
68
Node Manipulation
  • Children of a node in a DOM tree can be
    manipulated - added, edited, deleted, moved,
    copied, etc.

Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
69
Node Manipulation (2)
Figure as appears in The XML Companion - Neil
Bradley
70
Other Interfaces
  • We have discussed methods of the Node interface
  • Each of the "specific types of nodes" have
    additional methods
  • See API for details

71
Note about DOM Objects
  • DOM object ? compiled XML
  • Can save time and effort if send and receive DOM
    objects instead of XML source
  • Saves having to parse XML files into DOM at
    sender and receiver
  • But, DOM object may be larger than XML source

72
JAXP
  • Java API for XML Parsing
  • Includes the DOM and SAX API as part of the Java
    API
  • See the package
  • javax.xml.parsers

73
Getting a SAX Parser with JAXP
  • import javax.xml.parsers.
  • import org.xml.sax. import org.xml.sax.helpers.
  • DefaultHandler handler new MyDefaultHandlerImpl(
    )
  • SAXParserFactory factory
  • SAXParserFactory.newInstance()
  • try
  • SAXParser saxParser
  • factory.newSAXParser()
  • saxParser.parse(new File(example.xml))
  • catch (Exception e) / do something /

74
Getting a DOM Parser with JAXP
  • import javax.xml.parsers.
  • import org.xml.sax. import org.w3c.dom.
  • DocumentBuilderFactory factory
  • DocumentBuilderFactory.newInstance()
  • try
  • DocumentBuilder builder
  • factory.newDocumentBuilder()
  • Document document builder.parse(new
  • File(example.xml))
  • catch (SAXException se) / do something /
Write a Comment
User Comments (0)
About PowerShow.com