Parsing XML into programming languages - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Parsing XML into programming languages

Description:

As of JAXP1.2, java provides a default parse than can handle most schema features ... Transformers. Recall that transformers easily let us go between any source ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 47
Provided by: peopleCs
Category:

less

Transcript and Presenter's Notes

Title: Parsing XML into programming languages


1
Parsing XML into programming languages
  • JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB

2
Parsing XML
  • Goal read XML files into data structures in
    programming languages
  • Possible strategies
  • Parse by hand with some reusable libraries
  • Parse into generic tree structure
  • Parse as sequence of events
  • Automagically parse to language-specific objects

3
Parsing by-hand
  • Advantages
  • Complete control
  • Good if simple needs build off of regex package
  • Disadvantages
  • Must write the initial code yourself, even if it
    becomes generalized
  • Pretty tedious and error prone.
  • Gets very hard when using schema or DTD to
    validate

4
Parsing into generic tree structure
  • Advantages
  • Industry-wide, language neutral standard exists
    called DOM (Document Object Model)
  • Learning DOM for one language makes it easy to
    learn for any other
  • As of JAXP 1.2, support for Schema
  • Have to write much less code to get XML to
    something you want to manipulate in your program
  • Disadvantages
  • Non-intuitive API, doesnt take full advantage of
    Java
  • Still quite a bit of work

5
What is JAXP?
  • JAXP Java API for XML Processing
  • In the Java language, the definition of these
    standard APIs (together with XSLT API) comprise
    a set of interfaces known as JAXP
  • Java also provides standard implementations
    together with vendor pluggability layer
  • Some of these come standard with J2SDK, others
    are only availdable with Web Services Developers
    Pack
  • We will study these shortly

6
Another alternative
  • JDOM Native Java published API for representing
    XML as tree
  • Like DOM but much more Java-specific, object
    oriented
  • However, not supported by other languages
  • Also, no support for schema
  • Dom4j another alternative

7
JAXB
  • JAXB Java API for XML Bindings
  • Defines an API for automagically representing XML
    schema as collections of Java classes.
  • Most convenient for application programming
  • Will cover next class

8
DOM
9
About DOM
  • Stands for Document Object Model
  • A World Wide Web Consortium (w3c) standard
  • Standard constantly adding new features Level 3
    Core just released this month
  • Well cover most of the basics. Theres always
    more, and its always changing.

10
DOM abstraction layer in Java -- architecture
Emphasis is on allowing vendors to supply their
own DOM Implementation without requiring change
to source code
Returns specific parser implementation
org.w3d.dom.Document
11
Sample Code
A factory instance is the parser
implementation. Can be changed with runtime
System property. Jdk has default. Xerces much
better.
DocumentBuilderFactor factory
DocumentBuilderFactory.newInstance() / set
some factory options here / DocumentBuilder
builder factory.newDocumentBuilde
r() Document doc builder.parse(xmlFile)
From the factory one obtains an instance of the
parser
xmlFile can be an java.io.File, an inputstream,
etc.
javax.xml.parsers.DocumentBuilderFactory javax.xml
.parsers.DocumentBuilder org.w3c.dom.Document
For reference. Notice that the Document class
comes from the w3c-specified bindings.
12
Validation
  • Note that by default the parser will not validate
    against a schema or DTD
  • As of JAXP1.2, java provides a default parse than
    can handle most schema features
  • See next slide for details on how to setup

13
Important Schema validation
String JAXP_SCHEMA_LANGUAGE    
 "http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA     
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema DocumentBuilderFactory
factory     DocumentBuilderFactory.newInstance()
 factory.setNamespaceAware(true)
   factory.setValidating(true) try  
 factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x)    // Happens if the parser does not
support JAXP 1.2   ...
14
Associating document with schema
  • An xml file can be associated with a schema in
    two ways
  • Directly in xml file in regular way
  • Programmatically from java
  • Latter is done as
  • factory.setAttribute(JAXP_SCHEMA_SOURCE,    new
    File(schemaSource))

15
A few notes
  • Factory allows ease of switching parser
    implementations
  • Java provides simple DOM implementation, but much
    better to use vendor-supplied when doing serious
    work
  • Xerces, part of apache project, is installed on
    cluster as Eclipse plugin. Well use next week.
  • Note that some properties are not supported by
    all parser implementations.

16
Document object
  • Once a Document object is obtained, rich API to
    manipulate.
  • First call is usually
  • Element root doc.getDocumentElement()
  • This gets the root element of the Document as an
    instance of the Element class
  • Note that Element subclasses Node and has methods
    getType(), getName(), and getValue(), and
    getChildNodes()

17
Types of Nodes
  • Note that there are many types of Nodes (ie
    subclasses of Node
  • Attr, CDATASection, Comment, Document,
    DocumentFragment, DocumentType, Element, Entity,
    EntityReference, Notation, ProcessingInstruction,
    Text
  • Each of these has a special and non-obvious
    associated type, value, and name.
  • Standards are language-neutral and are specified
    on chart on following slide
  • Important keep this chart nearby when using DOM

18
Node nodeName() nodeValue() Attributes nodeType()
Attr Attr name Value of attribute null 2
CDATASection cdata-section CDATA cotnent null 4
Comment comment Comment content null 8
Document document Null null 9
DocumentFragment document-fragment null null 11
DocumentType Doc type name null null 10
Element Tag name null NamedNodeMap 1
Entity Entity name null null 6
EntityReference Name entitry referenced null null 5
Notation Notation name null null 1
ProcessingInstruction target Entire string null 7
Text text Actual text null 3
19
Transforming XML
20
The JAXP Transformation Packages
  • JAXP Transformation APIs
  • javax.xml.transform
  • This package defines the factory class you use to
    get a Transformer object. You then configure the
    transformer with input (Source) and output
    (Result) objects, and invoke its transform()
    method to make the transformation happen. The
    source and result objects are created using
    classes from one of the other three packages.
  • javax.xml.transform.dom
  • Defines the DOMSource and DOMResult classes that
    let you use a DOM as an input to or output from a
    transformation.
  • javax.xml.transform.sax
  • Defines the SAXSource and SAXResult classes that
    let you use a SAX event generator as input to a
    transformation, or deliver SAX events as output
    to a SAX event processor.
  • javax.xml.transform.stream
  • Defines the StreamSource and StreamResult classes
    that let you use an I/O stream as an input to or
    output from a transformation.

21
Transformer Architecture
22
Writing DOM to XML
public class WriteDOM public static void
main(String argv) throws Exception
File f new File(argv0)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder builder factory.newDocumentBuild
er() Document document
builder.parse(f) TransformerFactory
tFactory TransformerFactory.newInsta
nce() Transformer transformer
tFactory.newTransformer() DOMSource
source new DOMSource(document)
StreamResult result new StreamResult(System.out)
transformer.transform(source, result)

23
Creating a DOM from scratch
  • Sometimes you may want to create a DOM tree
    directly in memory. This is done with
  • DocumentBuilderFactory factory
     DocumentBuilderFactory.newInstance()         
  • DocumentBuilder builder         factory.newDocum
    entBuilder()       
  •  document builder.newDocument()

24
Manipulating Nodes
  • Once the root node is obtained, typical tree
    methods exist to manipulate other elements
  • boolean node.hasChildNodes()
  • NodeList node.getChildNodes()
  • Node node.getNextSibling()
  • Node node.getParentNode()
  • String node.getValue()
  • String node.getName()
  • String node.getText()
  • void setNodeValue(String nodeValue)
  • Node insertBefore(Node new, Node ref)

25
SAX
  • Simple API for XML Processing

26
About SAX
  • SAX in Java is hosted on source forge
  • SAX is not a w3c standard
  • Originated purely in Java
  • Other languages have chosen to implement in their
    own ways based on this prototype

27
SAX vs.
  • Please dont compared unrelated things
  • SAX is an alternative to DOM, but realize that
    DOM is often built on top of SAX
  • SAX and DOM do not compete with JAXP
  • They do both compete with JAXB implementations

28
How a SAX parser works
  • SAX parser scans an xml stream on the fly and
    responds to certain parsing events as it
    encounters them.
  • This is very different than digesting an entire
    XML document into memory.
  • Much faster, requires less memory.
  • However, need to reparse if you need to revisit
    data.

29
Obtaining a SAX parser
  • Important classes
  • javax.xml.parsers.SAXParserFactory
  • javax.xml.parsers.SAXParser
  • javax.xml.parsers.ParserConfigurationException
  • //get the parser
  • SAXParserFactory factory
    SAXParserFactory.newInstance()
  • SAXParser saxParser factory.newSAXParser
    ()
  • //parse the document
  • saxParser.parse( new File(argv0),
    handler)

30
DefaultHandler
  • Note that an event handler has to be passed to
    the SAX parser.
  • This must implement the interface
  • org.xml.sax.ContentHandler
  • Easier to extend the adapter
  • org.xml.sax.helpers.DefaultHandler

31
Overriding Handler methods
  • Most important methods to override
  • void startDocument()
  • Called once when document parsing begins
  • void endDocument()
  • Called once when parsing ends
  • void startElement(...)
  • Called each time an element begin tag is
    encountered
  • void endElement(...)
  • Called each time an element end tag is
    encountered
  • void characters(...)
  • Called randomly between startElement and
    endElement calls to accumulated character data

32
startElement
  • public void startElement(
  • String namespaceURI, //if namespace
    assoc
  • String sName,
    //nonqualified name
  • String qName,
    //qualified name
  • Attributes attrs) //list
    of attributes
  • Attribute info is obtained by querying Attributes
    objects.

33
Characters
  • public void characters(
  • char buf, //buffer of
    chars accumulated
  • int offset, //begin
    element of chars
  • int len) //number of
    chars
  • Note, characters may be called more than once
    between begin tag / end tag
  • Also, mixed-content elements require careful
    handling

34
Entity references
  • Recall that entity references are special
    character sequences for referring to characters
    that have special meaning in XML syntax
  • lt is lt
  • gt is gt
  • In SAX these are automatically converted and
    passed to the characters stream unless they are
    part of a CDATA section

35
Choosing a Parser
  • Choosing your Parser Implementation
  • If no other factory class is specified, the
    default SAXParserFactory class is used. To use a
    different manufacturer's parser, you can change
    the value of the environment variable that points
    to it. You can do that from the command line,
    like this
  • java -Djavax.xml.parsers.SAXParserFactoryyourFact
    oryHere ...
  • The factory name you specify must be a fully
    qualified class name (all package prefixes
    included). For more information, see the
    documentation in the newInstance() method of the
    SAXParserFactory class.

36
Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE    
 "http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA     
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema SaxParserFactory
factory     SaxParserFactory.newInstance()
 factory.setNamespaceAware(true)
   factory.setValidating(true) try  
 factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x)    // Happens if the parser does not
support JAXP 1.2   ...
37
Transforming arbitrary data structures using SAX
and Transformer
38
Goal
  • Now that we know SAX and a little about
    Transformations, there are some cool things we
    can do.
  • One immediate thing is to create xml files from
    plain text files using the help of a faux SAX
    parser
  • Turns out to be more robust than doing by hand

39
Transformers
  • Recall that transformers easily let us go between
    any source and result by arbitrary wirings of
  • StreamSource / StreamResult
  • SAXSource / SAXResult
  • DOMSource / DOMResult
  • We used this to write a DOM tree to an XML file
  • Now we will use a SAXSource together with a
    StreamResult to convert our text file

40
Strategy
  • We construct our own SAXParser ie a class that
    implements the XMLReader interface
  • This class must have a parse method (among
    others)
  • We use parse to read our input file and fire the
    appropriate SAX events.

41
What?
  • What are we really doing here?
  • Were having the SAXParser pretend as though it
    has encountered certain SAX XML events when it
    reads the text file.
  • Exactly where we pretend these things occur is
    where the appropriate XML will get written by the
    transformer

42
Main snippet
public static void main (String argv )
StudentReader parser new StudentReader()
TransformerFactory tFactory
TransformerFactory.newInstance()
Transformer transformer tFactory.newTransformer(
) FileReader fr new FileReader(student
s.txt) BufferedReader br new
BufferedReader(fr) InputSource
inputSource new InputSource(fr)
SAXSource source new SAXSource(parser,
inputSource) StreamResult result new
StreamResult(System.out)
transformer.transform(source, result)
Create SAX parser
create transformer
Use text File as Transformer source
Use text as result
43
XMLReader implementation
  • To have a valid SAXSource we need a class that
    implements
  • XMLReader interface
  • public void parse(InputSource input)
  • public void setContentHandler(ContentHandler
    handler)
  • public ContentHandler getContentHandler()
  • .
  • .
  • .
  • Shown are the important methods for a simple app

44
Extra Credit?
  • Volunteer to present this next class?

45
End
46
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com