Title: XML Tutorial
1XML Tutorial
2Outline
- Todays web Created by hand for-eyes-only
- Can HTML become smarter?
- SGML -gt XML
- The next generation web XML and component-based
commerce - Prologue XML and EDI
3A Web Created by Hand for Eyes
- Much of the web is hand-crafted
- HTML often exploited and extended to achieve
specific layout and formatting - HTML has too low an Information IQ to enable
many desirable applications
4The Limits of Hand-crafting
Time to Convert Word Processing Documentand
Apply HTML Markup (minutes/page)
Number of Pages
1 10 60
10
10 minutes 100 minutes 10 hours
100 minutes 16.67 hours 12.5 days
100
16.67 hours 20.83 days 4.17 months
1000
10000
20.83 days 6.94 months 3.47 years
100000
6.94 months 5.79 years 34.72 years
5Low vs. High IQ Encoding
- What information can be encoded?
- How adaptable or flexible is the format for
encoding style, structure, or markup? - Can the format tell you what it encodes?
- ASCII is very low IQ only character info
- SGML is highest IQ encodes anything and
completely specifies the encoding rules - PDF? HTML?
6HTML is too low in IQ
- HTML was designed as a simple markup language
- simple structures headings, lists, links
- strong emphasis on formatting
- weak for encoding content
- HTML wasnt designed to encode the structure and
semantics needed for complex applications
7Web Applications That Need Smarter Data
- Data interchange between Web clients
- Moving processing from server to client
- Multiple client-side views w/o new data
- Information push from personalized applications
8Can HTML be made smarter?
- Create new tags used by your application, or use
ltMETAgt, DIV, and CLASS (and hope they dont
interfere elsewhere) - Use a standard metadata model (but which one?
Dublin Core, PICS, P3, OPS,) - Hide applet code in comments (platform
dependent?) - Hack, hack, hack...
9Inherent Limitations of HTML
- Not extensible
- Limited capability to encode structure
- No validation
- Lossy interchange
10XML
- Extensible Markup Language - a standard way of
creating markup languages for the Web - a file format for data representation
- a schema for describing data or message
structures - a mechanism for extending and annotating HTML
with semantic information - XML is a simplification of SGML, the Standard
Generalized Markup Language - easier to understand and implement
11HTML Apartment Listing
- ltHTMLgt
- ltHEADgt
- ltTITLEgtAn Apartment For Rentlt/TITLEgt
- lt/HEADgt
- ltBODYgt
- ltH1gtApartmentlt/H1gt
- ltPgt1800 square feet, 3 bedrooms, 7 baths.
- ltH2gtNo pets, smoking forbidden!lt/H2gt
- ltH3gtAmenitieslt/H3gt
- ltPgt
- Sunny location, good view, has air-conditioner.
- ltH3gtLocationlt/H3gt
- ltPgt2008 South E. Avenue, Eureka, CA
- ltH3gtCost, Etc.lt/H3gt
- ltPgtPrice 3600 a month
- ltPgtContact (415) 123-4567
- ltPgtAvailable immediately
- ltPgtThis offer posted 1 August 1997 in the Eureka
Daily Times - lt/BODYgt
12An XML Apartment Listing
- lt?XML VERSION1.0?gt
- lt!DOCTYPE APTLISTING SYSTEM APTLISTING.DTDgt
- ltLISTINGgt
- ltADINFOgt
- ltPOSTEDgtMarch 26, 1997lt/POSTEDgt
- ltWHERE_POSTEDgtBelmont Courierlt/WHERE_POSTEDgt
- ltCONTACTgt(650) 111-2222lt/CONTACTgt
- lt/ADINFOgt
- ltDESCRIPTIONgt
- ltAREAgt1400 SQUARE FEETlt/AREAgt
- ltAMENITIESgt1 bedroom, 1 bathroomlt/AMENITIESgt
- ltCOMMENTgtSmall cottage in a big
forestlt/COMMENTgt - lt/DESCRIPTIONgt
- ltPOLICIESgt
- ltPETSgtNot allowedlt/PETSgt
- ltBOZOSgtNot allowedlt/BOZOSgt
- lt/POLICIESgt
- ltCOSTgt875lt/COSTgt
- lt/LISTINGgt
13But First One Minute SGML
- Standard Generalized Markup Language, ISO 8879
- SGML defines the markup language that specifies
the logical rules for a given type of document - Markup transforms a flat stream of text into a
set of objects or elements that can be
manipulated by other applications - Since there is no universal tag set that can
describe all documents, SGML provides the means
for defining the tag set that meets your needs
14SGMLs Big Idea Document Types
- Idea of document type easy to understand
- The Document Type Definition or DTD defines
- the class of documents that shares a common
information model - permissible elements and attributes, their
contents, the order in which they occur - The DTD is the document schema that makes an
instance self-describing - From a DTD a parser can be generated to test any
document for conformance
15Examples of Document Types
- User manuals
- Reference manuals
- Directories
- Newsletters
- Brochures
- Catalogs
- Datasheets
- Proposals
- Dictionaries
- Technical reports
- Contracts
- Regulations
- Policies and procedures
- Journal Articles
- Textbooks
- Purchase Orders
- Invoices
- Recipes
16HTML as a Document Type
- HTML can be described as an application of SGML -
the HTML document type - Simple structures headings, lists, links
- Strong emphasis on formatting, weak for encoding
content - Not designed to encode the content distinctions
for any particular industry or application - But most HTML doesnt conform to the HTML DTD
17Designing a DTD
- Determine information requirements, purposes,
uses (and their priorities) - deliver in one or more print and online formats
- create new information products
- interchange with other authors or publishers
- integrate information into equipment
- meet company, industry, customer standards
18Designing a DTD
- Determine process, tool, external constraints or
standards - Identify and name information components and
component containers - Create categories to organize the components
- Determine when, where, how often components appear
19Designing a DTD
- Identify meta-information to augment the
information components - bibliographic information
- process and workflow-related information
- Describe the component hierarchy in a graphic
notation to visualize it - Transcribe the graphic notation into formal
syntax - Test the analysis on sample documents
- Document the process and the results
20SGML Close, but no Cigar
- SGML has been successful in niches, but hasnt
been adopted by rank-and-file Web publishers - the quiet revolution
- the million dollar secret
- Perceived as too complex (because of features
dating from keystroke-minimizing origins) - Small vendors didnt have the clout to legitimize
SGML in the mass market (but some of them
cleverly dumbed-down their tools for HTML)
21XML Right Place, Right Time
- Looks like HTML, but acts like SGML--
- Backed by
- World Wide Web Consortium (W3C)
- Sun - give Java something to do
- Microsoft - with great enthusiasm
- Netscape - with less enthusiasm
- SGML tool vendors and consultants
- Innovators in EDI community
22Specific XML Proposals to Simplify SGML
- All elements have start and end tags
- All attributes are namevalue
- Changed syntax for EMPTY elements
- lttocgt gt lttoc/gt
- ltgraphic filex.gifgt gt ltgraphic
filex.gif/gt - No connector in content models
- No inclusions and exclusions
- DTD not necessary because it can be inferred if
instance is well-formed
23XML Adoption Scenarios
- The transition from the Web for eyes to the
automated Web - 1st generation XML leaves HTML alone
- 2nd generation HTML as output format created
from XML instance - 3rd generation XML repositories
241st Generation XML
- No disruption of existing HTML production
processes - XML production process may have nothing to do
with HTML production process - XML for processes, HTML for eyes, but XML and
HTML can be linked together
251st Generation XML Leaves HTML as is
DELIVERY
CREATION
XML
conversion to XML
data source
conversion to HTML
HTML for eyes
262nd Generation XML
- Creation of XML is primary process
- Replace hand-crafted HTML with automated
down-translation - Alternatively, use XML style sheet to create
HTML-like presentation(s) - instance at a time retargeting
27Up Down Translation
Content/structure-based text objects SGML, XML,
databases
Formatted electronic text HTML, word processing
files
Easier to translate to
Unstructured electronic text ASCII
More structure (energy)
Printed text
282nd Generation XML Restores Order
XML
down translate
HTML
XML source
data source
conversion to XML
down translate
down translate
HDML
XML style sheet(s)
HTML- like
29HTML as an Output Format
- Treating HTML as an output format generated from
an SGML source repository insulates you from
ongoing changes to HTML and the latest
proprietary extensions - HTML created by down translation can be richer
in structure and more consistent that HTML
created by hand at many times the cost
303rd Generation XML
- reuse, not just retargeting
- XML a first-class citizen from the start
- content-oriented DTD
- native authoring, or enhanced markup by editorial
or production staff - no longer file at a time, create db and work on
it - support for custom applications
313rd Generation XML Repository
Input 1
Output 1
X M L
Input 2
Output 2
up- translation or decom-position
down- translationor assembly
Input 3
Output 3
Output 4
Input 4
32Retargeting and Reuse Requirements
- different delivery channels
- Web
- CD-ROM, CD-ROM Web hybrids
- Braille, large print, voice synthesis (ICADD)
- different dialects of HTML for different
browsers or bandwidths or as HTML changes - different applications (slice and dice)
- reference manual vs help vs tutorial
33XML for the Webs Little Languages
- CDF -- channel definition format, eliminates
need for proprietary push plug-in - OSD -- open software description, for
describing configurations for automated
distribution of software - PICS -- for content ratings
- RDF -- resouce description framework, merging
Netscape and Microsoft metadata initiatives - CBL -- common business language in eCo framework
34The Next-Generation Web
PROBLEMS
SOLUTIONS
Metadata and Object APIs -- self-describing
smart Web
The Web is eyeballs-only
No content encoding
Web catalogs and documents in their native
schema
Distributed registries and structure-based
retrieval
Things cant be found
No automation of tasks
Agent-based run-time environment
35Infrastructure Requirements
- A means of transforming legacy Internet services
into components - Todays services are accessed through browsers or
ad hoc APIs - An extensible semantic framework for component
integration - Heterogeneity and lack of standards
- A scalable, distributed indexing structure and
registry services for components - Things cant be found systematically
- An agent-based execution environment
- No run-time integration or automation of tasks
36The Internet Today
Database
FTP Server
Application
Web ServerDocuments
Web ServerDocuments
Web ServerDocuments
Application
Database
37A Commerce Type Definition (CTD)
- lt!Doctype Taxonomy public "-//CommerceNet//DTD
Taxonomy V1.0//EN"gt - ltTaxonomygt
- ltHeadgt
- ltLabelgtUnited Airlineslt/Labelgt
- ltVersiongt1.0lt/Versiongt
- ltBasegtWorld Airline Registry1.12.3.7lt/Basegt
- ltRegistrygttoe.commerce.net2111lt/Registrygt
- lt/Headgt
- ltBodygt
- ltServicesgt
- ltPassenger_Flight_Informationgt
- ltFlight_NumbergtUA 200lt/Flight_Numbergt
- ltFlight_Price USgt168.50lt/Flight_Price USgt
- ltFlight_DestgtHonolulu, Hawaiilt/Flight_Destgt
- lt/Passenger_Flight_Informationgt
- ltCargo_Flight_Informationgt
- lt/Cargo_Flight_Informationgt
- lt/Servicesgt
- lt/Bodygt
38Step 1 XML Metadata
CTD
CTD
CTD
Database
FTP Server
Application
CTD
CTD
Web ServerDocuments
CTD
Web ServerDocuments
CTD
CTD
Web ServerDocuments
Application
Database
39Step 2 Registries
CTD
CTD
CTD
Database
Registry
FTP Server
Application
CTD
CTD
Registry
Web ServerDocuments
CTD
Registry
Web ServerDocuments
CTD
CTD
Web ServerDocuments
Application
Database
Registry
40Common Business Language (CBL)
- Who am I?
- Company name, contact, public key certificates
- What am I?
- Agent/object (API), document (DTD), database
(schema) - Available data
- Product list, price list, terms and conditions,
catalog, order form - Available services
- Buy, sell, RFQ, search catalog
41Step 3 CBL Components
CTD
CTD
CTD
Database
Registry
FTP Server
Application
CTD
CTD
Registry
Web ServerDocuments
CTD
Registry
Web ServerDocuments
CTD
CTD
Web ServerDocuments
Application
Database
Registry
42Step 4 Agents
CTD
CTD
CTD
Agent
Database
Registry
FTP Server
Application
CTD
CTD
Agent
Registry
Web ServerDocuments
CTD
Registry
Web ServerDocuments
CTD
CTD
Agent
Web ServerDocuments
Application
Database
Registry
43Step 5 Business Services
Matchmaking Services
CTD
CTD
CTD
Agent
Database
Registry
FTP Server
Application
CTD
CTD
Agent
Registry
Web ServerDocuments
CTD
Registry
Web ServerDocuments
CTD
CTD
Agent
Web ServerDocuments
Application
Database
Trust Intermediaries
Registry
44Wrapping Up
- HTML will continue to exist, but most serious
publishers will produce HTML and XML versions of
their content from the same smarter source - XML unifies document and database perspectives
and tools for Web publishing and lets them be
automated in the same way
45Prologue XML and EDI
- XML appeals to the EDI community because
- it reinforces the move to Internet EDI
- it suggests a way to make transaction sets easier
to define and self-describing - But which kind of XML/EDI?
- incremental strategy of wrapping existing EDI
transactions in XML syntax - radical re-thinking of EDI to create XML
fragments for transaction components that are
dynamically combined as needed
46Learning More
- The mother of all information about XML is the
SGML Home Page - www.sil.org/sgml/xml.html - Best overall book for managers to get started
with SGML and XML is ABCDSGML by Liora Alschuler - Best overall book for HTML-savvy types is SGML on
the Web by Yuri Rubinsky Murray Maloney