Unicode in Distributed Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Unicode in Distributed Systems

Description:

Title: Unicode in Distributed Systems Author: Mike McKenna Keywords: International Unicode Conference #9 Last modified by: Mary McKenna Created Date – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 75
Provided by: MikeMc87
Category:

less

Transcript and Presenter's Notes

Title: Unicode in Distributed Systems


1
Unicode in Distributed Systems
Michael G. McKenna mgm_at_globalisation.org Globalisa
tion Strategist Haddon Hill International
2
Distributed Systems
a
d
End user device
Data Access
e
Data
b
End User interface standards
Application
Unicode can be implemented in any of the
functional areas
3
Distributed Systems Status Quo
  • Heterogeneous
  • Large Investments
  • Mixed Proprietary and International Standards
  • Often Under Parochial Control
  • Work Group to Global Organizational Size

4
The Enterprise in the Real World
Java
Internet Clients
SYBASE
ApplServer
DB2
Application Development
Data Collection
Oracle
Distributed Enterprise Information
Multiple SQL Database Access
Embedded Training
Flat Files
Web Servers
Legacy Non- Relational Data
Real-Time
Mainframe Data
Distributed Systems
Plug Play Users
CORBA

J2EE
5

The Enterprise in the Real World
Java
Internet Clients
SYBASE
ApplServer
DB2
Application Development
Data Collection
Oracle
Distributed Enterprise Information
Multiple SQL Database Access
Embedded Training
Flat Files
Web Servers
Legacy Non- Relational Data
Real-Time
Mainframe Data
LAN-Based Systems
Plug Play Users
TCP/IP
CORBA
J2EE
6

Enterprise Client/Server Requirements
Multilingual
Java
WWW Clients
SYBASE
Developer End-User Productivity
ApplServer
DB2
Application Development
Oracle
Data Entry
Transparency
Multiple SQL Database Access
Distributed Enterprise Information
Embedded Training
Flat Files
Remote Backup
Legacy Non- Relational Data
Real-Time
Localisable
SeamlessInteroperability
Mainframe Data
System Management
Auditing
LAN Based Systems
Plug Play Users
System Configuration
Performance Monitoring
TCP/IP
CORBA
Security
J2EE
7
Any Component Can Affect Globalization
Network
Database Server
Client Application
Database Design
Server API
Client API
Non-RDBMS Data
OS API
GUI API
8
Globalisation Spans all Areas
Data
O/S
Network API
Server Comm API
Application
9
Rating Distributed Systems
  • A system for rating levels of Internationalisation
  • 3 Global Ready / Local Cultural Authenticity
  • 2 Global Ready
  • 1 Single-Locale Ready (Europe or Asia)
  • 0 Locale-Specific Early Adopter
  • -1 8-bit Clean
  • -2 7-bit Dirty
  • -3 Dont Care

10
Level (-3) Dont Care
  • Ethnocentric attitude of organization
  • Lack of understanding
  • No desire
  • I18N thought of as another feature
  • Fear, uncertainty and doubt

11
Level (-2) 7-bit dirty
  • 7-bit ASCII support
  • U.S. only
  • ASCII sort only
  • U.S. platforms/environments only
  • U.S.-specific UI
  • U.S. keyboards, terminals, printers

12
Level (-1) 8-bit clean
  • 8-bit data integrity (the 8th bit is not
    stripped)
  • Support for 8-bit object names
  • 16-bit data integrity for pass through

13
Level (0) Minimum I18N
  • 8-bit and multibyte codeset support
  • 8-bit and multibyte lexical support
  • 8-bit and multibyte object names
  • European sort orders
  • Localizable
  • European and Asian platform/HW support
  • Documentation on I18N
  • Multibyte input and display
  • Application development in target language
  • European and Asian keyboards, terminals, printers

14
Level (1) Minimum Heterogeneous I18N
  • Unicode support
  • Can add European sort orders
  • Distributed locale management
  • All messages localizable
  • Language-sensitive string operations
  • Locale-sensitive cultural string formatting
  • Transparent Connectivity
  • Imperial calendars
  • Codeset conversion
  • Localizable user interface
  • European multilingual application development

15
Level (2) Global Ready
  • Can add new character set conversions in the
    field
  • Bi-directional support
  • Robust codeset conversion
  • Support world-wide multilingual application
    development
  • Multiscript heterogeneous distributed processing

16
Level (3) Cultural Authenticity
  • Full Unicode support
  • Keisen tables, radar charts in Japan
  • Non-Gregorian calendars
  • Composite characters
  • Vertical input and display

17
Evolution of Client/Server/Intranet
Enterprise
Internet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt

18
Evolution of Client/Server/Intranet
Enterprise
Intranet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt

19

Evolution of Client/Server/Intranet
Enterprise
Intranet
Departmental
10-100 Users 100s to 1000s of Users Centralized
Server(s) Distributed Servers Mainframe
Extracts Mainframe Integration Single
Function Corporate-Wide Stand Alone Integrated Sim
ple Administration Complex Management Single
Vendor Many Vendors 10s of Gigabytes Gigabytes to
Terabytes
Any User Any Machine Anywhere
Systems Applications Databases
World-wide HTTP/HTML Remote Mgmt
20
Legacy Systems
  • Communication through Gateways
  • Proprietary Character Sets
  • Many Asian Implementations
  • Lots of Data Lot of Mone

Gateway
Gateway
Bridge
21
Three-Tier I18N System Normalisation
LANGUAGE
VIEW
DATA
22
Phased Approach for Distributed Unicode
  • Phase I - encapsulated Unicode
  • used internally, conversion filters to operating
    system environment and external distributed APIs
  • Phase II - Unicode on the wire
  • Unicode for transmission to distributed
    applications
  • Requires application control on both sides of the
    wire
  • Phase III - Unicode end-to-end
  • Unicode enabled user-I/O with appropriate
    software
  • Competitive advantage for multiplatform
    portability
  • Finally - Unicode everywhere
  • Operating environments and standards catch up.
    Change the conversion filters and the distributed
    applications continue to work

23
Phase I - Encapsulated Unicode
  • Unicode Enabled application inside a conversion
    envelope.

24
Phase II - Unicode on the Wire
non-Unicode App
  • Conversion filters to operating environment and
    distributed non-Unicode APIs

25
Phase III - Unicode End-to-End
non-Unicode App
  • If needed, use proprietary software to enable
    Unicode technology for user interfaces.

26
Final Phase - Unicode Everywhere
  • Distributed Environment vendors and standards
    bodies support Unicode
  • Unicode used everywhere for communication, data
    representation, and user interfacing

27
Legacy System Integration
PCs
3270s
Local Data Servers
AS/400
AS/400
  • Rightsizing a Large Legacy System
  • 3270 terminals in 16 countries connected to
    AS/400 MIS system
  • Integration with new Client/Server
  • Microsoft Windows clients (1st tier)
  • Sun Sparc Solaris Unix servers (2nd tier)
  • IBM AS/400 backend (3rd tier)

28
Example B2B Technologyto enable Global eCommerce
  • All data in Unicode
  • UTF-8 in XML
  • All resource and message files stored in Unicode
    for portability
  • Support for Unicode internally
  • Java and XML

29
Convertibility
CS0
  • Mapping Tables
  • National standards
  • International standards
  • Vendor standards
  • Always a mapping
  • Unicode base standards
  • Replacement characters

CS0
30
CORBA and Code Set Conversion
  • Use Unicode for Inter-ORB global communications
  • OMG Common Object Services (COS)
  • Inter-ORB Bridge Support
  • General Inter-ORB Protocol (GIOP)
  • Internet Inter-ORB Protocol (IIOP)
  • Code Set Negotiation use CONV_Frame IDL

Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
31
CORBA IOP/IOR
  • IOP Inter-ORB Protocol
  • IOR Interoperable Object Reference (like URL,
    with attributes)

Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
32
Transmission Code Set
  • Character Set The characters, independent of
    encoding
  • Code Set The encoded values of a Character Set
  • OSF Character and Code Set Registry
  • ftp.opengroup.org/pub/code_set_registry

Diagram from The Common Object Request Broker
Architecture and Specification, Rev. 2.2,
Chapter 11 ORB Interoperability Architecture
33
Character Set Conversion
  • User definable
  • Table-driven
  • API for user-defined routines
  • Robust
  • Negotiated conversion policy with Server
  • CMR - Client Makes Right
  • SMR - Server Makes Right
  • UNR - Universal Network Representation
  • Unicode based conversions

34
Character Set Conversion
  • Configurable error results depending on
    data-integrity needs
  • Exact Match
  • Best Guess
  • Error plus replacement character
  • Multiple character sets supportable with ICU as a
    Conversion Envelope

35
Character Set and Sort Order Definitions
  • International and commercial standards supplied
    by ICU
  • User definable for others
  • 8-bit
  • Multi-byte
  • Unicode reference set
  • Utilities for creating character sets
  • Sort order issues
  • Multilingual sorting
  • Multiple sort orders and indexing
  • Default vs expected sorting

36
Unicode SQL Database
  • Virtually every written business language
    supported
  • Allows world-wide solutions

37
Unicode in Databases
38
IETF
  • RFC 2277 - All Protocols
  • IETF Policy on Character Sets and Languages
  • The Internet is international
  • Must identify charset
  • Must support UTF-8
  • Must identify language
  • Multilanguage support required
  • RFC 2130 - new Protocols and Formats
  • Unicode default (UTF-8)

39
MIME
  • MIME charset Parameters
  • Used for Character encoding identification
  • HTTP
  • HTML
  • XML
  • CSS

40
HTTP encoding negotiation

Client sends Accept-Charset HTTP header
Accept-Charset UTF-8, ISO-8859-1q0.9,q0.1

Server know encoding and sends charset
parameter in HTTP
header
Content-Type text/html charsetUTF-8
HTML clues

document header
ltmeta http-equivContent-Type
contenttext/html charsetUTF-8gt
links
lta
href
charsetUTF-8gt lt/agt
42
Unicode in Dist Sys
(c) 2002 M. McKenna
IUC22
41
Determining Internet Encodings
  • In priority order
  • 1. User override
  • 2. HTTP header or protocol information
  • 3. Self-identification
  • ltmetagt for HTML
  • encoding for XML
  • _at_charset for CSS
  • 4. charset parameters on links
  • 5. User preferences/heuristics

42
LDAP version 3.0
LDAP strings are UTF-8


Directory entries can be in any language
RFC 2251 to RFC 2256

43
XML and Java
  • XML - Portable Data
  • Java - Portable Code
  • XML tag structures map to Java Classes
  • Default encoding for XML is Unicode
  • Encoding for Java Strings is Unicode

44
XML
  • Conforming parsers must support
  • UTF-16
  • UTF-8
  • UTF-8 is the default encoding
  • lt?xml version1.0" encodingUTF-8" ?gt
  • Character entities are Unicode values
  • dddd
  • xUUUU
  • CSS _at_charset UTF-8

45
Java
java.lang.String - Unicode

inputStreamReader

converts
sourceCharset
to Unicode

outputStreamWriter


converts Unicode to
targetCharset
Different list of
charsets
supported per Vendor

Java 1.1
vs
Java 2 and Unicode

Java 2 Swing set has better display support

http//www.
javasoft
.com
search on internationalization
46
GUI in Java
  • Portable consistent interface
  • Use Java 1.2
  • Use Java Foundation Classes (e.g. JTextArea)
  • Use Java Locale class
  • Link to O/S through JNI
  • Java runs in native Unicode

47
Java and I18n

java.
io

Character set conversion

InputStreamReader
,
OutputStreamWriter

java.
util

Locale

Date, Calendar

ResourceBundle

java.text

String handling, formatting

Collation
48
Java Methods for JDBC
  • Connection
  • Data Binding
  • Formatting Output
  • Date and Time
  • Collation
  • Translated Messages

49
Connection

What language?

System default?
Locale
defLocale
Locale.
getDefault
()
set properties.put("LANGUAGE",
(
defLocale
.
getDisplayLanguage
(Locale.US)).
toLowerCase
())

User choice?

Server choices list
us_
english
select name from master..
syslanguages

Server Default?

Java Application/Applet can be different

What character set?

sp_server_info server_
csname
set properties.put(CHARSET, server_
csname
)
50
Data Binding

Static locale

User-driven

System default

Menu pull-down

Dynamic locale

Data-driven

per-column

per-row

generated by business rules

Format using
java.text
51
Formatting Output

Numeric

java.text

DecimalFormat

NumberFormat

ChoiceFormat

Date and Time

java.text

DateFormat
,
SimpleDateFormat

java.
util

Calendar,
GregorianCalendar

Date

TimeZone
.
SimpleTimeZone
52
Example www.3m.com
Flag
Banner
Meta Data
Content
53
Generic Datetime for input to remote systems
  • Use YYYYMMDD format (ISO 8601 format)
  • insert 20021104 into table1(date_col)
  • / 4 November 2002 /
  • avoids language and format confusions

54
E-Marketplace Technology
XML Facilitates eCommerce.
55
Example Message (DTD)
  • lt?xml version"1.0" encoding"UTF-8" ?gt
  • lt!DOCTYPE Book
  • lt!ELEMENT Book (BookDesc) gt
  • lt!ELEMENT BookDesc (Title, Author, Publisher,
    ISBN, Price, CoverImage, Desc) gt
  • lt!ATTLIST Book xmllang CDATA REQUIRED gt
  • lt!ELEMENT Title (PCDATA) gt
  • lt!ELEMENT Author (PCDATA) gt
  • lt!ELEMENT Publisher (PCDATA) gt
  • lt!ELEMENT ISBN (PCDATA) gt
  • lt!ELEMENT Price (Currency, Amount) gt
  • lt!ELEMENT Currency (PCDATA) gt
  • lt!ELEMENT Amount (PCDATA) gt
  • lt!ENTITY CoverImage EMPTY gt
  • lt!ELEMENT Desc (PCDATA) gt
  • lt!ATTLIST CoverImagemage type (bmpgifjpgother)
    "gif"gt
  • lt!NOTATION gif SYSTEM ("gwswin/gws.exe"gt
  • lt!NOTATION bmp SYSTEM ("gwswin/gws.exe"gt
  • lt!NOTATION jpg SYSTEM ("gwswin/gws.exe"gt
  • lt!NOTATION other SYSTEM ("gwswin/gws.exe"gt

56
Example Message (XML)
  • ltBookgt
  • ltBookDesc xmllangENgt
  • ltTitlegtJava in a Nutshelllt/Titlegt
  • ltAuthorgtDavid Flanaganlt/Authorgt
  • ltPublishergtO'Reilly Associates
    lt/Publishergt
  • ltISBNgt156592262Xlt/ISBNgt
  • ltPricegt
  • ltCurrencygtUSDlt/Currencygt
  • ltAmountgt24.95lt/Amountgt
  • lt/Pricegt
  • ltCoverImagegtjnut_us.giflt/CoverImagegt
  • ltDescgtThe bestselling Java in a Nutshell
    has been updated to cover Java 1.1. If you're a
    Java programmer who is migrating to 1.1, this
    second ... lt/Descgt
  • lt/BookDescgt

57
Example Message (XML)
  • ltBookDesc xmllangDEgt
  • ltTitlegtJava in a Nutshelllt/Titlegt
  • ltAuthorgtDavid Flanaganlt/Authorgt
  • ltPublishergtOReilly/VVA lt/Publishergt
  • ltISBNgt3897211009lt/ISBNgt
  • ltPricegt
  • ltCurrencygtEURlt/Currencygt
  • ltAmountgt25.95lt/Amountgt
  • lt/Pricegt
  • ltCoverImagegtjnut_de.giflt/CoverImagegt
  • ltDescgtDieses Handbuch ist eine unentbehrliche
    Kurzreferenz, die dazu gedacht ist,
    aufgeschlagen neben der Tastatur jedes
    Java-Programmierers zu liegen. Es enthält eine
    ... lt/Descgt
  • lt/BookDescgt

58
Example Message (XML)
  • ltBookDesc xmllangJPgt
  • ltTitlegt lt/Titlegt
  • ltAuthorgtDavid Flanaganlt/Authorgt
  • ltPublishergt lt/Publishergt
  • ltISBNgt4-900900-08-7lt/ISBNgt
  • ltPricegt
  • ltCurrencygtJPYlt/Currencygt
  • ltAmountgt3900.00lt/Amountgt
  • lt/Pricegt
  • ltCoverImagegtjnut_jp.giflt/CoverImagegt
  • ltDescgt
  • ... lt/Descgt
  • lt/BookDescgt
  • lt/Bookgt

59
Example Order (XML)
  • ltOrdergt
  • ltOrderNumgt20193786lt/OrderNumgt
  • ltUserIdNumgtA47US37892lt/UserIdNumgt
  • ltItemsOrderedgt
  • ltItemgt
  • ltProductIdgt 156592262X lt/ProductIdgt
  • ltQtygt1lt/Qtygt
  • lt/Itemgt
  • ltItemgt
  • ltProductIdgt 3897211009 lt/ProductIdgt
  • ltQtygt12lt/Qtygt
  • lt/Itemgt
  • ltItemgt
  • ltProductIdgt 4900900087 lt/ProductIdgt
  • ltQtygt2lt/Qtygt
  • lt/Itemgt
  • lt/ItemsOrderedgt
  • lt/Ordergt

60
Business Rules
  • Reports
  • Taxes
  • Currency
  • Dual Currency Display
  • Currency Conversion
  • Payment and Settlement
  • Import/Export
  • Business Process
  • Workflow

61
Communication Protocols
  • CORBA 3.0
  • string, char supports UTF-8
  • wstring, wchar supports UTF-16
  • COM, DCOM
  • Allows Unicode
  • ActiveX
  • Unicode interface

62
Fonts
True Type

Bitstream Cyberbit

Monotype

BDF, Java


cobble together from many sources

Dynamic

Composed of multiple fonts

Bitstream Truedoc
www.
truedoc
.com
63
Web Services
HTTP
64
E-Marketplace Technology
Service Provider layers in their Services to
seamlessly add value to all trading partners.
65
UDDI
  • Describes
  • What is it?
  • Where is it?
  • How do I get it?

66
UDDI - I18n
  • Need to track time zone usage
  • Useful to have alternate names
  • Specify normalized formats to use

67
WSDL
Web Services Description Language
  • Services are defined using six major elements
  • types describe the messages exchanged.
  • message abstract definition of the data being
    transmitted
  • portType set of abstract operations refering to
    an input message or output messages.
  • binding protocol and data format specifications
  • port address for a binding - single
    communication endpoint.
  • service aggregate a set of related ports

68
WSDL - I18n
Web Services Description Language
  • Pure XML
  • Use xmllang and locale attributes
  • Export to UDDI
  • Service provider localizes to supported locales

69
SOAP
Simplified Object Access Protocol
SOAP
SOAP
70
SOAP
  • X

71
SOAP - I18n
Simplified Object Access Protocol
Locale B
Locale A
SOAP
I18N Info
SOAP
I18N Info
72
  • E-Marketplace Technology

73
System Architecture
Text Handling
Character Handling
Application Software
Cultural Profiler
Message System
Collations
String Formatting
Portability Layer
Conversions
en fr de jp
Operating System
Resource Store
Resource Files
Resource Files
Resource Files
74
Business GlobalizationTo Enable Global eCommerce
  • Features
  • Fully internationalized
  • Currency, International Taxes
  • Unicode support
  • Business Services Framework
  • Logistics
  • International Payment
  • Currency Exchange
  • Import/Export Compliance
  • Landed Cost
  • Translation Services
  • Regional Early Adopter Process
  • Thailand
  • Malaysia
  • Israel
  • Middle East
  • Greece
  • India

Tier-2 English-UK Danish Dutch Finnish Norwegian S
wedish Portuguese-PT Czech-SAP Hungarian-SAP Polis
h-SAP Russian-SAP
Tier-1 English German French Italian Japanese Port
uguese-BR Spanish-Intl Chinese-S Chinese-T Korean
APAC
EMEA
Americas
Includes enterprise, content auction
applications and global services
75
Summary
  • Unicode is a powerful portability and
    interoperability solution for distributed
    environments
  • Global, distributed computing (2 Level of I18N)
    requires Unicode to be effective
  • Unicode can be acquired in a phased approach
  • Unicode is now required to use new technologies
    (RFC 2277)
  • XML, Java

76
Global Vision
  • Think Globally, Act Locally
  • The trade relationships of the World makes for a
    very small planet economically, but complex
    culturally
  • The World Needs Unicode Today!
Write a Comment
User Comments (0)
About PowerShow.com