Unicode support status in various platforms Microsoft Windows - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Unicode support status in various platforms Microsoft Windows

Description:

There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. ... e.g. String str = new String(utf8Bytes, 'UTF-8'); Lecture 10. 5. Code Conversion ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 26
Provided by: luq
Category:

less

Transcript and Presenter's Notes

Title: Unicode support status in various platforms Microsoft Windows


1
Unicode support status in various platforms
(Microsoft Windows)
  • Windows 9x / ME
  • Do not support Unicode internally
  • Limited Unicode APIs are supported.
  • Unicode applications compiled with Microsoft
    Layer for Unicode can be run on Win9x
  • Use code page to support different encodings
  • Windows NT / 2000 / XP
  • Support Unicode
  • Use of wide char (fixed 2 bytes)
  • Use UCS-2

2
Unicode support status in various platforms
(Linux Mac OS)
  • Linux
  • Newer Kernel supports Unicode
  • Requires glibc 2.2.2 and XFree86 4.0.3 or newer
  • Use UTF-8 in most case, e.g. filesystem
  • Set locale to _., e.g.
    zh_TW.utf8
  • Enable UTF-8 support in console by executing
    unicode_start
  • Mac OS
  • Mac OS 9.1, Mac OS X support Unicode
  • 16-bit for Unicode character

3
What is a code page
  • There are a lot of different encodings, e.g.
    EUC-TW, Big5, Latin-1 etc.
  • A code page (code page identifier) is a number to
    identify a codeset.
  • e.g. 950 Traditional Chinese (Big5)
  • e.g. 1252 Windows Latin-1
  • Other code page identifiers can be found in
  • http//msdn.microsoft.com/library/en-us/intl/unico
    de_81rn.asp
  • In Windows NT/2000/XP, code page conversion table
    provides information to convert between different
    encodings.

4
Java
  • Java is in Unicode internally. The supported
    encoding sets are provided by Java library
    packages rt.jar and i18n.jar
  • The supported encoding sets for java.io.,
    java.lang. and java.nio. API can be found in
  • http//java.sun.com/j2se/1.4/docs/guide/intl/encod
    ing.doc.html
  • User Input/Output will be automatically convert
    between Unicode and System code page
  • Specify the encoding of the source files when
    compiling.
  • javac encoding
  • Convert to other supported encoding
  • e.g. byte utf8Bytes str.getBytes(UTF-8)
  • Convert from other supported encoding
  • e.g. String str new String(utf8Bytes, UTF-8)

5
Code Conversion
  • Generally codeset conversion cannot provide
    one-to-one mapping(unless the two character sets
    are exactly the same)
  • Unicode is a superset of every existing national
    standard guaranteed round-trip conversion
  • Round-trip conversion Suppose a file file1 in
    codeset A is converted to a file file2 in codeset
    B and then converted back to codeset A with a
    file name file3.
  • If file3file1, we say that codeset B guarantees
    round-trip conversion for A.

6
Java Code conversion
  • Conversion from multibyte to Unicode
  • Byte my_data 0xA4, 0x40
  • String my_unicode_data new String(my_data,big5
    )
  • Where big5 is the name of the multibyte code
    name. Unicode needs this to do code conversion
    to
  • Conversion from Unicode to multibyte
  • String my_unicode_data \u4E00 (?)
  • Byte my_b5_datamy_unicode_data.getBytes(Big5)
  • My_b5_data will have the value of 0xA440
  • Byte my_gb_data my_unicode_data.getBytes(GBK)
  • My_gb_data will have the value of D2BB

7
  • Text stream import
  • File I new File(input)
  • FileInputStream tmpin new FileInputStream(I)
  • BufferedReader in new BufferedReader( new
    InputStreamReader(tmpin, Big5))
  • Once the BufferedReader in is established, data
    can be read using the readLine() method.
  • inputStr in.readLine()

8
  • Text Stream Export
  • File o new File(output.big5)
  • FileOutputStream tmpout new FileOutputStream(o)
  • BufferedWriter out new BufferedWriter(new
    OutputStreamWriter(tmpout, Big5))
  • .
  • Out.println(\u6CB3\u8C5A) (??)
  • Out.close()
  • 0xAA65 0xB362

9
Multilingual applications
  • Software teaching Chinese for English people
  • Software teaching English for Chinese
  • Conceptually separate two types of data in a
    multilingual application
  • Data related to display of menu/instructions,
  • Data related to the processing in the program
  • Multilingual application vs. I18n applications
  • I18N data related to display and processing are
    the same and it is for the same
    language/convention
  • Multilingual applications Data related to
    display is for one language(and can be
    internationalized). Data related to the
    processing can be multilingual and not
    necessarily related to the display language.
  • Unicode is the most convenient encoding for
    multilingual applications, but not absolutely
    necessary

10
The Ideographic Composition Scheme Used in ISO
10646
  • Introduction to Ideograph Description
    Characters(IDCs)
  • The ideographic composition scheme
  • Application using IDCs

11
What are Ideograph Description Characters
  • 12 structure symbols used to describe the
    formation of characters using some smaller
    ideograph functional units such as character
    components????????????

12
Characteristics of Ideographs
  • Ideograph characters are often formed by smaller
    ideographic elements such as Radicals, ideographs
    proper, and other ideographic components which we
    generally call ideograph components
  • Natural in the formation of characters
  • Examples 2 components
  • Chinese uses components has long been using
    components to describe characters, especially
    characters with the same pronunciation

13
Problems with ideograph Character Encoding
  • Each character is treated as a different symbol,
    and thus given a codepoint
  • Codepoint assignment in a block does try to
    follow radical order, but codepoint assignment
    does not consider the substructures(components).
    Thus such information is not revealed.
  • When new character is created, codepoint
    allocation is needed in new blocks, thus radical
    order cannot be globally maintained.
  • Also there is a potentially endless
    standardization process
  • Encoding of rarely used ideograph characters is a
    waste of resource both in terms of code space and
    also standardization effort

14
Introduction of IDCs
  • Work started in 1995 by ISO/IEC SC2/WG2/IRG in
    1995
  • Objective of the Original proposal use coded
    ideographs and structure symbols to describe
    not yet coded ideographs.
  • Original proposal has 15 Ideograph Structure
    Symbols base on study on Han characters, three
    of them didnt make it to ISO 10646/Unicode
  • Ideograph_Proper(?) Every coded character is
    considered ideograph proper, thus not needed
  • Left_Up_Encompass no un-encoded example
  • Mirror_Symmetry(?) left being mirrored to the
    right, but can be describe by Left_to_Right
  • Renames the 12 symbols as Ideograph Description
    Characters

15
Ideographic Composition Scheme
  • IDS describes a character using its components
    and indicating the relative positions of the
    components.
  • IDCs are considered operators to the components.
  • IDSs can be expressed by a context free grammar
    through the Backus Naur Form(BNF). The grammar G
    has four components
  • Let G ?, N, P, S, where
  • ? the set of terminal symbols coded radicals,
    coded ideographs, and the 12 IDCs.
  • Nthe set of 5 non-terminal symbols
  • NIDS, IDS1, Binary_Symbol, Ternary_Symbol,
    Ideograph_Component
  • S IDS, which is the start symbol of the
    grammar
  • P a set of rewrite rules

16
  • The following is the set of rewriting rules P
  • IDS
  • coded_ideograph
    coded_radical coded_component
  • ?? ? ? ? ? ? ?
    ? ?
  • ? ?
  • Note that even though the IDCs are terminal
    symbols, they are not part of the ideograph
    components.

17
Examples
18
  • IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID
    (IDC-OLD , ?)
  • The IDS introduced by IDC-OLD describes the
    abstract form of the ideograph with D1 and D2
    overlaying each other.
  • ??? is an example of an IDS which represents the
    abstract from of ?
  • IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM
    UPPER RIGHT(IDC-SUR, ?)
  • The IDS introduced by IDC-SUR describes the
    abstract form of the ideograph with D1 on the
    right top corner of D2, and D2 is encompassed
    by D1.
  • ? is an example of an IDS which
    represents the abstract from of

19
  • IDS allows a character to be described by
    different sequences
  • One IDS should describe only one character, yet
    one character can be described by different IDSs.

20
  • IDS describes ideographic character composition
    at the abstract level. It indicates the relative
    positions of the components, but does not
    indicate the proportions.
  • Not intended for rendering.
  • Nesting is natural in ideographs and they are
    reflected in in the IDS scheme

21
Components
  • Ideographic Components(IRG definition)
  • units which can be used to represent ideographs.
    These components consist of ideographs proper
    coded in ISO 10646 (BMP) and some basic elements
    used to form ideographs.
  • Radicals(IRG definition) those ideographic
    components listed in index pages of KX1, DKW,
    DJW, HYD.
  • ISO extensions
  • Radicals
  • Components

22
  • 28 from GBK and more from IRG
  • ISO IRG component sample

23
Extending the Objectives of IDCs
  • Using coded characters to describe not yet code
    ideographs both for representation and exchange
  • Limit standardization to only modern characters,
    and not some rarely used characters
  • Learning of character composition(education)
  • Revealing substructures of ideograph characters
  • Description of ideograph variants

24
Examples
  • Given characters IDS?
  • ? ? ? ? ?
  • Given a IDS what are the characters
  • ??
  • ???
  • Is the following a legal IDS?
  • ?? ??

25
Conclusion
  • IDCs are introduced in Unicode 3.0
  • The use is going beyond the original objective
  • Applications based on the IDCs were already
    developed such as in the the Hong Kong Glyph
    Specification.
  • IDCs should also useful in ideograph variant
    specifications
  • Additional search site
  • http//glyph.iso10646hk.net/ccs/ccs.jsp?langzh_TW
Write a Comment
User Comments (0)
About PowerShow.com