Unicode (and Java) - PowerPoint PPT Presentation

About This Presentation
Title:

Unicode (and Java)

Description:

Encoding is usually a lookup in a table ... code point variation selector = variation sequence ... Private Use Area (PUA) ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 35
Provided by: bricegie
Category:
Tags: area | code | java | lookup | unicode

less

Transcript and Presenter's Notes

Title: Unicode (and Java)


1
Unicode (and Java)
  • Brice Giesbrecht

2
Objective of Presentation
  • The need for Unicode
  • How it works
  • Differentiate between encodings
  • How to get your browser to work
  • See how Java consumes and produces data

3
Overview of Presentation
  • Character Sets
  • Unicode
  • Encodings
  • Unicode Support in Java
  • Unicode Support in Databases (?)
  • Demonstration (web app)
  • Resources
  • Door Prizes (for those still awake)

4
Character Sets
  • What is a character set?
  • Code Page a mapping in which a sequence of bits,
    usually a single octet representing integer
    values 0 through 255, is associated with a
    specific character (wikipedia)
  • Most character sets are a direct mapping of a
    value to a number (7 bit / 8 bit)
  • Character sets are NOT fonts!
  • Encoding is usually a lookup in a table
  • Most IBM and Microsoft code pages use ASCII as
    their base set of characters
  • The English bias (compare to Indic languages)

5
Character Sets
  • Issues Within a single Language
  • Selectors to overcome 8 bit limitations
    (especially for CJK sets)
  • Historical importance of platforms and hardware
  • Compatibility (or more likely, lack thereof)
  • ISCII as an example
  • Issues outside a single Language
  • How do you produce content using multiple
    languages? (Or the characters from those
    languages?)
  • http//en.wikipedia.org/wiki/Code_page_437

6
Character Sets
  • Enter the standards
  • ISO-646 (ASCII, still 7 bit)
  • 12 whole code points to play with!
  • C0 Control Set (0x00 0x1F)
  • ISO-8859-n
  • 0x00 0x7F ISO-646 IRV
  • 0x80 0xFF Different for each set (or part)
  • ISO 8859-1 (Latin1)
  • C1 Control Set (0x80 0X9F)
  • ISO-2022
  • Designed for transmission
  • Non Latin bases multi byte sets

7
Character Sets
  • Enter Microsoft!
  • Windows code pages
  • http//www.microsoft.com/globaldev/reference/wincp
    .mspx
  • Cp1252
  • Based on ISO 8859-1
  • C1 code points used for printable characters
  • Often mislabeled as ISO-8859-1 due to their
    similarities

8
Unicode
  • What is Unicode?
  • Unicode provides a unique number for every
    character,
  • no matter what the platform,
  • no matter what the program,
  • no matter what the language.

9
Unicode
  • ISO 10646 1990
  • Merged with the Unicode Consortium Ties a
    character, name, and a code point together
  • BMP Basic Multilingual Plane (the first 65,536
    code points)
  • ISO and UC Character repertoire are synchronized
  • UCS (Universal Character Set)

10
Unicode
  • Q So are they the same thing?A No. Although
    the character codes and encoding forms are
    synchronized between Unicode and ISO/IEC 10646,
    the Unicode Standard imposes additional
    constraints on implementations to ensure that
    they treat characters uniformly across platforms
    and applications. To this end, it supplies an
    extensive set of functional characterspecificatio
    ns, character data, algorithms and substantial
    background material that is not in ISO/IEC
    10646.(http//unicode.org/faq/unicode_iso.html)

11
Unicode
  • The Unicode Standard includes a set of
    characters, names, and coded representations that
    are identical with those in ISO/IEC 106462003.
    It additionally provides details of
    characterproperties, processing algorithms, and
    definitions that are useful to implementers. It
    strengthens Unicode support for worldwide
    communication, software availability, and
    publishing. (http//www.iso.org)

12
Unicode
  • UCS Code space (0x 0x7FFFFFFF)
  • 128 x 256 x 256 x 256 (GPRC)
  • 2,147,483,648 possible code points
  • The Unicode Character Database
  • http//unicode.org/Public/UNIDATA/UCD.html
  • Main Definition (UnicodeData.txt)
  • Available on line
  • http//www.unicode.org/Public/UNIDATA/
  • Unicode Code Space (0x 0x10FFFF)
  • 17 x 256 x 256 1,114,112 code points

13
Unicode
  • As of Unicode 5.0.0, 101,063 (9.1) of these
    codepoints are assigned, with another 137,468
    (12.3) reserved for private use, leaving 875,441
    (78.6) unassigned. The number of assigned code
    points is made up as follows 98,884
    graphemes 140 formatting characters 65 control
    characters 2,048 surrogate characters

14
Unicode
  • Plane 0 (0000-FFFF)
  • Basic Multilingual Plane (BMP)
  • Used for most of the alphabets
  • Not all code points are used
  • Allocated in areas/blocks

15
Unicode
  • Plane 1 (10000-1FFFF)
  • Supplementary Multilingual Plane (SMP)
  • Historic scripts such as Linear B, but is also
    used for musical and mathematical symbols.

16
Unicode
  • Plane 2 (20000-2FFFF)
  • Supplementary Ideographic Plane (SIP)
  • Used for about 40,000 rare Chinese characters
    that are mostly historic

17
Unicode
  • Planes 3 to 13 (30000-DFFFF)
  • Unassigned

18
Unicode
  • Plane 14 (E0000-EFFFF)
  • Supplementary Special-purpose Plane (SSP)
  • glyph (font) selection
  • code point variation selector variation
    sequence
  • http//www.unicode.org/reports/tr37/tr37-3.html
    (Ideographic Variation Database)

19
Unicode
  • Plane 15 (F0000-FFFFF)
  • Plane 16 (100000-10FFFF)
  • Plane 0 (E000-F8FF)
  • Private Use Area (PUA)
  • The use of the PUA was a concept inherited from
    certain Asian encoding systems. These systems had
    private use areas to encode Japanese Gaiji (rare
    personal name characters) in application-specific
    ways)

20
Unicode
  • ConScript Unicode Registry
  • The purpose of the ConScript Unicode Registry
    (CSUR) is to coordinate the assignment of blocks
    out of the Unicode Private Use Area (E000-F8FF
    and 000F0000-0010FFFF) to constructed/artificial
    scripts, including scripts for constructed/artific
    ial languages.
  • Cirth, Klingon, Tengwar, etc.

21
Encodings
  • Purpose of the following encodings is to get the
    Unicode value to you.Depending on the storage or
    transmission protocols, differentencodings will
    need to be used.  These are not different
    character sets, they are ways of representing the
    characters in Unicode.

22
Encodings
  • Endianness
  • 0x1234
  • LE 34 12
  • BE 12 34
  • Byte Order Mark - 0xFEFF
  • Helps Determine Endianness
  • Unicode 3.2 (0x2060)
  • 0xFFFE reserved
  • 0XFEFF set aside for BOM
  • Also used to declare encoding (UTF-8)

23
Encodings
  • UTF-8
  • Variable-length character encoding
  • Can address all characters in the UCS but was
    limited by RFC 3629 to just address the Unicode
    code space.
  • BOM EF BB BF
  • Format
  • 000000-00007F 0zzzzzzz
  • 000080-0007FF 110yyyyy 10zzzzzz
  • 000800-00FFFF 1110xxxx 10yyyyyy 10zzzzzz
  • 010000-10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz

24
Encodings
  • UTF-32/UCS-4
  • Fixed-length character encoding
  • Uses 31 bits
  • UCS-4 capable of addressing entire UCS, but was
    restricted to only cover the Unicode code space
  • UTF-32 only covers the Unicode code space
  • 4E8C, 10302 00004E8C, 00010302
  • BE BOM 00 00 FE FF
  • LE BOM FF FE 00 00

25
Encodings
  • UCS-2
  • Fixed-length encoding
  • Two-octet
  • It is NOT UTF-16!
  • Only addresses BMP
  • UCS-2BE, UCS-2LE
  • Obsoleted by UTF-16

26
Encodings
  • UTF-16
  • Variable-length encoding
  • UTF-16BE, UTF-16LE
  • BE BOM FEFF
  • LE BOM FFFE
  • Surrogates are used to address code points
    outside the BMP. (We will cover this later)

27
Encodings
  • UTF-16 Surrogate Pairs
  • Needed for code points gt 0xFFFF
  • High Byte 0xD800 0xDBFF first surrogate
  • Low Byte 0xDC00 0xDFFF second surrogate
  • Algorithm
  • ((cp - 0x10000) high 10 bits) 0xD800
  • ((cp - 0x10000) low 10 bits) 0xDC00

28
Encodings
  • Which Encoding should you use?
  • If dealing with CJK or Hindi (gt0x0800), UTF-8
    requires 3 bytes whereas UTF-16 needs only 2
  • UTF-8 is great for ASCII whereas UTF-16 needs 2
    bytes for it
  • Java uses UTF-16
  • Windows uses UTF-16LE internally
  • UTF-32 not really used that much
  • UTF-8 and UTF-16 are the most common

29
Java
  • J2SE 1.5 version 4.0
  • J2SE 1.4 version 3.0
  • J2SE 1.3 version 2.1
  • Supplementary characters were part of Unicode 3.1
  • Addressed in JSR 204 (http//jcp.org/en/jsr/detail
    ?id204)

30
Java
  • Unicode characters are specified using \u such as
    \u0039
  • Unicode can be used in source files
  • file.encodingCp1252 on my machine
  • You can change this, but beware
  • Java reads and writes using this encoding by
    default
  • You can specify the character set to use for
    reading or writing

31
Java
Big5 Big5-HKSCS EUC-JP EUC-KR GB18030 GB2312 GBK IBM-Thai IBM00858 IBM01140 IBM01141 IBM01142 IBM01143 IBM01144 IBM01145 IBM01146 IBM01147 IBM01148 IBM01149 IBM037 IBM1026 IBM1047 IBM273 IBM277 IBM278 IBM280 IBM284 IBM285 IBM297 IBM420 IBM424 IBM437 IBM500 IBM775 IBM850 IBM852 IBM855 IBM857 IBM860 IBM861 IBM862 IBM863 IBM864 IBM865 IBM866 IBM868 IBM869 IBM870 IBM871 IBM918 ISO-2022-CN ISO-2022-JP ISO-2022-KR ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 JIS_X0201 JIS_X0212-1990 KOI8-R Shift_JIS TIS-620 US-ASCII UTF-16 UTF-16BE UTF-16LE UTF-8 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-31j x-Big5-Solaris x-euc-jp-linux x-EUC-TW x-eucJP-Open x-IBM1006 x-IBM1025 x-IBM1046 x-IBM1097 x-IBM1098 x-IBM1112 x-IBM1122 x-IBM1123 x-IBM1124 x-IBM1381 x-IBM1383 x-IBM33722 x-IBM737 x-IBM856 x-IBM874 x-IBM875 x-IBM921 x-IBM922 x-IBM930 x-IBM933 x-IBM935 x-IBM937 x-IBM939 x-IBM942 x-IBM942C x-IBM943 x-IBM943C x-IBM948 x-IBM949 x-IBM949C x-IBM950 x-IBM964 x-IBM970 x-ISCII91 x-ISO-2022-CN-CNS x-ISO-2022-CN-GB x-iso-8859-11 x-JIS0208 x-JISAutoDetect x-Johab x-MacArabic x-MacCentralEurope x-MacCroatian x-MacCyrillic x-MacDingbat x-MacGreek x-MacHebrew x-MacIceland x-MacRoman x-MacRomania x-MacSymbol x-MacThai x-MacTurkish x-MacUkraine x-MS950-HKSCS x-mswin-936 x-PCK x-windows-874 x-windows-949 x-windows-950
32
Databases (Maybe)
  • SQL 92 NATIONAL CHARACTER
  • The ltkey wordgts NATIONAL CHARACTER are used to
    specify a character string data type with a
    particular implementation-defined character
    repertoire. Special syntax (N'string') is
    provided for representing literals in that
    character repertoire.
  • Collation
  • Database Support
  • MySQL
  • Oracle
  • Sql Server
  • Postgres

33
Demonstration
  • Read/Write/Examine UTF-8/UTF-16/UTF-16LE encoded
    text (with Hex editor)
  • Show encoding settings in Eclipse and Java
  • Show how windows (and eclipse console) can/can't
    display some characters
  • web browser settings
  • Chinese article on cracking of SHA-1
  • Martin Fowler article on dependency Injection

34
Resources
  • The big ones
  • http//www.unicode.org/Public/UNIDATA/
  • http//en.wikipedia.org/wiki/Unicode
  • http//www.evertype.com/standards/csur
  • The rest
  • http//java.sun.com/javase/technologies/core/basic
    /intl/faq.jsp
  • http//en.wikibooks.org/wiki/Unicode/Character_ref
    erence
  • http//www.joelonsoftware.com/articles/Unicode.htm
    l
  • http//www.cl.cam.ac.uk/mgk25/unicode.html
  • http//czyborra.com/charsets/iso646.html
  • http//www.fileformat.info/ (GREAT resource)
  • For fun
  • http//www.omniglot.com/
  • http//en.wikipedia.org/wiki/Constructed_language
  • http//talideon.com/concultures/wiki/
Write a Comment
User Comments (0)
About PowerShow.com