Compact Encodings of Unicode - PowerPoint PPT Presentation

About This Presentation
Title:

Compact Encodings of Unicode

Description:

New encoding; supported by ICU. San Jose, California, September 2002 ... UTF-8 and/or UTF-16 work in most cases. Size of text often not critical ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 19
Provided by: markusw7
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Compact Encodings of Unicode


1
Compact Encodings of Unicode
  • Markus W. Scherer
  • Unicode/G11N Software Engineer
  • IBM Globalization Center of Competency

2
Agenda
  • Encodings in files and protocols
  • Not Processing encoding forms
  • Unicode is too big
  • Issues and non-issues
  • How to reduce size of Unicode text
  • Choice of encoding
  • Optional compression
  • Examples and comparisons

3
What is ICU?
  • Internationalization libraries for C, C, Java
  • Open source non-viral
  • Sponsored by IBM
  • Suns Java licenses an earlier ICU version ICU4J
    updates it.
  • Unicode standard compliant
  • full supplementary support
  • Cross-platform extensible and customizable
  • High performance and thread-safe
  • Multiple locales in same thread simultaneously
  • Converters for all Unicode charsets hundreds of
    legacy codepages
  • http//oss.software.ibm.com/icu/

4
Encodings of Unicode
  • Common Unicode character set
  • External encodings
  • Files and protocols
  • Almost always byte-serialized
  • Character Encoding Schemes/charsets
  • Processing encodings
  • Character Encoding Forms, often 16/32-bit
  • Different requirements
  • Topic for different presentation

5
Unicode is too big?
  • Perceived large size of Unicode text
  • Compared with legacy codepages
  • Size matters
  • Low-speed connections (dial-up, mobile)
  • Little memory (PDA, cell phone, embedded)
  • Size does not matter when
  • Images other binaries swamp text size
  • High-speed network
  • Temporary documents
  • Large amounts of memory

6
How big is it?
  • Size depends on language/script
  • Bytes/char for some language groups

7
Legacy codepages
  • Compact because
  • Designed for single/few languages
  • Few characters compared with Unicode
  • Conversion problems
  • Fallback/substitution of unmappable chars
  • Mapping table differences
  • Loss of parts of text common
  • Large number/size of mapping tables

8
Reduce Unicode text size
  • Choice of encoding
  • Encodings designed for different purposes
  • Compactness vs. direct applicability vs. software
    support etc.
  • General-purpose compression
  • Best on top of compact encoding
  • Not available in all applications

9
UTF-8/16
  • Designed for processing but all-purpose
  • UTF-8
  • Byte-based, ASCII-compatible
  • BMP up to 3 bytes/char
  • UTF-16 (BE/LE)
  • Byte-serialization of 16-bit form, not
    ASCII-compatible
  • BE/LE forms or Byte Order Mark
  • BMP always 2 bytes/char

10
UTF-7
  • 7-bit encoding designed for email
  • Obsolete email now 8-bit-safe
  • Partially ASCII-compatible
  • BMP 2.67 bytes/char plus overhead
  • Base64-encoded UTF-16BE
  • Stateful

11
SCSU BOCU-1
  • About as compact as legacy codepages
  • 1 byte/char for small scripts, 2 for CJK
    stateful
  • Compress short strings better than LZW (zip) etc.
  • SCSU
  • Limited ASCII compatibility (initial state)
  • Complex state, many encoding choices
  • Indeterministic arbitrary byte values
  • Established encoding, supported in
  • Various tools editors (SC UniPad), ICU, Symbian
    OS (cell phones/PDAs)

12
BOCU-1
  • BOCU-1
  • Delta-encoding avoids control codes
  • MIME text-compatible but not ASCII
  • Deterministic
  • Preserves binary order (for sorting, databases)
  • New encoding supported by ICU

13
SCSU BOCU-1 text sizes
  • Average bytes/char relative to UTF-8

14
Encoding vs. compression
  • For example BOCU-1 with WinZip

15
Performance
  • Converter performance
  • Roundtrip to/from UTF-16 with ICU
  • SCSU 45..125 of UTF-8 roundtrip time
  • BOCU-1 40..160 of UTF-8 roundtrip time
  • Depends on encoding ratio
  • Fast for small scripts, 1 byte/char
  • Separate compression adds to I/O time
  • Conversion time typically swamped by
  • Transmission (low-bandwidth connections)
  • Shorter texts transmit faster!
  • Parsing/processing

16
Further considerations
  • In-document encoding declarations require ASCII
    readability (XML, HTML)
  • Protocol may limit byte values (SMTP)
  • TES required for some encodings
  • base64 for SCSU or UTF-16 in emails
  • Increases text size
  • Compression removes ASCII readability and uses
    arbitrary byte values

17
Conclusion
  • UTF-8 and/or UTF-16 work in most cases
  • Size of text often not critical
  • When small text size needed
  • Use SCSU or BOCU-1
  • Consider compression
  • Make sure receiver can handle it

18
References
  • Forms of Unicode http//oss.software.ibm.com/icu/
    docs/papers/forms_of_unicode/
  • Character Encoding Model UTR 17
    http//www.unicode.org/reports/tr17/
  • SCSU UTS 6 http//www.unicode.org/reports/tr6/
  • BOCU-1 http//oss.software.ibm.com/cvs/icu/check
    out/icuhtml/design/conversion/bocu1/bocu1.html
  • ICU homepage http//oss.software.ibm.com/icu/
  • Unicode Consortium http//www.unicode.org/
  • IBM developerWorks http//www.ibm.com/developerwo
    rks/unicode/
Write a Comment
User Comments (0)
About PowerShow.com