Surrogate Support in Microsoft Products - PowerPoint PPT Presentation

About This Presentation
Title:

Surrogate Support in Microsoft Products

Description:

Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) ... Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Items not supported. Character Map ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 22
Provided by: unic9
Learn more at: http://www.unicode.org
Category:

less

Transcript and Presenter's Notes

Title: Surrogate Support in Microsoft Products


1
Surrogate Support in Microsoft Products
  • Michael S. Kaplan
  • Software Design Engineer
  • Trigeminal Software, Inc.

2
What are surrogates?
  • "a coded character representation for a single
    abstract character that consists of a sequence of
    two code units, where the first unit of the pair
    is a high surrogate and the second is a low
    surrogate"

3
High/low surrogate?
  • High UD800 - UDBFF
  • Low UDC00 - UDFFF
  • Terminology
  • "surrogate pair" preferred over "surrogate
    character"

4
Conversion example 1
  • Example 1
  • The first character in the Surrogate range (D800,
    DC00) as UTF-32
  • 1. D800 binary 1101100000000000 (lower ten
    bits 0000000000)
  • 2. DC00 binary 1101110000000000 (lower ten
    bits 0000000000)
  • 3. Concatenate 00000000000000000000 x0000
  • 4. Add x10000
  • Result U10000. This makes sense, since the
    first character in the Surrogate range follows
    immediately after the last character in the
    16-bit Unicode range (UFFFF)

5
Conversion example 2
  • Example 2.
  • You have a Unicode character such as U2040A (a
    CJK character in Plane2) and wish to encode it in
    UTF-16
  • 1. Subtract x10000 - Result 1040A
  • 2. Split into two ten-bit pieces 0001000001
    0000001010
  • 3. Add 1101100000000000 (D800) to the high 10
    bits piece (0001000001) - Result
    1101100001000001 (D841)
  • 4. Add 1101110000000000 (DC00) to the low 10 bits
    piece (0000001010) - Result 1101110000001010
    (DC0A)
  • Your surrogate pair D841, DC0A

6
UTF-8 conversions
  • Illegal conversions six-byte UTF-8 (two
    surrogate code points of UTF-16, converted
    separately)
  • legal conversions four-byte UTF-8 (one UTF-32
    code point)

7
UTF-8 example
  • Unicode surrogate pair
  • aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx
  • becomes incorrect UTF-8 total 6 bytes
  • 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy
    10xxxxxx
  • Instead, you should take a Unicode surrogate
    pair
  • 110110wwwwzzzzyy, 110111yyyyxxxxxx
  • and convert it to UTF-8 totaling 4 bytes (below,
    uuuuu is defined as wwww1)
  • 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

8
Encoding choices for MS
  • UTF-16, mostly
  • Occasionally UTF-8
  • Even more occasionally, UTF-32
  • REASONS
  • There was obviously an existing, well-tested set
    of APIs that support UCS-2, which is a total
    subset of UTF-16.
  • A completely new API set was not required.
  • A move to UTF-32 would require twice as much
    space for all characters.
  • A move to UTF-8 would require even more than
    twice as much space in many cases.

9
The products...
  • Mostly the new generation of products
  • Windows 2000/XP
  • Office XP (some support in Office 2000)
  • Most of these products supported Unicode already
  • a little bit of extra work needed for surrogate
    pairs
  • usually just UTF-8 support needed

10
Windows 2000/XP
  • Uniscribe/GDI support for rendering
  • Each surrogate pair is a single grapheme
  • APIs like CharPrev/CharNext not changed
  • Extensions to fallback fonts in XP
  • Font CMAP extensions in XP
  • Lots of UTF-8 issues fixed in XP
  • No specific surrogate font/IME (yet)

11
Collation for Supplementary chacacters
  • All Plane-1 (non-ideographic) characters sort
    after all the other non-ideographic scripts but
    before the ideographs.
  • All Plane 2 (ideographic) characters will be
    sorted after all the ideographs on the BMP.
  • All Plane 3-14 (currently not assigned) will be
    treated like any other unassigned characters.
    (includes plane 14 language tags)
  • All characters encoded in Plane 15-16 (private
    use) will be sorted after all other characters.

12
Other system components
  • MLang
  • Internet Explorer
  • IIS 5.0/6.0

13
The downlevel story
  • No good support for Unicode, let along
    supplementary characters
  • Uniscribe/RichEdit does improve the downlevel
    story for display purposes, at least
  • Officially, no surrgoate support on Win9x

14
The Office suite
  • Word
  • Frontpage
  • Excel/Access
  • Outlook
  • RichEdit 4.0

15
Specific Features
  • Insertion/Deletion of text - All
  • Cursor movement - All
  • Font linking/fallback - All (Word's is best)
  • UTF-8 issues fixed - All
  • Enhanced word breaking - All (Word/RichEdit)
  • Vertical text - Word/PowerPoint/Publisher/RichEdit

  • Direct entry (Altnnnnnn, hhhhh Altx) -
    Word/RichEdit

16
CHS/CHT/CHP Office
  • The product and the langpacks support an extended
    Unicode IME that handles supplementary
    characters
  • An Extension B font is also included

17
Visual Studio.NET
  • String class and globalization namespace
  • StringInfo
  • GetTextElementEnumerator
  • Handles supplementary characters
  • Also handles composite characters
  • GDI
  • IDE support

18
SQL Server
  • Past - no support
  • Present - surrogate "safe" (neutral)
  • Future - surrogate awaree

19
Items not supported
  • Character Map
  • Graph 10
  • Outlook 10 mail headers
  • Collations for supplementary characters
  • Fonts/IMEs

20
Questions?
21
  • Surrogate Support in Microsoft Products
Write a Comment
User Comments (0)
About PowerShow.com