Globalization Gotchas - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Globalization Gotchas

Description:

glyphs, code points, bytes, code units, user-perceived characters (grapheme clusters) ... 034F COMBINING GRAPHEME JOINER doesn't join graphemes. http://www. ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 34
Provided by: mark740
Category:

less

Transcript and Presenter's Notes

Title: Globalization Gotchas


1
Globalization Gotchas
  • Mark Davis

2
Unicode Basics
  • Unicode encodes characters, not glyphs
  • U0067 ? g g g g g g g g g g g g g. ...
  • Unicode does not encode characters by language
  • French, German, English j have the same code
    point even though all have different
    pronunciations
  • Chinese ? (da) has the same code point as
    Japanese ? (dai).
  • UTF-8, UTF-16, and UTF-32 are all Unicode.
  • The word character means different things to
    different people make clear which one you mean.
  • glyphs, code points, bytes, code units,
    user-perceived characters (grapheme clusters),

3
Unicode in APIs
  • U0000 to U10FFFF Be prepared to handle (at
    least not corrupt!) any incoming code points
  • A back-level system may get unassigned code
    points from later versions.
  • Watch for "UCS-2" implementations. They use
    UTF-16 text, but don't support characters above
    UFFFF they also may accidentally cause isolated
    surrogates.
  • Some APIs/protocols will count lengths in code
    points, and others in bytes (or other code
    units).
  • Make sure you don't mix them up.
  • Don't limit API parameters to a single character
    (and definitely not to a single code unit!).
  • What users think of as a single character (e.g.
    x, ch) may be a sequence in Unicode.
  • Use the latest version of Unicode supports new
    characters, corrections, more stability
    guarantees.

4
Choice of Characters
  • Character and block names may be misleading, eg,
  • U034F COMBINING GRAPHEME JOINER doesn't join
    graphemes.? http//www.unicode.org/faq/
  • Use U2060 (word joiner) instead of UFEFF
    (zero-width nobreak space) for everything but the
    BOM function.
  • Never use unassigned code points those will be
    used in future versions of Unicode.
  • Only use private use (PUA) or non-characters (and
    only if necessary)
  • If you do, minimize the opportunity for collision
    by picking an unusual range.

5
Character Conversion
  • Always use "shortest form" UTF-8.
  • It's the Law.
  • And if that isnt enough, consider security
    attacks.
  • If a protocol allows a choice of charsets, always
    tag correctly
  • Not all text is correctly tagged character
    detection may be necessary. But remember, it's
    always a guess!
  • Converting a database of mixed, untagged data is
    extremely painful.
  • Bad assumptions
  • Length bytes N length code points
  • 1 character charset X 1 character Unicode
  • The ordering may also be different.

6
Character Conversion II
  • IANA / MIME charset names are ill-defined
    vendors often convert same charset different
    ways.
  • Shift-JIS 0x5C ? U005C (\) or U00A5 ()
  • Dont simply omit unconvertable data to reduce
    security problems, at least substitute
  • UFFFD (when converting to Unicode) or
  • 0x1A (when converting to bytes).
  • ? http//www.w3.org/TR/japanese-xml/
  • ? http//icu.sourceforge.net/charts/charset/

7
Properties
  • Use properties such as Alphabetic, not hard-coded
    lists
  • isAlphabetic(x) regex \pAlphabetic or
    Alphabetic
  • Not (A x Z OR a x z)
  • Some properties aren't what you think use
  • White_Space not General_CategoryZs
  • Alphabetic not General_CategoryL
  • Lowercase not General_CategoryLl
  • ScriptGreek not BlockGreek
  • Characters may change property values between
    versions of Unicode
  • ? http//unicode.org/standard/stability_policy.htm
    l

8
Identifiers Tokens
  • When designing syntax, use as a base
  • Pattern_Syntax for operators / relations
  • Pattern_Whitespace for gaps
  • XID_Start and XID_Continue for identifiers.
  • All backwards compatible across versions
  • Profiles may expand or narrow from the base
  • Watch out for security attacks
  • paypal.com with a Cyrillic a
  • ? See Unicode Security at this conference

9
Comparison (Collation)Searching, Sorting,
Matching
  • There are two binary orders
  • code point order UTF-8 order UTF-32 order
  • ? UTF16 order
  • Dont present users with binary order!
  • No users expect A lt Z lt a lt z lt Ç lt ä.
  • Apply normalization to get a unique form, so Å
    Å.
  • Security Issues Protocols must precisely define
    the comparison operations
  • Eg, LDAP doesn't, so lookup may fail (or falsely
    succeed!)
  • Aside from wrong results, opening for security
    attacks.

10
Language-Sensitive Comparison
  • Use UCA Order as a base to meet
    user-expectations
  • a lt A lt ä lt Ç C? lt z lt Z
  • Real language-sensitive order requires tailoring
    on top of UCA ordering depends on context and
    language
  • china lt China lt chinas lt danish
  • ae lt æ lt af
  • z lt æ (Danish)
  • c lt d lt ... h lt ch lt i (Slovak)
  • Follow UCA for substring match offsets some
    gotchas here.
  • Don't mix up "stable" and "deterministic"
    sorting they are very different.
  • ? http//unicode.org/reports/tr10/
  • ? http//unicode.org/cldr

11
Normalization (NFC,)
  • Standardized normalized forms defined by Unicode.
  • The ordering of accents in a normalization form
    may not be the typical type-in order.
  • Fonts should handle both orders.
  • Normalization is context independent
  • Don't assume NFC(x y) NFC(x) NFC(y)
  • People assume that NFC always composes, but some
    characters decompose in NFC.
  • Trivia In Unicode 4.1 there are exactly 3
    characters that are different in all 4
    normalization forms ?, ?, ?

12
Maximum Expansion (U4.1)
Operation UTF Factor Sample Sample
NFC 8 3X ?? U1D160
NFC 16, 32 3X ? UFB2C
NFD 8 3X ? U0390
NFD 16, 32 4X ? U1F82
NFKC / NFKD 8 11X ? UFDFA
NFKC / NFKD 16, 32 18X ? UFDFA
13
Case Conversion
  • Not a simple 11 mapping
  • Title case ? ? ? ? ?
  • Expansion heiß ? HEISS ? heiss
  • Context-dependent ?S?S ? ?s??
  • Language-dependent istanbul ? ISTANBUL
  • Warning never use language-dependent casing for
    language-independent structures, like file-system
    B-Trees.

14
Casing Maximum Expansion
Operation UTF Factor Sample Sample
Lower 8 1.5X ? U023A
Lower 16, 32 1X A U0041
Upper / Title / Fold 8, 16, 32 3X ? U0390
15
Case Conversion II
  • Case folding was not stable.
  • Different results from toCaseFold(S) between two
    versions
  • Stability now guaranteed in Unicode 5.0
  • Don't use the Lowercase_Letter (Ll) or
    Uppercase_Letter (Lt) of  General_Category
  • These were constrained to be in a partition.
  • Use the separate binary properties Lowercase and
    Uppercase instead.

16
Lowercase / UppercaseForm vs Function
  • Lowercase, the binary property
  • The character is lowercase in form,but not
    necessarily in function.
  • Functionally Lowercase
  • isCased(x) isLowercase(x).
  • See Section 3.13 of TUS.

17
Lowercase Form vs Function
LC F. LC Ll Count Examples Examples Examples
LC F. LC Ll (U4.1) Examples Examples Examples
Y N N 114 ? U02E0 MODIFIER LETTER SMALL GAMMA
Y N Y 705 ª U00AA FEMININE ORDINAL INDICATOR
Y Y N 43 ? U2170 SMALL ROMAN NUMERAL ONE
Y Y Y 903 a U0061 LATIN SMALL LETTER A
18
Segmentation
  • What a user thinks of as a characters is often a
    sequence.
  • Words are not just sequences of letters.
  • Lines dont just break at spaces
  • All may be language-dependent
  • ? http//www.unicode.org/reports/tr14/
    ? http//www.unicode.org/reports/tr29/   

19
Transliteration
  • Transliteration ???????? ? Elleniká?
    Translation ???????? ? Greek
  • Transliteration may vary by language
  • ????? ? Putin, Poutine, ...
  • ???????? ? Gorbachev, Gorbacev, Gorbatchev,
    Gorbacëv, Gorbachov, Gorbatsov, Gorbatschow, ...
  • Watch for terminology lossy vs lossless
  • Lossy transliteration ???????? ? Ellinika ?
    ???????a
  • In ISO terms transliteration lossless
    transliteration transcription lossy
    transliteration.
  • ? http//unicode.org/draft/reports/tr35/tr35.html

20
Rendering is Contextual
Processing character-by-character gives the wrong
results!
  • Glyphs may change shape
  • Multiple characters ? 1 glyph
  • One character ? multiple glyphs

21
Rendering II
  • Good rendering systems will handle customary
    type-in order for text plus canonical order.
  • Excellent ones will do any canonically-equivalent
    order, but those are rare.
  • There may be differences in the customary glyphs
    for different languages specify the font or the
    language where they have to be distinguished
  • Security Issues
  • Never render a missing glyph as "?.
  • Don't simply overlay diacritics it can cause
    security problems.
  • ? http//www.unicode.org/notes/tn2/
  • ? http//unicode.org/reports/tr14/

22
Globalization
  • Unicode ? Globalization (aka Internationalization,
    Localizability)
  • Unicode provides the basis for software
    globalization, but there's more work to be
    done...
  • Use globalization APIs Formatting and parsing of
    dates, times, numbers, currencies comparison of
    text calendar systems ... are locale-dependent.
  • Where OS facilities are not adequate or
    cross-platform solutions are needed, use ICU (C,
    C, Java)
  • Don't put any translatable strings into your
    code separate into resource files.
  • Provide context to translators is Mark a noun, a
    verb, or a name
  • Dont use the same string in different contexts
    unless the meaning is identical (including
    references).
  • Note User-Interface language (menus, dialog,
    help-system,...) ?Data language (body text,
    spreadsheet cells).
  • Programs need to handle, as data, more languages
    than in localized UI

23
Common Globalization Mistakes
  • Never compile Windows apps as ANSI (the
    default!).
  • Don't simply concatenate strings to make
    messages
  • Order of components differs by language use Java
    MessageFormat, or structure UI as separate
    fields.
  • Don't assume icons and symbols mean the same
    around the world. Don't assume everyone can read
    the Latin alphabet.
  • Allocate space flexibly OK in English ?
    Aceptar in Spanish
  • English is a relatively compact language others
    may require more characters (eg in database
    fields) and more screen real estate (in UIs).
  • Beware of discrepancies in fallback behavior
  • Java ResourceBundle (J2SE), Java Standard Tag
    Library (JSTL), Java Server Face (JSF), Apache
    HTTP,...
  • ? http//unicode.org/cldr/
  • ? http//ibm.com/software/globalization/icu/

24
Neutral Formats
  • Store and transmit neutral-format data wherever
    possible. Convert that data to the user's
    preferred formats as "close" to the user as
    possible.
  • Type Example Rec. Standard
  • Language/Locale en-US (en_US) RFC 3066 bis /
    CLDR
  • Territory AU RFC 3066 bis
  • Currency EUR ISO 4217
  • Timezone Australia/Melbourne TZDB
  • Calendar islamic-civil CLDR Calendar ID
  • Custom Date yyyy-mmm-dd CLDR Pattern Format
  • Binary Time 8C80E9E3967A4B0 Windows File Time

25
Identification
  • Locale IDs are extensions of language IDs use
    CLDR.? http//unicode.org/cldr/
  • Don't assume that everyone in country always uses
    that countrys currency. Always use an explicit
    currency ID (ISO 4217).
  • ltRUR, 1.2345710³gt ? 1 234,57?. in Russian,
  • but Rub 1,234.57 in English.
  • Don't assume the timezone ID is implied by the
    user's locale. For the best timezone information,
    use the TZ database use CLDR for timezone
    names.? http//www.twinsun.com/tz/tz-link.htm
  • If you heuristically compute territory IDs,
    timezone IDs, currency IDs, etc. (eg, from
    browser settings) make sure the user can override
    that and pick an explicit value.

26
Unicode Guide
  • Authoritative but lightweight
  • Introduction, overview, and quick reference
  • Main principles of the Unicode Standard
  • Best practices in Software Globalization

27
Other Resources
  • Unicode Site
  • http//unicode.org
  • An Overview of ICU
  • http//icu.sourceforge.net/docs/papers/icu_overvie
    w_latest.ppt
  • Globalizing Software
  • http//icu.sourceforge.net/docs/papers/globalizing
    _software.ppt
  • W3C Internationalization
  • http//www.w3.org/International/
  • Microsoft Global Software Development
  • http//www.microsoft.com/globaldev/default.asp

28
QA
29
Backup Slides
30
User Input 
  • If you develop your own text editor, use the OS
    APIs to handle IMEs (Input Method Engines) for
    Chinese, Japanese, Korean,...
  • If you are using "type-ahead" to get to a
    position in a list (eg typing "Jo" gets to the
    first element starting with those characters),
    allow arbitrary input. This is often easiest with
    visible fields.
  • If your password field can contain characters
    that require an IME, a screen pop-up box may
    reveal the password to onlookers.

31
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
  ?   ? I ?
  ? i ? 0049 0307
I ? 0069 ? I
0049 ?   ? 0130
  ? i ? I
  ? 0131 ? 0049
I ?   ?  
0130 ? i ? ? I ?
I ? ? 0069 0307 ? 0130 0307
0049 0307 ?   ?  
32
Java
  • In MessageFormat, watch for words like can't,
    since ASCII ' has syntactic meaning. Use a real
    apostrophe (U2019) where possible cant.
  • In Date and Calendar, the months are numbered
    from 0 (February is month number 1!). However,
    weeks and days are numbered from 1.
  • Java serialized text isn't UTF-8, though it's
    close. U0000 and supplementary code points are
    encoded differently.
  • Java globalization support is pretty outdated
    use ICU to supplement it.
  • Java ResourceBundle (J2SE), Java Standard Tag
    Library (JSTL), Java Server Face (JSF), Apache
    HTTP server, etc. all provide some locale
    determination mechanism and facility but they
    all differ in details.

33
JavaScript
  • Always encode characters above U007F with
    escapes (\uxxxx).
  • There is an HTML mechanism to specify the charset
    of the Javascript source, but it is not widely
    implemented.
  • The JDK tool native2ascii can be used to convert
    the files to use escapes
Write a Comment
User Comments (0)
About PowerShow.com