Unicode 4.0 - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Unicode 4.0

Description:

Unicode 4.0 Mark Davis President, The Unicode Consortium Note: s differ from proceedings Overview New Characters Conformance UAX: Unicode Standard Annexes UCD ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 40

Provided by: mark740

Category:

more less

Transcript and Presenter's Notes

Title: Unicode 4.0

1
Unicode 4.0

Mark Davis
President, The Unicode Consortium
Note slides differ from proceedings

2
Overview

New Characters
Conformance
UAX Unicode Standard Annexes
UCD Unicode Character Database
UTS Unicode Technical Standards
Not part of the Standard, but can claim
conformance

3
Properties and Behavior

Unicode is not just a list of characters
Properties and behavior are crucial
With them, new characters can work out of the
box
Some are part of the standard (BIDI,
Normalization), others are associated (Collation,
Regular Expressions)

4
New Characters 1,228

Modern Scripts
(additions to) Indic, Khmer, Latin, Greek,
Arabic, Syriac
(minority scripts) Limbu, Tai Le, Osmanya
Historic Scripts
Linear B, Cypriot, Ugaritic, Shavian, Aegean
Numbers
Symbols
Monograms, digrams, tetragrams, other symbols
modifier combining characters

5
New Characters (cont.)

Special Characters
additional variation selectors (for future CJK
variants), double-diacritics for dictionary use
For a detailed list, see Derived Age in the UCD
4.0, and the beta Charts.
Character repertoire corresponds to ISO/IEC
106462003.

6
Conformance

Substantially improved specification of
conformance requirements
Incorporated UTR 17 Character Encoding Model,
clearly separating encoding forms and encoding
schemes
Tightened definitions of UTF-8, UTF-16, UTF-32
Separate definition of Unicode String
Clarified conformance status of Unicode Standard
Annexes
Formal definitions of properties algorithms
Provisional properties

7
UTF vs. Unicode String

Important Distinction
UTF
Unique representation for Code Point
All else illegal
C0 80
D800 0061
Unicode String
Sequence of code units
Internal Processing, not interchange
Not necessarily valid UTF
C0 A0
D800 0061

8
Conformance (cont.)

Formalized policies for stability of the standard
Clarification of semantics of important
characters, including BOM
Revised scope of enclosing combining marks
Revised semantics of ZWJ for cursive scripts
Normalization Corrections
U2F868 U2F874 U2F91F U2F95F U2F9BF
All corrections subject to strict stability
constraints
For 3.2 repertoire, NFC3.2(X) NFC4.0(X)

9
Textual Clarifications

Major changes to Chapters 2, 3, 6, 14 and 15
Definitive terminology for code points
graphic, format, control, private-use
assigned characters
surrogate, noncharacter, reserved
not characters
Substantial improvements to many character block
descriptions, especially Indic

10
Programming language identifiers

Now backwards-compatible
Once a Unicode identifier,
Always a Unicode identifier
Alternate definition for complete stability
Fix set of allowed characters
Allow all reserved code points
Complete stability
- Odd characters
Also see new UTR on Syntax Characters

11
Case mappings now normative (but tailorable)

Clearer definition of string functions
isUpper(), isLower(), isTitle(), isFold()
toUpper(), toLower(), toTitle(), toFold()
Definition of titlecase uses word boundaries
Note that the Turkic mappings do not maintain
canonical equivalence, without additional
processing.

12
UAX 9 BIDI

BIDI Arabic/Hebrew Display
HTML, all modern word processors, OSs,
New
canonically equivalence now preserved
data change, not algorithm
shaping is done after reordering
but not across directional boundaries
clarifications of
ZWJ, ZWNJ
intermediate level processing

13
UAX 15 Normalization

Unique form for text comparison
W3C Character Model, International Domain Names,
Network File System,
New
Description of Stable Code Points.
Notation NFC(x) and isNFC(x), in Notation.
Added pointer to UTN 5 Canonical Equivalences in
Applications
Rewrote Annex 12 Corrigenda for clarity, and to
describe the use of Normalization Corrections.
Added Annex 13 Canonical Equivalence.

14
UAX 14 Line Breaking

Line-Break (word-wrap) all Unicode text
Customizable for different languages
New
Negative numbers and dates with hyphens will not
break across lines
Word-Joiner will link any characters (except hard
line breaks)
Behavior of soft hyphen clarified
marks opportunity for breaking, not specific
graphic appearance.
Rules for GL relaxed SP and ZW override
New Property Values NL, WJ

15
UAX 29 Text Boundaries

Default User Character, Word, Sentence
boundaries
Customizable for different languages
Word, sentence tailoring expected
New
Extracted from 3.0, but significantly revised
Grapheme cluster (user character)
Hangul Syllable or other Base
plus (optionally) any number of NSMs

16
No Sub. Changes

UAX 11 East Asian Width
Guidelines for choosing character width
UAX 24 Script Names
Default script assignment
Used in regular expressions
Now UAX

17
Superseded UAXes

Incorporated into and thus superseded by Unicode
Version 4.0
UAX 13 Unicode Newline Guidelines
UAX 19 UTF-32
UAX 21 Case Mappings
UAX 27 Unicode 3.1
UAX 28 Unicode 3.2

18
Unicode Character Database

Crucial Component of Unicode
Documentation coalesced into UCD.html.
New properties and values
Hangul_Syllable_Type, Unicode_Radical_Stroke
CJK numeric values added.
PropertyValueAliases adds block names
UCD fallback props more precisely defined.
for code points not explicitly in data files
New Characters
Appropriate properties assigned

19
UCD4.0 (cont.)

Modifier letters
The general category of 02B9..02BA, 02C6..02CF
changed to general category Lm.
Khmer
Two Khmer characters are deprecated four others
strongly discouraged.
Decimal Digits
Numeric_Typedecimal digit now aligned with
General_CategoryNd
Braille
Added script value

20
UCD4.0 (cont. 2)

Case Mapping
Fixed for Turkish, Lithuanian
Default Ignorables
Hangul Filler characters
Soft-Hyphen, CGJ, ZWS
Arabic End of Ayah and Syriac Abbreviation Mark
no longer DI, shaping classes fixed.
Grapheme_Extend
removes halfwidth katakana marks, most Mc (except
as needed for canonical equivalence)

21
Unicode Technical Standard

UTS separate standard
independent conformance requirements
UTR information and guidelines
Documents may move from UTR status to UTS

22
UTS 10 Unicode Collation

Significance
String comparison, matching, searching
Compares all Unicode characters
Handles linguistic features
Accents, Case, Punctuation,
Contextual weighting,
Tailor for different languages
Version 4.0.0 due Sept. 2003
From now on, to be sync'ed in repertoire and
version with the Unicode Standard.

23
UTS 18 Regular Exp.

Significance
Crucial to many applications web, XML,
Unicode adds significant requirements
Level 1 Basic Support
Perl
Level 2 Extended Support
Level 3 Tailored Support
New
Recently approved as UTS (was UTR)
Adds clearer conformance requirements
Flexible list of features
Partial conformance claims

24
UTS 6 SCSU

Simple Unicode Compression
Added suitability for XML
See also Technical Note on BOCU
Main difference preserves binary order
x lt y gt BOCU(x) lt BOCU(y)

25
New UTRs

Draft UTR 23 Character Properties
Draft Character Property Model
Character Folding
Hiragana-Katakana, Case,
Programming Language IDs, Syntax characters

26
Q A

Other talks here
Common Locale Data
interchange of language-specific data for
sorting, dates, times, currencies
ICU
premier Unicode enablement library
full-featured, x-platform
C, C, Java

27
Background Slides
28
Unicode 3.2 (March, 2002)

New Characters 1,016
Symbols
Large collection of mathematical symbols,
especially targeted at MathML, recycling symbols,
ornamental brackets.
Special Characters
combining grapheme joiner, word joiner, invisible
operators for math, variation selectors
Modern Scripts
minority scripts of the Philippines

29
Conformance

Eliminates irregular UTF-8
Defines variation sequences
Replaces ZWNBSP with Word Joiner
Clarifies scope of combining marks (further
revised in 4.0)
Clarifications of conjoining jamo behavior,
hangul syllable structure, decomposables,

30
Textual Clarifications

Combined vowels in Khmer, characters discouraged
in Khmer
Use of dingbats

31
Unicode Standard Annexes

UAX 21 Case Mappings (was UTR)

32
Unicode Character Database

New properties
IDS_Binary_Operator, IDS_Trinary_Operator,
Radical, Unified_Ideograph,
Default_Ignorable_Code_Point, Deprecated
Soft_Dotted, Logical_Order_Exception
Grapheme_Base, Grapheme_Extend,Grapheme_Link
DerivedAge
Normalization Corrections
Added Property Property Value Aliases
Adds StandardizedVariants.html

33
Related Items

UTS 10 Unicode Collation Algorithm
Ignorable character handling, dual versioning,
more conditions on well-formed weights, separate
weights for CJK and unassigned characters,
non-characters
Note base version still U3.1
UTR 26 CESU-8
Unicode Technical Notes
Updated Character Encoding Stability Policy
Added Public Review process
Updated Glossary

34
Unicode 3.1 (March, 2001)

New Characters 44,946
First supplementaries encoded!
Modern scripts
CJK Ideographs (now totaling 71,039)
Historic scripts
Old Italic, Gothic, Deseret, Byzantine Musical
Symbols
Symbols
Mathematical Alphanumeric Symbols, (Western)
Musical Symbols

35
Conformance

Non-shortest-form UTF-8 excluded
Clarification of the stability of the standard,
code units vs. code points, non-characters,
normative properties, informative properties,
normative references
Revisions of guidelines
wchar_t, unassigned code points, identifiers
Major revision of Georgian
Use of ZWNJ and ZWJ for ligatures
Language tag characters encoded
but discouraged

36
Unicode Standard Annexes

UAX 19 UTF-32

37
Unicode Character Database

Major revision of PropList properties
White_Space, Bidi_Control, Join_Control,
Hex_Digit
Alphabetic, Ideographic, Lowercase, Uppercase
ID_Start, ID_Continue, XID_Start, XID_Continue
Noncharacter_Code_Point
Quotation_Mark, Terminal_Punctuation, Math, Dash,
Hyphen, Diacritic, Extender
New properties Case folding, Scripts
Added DerivedProperties, NormalizationTest

38
Related Items