Optimizing the Usage of Normalization - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing the Usage of Normalization

Description:

Dublin, Ireland, May 2002. Introduction ... Dublin, Ireland, May 2002. Avoiding Normalization. Force users to provide already normalized data ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 23
Provided by: icupr
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Optimizing the Usage of Normalization


1
Optimizing the Usage of Normalization
  • Vladimir Weinstein
  • vweinste_at_us.ibm.com

Globalization Center of Competency, San Jose, CA
2
Introduction
  1. Unicode standard has multiple ways to encode
    equivalent strings
  1. Accents that dont interact are put into a unique
    order

3
Introduction (contd.)
  • Normalization provides a way to transform a
    string to an unique form (NFD, NFC)
  • Strings that can be transformed to the same form
    are called canonically equivalent
  • Time-critical applications need to minimize the
    number of passes over the text
  • ICU gives a number of tools to deal with this
    problem
  • We will use collation (language-sensitive string
    comparison) as an example

4
Avoiding Normalization
  • Force users to provide already normalized data
  • The performance problem does not go away
  • When the strings are processed many times, it
    could be beneficial to normalize them beforehand
  • Forcing users to provide a specific form can be
    unpopular

5
Check for Normalized Text
  • Most strings are already in normalized form
  • Quick Check is significantly faster than the full
    normalization
  • Needs canonical class data and additional data
    for checking the relation between a code point
    and a normalization form
  • Algorithm in UAX 15 Annex 8 (http//www.unicode.o
    rg/unicode/reports/tr15/Annex8)

6
Normalize Incrementally
  • Instead of normalizing the whole string at once,
    normalize one piece at a time
  • This technique is usually combined with an
    incremental Quick Check
  • Useful for procedures with early exit, such as
    string comparing or scanning
  • Normalizes up to the next safe point

7
Incremental Normalization Example
Non incremental normalization
Initial string
résumé
Quick check
If normalized regularly, the whole string is
processed by normalization
Incremental normalization
Normalize just the parts that fail quick check
8
Optimized Concatenation
  • Simple concatenation of two normalized strings
    can yield a string that is not normalized
  • One option is to normalize the result
  • Unnecessarily duplicates normalization

9
Optimized Concatenation Example
  • It is enough to normalize the boundary parts
  • Incremental normalization is used
  • Much faster than redoing the whole resulting
    string

10
Accepting the FCD Form
  • Fast Composed or Decomposed form is a partially
    normalized form
  • Not unique
  • More lenient than NFD or NFC form
  • It requires that the procedure has support for
    all the canonically equivalent strings on input
  • It is possible to quick check the FCD format

11
FCD Form Examples
SEQUENCE FCD NFC NFD
A-ring Y Y
Angstrom Y
A ring Y Y
A grave Y Y
A-ring grave Y
A cedilla ring Y Y
A ring cedilla
A-ring cedilla Y
12
Canonical Closure
  • Preprocessing data to support the FCD form
  • Ensures that if data is assigned to a sequence
    (or a code point) it will also be assigned to all
    canonically equivalent FCD sequences

13
Collation
  • Locale specific sorting of strings
  • Relation between code points and collation
    elements
  • Context sensitive
  • Contractions H lt Z, but CZ lt CH
  • Expansions OE lt Œ lt OF
  • Both ?? lt ?? or ?? gt ??
  • See Collation in ICU by Mark Davis

14
Collation Implementation in ICU
  • Two modes of operation
  • Normalization OFF expects the users to pass in
    FCD strings
  • Normalization ON accepts any strings
  • Some locales require normalization to be turned
    on
  • Canonical closure done for contractions and
    regular mappings
  • Two important services
  • Sort key generation
  • String compare function
  • More about ICU at the end of presentation

15
FCD Support in Collation
  • Much higher performance
  • Values assigned to a code point or a contraction
    are equal to those for its FCD canonically
    equivalent sequences
  • This process is time consuming, but it is done at
    build time
  • May increase data set

16
Sort Key Generation
  • Whole strings are processed
  • Sort keys tend to get reused, so the emphasis is
    on producing as short sort keys as possible
  • Two modes of operation
  • Normalization ON strings are quick checked and
    normalization is performed, if required
  • Normalization OFF depends on strings being in
    FCD form. The performance increases by 20 to 50

17
String Compare
  • Very time critical
  • Result is usually determined before fully
    processing both strings
  • First step is binary comparison for equality
  • When it fails, comparison continues from a safe
    spot

18
String Compare Continued
  • Normalization ON incremental FCD check and
    incremental FCD normalization if required
  • Normalization OFF assumes that the source
    strings are FCD
  • Most locales dont require normalization on and
    thus are 20 faster by using FCD

19
International Components for Unicode
  • International Components for Unicode(ICU) is a
    library that provides robust and full-featured
    Unicode support
  • The ICU normalization engine supports the
    optimizations mentioned here
  • Library services accept FCD strings as input
  • Wide variety of supported platforms
  • Open source (X license non-viral)
  • C/C and JAVA versions
  • http//oss.software.ibm.com/icu/

20
Conclusion
  • The presented techniques allow much faster string
    processing
  • In case of collation, sort key generation gets up
    to 50 faster than if normalizing beforehand
  • String compare function becomes up to 3 times
    faster!
  • May increase data size
  • Canonical closure preprocessing takes more time
    to build, but pays off at runtime

21
Q A

22
Summary
  • Introduction
  • Avoiding normalization
  • Check for normalized text
  • Normalize incrementally
  • Concatenation of normalized strings
  • Accepting the FCD form
  • Implementation of collation in ICU
Write a Comment
User Comments (0)
About PowerShow.com