The%20Ideographic%20Composition%20Scheme%20and%20Its%20Applications%20in%20Chinese%20Text%20Processing - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Ideographic%20Composition%20Scheme%20and%20Its%20Applications%20in%20Chinese%20Text%20Processing

Description:

Department of Computing, The Hong Kong Polytechnic University ... Hong Kong Supplementary Character Set(4,702) published in Sept.1999, some GCCS ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: The%20Ideographic%20Composition%20Scheme%20and%20Its%20Applications%20in%20Chinese%20Text%20Processing


1
The Ideographic Composition Scheme and Its
Applications in Chinese Text Processing
  • Qin LU
  • Department of Computing, The Hong Kong
    Polytechnic University
  • Introduction to Ideograph Description Characters
  • The ideographic composition scheme
  • The Hong Kong Glyph Specification Project

2
What are Ideograph Description Characters
  • 12 structure symbols used to describe the
    formation of characters using some smaller
    ideograph functional units

3
Characteristics of Ideographs
  • Ideograph characters are open formed by smaller
    ideographic elements such as Radicals, ideographs
    proper, and other ideographic components
  • Natural in the formation of characters
  • Examples 2 components
  • gt
  • Chinese uses components has long been using
    components to describe characters, especially
    characters with the same pronunciation

4
Problems with ideograph Character Encoding
  • Each character is treated as a different symbol,
    and thus given a code point
  • Code point assignment in a block does try to
    follow radical order, but codepoint assignment
    does not consider the substructures(components).
    Thus such information is not revealed.
  • When new character is created, codepoint
    allocation is needed, potentially endless
    standardization process
  • Encoding of rarely used ideograph characters is a
    waste of resource both in terms of code space and
    also standardization effort

5
Introduction of IDCs
  • Work started in 1995 by ISO/IEC SC2/WG2/IRG ? In
    1995
  • Objective of the Original proposal use coded
    ideographs and structure symbols to describe
    not yet coded ideographs.
  • Original proposal has 15 Ideograph Structure
    Symbols base on study on Han characters, three
    of them didnt make it to ISO 10646/Unicode
  • Ideograph_Proper(?) Every coded character is
    considered ideograph proper, thus not needed
  • Left_Up_Encompass no un-encoded example
  • Mirror_Symmetry(?) left being mirrored to the
    right, but can be describe by Left_to_Right
  • Renames the 12 symbols as Ideograph Description
    Characters

6
Ideographic Composition Scheme
  • IDS describes a character using its components
    and indicating the relative positions of the
    components.
  • IDCs are considered operators to the components.
  • IDSs can be expressed by a context free grammar
    through the Backus Naur Form. The grammar G has
    four components
  • Let G ?, N, P, S, where
  • ? the set of terminal symbols-coded radicals,
    coded ideographs, and the 12 IDCs.
  • Nthe set of 5 non-terminal symbols
  • NIDS, IDS1, Binary_Symbol, Ternary_Symbol,
    Ideograph_Component
  • S IDS, which is the start symbol of the
    grammar
  • P a set of rewrite rules

7
  • IDSltBinary_SymbolgtltIDS1gtltIDS1gtltTernary_Symbolgt
  • ltIDS1gtltIDS1gtltIDS1gt
  • ltIDS1gt ltIDSgt ltIdeograph_Componentgt
  • ltIdeograph_Componentgt coded_ideograph
    coded_radical coded_component
  • ltBinary-Symbolgt
  • ltTernary_Symbolgt
  • Note that even though the IDCs are terminal
    symbols, they are not part of the ideograph
    components.

8
Examples
9
  • IDS allows a character to be described by
    different sequences
  • That is the composition scheme allows a character
    to be formed by different component characters

10
  • IDS describes ideographic character composition
    at the abstract level. It indicates the relative
    positions of the components, but does not
    indicate the proportions.
  • Not intended for rendering.
  • Nesting is natural in ideographs and they are
    reflected in in the IDS scheme

11
Extending the Objectives of IDCs
  • Using coded characters to describe not yet code
    ideographs both for representation and exchange
  • Limit standardization to only modern characters,
    and not some rarely used characters
  • Learning of character composition(education)
  • Revealing substructures of ideograph characters
  • Description of ideograph variants

12
The Hong Kong Glyph Specification Project
  • Objectives of the project
  • Provide for computer (font) vendors a set of
    glyph specification of all ISO 10646 characters
    and the Hong Kong Supplementary Character Set
    that adhere to Hong Kongs common writing style
    so as to facilitate publishing in HK.
  • An effective H column as horizontal extension of
    ISO 10646(Horizontal extension is a confusing
    concept to many)
  • Different styles are due to Chinese character
    variants

13
Major References Lead to this project
  • The Hong Kong Education Institutes book The
    Common Character Glyph Set ??????, published
    in 1997 and revised in 2000 for elementary school
    education
  • Number of characters 4,751
  • Hand-written with some inconsistency, and
    variants
  • Hong Kong Supplementary Character Set(4,702)
    published in Sept.1999, some GCCS were unified
    with Big5, even if they are variants
  • Extension to HKSCS 97 characters
  • 69 in BMP (including 10 in Extension A), 22 in
    Ext. B and perhaps 6 to Ext. C
  • ???????????,?????GB13000.1??? ???????(GF3001-199
    7,????????)
  • Industrial Support Fund Support for the Hong
    Kong glyph specification, HKD 3.67M

14
Problems and scope of work
  • CCS has only gives 4,751 characters, but ISO
    10646 has 27,484 chars and also over 1,000 HKSCS
    chars in Ext. B
  • Avoid listing out every character in ISO 10646
    using components.
  • The rationale is if bone should be written in
    certainly, any character with bone as a component
    should follow the same style
  • Characters in ISO 10646 that are out of scope
  • Simplified characters follow mainland glyph
  • One Country/Region only characters(no
    unification)(Chinese GE source is not considered
    independent source) follow ISO 10646 provided
    glyph
  • Special working group was set up in October 2000
  • www.comp.polyu.edu.hk/glyphwg

15
  • Component Table
  • Based on components defined in GF3001, 1997 with
    a set of decomposition rules
  • Total of 620 components
  • Some components are not coded, thus, we use our
    internally created codes to represent them
  • Character Decomposition Table
  • Has one entry for each character and its
    decomposition sequence(using minimum
    decomposition, one level only)
  • characters that considered radicals or components
    commonly recognized, are not further decomposed
  • Structure symbols are maintained for facilitate
    both upward search and downward search

16
(No Transcript)
17
  • Upward search find all characters for a given
    component
  • Downward search find all components of a given
    characters

18
Conclusion
  • IDCs are introduced in Unicode 3.0
  • The use is going beyond the original objective
  • We have already created an application using
    these symbols in the Hong Kong Glyph
    Specification which is due out this year
  • IDCs should also useful in ideograph variant
    specifications

19
  • Appendix
  • Components inBig5 and
  • HKSCS
  • not yet in Unicode

20
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com