Mapping multiple South Asian 8bit character sets to the Unicode Standard PowerPoint PPT Presentation

presentation player overlay
1 / 16
About This Presentation
Transcript and Presenter's Notes

Title: Mapping multiple South Asian 8bit character sets to the Unicode Standard


1
Mapping multiple South Asian 8-bit character sets
to the Unicode Standard
  • Paul Baker, Tony McEnery, Mark Leisher, Hamish
    Cunningham and Robert Gaizauskas

2
Introduction
  • What happens when standardization fails?
  • Representing Indic language data
  • Converting South Asian Writing Systems from 8 bit
    to Unicode
  • Converting Indic languages
  • A few notes - i.) talk from a corpus linguists
    perspective ii.) the technical problems presented
    actually impede my work - these are problems to
    which I need solutions!

3
Unicode
  • The obvious standard for a web based architecture
    (see Constable) - though harmonising to one 8/16
    bit representation per writing system is a
    possibility
  • For languages with an 8 bit standard which is
    widely adhered to this may not seem so necessary
  • But for a wide range of languages where 8 bit
    standardization has not been established/successfu
    l it is much more useful

4
What happens when standardization fails?
  • South Asian languages are good examples of the
    failure of standardization
  • There ARE standards- they are simply not adhered
    to
  • The standards came too late, and now compete with
    well established rival commercial/shareware
    standards
  • These standards and rivals are mutually
    incompatible to varying degrees

5
For example, Panjabi (k, g, t)
kgt (Anandpur Sahib, Maboli Systems Inc.)
kgt (Gurbani, Gurbani Foundation)
kgt (Panjabi, Hardip Singh Pannu)
kgt (WCGurumukhi, Duke University)
6
  • ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
    nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
    (Gurbani)
  • ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
    nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
    (Panjabi)
  • ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
    nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
    (Anandpur)
  • ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
    nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
    (ASCII)

7
Representing Indic language data on the web
  • Used
  • Graphics (jpegs, gifs)
  • 8 bit fonts
  • Dynamic fonts
  • Not Used
  • Unicode
  • ISCII

8
Some software to achieve this
Legacy Material Here
Graphics
8 bit solutions
Legacy Material Here
Some software to achieve this (UniEdit,
NCST/Lancs)
Unicode
No software to achieve this
No material here
9
Converting South Asian Writing Systems from 8 bit
to Unicode
  • Certain co-occuring consonants may form what is
    known as a conjunct a single new glyph
  • Consonant TTA gt
  • Consonant TTHA ?
  • Consonants TTA virama TTHA Ø

10
  • Some 8 bit fonts encode conjuncts as separate
    characters while Unicode does not store the
    conjunct as a single character at all, expecting
    it to be rendered from component characters. So
    any mapping may include one to many mappings,
    where an 8-bit character needs to be rendered as
    two or more Unicode characters (so an 8-bit Ø
    character would be mapped as Unicode 091F 094D
    0920 there is no single character for Ø in
    Unicode).

11
  • Another difference that occurs between 8-bit and
    Unicode encodings of South Asian text relates to
    the ordering of characters which cause a
    rendering rule to become active. To render ?
    onscreen with Unicode, the sequence must be
    encoded as ? j O . With an 8-bit font such
    as WCDevanagari_08 this is encoded as ? followed
    by . So resequencing of characters -
    many-to-many mappings - is also needed.

12
Converting Indic languages
  • Need one - a standardised format for encoding
    mappings
  • Need two - software to implement encoded mappings
  • Need three - a library of mappings

13
A standardised format for encoding mappings
  • Albright
  • TEI WSDs (McEnery, Baker Burnard 2000)
  • Unicode technical committee

14
Software to implement encoded mappings
  • In essence fairly trivial software - simply
    implementing mappings
  • No general software that can achieve the task
    that I am aware of

15
A Library of Mappings
  • This is where the hard work is
  • These mappings are not simply required for South
    Asian languages
  • Yet even the task for South Asian languages is
    daunting

16
Conclusion
  • Currently working towards needs one, two and
    three with the National Centre for Software
    Technology, Mumbai
  • While technically feasible, the principle
    stumbling block to harmonisation, in my view, is
    the substantial effort that will be required to
    author mappings
  • This is real issue for users, creators and
    archivists - impacts upon User Requirements cell
    6, Creator Requirements cell 4, Archivist
    Requirements cells 1, 3 and 8
Write a Comment
User Comments (0)
About PowerShow.com