Title: Mapping multiple South Asian 8bit character sets to the Unicode Standard
1Mapping multiple South Asian 8-bit character sets
to the Unicode Standard
- Paul Baker, Tony McEnery, Mark Leisher, Hamish
Cunningham and Robert Gaizauskas
2Introduction
- What happens when standardization fails?
- Representing Indic language data
- Converting South Asian Writing Systems from 8 bit
to Unicode - Converting Indic languages
- A few notes - i.) talk from a corpus linguists
perspective ii.) the technical problems presented
actually impede my work - these are problems to
which I need solutions!
3Unicode
- The obvious standard for a web based architecture
(see Constable) - though harmonising to one 8/16
bit representation per writing system is a
possibility - For languages with an 8 bit standard which is
widely adhered to this may not seem so necessary - But for a wide range of languages where 8 bit
standardization has not been established/successfu
l it is much more useful
4What happens when standardization fails?
- South Asian languages are good examples of the
failure of standardization - There ARE standards- they are simply not adhered
to - The standards came too late, and now compete with
well established rival commercial/shareware
standards - These standards and rivals are mutually
incompatible to varying degrees
5For example, Panjabi (k, g, t)
kgt (Anandpur Sahib, Maboli Systems Inc.)
kgt (Gurbani, Gurbani Foundation)
kgt (Panjabi, Hardip Singh Pannu)
kgt (WCGurumukhi, Duke University)
6- ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
(Gurbani) - ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
(Panjabi) - ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
(Anandpur) - ltsgtauh Awpny prcwr rwhIN dbI qy sihmI hoeI jnqw
nMU zulm dy bMDn qOVn leI pRyr rhy sn lt/sgt
(ASCII)
7Representing Indic language data on the web
- Used
- Graphics (jpegs, gifs)
- 8 bit fonts
- Dynamic fonts
- Not Used
- Unicode
- ISCII
8Some software to achieve this
Legacy Material Here
Graphics
8 bit solutions
Legacy Material Here
Some software to achieve this (UniEdit,
NCST/Lancs)
Unicode
No software to achieve this
No material here
9Converting South Asian Writing Systems from 8 bit
to Unicode
- Certain co-occuring consonants may form what is
known as a conjunct a single new glyph - Consonant TTA gt
- Consonant TTHA ?
- Consonants TTA virama TTHA Ø
10- Some 8 bit fonts encode conjuncts as separate
characters while Unicode does not store the
conjunct as a single character at all, expecting
it to be rendered from component characters. So
any mapping may include one to many mappings,
where an 8-bit character needs to be rendered as
two or more Unicode characters (so an 8-bit Ø
character would be mapped as Unicode 091F 094D
0920 there is no single character for Ø in
Unicode).
11- Another difference that occurs between 8-bit and
Unicode encodings of South Asian text relates to
the ordering of characters which cause a
rendering rule to become active. To render ?
onscreen with Unicode, the sequence must be
encoded as ? j O . With an 8-bit font such
as WCDevanagari_08 this is encoded as ? followed
by . So resequencing of characters -
many-to-many mappings - is also needed.
12Converting Indic languages
- Need one - a standardised format for encoding
mappings - Need two - software to implement encoded mappings
- Need three - a library of mappings
13A standardised format for encoding mappings
- Albright
- TEI WSDs (McEnery, Baker Burnard 2000)
- Unicode technical committee
14Software to implement encoded mappings
- In essence fairly trivial software - simply
implementing mappings - No general software that can achieve the task
that I am aware of
15A Library of Mappings
- This is where the hard work is
- These mappings are not simply required for South
Asian languages - Yet even the task for South Asian languages is
daunting
16Conclusion
- Currently working towards needs one, two and
three with the National Centre for Software
Technology, Mumbai - While technically feasible, the principle
stumbling block to harmonisation, in my view, is
the substantial effort that will be required to
author mappings - This is real issue for users, creators and
archivists - impacts upon User Requirements cell
6, Creator Requirements cell 4, Archivist
Requirements cells 1, 3 and 8