Title: Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley
1Using the Unicode Standard for Linguistic Data
Preliminary GuidelinesDeborah
AndersonResearcherDept. of Linguistics, UC
Berkeley
2Using Unicode for Linguistic Data
- Introduction
- E-MELD and its mission
- What is the situation for character encoding?
- The role of this presentation
-
-
3Using Unicode for Linguistic Data
- Background What is Unicode?
- Core Concepts
- Practical Issues How do I get Unicode to work?
- Organization of the Unicode Standard
- Finding the character you need
- Other practical issues
- Further recommendations
4Using Unicode for Linguistic Data
- Background What is Unicode?
- Unicode is the international character encoding
standard - It assigns a unique number to every character and
this number stays the same no matter what the
platform, no matter what the program, no
matter what the language
5Using Unicode for Linguistic Data
- Background What is Unicode?
- Example
- the Unicode character code for Latin capital
letter A is U0041 - Unicode format Uxxxx (xxxx is in hex)
6Using Unicode for Linguistic Data
- Background What is Unicode?
- Used for plain text representation
- (i.e., 0045 002D 004D 0045 004C 0044 E-MELD)
- Different from rich text, which is plain text
with additional information (including formatting
information, such as font size, styles, etc.)
7Using Unicode for Linguistic Data
- Background What is Unicode?
- Example Superscripts
- (a) Plain text use Unicode characters
- e.g., for use 02B0 for superscript h
- (b) Rich text apply superscript style to a base
character to get the superscript h - e.g., ltsupgthlt/supgt (This can be done on MS
Word by selecting the superscript formatting
feature on the font menu.) -
-
8Using Unicode for Linguistic Data
- Background What is Unicode?
- Widely supported by computer companies and
national bodies many current fonts, keyboards,
and software are based on Unicode - But the process to get characters incorporated
can be lengthy (2 years), so there can be
lag-time before they appear in fonts, etc.
9Using Unicode for Linguistic Data
- Core Concepts
- 1. Characters, not glyphs.
- Characters are the smallest components of
written language that have semantic value (TUS,
p. 13) -
- Glyphs the surface representation of abstract
characters what appears on the page or on your
monitor -
-
10Using Unicode for Linguistic Data
- Core Concepts
- 1. Characters, not glyphs.
- Example
- Abstract Character a ? Unicodes (small
Latin letter a) domain - Glyphs a, a, a, a ? Fonts
domain -
11Using Unicode for Linguistic Data
- Core Concepts
- 1. Characters, not glyphs.
- Dont take glyphs in the Unicode
- Standard charts as definitive
12Using Unicode for Linguistic Data
- Core Concepts
- 1. Characters, not glyphs.
- Characters arent necessarily the same as
graphemes - Spanish ch
- Unicode c h
13Using Unicode for Linguistic Data
- Core Concepts
- 1. Characters, not glyphs.
- There is not always a 1-1 relationship between a
character and glyph - (a) Arabic one character can have different
glyphs depending upon position in a word - (b) Devanagari the glyph for ksha is made up of
3 characters ka virama sha
14Using Unicode for Linguistic Data
- Core Concepts
- 2. No new precomposed forms or digraphs
-
- Example
-
-
-
15Using Unicode for Linguistic Data
- Core Concepts
- 3. No variants
- 4. No idiosyncratic characters
-
16Using Unicode for Linguistic Data
- Core Concepts
- 4. Unify, wherever possible
- Greek letter beta is unified with IPA beta
(voiced bilabial fricative)
17Using Unicode for Linguistic Data
- Core Concepts
- 4. Unify, wherever possible
- 0283 LATIN SMALL LETTER ESH (voiceless
post-alveolar fricative) -
-
- 222B INTEGRAL symbol
18Using Unicode for Linguistic Data
- Practical Issues Getting Unicode to Work
19Using Unicode for Linguistic Data Practical
Issues Getting Unicode to Work
- A recent operating system (Mac OS 9.2, X, Windows
CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2) - A recent browser (IE, Safari, OmniWeb,
Mozilla/Netscape) - A Unicode text editor (Word 2000, 2002, Unipad,
Apple TextEdit) - An input mechanism (insert symbol, keyboard,
Keyman)
20Using Unicode for Linguistic Data Getting Unicode
to Work
- A Unicode-enabled font (Code2000, Lucida Sans
Unicode, SILs Doulos, Gentium, Arial Unicode MS) - Note Be wary of Unicode fonts they may only
be partially Unicode-compliant.
21Organization of the Unicode Standard
22Organization of the Unicode StandardUnicode
Code Charts
23Using Unicode for Linguistic Data Unicode Code
Charts
24Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
25Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
26Using Unicode for Linguistic Data Unicode Code
Charts
27Using Unicode for Linguistic Data Unicode Code
Charts
28Using Unicode for Linguistic Data Steps to using
Unicode
- Finding the character you need
- 1. See if it is in Unicode
- Check the IPA blocks (etc.) on the Unicode
website - Check Appendix 2 of the IPA Handbook or a Web
version of the IPA symbols
29Using Unicode for Linguistic DataSteps to using
Unicode
- Finding the character you need
- Note In looking through Unicode and using
insert Symbol/font charts, be careful of spoof
buddies
30Using Unicode for Linguistic Data Steps to using
Unicode
- Finding the character you need
- 2. See if it is in the process of being proposed
- Check on Unicodes Proposed New Characters page
- Ask on the Transcription email list
- Ask on Unicode email list
- Verify the character you need is a true
character, and not a variant
31Using Unicode for Linguistic Data Steps to using
Unicode
- If you find a character that is missing
- Work with the Peter Constable to get it
proposed. - A proposal is composed of
- the characters name
- a representative glyph
- information on the characters properties
- a representative sample of the character in
context - a short bibliography with references
32Using Unicode for Linguistic Data Steps to using
Unicode
- How can I use a character not yet in Unicode?
- Use FontLab or work with a font foundry to create
a font in the interim, using the Private Use Area
(PUA) fully document PUA chars. - Use markup / entities
- Use Scalable Vector Graphics.
- TEI is preparing guidelines, but nothing has yet
been finalized.
33Using Unicode for Linguistic Data Steps to using
Unicode
- For those languages without an orthography
- Use Unicode characters if possible
- Verify character properties are similar
- Stay away from certain characters
- Presentation forms
- Letterlike symbols
- Number forms
34Using Unicode for Linguistic Data Steps to using
Unicode
- How do I tell if my font is Unicode-compliant?
- Set your font as the default for your browser,
then look at a test page, such as Alan Woods IPA
Extensions page. - Use font utilities to check the fonts on your
system (see Alan Woods website)
35Using Unicode for Linguistic Data Steps to using
Unicode
- What about my data that is in a non-Unicode font?
- If possible, upgrade your documents to Unicode,
converting to a Unicode font. - Use a converter
- If the font you use isnt included, create a
converter and have it hosted on a publicly
available website -
36Using Unicode for Linguistic Data Steps to using
Unicode
- Encoding Forms
- Different ways to represent the hex-based integer
as a series of bytes - A series of 8-bit values (UTF-8)
- A 16-bit value (UTF-16)
- A 32-bit value (UTF-32)
-
37Using Unicode for Linguistic Data Steps to using
Unicode
- Encoding Forms
- Reason for different forms different
implementation needs - Some tradeoffs for storage/processing
- Suggestion Use UTF-8 or UTF-16
-
38Using Unicode for Linguistic Data Steps to using
Unicode
- Further recommendations
-
- Groups of users (i.e., Athabaskanists) should
publicly document Unicode values for the
orthography and give font recommendations. - Provide feedback on missing characters to Peter
Constable. -
39Appendices1 Linguistic letters and Symbols in
Unicode2 Characters known to be missing3
Normalization
40end