Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley

Description:

Work with the Peter Constable to get it proposed. A proposal is composed of: ... Provide feedback on missing characters to Peter Constable. Appendices ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 41

Provided by: DeborahW152

Learn more at: https://linguistlist.org

Category:

more less

Transcript and Presenter's Notes

Title: Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley

1
Using the Unicode Standard for Linguistic Data
Preliminary GuidelinesDeborah
AndersonResearcherDept. of Linguistics, UC
Berkeley
2
Using Unicode for Linguistic Data

Introduction
E-MELD and its mission
What is the situation for character encoding?
The role of this presentation

3
Using Unicode for Linguistic Data

Background What is Unicode?
Core Concepts
Practical Issues How do I get Unicode to work?
Organization of the Unicode Standard
Finding the character you need
Other practical issues
Further recommendations

4
Using Unicode for Linguistic Data

Background What is Unicode?
Unicode is the international character encoding
standard
It assigns a unique number to every character and
this number stays the same no matter what the
platform, no matter what the program, no
matter what the language

5
Using Unicode for Linguistic Data

Background What is Unicode?
Example
the Unicode character code for Latin capital
letter A is U0041
Unicode format Uxxxx (xxxx is in hex)

6
Using Unicode for Linguistic Data

Background What is Unicode?
Used for plain text representation
(i.e., 0045 002D 004D 0045 004C 0044 E-MELD)
Different from rich text, which is plain text
with additional information (including formatting
information, such as font size, styles, etc.)

7
Using Unicode for Linguistic Data

Background What is Unicode?
Example Superscripts
(a) Plain text use Unicode characters
e.g., for use 02B0 for superscript h
(b) Rich text apply superscript style to a base
character to get the superscript h
e.g., ltsupgthlt/supgt (This can be done on MS
Word by selecting the superscript formatting
feature on the font menu.)

8
Using Unicode for Linguistic Data

Background What is Unicode?
Widely supported by computer companies and
national bodies many current fonts, keyboards,
and software are based on Unicode
But the process to get characters incorporated
can be lengthy (2 years), so there can be
lag-time before they appear in fonts, etc.

9
Using Unicode for Linguistic Data

Core Concepts
1. Characters, not glyphs.
Characters are the smallest components of
written language that have semantic value (TUS,
p. 13)
Glyphs the surface representation of abstract
characters what appears on the page or on your
monitor

10
Using Unicode for Linguistic Data

Core Concepts
1. Characters, not glyphs.
Example
Abstract Character a ? Unicodes (small
Latin letter a) domain
Glyphs a, a, a, a ? Fonts
domain

11
Using Unicode for Linguistic Data

Core Concepts
1. Characters, not glyphs.
Dont take glyphs in the Unicode
Standard charts as definitive

12
Using Unicode for Linguistic Data

Core Concepts
1. Characters, not glyphs.
Characters arent necessarily the same as
graphemes
Spanish ch
Unicode c h

13
Using Unicode for Linguistic Data

Core Concepts
1. Characters, not glyphs.
There is not always a 1-1 relationship between a
character and glyph
(a) Arabic one character can have different
glyphs depending upon position in a word
(b) Devanagari the glyph for ksha is made up of
3 characters ka virama sha

14
Using Unicode for Linguistic Data

Core Concepts
2. No new precomposed forms or digraphs
Example

15
Using Unicode for Linguistic Data

Core Concepts
3. No variants
4. No idiosyncratic characters

16
Using Unicode for Linguistic Data

Core Concepts
4. Unify, wherever possible
Greek letter beta is unified with IPA beta
(voiced bilabial fricative)

17
Using Unicode for Linguistic Data

Core Concepts
4. Unify, wherever possible
0283 LATIN SMALL LETTER ESH (voiceless
post-alveolar fricative)
222B INTEGRAL symbol

18
Using Unicode for Linguistic Data

Practical Issues Getting Unicode to Work

19
Using Unicode for Linguistic Data Practical
Issues Getting Unicode to Work

A recent operating system (Mac OS 9.2, X, Windows
CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2)
A recent browser (IE, Safari, OmniWeb,
Mozilla/Netscape)
A Unicode text editor (Word 2000, 2002, Unipad,
Apple TextEdit)
An input mechanism (insert symbol, keyboard,
Keyman)

20
Using Unicode for Linguistic Data Getting Unicode
to Work

A Unicode-enabled font (Code2000, Lucida Sans
Unicode, SILs Doulos, Gentium, Arial Unicode MS)
Note Be wary of Unicode fonts they may only
be partially Unicode-compliant.

21
Organization of the Unicode Standard
22
Organization of the Unicode StandardUnicode
Code Charts
23
Using Unicode for Linguistic Data Unicode Code
Charts
24
Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
25
Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
26
Using Unicode for Linguistic Data Unicode Code
Charts

27
Using Unicode for Linguistic Data Unicode Code
Charts

28
Using Unicode for Linguistic Data Steps to using
Unicode

Finding the character you need
1. See if it is in Unicode
Check the IPA blocks (etc.) on the Unicode
website
Check Appendix 2 of the IPA Handbook or a Web
version of the IPA symbols

29
Using Unicode for Linguistic DataSteps to using
Unicode

Finding the character you need
Note In looking through Unicode and using
insert Symbol/font charts, be careful of spoof
buddies

30
Using Unicode for Linguistic Data Steps to using
Unicode

Finding the character you need
2. See if it is in the process of being proposed
Check on Unicodes Proposed New Characters page
Ask on the Transcription email list
Ask on Unicode email list
Verify the character you need is a true
character, and not a variant

31
Using Unicode for Linguistic Data Steps to using
Unicode

If you find a character that is missing
Work with the Peter Constable to get it
proposed.
A proposal is composed of
the characters name
a representative glyph
information on the characters properties
a representative sample of the character in
context
a short bibliography with references

32
Using Unicode for Linguistic Data Steps to using
Unicode

How can I use a character not yet in Unicode?
Use FontLab or work with a font foundry to create
a font in the interim, using the Private Use Area
(PUA) fully document PUA chars.
Use markup / entities
Use Scalable Vector Graphics.
TEI is preparing guidelines, but nothing has yet
been finalized.

33
Using Unicode for Linguistic Data Steps to using
Unicode

For those languages without an orthography
Use Unicode characters if possible
Verify character properties are similar
Stay away from certain characters
Presentation forms
Letterlike symbols
Number forms

34
Using Unicode for Linguistic Data Steps to using
Unicode

How do I tell if my font is Unicode-compliant?
Set your font as the default for your browser,
then look at a test page, such as Alan Woods IPA
Extensions page.
Use font utilities to check the fonts on your
system (see Alan Woods website)

35
Using Unicode for Linguistic Data Steps to using
Unicode

What about my data that is in a non-Unicode font?
If possible, upgrade your documents to Unicode,
converting to a Unicode font.
Use a converter
If the font you use isnt included, create a
converter and have it hosted on a publicly
available website

36
Using Unicode for Linguistic Data Steps to using
Unicode

Encoding Forms
Different ways to represent the hex-based integer
as a series of bytes
A series of 8-bit values (UTF-8)
A 16-bit value (UTF-16)
A 32-bit value (UTF-32)

37
Using Unicode for Linguistic Data Steps to using
Unicode

Encoding Forms
Reason for different forms different
implementation needs
Some tradeoffs for storage/processing
Suggestion Use UTF-8 or UTF-16

38
Using Unicode for Linguistic Data Steps to using
Unicode

Further recommendations
Groups of users (i.e., Athabaskanists) should
publicly document Unicode values for the
orthography and give font recommendations.
Provide feedback on missing characters to Peter
Constable.

39
Appendices1 Linguistic letters and Symbols in
Unicode2 Characters known to be missing3
Normalization
40
end

Write a Comment

User Comments (0)