Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley

Description:

Work with the Peter Constable to get it proposed. A proposal is composed of: ... Provide feedback on missing characters to Peter Constable. Appendices ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 41
Provided by: DeborahW152
Learn more at: https://linguistlist.org
Category:

less

Transcript and Presenter's Notes

Title: Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley


1
Using the Unicode Standard for Linguistic Data
Preliminary GuidelinesDeborah
AndersonResearcherDept. of Linguistics, UC
Berkeley
2
Using Unicode for Linguistic Data
  • Introduction
  • E-MELD and its mission
  • What is the situation for character encoding?
  • The role of this presentation

3
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Core Concepts
  • Practical Issues How do I get Unicode to work?
  • Organization of the Unicode Standard
  • Finding the character you need
  • Other practical issues
  • Further recommendations

4
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Unicode is the international character encoding
    standard
  • It assigns a unique number to every character and
    this number stays the same no matter what the
    platform, no matter what the program, no
    matter what the language

5
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Example
  • the Unicode character code for Latin capital
    letter A is U0041
  • Unicode format Uxxxx (xxxx is in hex)

6
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Used for plain text representation
  • (i.e., 0045 002D 004D 0045 004C 0044 E-MELD)
  • Different from rich text, which is plain text
    with additional information (including formatting
    information, such as font size, styles, etc.)

7
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Example Superscripts
  • (a) Plain text use Unicode characters
  • e.g., for use 02B0 for superscript h
  • (b) Rich text apply superscript style to a base
    character to get the superscript h
  • e.g., ltsupgthlt/supgt (This can be done on MS
    Word by selecting the superscript formatting
    feature on the font menu.)

8
Using Unicode for Linguistic Data
  • Background What is Unicode?
  • Widely supported by computer companies and
    national bodies many current fonts, keyboards,
    and software are based on Unicode
  • But the process to get characters incorporated
    can be lengthy (2 years), so there can be
    lag-time before they appear in fonts, etc.

9
Using Unicode for Linguistic Data
  • Core Concepts
  • 1. Characters, not glyphs.
  • Characters are the smallest components of
    written language that have semantic value (TUS,
    p. 13)
  • Glyphs the surface representation of abstract
    characters what appears on the page or on your
    monitor

10
Using Unicode for Linguistic Data
  • Core Concepts
  • 1. Characters, not glyphs.
  • Example
  • Abstract Character a ? Unicodes (small
    Latin letter a) domain
  • Glyphs a, a, a, a ? Fonts
    domain

11
Using Unicode for Linguistic Data
  • Core Concepts
  • 1. Characters, not glyphs.
  • Dont take glyphs in the Unicode
  • Standard charts as definitive

12
Using Unicode for Linguistic Data
  • Core Concepts
  • 1. Characters, not glyphs.
  • Characters arent necessarily the same as
    graphemes
  • Spanish ch
  • Unicode c h

13
Using Unicode for Linguistic Data
  • Core Concepts
  • 1. Characters, not glyphs.
  • There is not always a 1-1 relationship between a
    character and glyph
  • (a) Arabic one character can have different
    glyphs depending upon position in a word
  • (b) Devanagari the glyph for ksha is made up of
    3 characters ka virama sha

14
Using Unicode for Linguistic Data
  • Core Concepts
  • 2. No new precomposed forms or digraphs
  • Example

15
Using Unicode for Linguistic Data
  • Core Concepts
  • 3. No variants
  • 4. No idiosyncratic characters

16
Using Unicode for Linguistic Data
  • Core Concepts
  • 4. Unify, wherever possible
  • Greek letter beta is unified with IPA beta
    (voiced bilabial fricative)

17
Using Unicode for Linguistic Data
  • Core Concepts
  • 4. Unify, wherever possible
  • 0283 LATIN SMALL LETTER ESH (voiceless
    post-alveolar fricative)
  • 222B INTEGRAL symbol

18
Using Unicode for Linguistic Data
  • Practical Issues Getting Unicode to Work

19
Using Unicode for Linguistic Data Practical
Issues Getting Unicode to Work
  • A recent operating system (Mac OS 9.2, X, Windows
    CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2)
  • A recent browser (IE, Safari, OmniWeb,
    Mozilla/Netscape)
  • A Unicode text editor (Word 2000, 2002, Unipad,
    Apple TextEdit)
  • An input mechanism (insert symbol, keyboard,
    Keyman)

20
Using Unicode for Linguistic Data Getting Unicode
to Work
  • A Unicode-enabled font (Code2000, Lucida Sans
    Unicode, SILs Doulos, Gentium, Arial Unicode MS)
  • Note Be wary of Unicode fonts they may only
    be partially Unicode-compliant.

21
Organization of the Unicode Standard
22
Organization of the Unicode StandardUnicode
Code Charts
23
Using Unicode for Linguistic Data Unicode Code
Charts
24
Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
25
Using Unicode for Linguistic Data Code Chart
(Phonetic Extensions block)
26
Using Unicode for Linguistic Data Unicode Code
Charts

27
Using Unicode for Linguistic Data Unicode Code
Charts

28
Using Unicode for Linguistic Data Steps to using
Unicode
  • Finding the character you need
  • 1. See if it is in Unicode
  • Check the IPA blocks (etc.) on the Unicode
    website
  • Check Appendix 2 of the IPA Handbook or a Web
    version of the IPA symbols

29
Using Unicode for Linguistic DataSteps to using
Unicode
  • Finding the character you need
  • Note In looking through Unicode and using
    insert Symbol/font charts, be careful of spoof
    buddies


30
Using Unicode for Linguistic Data Steps to using
Unicode
  • Finding the character you need
  • 2. See if it is in the process of being proposed
  • Check on Unicodes Proposed New Characters page
  • Ask on the Transcription email list
  • Ask on Unicode email list
  • Verify the character you need is a true
    character, and not a variant

31
Using Unicode for Linguistic Data Steps to using
Unicode
  • If you find a character that is missing
  • Work with the Peter Constable to get it
    proposed.
  • A proposal is composed of
  • the characters name
  • a representative glyph
  • information on the characters properties
  • a representative sample of the character in
    context
  • a short bibliography with references

32
Using Unicode for Linguistic Data Steps to using
Unicode
  • How can I use a character not yet in Unicode?
  • Use FontLab or work with a font foundry to create
    a font in the interim, using the Private Use Area
    (PUA) fully document PUA chars.
  • Use markup / entities
  • Use Scalable Vector Graphics.
  • TEI is preparing guidelines, but nothing has yet
    been finalized.

33
Using Unicode for Linguistic Data Steps to using
Unicode
  • For those languages without an orthography
  • Use Unicode characters if possible
  • Verify character properties are similar
  • Stay away from certain characters
  • Presentation forms
  • Letterlike symbols
  • Number forms

34
Using Unicode for Linguistic Data Steps to using
Unicode
  • How do I tell if my font is Unicode-compliant?
  • Set your font as the default for your browser,
    then look at a test page, such as Alan Woods IPA
    Extensions page.
  • Use font utilities to check the fonts on your
    system (see Alan Woods website)

35
Using Unicode for Linguistic Data Steps to using
Unicode
  • What about my data that is in a non-Unicode font?
  • If possible, upgrade your documents to Unicode,
    converting to a Unicode font.
  • Use a converter
  • If the font you use isnt included, create a
    converter and have it hosted on a publicly
    available website

36
Using Unicode for Linguistic Data Steps to using
Unicode
  • Encoding Forms
  • Different ways to represent the hex-based integer
    as a series of bytes
  • A series of 8-bit values (UTF-8)
  • A 16-bit value (UTF-16)
  • A 32-bit value (UTF-32)

37
Using Unicode for Linguistic Data Steps to using
Unicode
  • Encoding Forms
  • Reason for different forms different
    implementation needs
  • Some tradeoffs for storage/processing
  • Suggestion Use UTF-8 or UTF-16

38
Using Unicode for Linguistic Data Steps to using
Unicode
  • Further recommendations
  • Groups of users (i.e., Athabaskanists) should
    publicly document Unicode values for the
    orthography and give font recommendations.
  • Provide feedback on missing characters to Peter
    Constable.

39
Appendices1 Linguistic letters and Symbols in
Unicode2 Characters known to be missing3
Normalization
40
end
Write a Comment
User Comments (0)
About PowerShow.com