Writing - PowerPoint PPT Presentation

About This Presentation
Title:

Writing

Description:

... characters, a set of code charts for visual reference, an encoding methodology ... Non-alphabetic symbols should be included (eg music notation, currency symbols) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 26
Provided by: harold
Category:
Tags: writing

less

Transcript and Presenter's Notes

Title: Writing


1
Writing
  • Character sets
  • Unicode
  • Input methods

2
Character sets
  • Whats the problem?
  • Computer should handle your languages writing
    system in a natural way
  • Handle means input and output (and some other
    things, eg sorting)
  • Natural means like you are used to
  • Input method
  • Output (it should look right)
  • English is straightforward (why?), but not other
    languages
  • Distinguish storage and handling of text within
    the computer vs. input/output

3
Why the fuss?
  • Typing characters on a computer may appear
    deceptively simple you press a key labelled A,
    and the character A appears on the screen.
    Well, you actually get uppercase A or lowercase
    a depending on whether you used the shift key
    or not, but thats common knowledge. You also
    expect A to be included into a disk file when
    you save what you are typing, you expect A to
    appear on paper if you print your text, and you
    expect A to be sent if you send your product by
    e-mail or something like that. And you expect the
    recipient to see an A.
  • No big deal, but does the same happen for Ä? Or
  • Depends on keyboard settings, display settings,
    and degree of standardization

Adapted from http//www.cs.tut.fi/jkorpela/chars
.html
4
Character sets
  • Size of character set has to do with storage as
    bits and bytes
  • Early computers had only 32 characters upper
    case English plus numerals and a few other
    symbols
  • ASCII had space for 64 characters
  • most alphabetic writing systems can be covered by
    128 characters
  • Internal storage is independent of i/o
  • Leads to need for standardization of encoding

5
Writing systems
  • Alphabetic
  • Many languages use Roman alphabet
  • Often with diacritics (accents),
  • many are common to lots of languages
  • but some of are quite unusual
  • and some languages use multiple diacritics
  • There are other alphabetic writing systems
  • Conventionally, a range of other symbols
    (numerals, currency signs, fractions, math
    symbols) are included
  • Syllabic
  • Ideographic

6
Accented characters
  • Input method
  • Individual key
  • Key combination
  • Menu
  • Must be available in all fonts

7
Characters and glyphs
  • A single character might have a variety of
    appearances (glyphs) depending on size, font,
    etc.
  • a a a a a a a a a a a
  • A a à å a are all different characters
  • Appearance is a matter of rendering
  • In some writing systems, the same character is
    rendered differently depending on its context

8
Output text direction
Note mixed LR and RL in Arabic, and orientation
of Roman script in Chinese
9
Unicode
  • Problem of many (competing) standards, especially
    for Arabic, CJK and Indian scripts
  • Industry-agreed standard aiming to cover all
    the worlds writing systems
  • Unicode consists of a repertoire of about
    100,000 characters, a set of code charts for
    visual reference, an encoding methodology and set
    of standard character encodings, an enumeration
    of character properties such as upper and lower
    case, a set of reference data computer files, and
    a number of related items, such as character
    properties, rules for text normalization,
    decomposition, collation, rendering and
    bidirectional display order (Wikipedia)

10
Unicode some issues
  • 30 writing systems encoded, but many more still
    to do
  • Non-alphabetic symbols should be included (eg
    music notation, currency symbols)
  • Should invented alphabets (eg Klingon, Tolkien)
    and/or ancient systems (hieroglyphics, Mayan) be
    included?

11
Unicode some issues
  • Ready-made vs composite characters, e.g. é e
    Hangul and Chinese/Japanese characters made up of
    identifiable components
  • Ligatures many writing systems have special
    forms for character combinations
  • Is this a matter of representation or rendering?
  • Some disputed characters ligature or separate
    character? (e.g. Dutch ij)
  • Unicode also defines ordering conventions, not
    always uncontroversial

12
Input methods
  • Typing
  • Keyboard layout
  • Key combinations
  • Inputting ideographs
  • Handwriting pad
  • OCR

13
Typing
  • We are used to conventional keyboard which has
    (roughly) one key-stroke per character
  • We quickly learn key-stroke combinations (eg for
    capitals, accented characters)
  • Fluent typists rely on the key layout being
    familiar

14
(No Transcript)
15
Typing
  • Recent emergence of MSN on telephones has
    required input using just ten keys
  • Shows that software can map key-stroke
    combination to appropriate character sequence
  • For some users, bilingual keyboards are
    commonplace

16
Non-alphabetic writing systems
  • Syllabic system may require multiple key-strokes
    per character
  • Ideographic system (Chinese, Japanese) typically
    has input based on pronunciation, plus conversion
    to character, which may require contextual
    analysis
  • Alternate method composition by radical stroke
    count

17
Graphic input
  • Using stylus, eg on PDA
  • Also using finger on mousepad on laptop
  • Depends on recognizing stroke direction and order
  • Shorthand method invented
  • Recent systems recognize conventional letter
    shapes ...
  • ... in all their varieties

18
Graphic input
  • Also found for Chinese/Japanese
  • Important to get stroke order correct

19
OCR
  • Optical character recognition
  • Scanning
  • Essentially a pattern recognition task how
    similar is a given image to the expected image
  • Divide image into regions
  • Measure blackness of each region
  • Compare resulting matrix with template

20
OCR
  • Originally developed with special OCR font which
    maximized the differences between characters
  • For Latin scripts, works very well with almost
    any font
  • Can include orientation detection
  • Errors are predictable and could be eradicated
    with more sophisticated (linguistic) processing,
    but is it worth it?

21
OCR for handwriting
  • Neat printed handwriting not much harder than
    some fonts
  • Joined-up cursive handwriting still a research
    problem
  • Related problem of handwriting recognition a
    bit like speech understanding and voice
    recognition

22
OCR for other scripts
  • Correspondingly more difficult, depending on
  • Complexity of writing system in general
  • Complexity and similarity of individual characters

23
  • Not always easy
  • Handwriting is even harder

24
Need for OCR
  • Input of (all sorts of) texts for various
    purposes
  • Rapid input to save (re)typing
  • For further processing
  • For study
  • Two typical (hard) cases
  • Study of ancient manuscripts
  • Intelligence gathered in Iraq

25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com