COMP323 Foundations of Chinese Computing - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

COMP323 Foundations of Chinese Computing

Description:

COMP323 Foundations of Chinese Computing – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 38
Provided by: LQ5
Category:

less

Transcript and Presenter's Notes

Title: COMP323 Foundations of Chinese Computing


1
  • COMP323 Foundations of Chinese Computing

2
Course Introduction
  • Lecturer
  • Qin LU
  • csluqin_at_comp.polyu.edu.hk
  • Room PQ814, Tel. 27667247
  • Teaching Assistant (Responsible for some Labs and
    Project Assignments)
  • Chen Yirong
  • csyrchen_at_comp.polyu.edu.hk
  • Room QT416, Tel. 2766 7326

3
Course Introduction
  • COMP323 Reference Books
  • CJKV Information Processing Chinese, Japanese,
    Korean and Vietnamese Computing (PL1074.5 .L86)
  • An Introduction to Chinese, Japanese and Korean
    Computing (QA76.H7795)
  • ????????? (PL1074.5.C42) and others
  • Tutorials and labs PQ604A
  • Tuesday Group 930 1030 Tuesdays
  • Thursday Group 930 1030 Thursdays
  • Try to finish the labs and the online
    assignment/QA during lab hours

4
Course Introduction
  • COMP323 Website
  • WebCT
  • Lecture notes available Wed. by 5pm
  • Print as NotePage
  • Method of Assessment
  • Course Work 55
  • 2 Programming Assignments 20
  • 2 online quizzes 20
  • 1 online homework 5
  • 4 online QA(labs) 8
  • Class attendance (punctuation) 2
  • Final Examination 45

5
Course Introduction
  • Introduction to Chinese Computing
  • Computer processing of data related to Chinese,
    involving any human-computer interaction activity
    where communication is achieved using Chinese
    language.

Chinese
Computing


About one-fifth of the people in the world speak
some form of Chinese as their native language,
making it the language with the most native
speakers.
6
Course Introduction
  • Fundamental Problems with Chinese Computing
  • At Chinese Character Level
  • Large and not Closed Character Set
  • Computer Representation, Input and Output
  • At Chinese Language Level
  • Lack of Morphological Variation
  • Lack of Grammar
  • Very Arbitrary and Flexible
  • Superimposed Grammar
  • Texts are Running Together



7
Course Introduction
  • Fundamental Problems with Chinese Computing

8
Course Introduction
  • Fundamental Problems with Chinese Language
  • Bi-lingual, Tri-lingual and Multi-lingual
    Computing
  • Question Is Hong Kong a multi-lingual society?
  • How can a system be designed so that it can be
    used by different languages with minimal changes?
  • How can a system be designed so that it can be
    used for multiple languages?
  • Distinguish Chinese and English Characters
  • Chinese Text, English Text or Chinese Text Mixed
    Together with English Text?



9
Course Introduction
  • Fundamental Problems with Chinese Language
  • Bi-lingual, Tri-lingual and Multi-lingual
    Computing
  • Example Count the Number of (Chinese and/or
    English) Characters or Words

?
10
Tentative Teaching Content
  • Characteristics of Chinese Language
  • Reading System (Pronunciation)
  • Writing System (Look)
  • Computer Representation of Chinese Characters
  • Character Set Standards (GB, Big5 and Unicode
    ...)
  • Encoding Schemes (ISO and UTF )
  • Chinese Character Input
  • Chinese Input Processing by (Pen, Image, Speech
    and) Key Stroke
  • Shape-based Keystroke Input Method
  • Phonetic-based Keystroke Input Method

11
Tentative Teaching Content
  • Chinese Character Output
  • Bitmap and Outline Font Representation
  • Compression
  • Scaling Problem
  • Software Development for Chinese
  • Text Processing, such as Character Searching,
    Editing, and Deletion
  • Software Localization and Internationalization

12
Tentative Teaching Content
  • Chinese Language Processing
  • Word Segmentation
  • Part-of-Speech (POS) Tagging
  • Syntactic Analysis (Grammatical Analysis)
  • Chinese Information (Document) Retrieval
  • Document Retrieval Models
  • Language-Related Issues
  • Advanced Topics (possibly)
  • Information Extraction
  • Text Summarization

13
Lecture 1 Characteristics of Chinese
14
The Chinese Language
  • General Characteristics
  • The official language in China is mandarin (???),
    but there are many dialects in spoken form (50).
  • Different Pronunciation across Different Dialects
  • Relatively Unified Writing System
  • Dialect-specific Characters and Variant Character
    Writing
  • Different words express the same meaning, e.g. ?
    and ? (to be)
  • Word order reversal, e.g. ?? and ?? (look for)

??????????????????????
15
The Chinese Language
16
The Chinese Language
  • Characteristics of Chinese Characters
  • Each Chinese character associates with three
    features, namely its look (called graphemics),
    its pronunciation (called phonetics), and its
    meaning (called semantics).




Graphemics (The Look)
Phonetics (The Sound)
Semantic (The Meaning)
17
Chinese Writing System
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
???????
  • Radicals (??)
  • Chinese characters are
  • composed of smaller
  • units, called radicals.
  • 214 radicals are used
  • for indexing Chinese
  • characters.
  • The advantage of a
  • radical is that one does
  • not have to know the
  • pronunciation of the character,
  • but can still look up a character in a
    dictionary.

18
Chinese Writing System
  • Radicals
  • Remark Several radicals can stand alone as
    single and meaningful Chinese characters.

Radical
Standalone
Examples
????????????
?
Yes
????????????
?
Yes
????????????
?
Yes
????????????
?
Yes
19
Chinese Writing System
  • Strokes (??)
  • Radicals in turn are composed of smaller units,
    called strokes.
  • 30 strokes are the most basic elements of a
    character.
  • 5 basic strokes are ? (?, a horizontal stroke),
    ? (?, a vertical stroke), ? (?, dot), ? (?,
    a stroke curved to the left) and ? (?, a bend
    stroke).

20
Chinese Writing System
  • Strokes
  • Stroke Order (??)
  • The strokes for each Chinese character are to be
    drawn in a certain defined order.
  • Basic principles are from left to right, top to
    bottom, outside to inside, horizontal before
    vertical, left slant before right slant, center
    before two sides, etc.
  • See Animations here http//www.chinawestexchange.c
    om/Chinese/characters.htm

21
Chinese Writing System
  • Tree Structure of Chinese Characters

22
Chinese Writing System
  • Character Classifications and Formation
  • Type 1 Pictographs (Picture Characters) (??)
  • They look like the things they represent, e.g.
  • Other examples are ? (sun),
  • ? (mountain), ? (water),
  • ? (bird), ? (fire), ? (tree),
  • ? (car, cart), and ?
  • (month, opening), etc.

Does this character ? really look like a moon to
you? Centuries ago, it was written like this
23
Chinese Writing System
  • Evolution of Chinese Characters

24
Chinese Writing System
  • Character Classifications and Formation
  • Type 2(Simple) Ideographs (?? or ??)
  • They represent abstract concepts or ideas, such
    as numbers and directions, e.g. ? (one), ? (two),
    ? (three), and ? (center, middle), ? (above), ?
    (below) etc.

25
Chinese Writing System
  • Character Classifications and Formation
  • Type 3 Compound Ideographs (??)
  • Pictographs and ideographs can be combined to
    represent more complex characters, and usually
    reflect the combined meaning of them.
  • Examples
  • More
  • Interesting
  • Animations
  • from Internet http//www.language.berkeley.edu/fa
    njian/compound_ideographs.html

sun ? moon ? bright ? person ? person ?
agree/follow ? sun ? tree ? east (sun rising
above the trees in the east) ? tree ?
tree ? forest ? one more tree ?
full of trees ?
26
Chinese Writing System
  • Character Classifications and Formation
  • Type 3 Compound Ideographs

27
Chinese Writing System
  • Character Classifications and Formation
  • Type 3 Compound Ideographs

28
Chinese Writing System
  • Character Classifications and Formation
  • Type 4 Phonetic Ideographs (??)
  • They usually have at least two component
    characters, one influences the sound and the
    other influences the meaning.
  • For example,
  • They account
  • for more than
  • 90 of all
  • Chinese
  • characters
  • in use today.



For the character ? ( jump ), the left part ?
means foot. The meanings of those characters
that contain ? are related to foot in a
certain way. The right part ? indicates the
sound. They share the same vowel.
29
Chinese Writing System
Thought to be the oldest types of characters,
pictographs were originally pictures of things.
During the past 5,000 years or so they have
become simplified and stylised.
Ideographs are graphical representations of
abstract ideas.
Compound pictographs and ideographs combine one
or more pictographs or ideographs to form new
characters. Both component parts contribute to
the meaning of the compound character.
30
Chinese Writing System
Semantic-phonetic compounds represent around 90
of all existing characters and consist of two
parts a semantic component or radical which
hints at the meaning of the character, and a
phonetic component which gives a clue to the
pronunciation of the character. Characters
containing the same phonetic component may have
the same sound and the same tone, the same sound
but a different tone, the same initial or final
sound, or a different sound and a different tone.
Phonetic components are generally a more
reliable indication of pronunciation than
semantic components are of meaning.
31
Chinese Writing System
  • Traditional and Simplified Characters
  • Over time, frequently used and complex Chinese
    characters tend to be simplified.
  • More about Pitfalls and Complexities of Chinese
    to Chinese Conversion http//www.cjk.org/cjk/c2c/c
    2cbasis.htm

retain only one part from the traditional
character
32
Chinese Writing System
  • Chinese Language (Chinese Text)
  • Chinese characters are subsequently combined with
    other Chinese characters as words to form more
    complex ideas and concepts.
  • Question How many Chinese characters?

The Chinese writing system is open-ended, meaning
that there is no upper limit to the number of
characters. The largest Chinese dictionaries
include about 56,000 characters, but most of them
are archaic, obscure or rare variant forms.
Knowledge of about 3,000 characters enables you
to read about 99 of the characters in Chinese
newspapers and magazines. To read Chinese
literature, technical writings or classical
Chinese, though, you need to be familiar with
about 6,000 characters.
33
Chinese Reading System
  • Pronunciation
  • The phonetic information is not explicit.
  • Sometimes, you can guess the pronunciation
    through the component characters.
  • Sometimes, the pronunciation has no relation to
    its components at all.
  • It makes the learning of Chinese difficult
    without a phonetic transcription system.
  • Phonetic transcription Dictation of
    pronunciations
  • Symbols to indicate all sounds in the language -
    sufficient
  • One sound is denoted by only one symbol -
    Uniqueness

34
Chinese Reading System
  • Pronunciation
  • Pinyin dictating Mandarin Chinese
  • Vowel (??, Initial) and Consonant (??, final)
  • More about Pronunciation http//www.chinese-outpos
    t.com/language/pronunciation/mandarin-chinese-init
    ials-and-finals-table-1.asp

For example, consider Beijing bei b is an
initial, and ei is a final jing j is an initial,
and ing is a final In speech, Chinese words are
created using just 21 beginning sounds called
initials, and 37 ending sounds called finals.
Initials and finals, of course, combine to create
the basic sounds of Chinese.
35
Chinese Reading System
  • Pronunciation
  • Pinyin

36
Chinese Reading System
  • Pronunciation
  • Tones of Chinese
  • Chinese is a tonal Language.
  • Mandarin has 4 (5) tones and Cantonese has 6 (9)
    tones, which makes it much harder to learn than
    Mandarin.

37
Chinese Reading System
  • Pronunciation
  • Tones differentiate meanings.

Everyone seems to know this one Yes, just by
saying ma in different tones, you can ask, Did
mother scold the horse?  ????? (ma mà ma ma?)
?? (Gong Li, with third and fourth tones), is the
name of the star of Raise the Red Lantern and
other contemporary Chinese films. However, ??
(gong li, with first and third tones, means
kilometer.
Write a Comment
User Comments (0)
About PowerShow.com