The current status of ChineseEnglish EBMT where are we now - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

The current status of ChineseEnglish EBMT where are we now

Description:

Rapid deploy Machine Translation system between Chinese and English. For HLT 2001 (Jun 00-Jan 01) ... Adapting an Example-Based Translation System to Chinese' ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 47
Provided by: Joy293
Category:

less

Transcript and Presenter's Notes

Title: The current status of ChineseEnglish EBMT where are we now


1
The current status of Chinese-English EBMT-where
are we now
  • Joy (Ying Zhang)
  • Ralf Brown, Robert Frederking, Erik Peterson
  • Aug 2001

2
Topics
  • Overview of this project
  • Rapid deploy Machine Translation system between
    Chinese and English
  • For HLT 2001 (Jun 00-Jan 01)
  • Augment the segmenter with new words found in the
    corpus
  • For MT-Summit VIII Paper (Jan 01- May 01)
  • Two-threshold method used in tokenization code to
    find new words in corpus
  • For PI meeting (Jun 01- Jul 01)
  • Accurate ablation experiments
  • Named-entities added to the training
  • Multi-corpora experiment
  • After PI meeting (Aug 01)
  • Study of results reported for PI meeting
  • Review of evaluation methods
  • Type-token relations
  • Plan for future research

3
Overview of Ch-En EBMT
  • Adapting EBMT to Chinese
  • Corpus used
  • Hong Kong legal code (from LDC)
  • Hong Kong news articles (from LDC)
  • In this project
  • Robert Frederking, Ralf Brown, Joy, Erik
    Peterson, Stephan Vogel, Alon Lavie, Lori Levin,

4
Corpus Cleaning
  • Convert from Big5 to GB
  • Divided into Training set (90), Dev-test (5)
    and test set (5)
  • Sentence level alignment, using Church Gale
    Method (by ISI)
  • Cleaned
  • Convert two-byte Chinese characters to their
    cognates

5
Corpus Statistics
  • Hong Kong Legal Code
  • Chinese 23 MB
  • English 37.8 MB
  • Hong Kong News (After cleaning)
  • 7622 Documents
  • Dev-test Size 1,331,915 byte , 4,992
    sentence pairs
  • Final-test Size 1,329,764 byte, 4,866
    sentence pairs
  • Training Size 25,720,755 byte, 95,752
    sentence pairs
  • Vocabulary size under LDC segmenter
  • Dev-test Total type 8,529 Total token
    134,749
  • Final-test Total type
    8,511 Total token 135,372
  • Training Total type 20,451 Total token
    2,600,095

6
Chinese Segmentation
  • There are no spaces between Chinese words in
    written Chinese.
  • The segmentation problem Given a sentence with
    no spaces, break it into words

7
Vague Definition of Words
  • In English, word might be a group of letters
    having meaning separated by spaces in the
    sentence---- Doesnt work for Chinese
  • Is the word a single Chinese character?---Not
    necessarily
  • Is the word the smallest set of characters that
    can have meaning by themselves? --- Maybe
  • Is the word the longest set of characters that
    can have meaning by themselves? --- Perhaps

8
Our Definition of Words/Phrases/Terms
  • Chinese Characters
  • The smallest unit in written Chinese is a
    character, which is represented by 2 bytes in
    GB-2312 code.
  • Chinese Words
  • A word in natural language is the smallest
    reusable unit which can be used in isolation.
  • Chinese Phrases
  • We define a Chinese phrase as a sequence of
    Chinese words. For each word in the phrase, the
    meaning of this word is the same as the meaning
    when the word appears by itself.
  • Terms
  • A term is a meaningful constituent. It can be
    either a word or a phrase.

9
Complicated Constructions
  • There are some constructions that can cause
    problems for segmentation
  • Transliterated foreign words and names Using
    Chinese characters for the sound of English
    names. The meaning of each character is
    irrelevant and can not be relied on. Each
    Chinese-speaking region will often transliterate
    the same name differently

10
Complicated Constructions (2)
  • Abbreviations In Chinese abbreviations are
    formed by taking a character from each word in
    the phrase being abbreviated.
  • Virtually any phrase can be abbreviated by taking
    on a character from each component, and these
    characters usually have no independent relation
    to each other

11
Complicated Constructions (3)
  • Chinese Names
  • Name Surname (gen. one character) Given name
    (one or two characters)
  • About 100 common surnames, but the number of
    given names is huge
  • The complication for NLP the same characters in
    names can be used in regular words. Just like
    in English Bill Brown as a name.

12
Complicated Constructions (4)
  • Chinese Numbers
  • Similar to English, there are several ways to
    write numbers in Chinese

13
Segmenter
  • Approaches
  • Statistical approaches
  • Idea Building collocation models for Chinese
    characters, such as first-order HMM. Place the
    space at the place where two characters rarely
    co-occur.
  • Cons
  • Data sparseness
  • Cross boundary

14
Segmenter (2)
  • Dictionary-based approaches
  • Idea Use a dictionary to find the words in the
    sentence
  • Forward maximum match / backward maximum match/
    or both direction
  • Cons
  • The size and quality of the dictionary used are
    of great importance New words, Named-entity
  • Maximum (greedy) match may cause mis-segmentations

15
Segmenter (3)
  • A combination of dictionary and linguistic
    knowledge
  • Ideas Using morphology, POS, grammar and
    heuristics to aid disambiguation
  • Pros high accuracy (possible)
  • Cons
  • Require a dictionary with POS and word-frequency
  • Computationally expensive

16
Segmenter (4)
  • We first used LDCs segmenter
  • Currently we are using a forward/backward maximum
    match segmenter for baseline. The word frequency
    dictionary is from LDC
  • Word frequency dictionary from LDC 43959 entries

17
For HLT 2001
  • Ying Zhang, Ralf D. Brown, and Robert E.
    Frederking. "Adapting an Example-Based
    Translation System to Chinese". To appear in
    Proceedings of Human Language Technology
    Conference 2001 (HLT-2001).

18
For MT-Summit VIII
  • Ying Zhang, Ralf D. Brown, Robert E. Frederking
    and Alon Lavie. "Pre-processing of Bilingual
    Corpora for Mandarin-English EBMT". Accepted in
    MT Summit VIII (Santiago de Compostela, Spain,
    Sep. 2001)
  • Two-threshold for tokenization

19
For MT-Summit VIII (2)
20
For PI Meeting (1)
  • Baseline System
  • Full System
  • Baseline Named-Entity
  • Multi-corpora System

21
For PI Meeting (2)
  • Baseline System

22
For PI Meeting (3)
  • Full System

23
For PI Meeting (4)
  • Named-Entity

24
For PI Meeting (5)
  • Multi-Corpora Experiment
  • Motivation
  • Corpus Clustering
  • Experiment

25
Evaluation Issues
  • Automatic Measures
  • EBMT Source Match
  • EBMT Source Coverage
  • EBMT Target Coverage
  • MEMT (EBMTDICT) Unigram Coverage
  • MEMT (EBMTDICT) PER
  • Human Evaluations

26
Evaluation Issues (2)
  • Human Evaluations
  • 4-5 graders each time
  • 6 categories

27
Evaluation Issues (3)
28
After PI Meeting (0)
  • Study of results reported in PI meeting
  • (http//pizza.is.cs.cmu.edu/research/internal/ebmt
    /tokenLen/index.htm)
  • The quality of Named-Entity (Cleaned by Erik)
  • Performance difference of EBMT while changing the
    average length of Chinese word token (by changing
    segmentation)
  • How to evaluate the performance of the system
  • Experiment of G-EBMT
  • Word clustering

29
After PI Meeting (1)
  • Changing the average length of Chinese token
  • No bracket on English
  • Use a subset of LDCs frequency dictionary for
    segmentation
  • Study the performance of EBMT system on different
    average Chinese token length

30
After PI Meeting (2)
31
After PI Meeting (3)
  • Avg. Token Len. vs. StatDict Recall

32
After PI Meeting (4)
  • Avg. Token Len. vs. Source word match

33
After PI Meeting (5)
  • Avg. Token Len vs. Source Coverage

34
After PI Meeting (6)
  • Avg. Token Len. Vs.

35
After PI Meeting (7)
  • Avg. Token Len. Vs. Src/Tgt Coverage of EBMT in
    MEMT

36
After PI Meeting (8)
  • Avg. Token Len. Vs. Translation Unigram Coverage

37
After PI Meeting (9)
  • Avg. Token Len. Vs. Hypothesis Len (Len of
    translation)The reference translations length
    is 1163 words

38
After PI Meeting (10)
  • Avg. Token Len. Vs. PER

39
After PI Meeting (11)
  • Type-Token curve for Chinese

40
After PI Meeting (12)
  • Type-Token curve of Chinese and English

41
Future Research Plan
  • Generalized EBMT
  • Word-clustering
  • Grammar Induction
  • Using Machine Learning to optimize the parameters
    used in MEMT
  • Better Alignment Model Integrating segmentation,
    brackting and alignment

42
New Alignment Model (1)
  • Using both monolingual and bilingual collocation
    information to segment and align corpus

43
New Alignment Model (2)
44
New Alignment Model (3)
45
New Alignment Model (4)
46
References
  • Tom Emerson, Segmentation of Chinese Text. In
    38 Volume 12 Issue2 of MultiLingual Computing
    Technology published by MultiLingual Computing,
    Inc.
Write a Comment
User Comments (0)
About PowerShow.com