Chinese Word Segmentation Method for Domain-Special Machine Translation - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Chinese Word Segmentation Method for Domain-Special Machine Translation

Description:

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin an Beijing Jiaotong University – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 24
Provided by: orgc88
Category:

less

Transcript and Presenter's Notes

Title: Chinese Word Segmentation Method for Domain-Special Machine Translation


1
Chinese Word Segmentation Method for
Domain-Special Machine Translation
  • Su Chen Zhang Yujie Guo Zhen Xu Jinan
  • Beijing Jiaotong University

2
Outline
Motivation
Method of combining multiple segmentation results
Experiment Evaluation
Conclusion
3
Motivation 1/2
?CTB test data ?OOV3.47
Training data Test data F-measure
News domain News domain 97.62
News domain Science 83.89
?Science annotated data ?OOV22.4
CTB training data
4
Motivation 2/2
  • Background Development of a domain-specific
    Chinese-English machine translation system,
  • Problem Accuracy of Chinese Word Segmentation
    (CWS) on large amounts of training text often
    decreases.
  • Many errors in translation knowledge extraction
  • Therefore seriously affects translation quality

5
Our resolution
  • Related work
  • Domain-Adapted Chinese Word Segmentation Based on
    statistical Features
  • In previous work, only 1-best result is adopted
    generally, and ignored the lower ranking result.
  • Bilingually motivated domain-adapted word
    segmentation
  • Many characters are aligned to NULL which
    decrease accuracy of Chinese segmentation.
  • Our goalExtend these method to augment domain
    adaptation of CWS

6
Our approach
  • We propose a linear model to combine multiple
    Chinese word segmentation results of the two
    segmenters to augment domain adaptation.
  • Segmenter based on n-gram features of Chinese raw
    corpus.
  • Segmenter based on bilingually motivated
    features.

7
Framework
Chinese raw corpus
Annotated corpus
Segmentation result
Training CRF model
Chinese sentences
Linear-model for combining multiple results
Results
CRF segmenter
English Sentences
Word alignment result
Result
Bilingual segmenter
8
Raw corpus
Annotated corpus
Test data
N-gram statistical features
Extracting statistical features
Extracting statistical features
Training CRF model
CRF Decoding
CRF segmenter
Segmentation result
9
CRF segmenter
  • Exploring statistical features of large-scale
    domain-specific Chinese raw corpus
  • N-gram frequency feature
  • N-gram AV (Accessor Variety) feature
  • Output of CRF models
  • N-best list of segmentation results
  • Corresponding probability scores

10
Observation
  • Some erroneous segmentations in 1-best result are
    segmented correctly in the low-ranking results.
  • We intend to utilize correct parts within the
    10-best results and the corresponding probability
    scores.

11
Bilingual segmenter
  • The boundaries of Chinese word are inferable on
    parallel corpus.
  • Marked word boundaries in English sentences.
  • Alignment from English word to Chinese word.

12
Inference step
  • Conduct word alignments using GIZA, regarding
    each character of Chinese sentence as one word.
  • For each alignment ailt ei, Cgt, if the characters
    in C are consecutive in the sentence.
  • Take C as a word
  • Calculate its confidence score (refer to paper)

13
Linear model
  • Calculate score of Cij being a word by combine
    multiple segmentation results

? (1kK) are weights of K segmentation results.
F(i, j) denotes the score of characters from i
to j being a word.
Confk(i,j) (1kK) is the confidence score of the
kth segmentation result.
segk(i, j) (1kK) is a two-valued function.
14
Decoding
  • Cij and F(i, j) being represented in a lattice
  • The best sequence is found by dynamic programming
    algorithm.
  • Search a sequence of words with a maximum product
    of their scores.

15
Training parameter ?
  • Initial point ?l (1lK) A point in
    K-dimensional parameter space is randomly
    selected.
  • The parameters ?l are optimized through iterative
    process.
  • In each step, only one parameter is optimized,
    while keeping all other parameters fixed.

16
Experiment setting
  • Experimental data NTCIR-10 Chinese-English
    parallel patent description sentences
  • Annotation set randomly selected 300 sentence
    pairs.
  • 150 sentences used for training the lattice
    parameters.
  • 150 sentences used for evaluation.

17
Evaluation
  • We conduct evaluations from two aspects
  • Evaluation (1) accuracy of Chinese word
    segmentation (F-measure)
  • Evaluation (2) translation quality of MT system
    (BLEU)

18
Evaluation(1)
Method Precision Recall F-measure
Bilingually motivated segmenter 73.1312 61.4480 66.7825
1-best of CRF segmenter(baseline) 90.2439 90.7710 90.5067
Linear-model (our approach) 91.6650 91.8614 91.7631
Accuracy of Chinese word segmentation
19
Evaluation(2)
  • We develop a phrase-based SMT with Moses, using
    different Chinese segmenters
  • 1-best of CRF segmenter (baseline)
  • Linear model (our approach)
  • Stanford Chinese segmenter
  • NLPIR Chinese segmenter

20
Evaluation (2) result
SMT using different Chinese segmenter BLEU
1-best of CRF segmenter (baseline) 30.53
Linear model (our approach) 31.15
Stanford Chinese segmenter 30.98
NLPIR Chinese segmenter 30.56
  • Our approach increased by 0.62 compared to
    baseline.
  • Performance of our approach is better than the
    two popular segmenters.

21
Result Analysis
CRF 1-best result Corresponding English word Linear-model result
? ?? Glycine ???
?? ? Polymer ???
??? Carbon atoms ? ??
???? Iodine complex ? ???
? ??? Antimicrobial ????
22
Conclusion
  • We propose a linear model to combine multiple
    segmentation results from two segmenters to
    augment domain-adaptation.
  • one based on n-gram statistical feature of large
    Chinese raw corpus.
  • the other one based on bilingually motivated
    features of parallel corpus.
  • The experimental results show that both F-measure
    of CWS result and the BLEU score of SMT are
    improved.

23
Thanks!QA
Write a Comment
User Comments (0)
About PowerShow.com