Title: Chinese Word Segmentation Method for Domain-Special Machine Translation
1Chinese Word Segmentation Method for
Domain-Special Machine Translation
- Su Chen Zhang Yujie Guo Zhen Xu Jinan
- Beijing Jiaotong University
2Outline
Motivation
Method of combining multiple segmentation results
Experiment Evaluation
Conclusion
3Motivation 1/2
?CTB test data ?OOV3.47
Training data Test data F-measure
News domain News domain 97.62
News domain Science 83.89
?Science annotated data ?OOV22.4
CTB training data
4Motivation 2/2
- Background Development of a domain-specific
Chinese-English machine translation system, - Problem Accuracy of Chinese Word Segmentation
(CWS) on large amounts of training text often
decreases. - Many errors in translation knowledge extraction
- Therefore seriously affects translation quality
5Our resolution
- Related work
- Domain-Adapted Chinese Word Segmentation Based on
statistical Features - In previous work, only 1-best result is adopted
generally, and ignored the lower ranking result. - Bilingually motivated domain-adapted word
segmentation - Many characters are aligned to NULL which
decrease accuracy of Chinese segmentation. - Our goalExtend these method to augment domain
adaptation of CWS
6Our approach
- We propose a linear model to combine multiple
Chinese word segmentation results of the two
segmenters to augment domain adaptation. - Segmenter based on n-gram features of Chinese raw
corpus. - Segmenter based on bilingually motivated
features.
7Framework
Chinese raw corpus
Annotated corpus
Segmentation result
Training CRF model
Chinese sentences
Linear-model for combining multiple results
Results
CRF segmenter
English Sentences
Word alignment result
Result
Bilingual segmenter
8Raw corpus
Annotated corpus
Test data
N-gram statistical features
Extracting statistical features
Extracting statistical features
Training CRF model
CRF Decoding
CRF segmenter
Segmentation result
9CRF segmenter
- Exploring statistical features of large-scale
domain-specific Chinese raw corpus - N-gram frequency feature
- N-gram AV (Accessor Variety) feature
- Output of CRF models
- N-best list of segmentation results
- Corresponding probability scores
10Observation
- Some erroneous segmentations in 1-best result are
segmented correctly in the low-ranking results. - We intend to utilize correct parts within the
10-best results and the corresponding probability
scores.
11Bilingual segmenter
- The boundaries of Chinese word are inferable on
parallel corpus. - Marked word boundaries in English sentences.
- Alignment from English word to Chinese word.
12Inference step
- Conduct word alignments using GIZA, regarding
each character of Chinese sentence as one word. - For each alignment ailt ei, Cgt, if the characters
in C are consecutive in the sentence. - Take C as a word
- Calculate its confidence score (refer to paper)
13Linear model
- Calculate score of Cij being a word by combine
multiple segmentation results
? (1kK) are weights of K segmentation results.
F(i, j) denotes the score of characters from i
to j being a word.
Confk(i,j) (1kK) is the confidence score of the
kth segmentation result.
segk(i, j) (1kK) is a two-valued function.
14Decoding
- Cij and F(i, j) being represented in a lattice
- The best sequence is found by dynamic programming
algorithm. - Search a sequence of words with a maximum product
of their scores.
15Training parameter ?
- Initial point ?l (1lK) A point in
K-dimensional parameter space is randomly
selected. - The parameters ?l are optimized through iterative
process. - In each step, only one parameter is optimized,
while keeping all other parameters fixed.
16Experiment setting
- Experimental data NTCIR-10 Chinese-English
parallel patent description sentences - Annotation set randomly selected 300 sentence
pairs. - 150 sentences used for training the lattice
parameters. - 150 sentences used for evaluation.
17Evaluation
- We conduct evaluations from two aspects
- Evaluation (1) accuracy of Chinese word
segmentation (F-measure) - Evaluation (2) translation quality of MT system
(BLEU)
18Evaluation(1)
Method Precision Recall F-measure
Bilingually motivated segmenter 73.1312 61.4480 66.7825
1-best of CRF segmenter(baseline) 90.2439 90.7710 90.5067
Linear-model (our approach) 91.6650 91.8614 91.7631
Accuracy of Chinese word segmentation
19Evaluation(2)
- We develop a phrase-based SMT with Moses, using
different Chinese segmenters - 1-best of CRF segmenter (baseline)
- Linear model (our approach)
- Stanford Chinese segmenter
- NLPIR Chinese segmenter
20Evaluation (2) result
SMT using different Chinese segmenter BLEU
1-best of CRF segmenter (baseline) 30.53
Linear model (our approach) 31.15
Stanford Chinese segmenter 30.98
NLPIR Chinese segmenter 30.56
- Our approach increased by 0.62 compared to
baseline. - Performance of our approach is better than the
two popular segmenters.
21Result Analysis
CRF 1-best result Corresponding English word Linear-model result
? ?? Glycine ???
?? ? Polymer ???
??? Carbon atoms ? ??
???? Iodine complex ? ???
? ??? Antimicrobial ????
22Conclusion
- We propose a linear model to combine multiple
segmentation results from two segmenters to
augment domain-adaptation. - one based on n-gram statistical feature of large
Chinese raw corpus. - the other one based on bilingually motivated
features of parallel corpus. - The experimental results show that both F-measure
of CWS result and the BLEU score of SMT are
improved.
23Thanks!QA