Chinese Word Segmentation Method for Domain-Special Machine Translation

About This Presentation

Title:

Chinese Word Segmentation Method for Domain-Special Machine Translation

Description:

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin an Beijing Jiaotong University – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 24

Provided by: orgc88

Category:

more less

Transcript and Presenter's Notes

Title: Chinese Word Segmentation Method for Domain-Special Machine Translation

1
Chinese Word Segmentation Method for
Domain-Special Machine Translation

Su Chen Zhang Yujie Guo Zhen Xu Jinan
Beijing Jiaotong University

2
Outline
Motivation
Method of combining multiple segmentation results
Experiment Evaluation
Conclusion
3
Motivation 1/2
?CTB test data ?OOV3.47
Training data Test data F-measure
News domain News domain 97.62
News domain Science 83.89
?Science annotated data ?OOV22.4
CTB training data
4
Motivation 2/2

Background Development of a domain-specific
Chinese-English machine translation system,
Problem Accuracy of Chinese Word Segmentation
(CWS) on large amounts of training text often
decreases.
Many errors in translation knowledge extraction
Therefore seriously affects translation quality

5
Our resolution

Related work
Domain-Adapted Chinese Word Segmentation Based on
statistical Features
In previous work, only 1-best result is adopted
generally, and ignored the lower ranking result.
Bilingually motivated domain-adapted word
segmentation
Many characters are aligned to NULL which
decrease accuracy of Chinese segmentation.
Our goalExtend these method to augment domain
adaptation of CWS

6
Our approach

We propose a linear model to combine multiple
Chinese word segmentation results of the two
segmenters to augment domain adaptation.
Segmenter based on n-gram features of Chinese raw
corpus.
Segmenter based on bilingually motivated
features.

7
Framework
Chinese raw corpus
Annotated corpus
Segmentation result
Training CRF model
Chinese sentences
Linear-model for combining multiple results
Results
CRF segmenter
English Sentences
Word alignment result
Result
Bilingual segmenter
8
Raw corpus
Annotated corpus
Test data
N-gram statistical features
Extracting statistical features
Extracting statistical features
Training CRF model
CRF Decoding
CRF segmenter
Segmentation result
9
CRF segmenter

Exploring statistical features of large-scale
domain-specific Chinese raw corpus
N-gram frequency feature
N-gram AV (Accessor Variety) feature
Output of CRF models
N-best list of segmentation results
Corresponding probability scores

10
Observation

Some erroneous segmentations in 1-best result are
segmented correctly in the low-ranking results.
We intend to utilize correct parts within the
10-best results and the corresponding probability
scores.

11
Bilingual segmenter

The boundaries of Chinese word are inferable on
parallel corpus.
Marked word boundaries in English sentences.
Alignment from English word to Chinese word.

12
Inference step

Conduct word alignments using GIZA, regarding
each character of Chinese sentence as one word.
For each alignment ailt ei, Cgt, if the characters
in C are consecutive in the sentence.
Take C as a word
Calculate its confidence score (refer to paper)

13
Linear model

Calculate score of Cij being a word by combine
multiple segmentation results

? (1kK) are weights of K segmentation results.
F(i, j) denotes the score of characters from i
to j being a word.
Confk(i,j) (1kK) is the confidence score of the
kth segmentation result.
segk(i, j) (1kK) is a two-valued function.
14
Decoding

Cij and F(i, j) being represented in a lattice
The best sequence is found by dynamic programming
algorithm.
Search a sequence of words with a maximum product
of their scores.

15
Training parameter ?

Initial point ?l (1lK) A point in
K-dimensional parameter space is randomly
selected.
The parameters ?l are optimized through iterative
process.
In each step, only one parameter is optimized,
while keeping all other parameters fixed.

16
Experiment setting

Experimental data NTCIR-10 Chinese-English
parallel patent description sentences
Annotation set randomly selected 300 sentence
pairs.
150 sentences used for training the lattice
parameters.
150 sentences used for evaluation.

17
Evaluation

We conduct evaluations from two aspects
Evaluation (1) accuracy of Chinese word
segmentation (F-measure)
Evaluation (2) translation quality of MT system
(BLEU)

18
Evaluation(1)
Method Precision Recall F-measure
Bilingually motivated segmenter 73.1312 61.4480 66.7825
1-best of CRF segmenter(baseline) 90.2439 90.7710 90.5067
Linear-model (our approach) 91.6650 91.8614 91.7631
Accuracy of Chinese word segmentation
19
Evaluation(2)

We develop a phrase-based SMT with Moses, using
different Chinese segmenters
1-best of CRF segmenter (baseline)
Linear model (our approach)
Stanford Chinese segmenter
NLPIR Chinese segmenter

20
Evaluation (2) result
SMT using different Chinese segmenter BLEU
1-best of CRF segmenter (baseline) 30.53
Linear model (our approach) 31.15
Stanford Chinese segmenter 30.98
NLPIR Chinese segmenter 30.56

Our approach increased by 0.62 compared to
baseline.
Performance of our approach is better than the
two popular segmenters.

21
Result Analysis
CRF 1-best result Corresponding English word Linear-model result
? ?? Glycine ???
?? ? Polymer ???
??? Carbon atoms ? ??
???? Iodine complex ? ???
? ??? Antimicrobial ????
22
Conclusion

We propose a linear model to combine multiple
segmentation results from two segmenters to
augment domain-adaptation.
one based on n-gram statistical feature of large
Chinese raw corpus.
the other one based on bilingually motivated
features of parallel corpus.
The experimental results show that both F-measure
of CWS result and the BLEU score of SMT are
improved.

23
Thanks!QA

Write a Comment

User Comments (0)