Title: Where%20do%20we%20stand?%20MT%20development,%20research,%20and%20deployment%20in%20Asia
1Where do we stand? MT development, research, and
deployment in Asia
- Key-Sun Choi (KAIST)
- AAMT
- http//www.asianlp.org/
- http//www.afnlp.org/
- http//korterm.org/
2Contents
- China
- Japan
- India
- Malaysia
- Thailand
- Taiwan
- Korea
- UNL
- Associations related to MT
3MT in China 1980-1990s
- To translate the scientific documents
- From Russian and Western Countries language
- Supported by government
- No private company in early stage
- TRANS-STAR
- 30,000 words/hour for 386 PC.
- Basis dictionary includes 40,000 entries,
- 10 specialized technical dictionaries
- including 350,000 entries.
- subject fields computer, economics,
telecommunication, ceramics, thermal power
industry, printing machine industry,
automobile/tractor industry, Petroleum
prospecting, geology, Chemical industry.
4MT in China PresentEnglish-to-Chinese
- GAOLI
- jointly by Beijing GAOLI Computer Co. Lid.
Linguistics Institute of CASS. - Basic lexical dictionary 60,000 entries in which
usage and grammatical function of every word is
described in detail. - Translation accuracy 80
- Readability of translated text 80-90
- 863-IMT/EC
- by the Institute of Computer Technology, Academia
Sinica. - commercialized and got very good economic
benefits.
5MT in China PresentChinese-to-English
- SINO-TRANS
- by the Company CSS (China National Software
Technology Service Co.) at 1993. - Basic dictionary 40,000 entries
- Two special subject technical dictionaries Naval
ships and boats (9312 entries), rocket-gun
(33,773 entries) - Linguistic rules 1,000 rules
6MT in China PresentEnglish-to-Chinese
terminology
- TONGYI system
- by the Tianjin DATONG computer software company
- WINDOWS platform
- Different special subject dictionaries
- a. commonly-used scientific terms 200,000
entries - b.terms including 22 different subjects (e.g.
machine building, telecommunication, aviation,
medicine, etc) 3,000,000 entries - Good market strategy and service
- Cooperation with enterprises
7MT in China PresentEnglish-to-Chinese
internet browsing more user interface
- YIWANG
- by SUNSHINE company of Shenzhen.
- Highest translation speed 100 sentences per
second. - Internet browsing
- YIBA
- by YAXINCHENG software technical company.
- Three translation on line, automatic, interface.
- Open to users to revise dictionary and rules
- Rich special subject dictionaries 30 subjects
(e.g. Computer, telecommunication, medicine)
8MT in China PresentEnglish-to-Japanese
- E-to-J
- by JEC company in Beijing.
- Technique of transformation from phrase tree
(P-tree) to dependency tree (D-tree). - Closely integrated with word processor
9MT in China PresentExample-based MT
experimental systems
- Japanese-Chinese EBMT
- computer department of Qinghua university in
1996. - corpus for Japanese and Chinese alignment
sentences - The example unit is sentence
- The similarity rate calculation based on word
- DAYA EBMT
- Harbin Polytechnic University.
- machine-aided translation system, human factor is
very important - corpus is sentence-level alignment
10MT in ChinaGovernment Funding 1990s
- Hi-Tech 863 funding
- 863-IMT/EC system (English-Chinese)
- SUNSHINE YIWANG system.
- 905 Chinese Language Processing Project
- completed in 1998.
11MT in ChinaUsers English Level
- The proportion of English level of user for
TONGYI MT software - Higher level 16.5
- Middle level 49.5
- Lower level 34.1
- So the MT software must be oriented to common
people
12MT in ChinaPotential Users
- The proportion of enterprise user for TONGYI MT
software - Small enterprises 31.3
- Medium-scale large-scale enterprises 68.7
- So the MT software must be oriented to
- large-scale medium-scale enterprises,
- but we dont ignore the small enterprises that
also has translation demand.
13MT in ChinaRegional Distribution
- Users region distribution of MT software
- translation demand is concentrated in the big
cities and developing regions. - Beijing 18.7
- Liaoning 7.9, Jiangsu 7.5
- Zhejiang 6.5, Hubei 6.5, Shanghai 6.1
- Sichuan 4.7, Guangdong 4.7
- Henan 3.3, Helongjiang 3.3
- Hebei 2.8, Shanxi 2.3, Jilin 2.3
- Yunnan 1.9, Neimeng 1.5, Gansu 1.4
- Guizhou 0.5, Anhui 0.5
14MT in China - Future and Strategies
(1)Terminology Data Bank
- MT software combines with terminology data bank
- 1990 sub-committee of computer-aided in
terminology of China set up. - This sub-committee is attached to the State
Language Commission (SLC) of China - A series of national standards for terminology
data-bank - Terminology Databank creation
- Chinese-English Since 1995, by ISTIC (Institute
of Scientific and technical Information of China) - Remarkable databanks
15MT in China - Future and Strategies (2)Language
Corpus Processing
- Corpus construction
- the scale of 25 million Chinese characters (1999)
- Automatic segmentation of Chinese writing text in
corpus (97.68, close test) - Automatic phrase bracketing and syntactic
annotation for Chinese Corpus
16MT in China - Future and Strategies
(3)speech-to-speech translation
- Chinese speech into Chinese text.
- "SIDA-863A" system can recognize
- 398 basic Chinese syllable,
- recognition rate can arrive to 93,
- response time is less than 0.1 second,
- input rapidity can arrive to 80 Chinese
characters per minute
17MT in China - Future and Strategies (4)combined
with OCR and Internet
- Internet MT
- SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc.
- The advantage for MT software in INTERNET are
- Higher translation speed, real-time translation
- Cheap price
- Large machine dictionary
- Possibility to add the new words
18MT in China New National Project
- 973 project from 2001
- supported by Chinese government.
- For creative research in
- Natural Language processing including machine
translation. - automatic speech-to-speech translation system
(English-Chinese) - developing in Institute of Automation of Academia
Sinica.
19MT in China Survey Source
- Prof. Feng, Zhiwei
- Secretary-general and the deputy chairman of
- sub-committee of computer-aided in terminology of
China - under the State Language Commission (SLC) of
China. - Invited professor, KAIST (Sep/2001 Aug/2002)
- Dr. Liu, Qun
- Institute of Computer Technology, Academia
Sinica, Beijing
20MT in Japan - 1
- More than 10 companies
- For English, Chinese, Korean
- Waiting for the new breakthrough
- Internet
- eLearning
- Co-work with special-domain related companies
- Technology transfer
- Collaboration tools is ready to be in market
- For translators collaboration workbench thru
network - User interface well-organized.
21MT in Japan - 2
- Leading Systems
- Cross-lingual patent retrieval
- Prime
- NTT/ALT
- Japanese-to-English
- Japanese-to-Malay
- Japanese-to-Chinese
- Speech Translation
- ATR C-Star
22UNL in UN University
- Through Universal Networking Language
- With Hindi, Japanese, Persian, Indonesia-Malay,
Thai, Chinese, Mongolian, Korean in Asian Region - Other region Major European languages and
English - Possible Users
- ITU mail translation
23MT in Malaysia
- No commercial product yet.
- But in academic sectors
- For application to
- Internet
- eLearning
- eCommerce
- Universiti Sains Malaysia
- Computer Aided Translation Unit
- Prof. Tang Enya Kong and Prof. Yusoff Zaharin
24MT in India
- 18 constitutional languages with 10 different
scripts - their script grammar and language grammars are
quite similar - they have 40 to 80 percent vocabularies in common
- less than 5 percent people who can work in
English
25MT in India 1990-2001government effort for IT
- TDIL (Technology Development of Indian
Languages) - 1990-1991
- development of corpora, OCR, Text-to-Speech,
machine translation Standards for keyboard and
internal code for information interchange - 2000-2001
- seven major initiatives
- Knowledge Resources, Knowledge Tools, Translation
Support Systems, Human Machine Interface Systems,
Localisation, Standardization and Language
Technology Human Resource Development. - Thirteen Resource centres for Indian Language
Technology Solutions (RC-ILTS) - were supported covering all 18 Indian languages.
26MT in India Future Digital Unite and Knowledge
for All
- Indian Language Technology Vision 2010 has been
prepared - with the Vision statement Digital Unite and
Knowledge for All. - Growing popularity of Internet
- content creation, localisation, on-line gisting
and summarisation, e-learning, Cross-Lingual
Information Retrieval are being promoted to
ensure information access in cyberspace in Indian
languages - Source Dr. Om Vikas
- Senior Director and Head, Computer Development
Division, Ministry of Information Technology
27MT in ThailandGovernment 1996
- IT-2000
- To build a national information infrastructure
(NII) - To invest in people, intends to concentrate on
transferring IT knowledge to their children. - To build a Government Information Network (GINET)
- Internet Users in Thailand (2000) 2.3M/66M
- Age lt10 10-14 15-19 20-29 30-39 40-49 50-59
60-69 70 Total - Freq 18 124 261 1,238 572 187 32
27 2 2,461 - Percent 0.7 5 10.6 50.3 23.2 7.6
1.3 1.1 0.1 100 - Most of the Thai Internet users know English and
other Internet languages at a basic or low
intermediate level
28MT in ThailandPARSIT
- web-based Thai-English Machine Translation
- since 1998 in cooperation with NEC (Japan).
- very popular among Thai users
- to translate English to Thai with the accuracy of
60. - 20 percent mistranslating might be due to
differences in expressions, slang, and sentence
structures - http//www.suparsit.com/
- 300,000 hits/month
- 25,000 users/month
29MT in Thailand Dictionary
- a web-based dictionary Lexitron
- Thai-English and English-Thai dictionary
30MT in Thailand Future
- to develop PARSIT translating system
- Thai-to-English
- and to other target languages.
- Other language programs, such as OCR research,
speech research, and language research - Thai full-text search engine
31MT in Thailand eASEAN
- eASEAN Plan
- Multilingual Machine Translation Proposal
- Thailand, Cambodia, Laos, Vietnam, Japan, Korea,
English - source
- Dr. Virach Sornlertlamvanich virach_at_nectec.or.th
- Dr. Prayong THITITHANANON (Rajabhat Institute
Ubon Ratchathani, Thailand)
32MT in Taiwan
- Prof. Su, Keh-Ih
- Machine translation
- localization
33MT in KoreaCommercial Product
- English-to-Korean (Korean-to-English)
- Enguide LNI Soft
- E-Tran2001 NLP Lab (Seoul National University)
- EZ Reader Language and Computer
- ClickWorld ClickQ
- Transmate IBM Korea
-
- Japanese-to/from-Korea
- Unisoft
- Changmyung
-
- Translation Memory
- Localization companies develop for their own use
- ITI
34MT in KoreaTest suite for E-to-K
- KAIST (http//korterm.kaist.ac.kr/ksurimal)
- Supported by Ministry of Science and Technology
- Exhaustive Evaluation
- A variety of Sentences (5000 from high school
textbooks, 10000 from internet e-business site) - To identify the RD direction
35 Problematic Part of System A
serious
average
Article
Pronoun
Noun
Adverb
Adjective
Verb
Part of Specech
Preposition
Relatives
Conjunction
Mark
Partial Structure
Infinitive
Tense
Gerund
Participle
Idioms
Structural Part
Number
Sentence type
Comparative
Subjunctive mood
Special Construction
Sentence Structure
Negation
Speech
Ellipsis
Lists
Insertion
Inversion
Multiple part of speech
Realtion and Scope of modification
Phrase
Semantic Part
VN
VPrep.
NV
NN
Collocation
NPrep.
Adv.N
Adv. Prep
N
Etc.
V
Ambiguous word
NP
Idioms
VP
PP
AP(adjective phrase)
Sentence
Natural Expression
Different meaning between singular and plural
36MT in Korea
- Caption/EK and KE - ETRI
- Real-time translation of caption in the TV news
- CNN for English-Korean
- KBS for Korean-English
- Chinese-Korean MT
- Pohang University of Science Tech.
- KAIST
- ETRI (Korean-to-Chinese)
- Companies Konan tech.
- Japanese-Korean MT (technology transfer)
- Pohang University of Science Tech.
37Online language populations (2001 June)
- English 45, Japanese 9.8, Chinese 8.4
- German 6.2, Korean 4.7, Spanish 4.5
- Italian 3.6, French 3.4, Portuguese 2.5
- Dutch 2, Russian 1.9
- GlobalReach. Global Internet Statistics (by
Language). - http//www.glreach.com/globstats/index.php3
38Organizations in Asia
- AAMT
- AFNLP (Asia Federation of NLP Assocations)
- http//asianlp.org/
- http//afnlp.org/
- Eafterm (East Asia Terminology Forum)
- http//eafterm.org/
- Language Resource Sharing and Management
- Jan/2001 workshop in Tokyo, invited by Japan
- Prof. Tanaka, Hozumi (Chair GSK)
- Nov/2001 workshop in NLPRS-2001, Tokyo
- ISO TC37/SC4 (Language Resource Management) under
organization
39MT Status in Asia