Infrastructures for the Korean Language - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Infrastructures for the Korean Language

Description:

... to Hangul Wordprocessor, Machine Translation and Korean Linguistic Research ... Retrieval, Machine Translation and Document ... Machine Translation: ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 29
Provided by: KeySu2
Category:

less

Transcript and Presenter's Notes

Title: Infrastructures for the Korean Language


1
Infrastructures for the Korean Language
  • Key-Sun Choi

2
Academic Society
  • SIG-Korean Language Computing under Korea
    Information Science Society
  • 300 members
  • Korea Information Society
  • linguistics oriented

3
KIBS Korea Information Base and Systems
  • Purpose
  • To improve Korean Language Processing Technology
  • To promote Korean Software Industry
  • in the planning phase (1993), targetted to Hangul
    Wordprocessor, Machine Translation and Korean
    Linguistic Research
  • 1995 - 1997 (Phase 1) word
  • Two ministry joint project Industry
  • Ministry of ScienceTechnology, Ministry of
    Culture
  • 1998 - 2000 (Phase 2) sentence
  • Only by Ministry of ScienceTechnology Industry
  • will be evaluated in October, 2000
  • 2001 - 2003 (Phase 3) discourse - not decided
  • http//kibs.kaist.ac.kr/

4
King Sejong Project
  • Purpose
  • To promote the Korean Language Research in the
    linguistics side
  • To prepare for the language planning
  • for Unification of South-/North-Korea
  • for International use of Korean
  • Sponsor Ministry of Culture
  • Period 1998 - 2007 (10 years)
  • Items
  • corpus, dictionary, internationalization,
    terminology, education, font, old Korean, old
    Chinese characters
  • http//www.sejong.or.kr/

5
KIBS Architecture
Terminology DB
User(Dictionary)
6
KIBS Introduction
  • Title of Project
  • KIBS I Integrated Korean Information Base
  • KIBS II On Development of Deep-Level Processing
    and Quality Management Technology for Very Large
    Korean Information Base
  • Outline
  • Term 1994.12.4 2004.9.30 (10 years)
  • Sponsor Ministry of Science and Technology
  • Staff 50 person/year

7
The Goal of First step
The Development of an Integrated, Environment
and Support Management System
  • Standard Module Interface
  • Corpus and Electronic Dictionary Development and
    Management System
  • Korean Part-of-Speech Tagging System
  • Korean Syntactic Tagging System
  • Korean/English Alignment System

The Standardization the Specification for
Korean Information Base
  • Terminological Data Base Development and
    Management System
  • Standard Korean Input/Output Environment
  • Standardized Methodology for the Construction of
    a Balanced Corpus
  • Part-Of-Speech Transfer Dictionary Rules and an
    Example Package

The Construction of Korean Information Base
  • Tree-Tagged Corpus
  • Word-Level Narrative Speech Data Base
  • Hand-written Hangul scripts of high frequency

8
The Goal of Second step
Development/Management System of Electronic
Dictionary for Sentence Analysis/Generation
(100,000 entries)
  • Syntactic Information Base for Syntactic
    Analysis/Generation
  • Semantic Information Base for Semantic
    Analysis/Generation
  • Additional Information on Language and GUI for
    Developing Applications

Terminology Dictionary and Development/Managem
ent System
  • Terminology Entries
  • Domain-specific Corpus for Terminology Building
  • Sublanguage Analysis and Extraction of Terminology

Quality Management System for Language
Information Processing
  • Development/Management System for Information
    Base
  • Development of Integrated Management System for
    Distributed Resources

9
Development Tools
  • Korean Concordance Program (KCP)
  • Compound Noun Browser
  • Corpus Browser
  • Corpus Browser by Category
  • Automatic English-to-Korean Transliteration
    System (TLEK)
  • KAIST Ontology Browser
  • Korean Morphological Analyser
  • Korean Tagger
  • Korean Syntactic Analyser
  • Editing Support Tools to Electronic Dictionary

10
Results Distribution
  • Major Results
  • The first (KIBS I) 1997.6. present (80 site)
  • Text corpus 10 million word phrases
  • POS tagged corpus 1 million word phrases
  • Syntactic structure tagged corpus 10 thousands
    sentences
  • TDMS, Speech DB samples, Hand-written character
    DB samples
  • The second (KIBS II) 1998.12. present (140
    site)
  • Raw corpus 10 million word phrases, POS tagged
    corpus 200 thousands word phrases
  • The third (KIBS III) 2000 (pending)
  • Proper noun 10 thousands entries, Compound noun
    20 thousands entries, Verb sentence pattern
    dictionary 3 thousands entries, ...
  • Plan to maintain and distribute ...

11
Integration of Electronic Dictionaries
  • Dictionaries total 420K entries (estimated now)
  • Machine Readable Dictionary (Hangul Society)
    200K entries
  • Compound Noun, Proper Noun Classification,
    Internal Semantic Structure 50K entries
  • Searched Compound Noun, Proper Noun open
  • Verb Subcategorization 10K frames (K-J
    comparison)
  • Thesaurus Korean-Japanese-Chinese-English not
    so good quality 150K entries
  • Usage from corpus for each sense
  • Functional words
  • Problem
  • Sense classification standardization
  • Character code Korean, Japanese, Chinese,
    (most important problem) now under unicode
    transfer

12
Open through web
  • Corpus KWIC for Korean and Japanese
  • http//morph.kaist.ac.kr/kcp/
  • Korean morphological analysis service
  • http//morph.kaist.ac.kr/
  • By email, if send a text file, then reply its POS
    tagging
  • Graphic editor/debugger for Korean morphology
  • Project Status
  • http//kibs.kaist.ac.kr/

13
KORTERM
Korea Terminology Center for Language and
Knowledge Engineering
http//korterm.org/ (English) http//korterm.or.kr
/ (Korean) http//eafterm.org/ (East Asian
Terminology)
14
Goals of KORTERM
  • Through World-Wide Terminology Collection and
    Their Standardization and Harmonization in Local
    Society
  • Distribution, Publication and Application in
    Language and Knowledge Engineering are promoted.
  • Through Education and Consultation of Terminology
    RD Methodology for Each Subject Field,
  • High-Quality, High-Reliable Terminology and Its
    Infrastructure and System are achieved.

Center of Terminology and Knowledge Engineering
15
Phases and Subjects of KORTERM
Phase 4 (2008 - ) Maintenance and Extension
Phase 3 (2004-2007) Operation
  • Continuous Extension and Management
  • Terminology Study Promotion
  • Distribution of Terminology Information Base
  • Continuous Terminology Extension and Management
  • Multi-lingual Terminology Integration
  • Terminology Collection (Humanity and Social
    Science)
  • Maintenance and Extension
  • Large-Scale Knowledge Base for Terminology
  • Terminology Education Curriculum Development
  • Application Product Development

Phase 2 (2001-2003) Value-Added Working System
  • Value-Added Terminology Integration
  • Terminology Collection (Extended ST)
  • Extension Maintenance (Industry Standards)
  • High-Quality Terminology
  • Application in Language Industry
  • Verification for High-Reliability and Distribution

Phase 1 (1998-2000) RD Environment and Basic
Data Collection
  • Integration of Working Terminology
  • Terminology Collection (Basic ST, Industry
    Standard,
  • Economics)
  • Electronic Terminology (Publication)
  • RD Environment (System Standardization)
  • Terminology Theory and Education Infrastructure

16
R D (1)
  • Basic Data (Corpus)
  • Corpus for Each Subject Domain
  • Electronic Dictionary for Basic Vocabulary
  • Everyday Vocabulary consists of General
    Vocabulary and Everyday Terminology
  • Internationalization of Korean Language
  • South-North Korean Terminology Standardization,
    Korean language Input Methods
  • Korean Language Engineering
  • Standardized Term Use for Information Retrieval,
    Machine Translation and Document Classification

17
R D (2)
  • Language Engineering
  • Information Retrieval
  • Effective Internet Information Creation and
    Information/Knowledge Acquisition
  • Multi-lingualism
  • Machine Translation
  • Efficient Information Generation through
    Terminology and Vocabulary Collection and
    Standardization
  • Wordprocessor
  • High Productivity by Spelling Correction,
    Summarization and Efficient Use.

18
R D (3)
  • Language, Information and Terminology
  • Language Education
  • Technical Thinking and Technical Communication
  • Terminology-based Education
  • Language Study
  • Domain-specific Language Study

19
Terminology Sponsors
  • Support from Government, Organization and
    Industry according to each specialty
  • Ministry of Culture and Tourism (KORTERM Center
    Operation)
  • Ministry of Science and Technology (RD Fund)
  • Ministry of Information and Telecommunication
    (RD Fund)
  • Ministry of Diplomacy and Trade
  • Ministry of Industry and Resource
  • Ministry of Education
  • Korea Science and Technology Foundation (Event
    Support)

20
Task Configuration
RD Industry Living Communication
Use
Terminology Information Environment
Application
Application-Specific Dictionary
Language Education Adaptable to Student
LanguageEducationEnvironment
Language Knowledge Product
TerminologySymbolization
Grid Size Controller
Terminology Access Standard Channel
RD Environment
International Term Standard
Terminology Standard
TerminologicalConceptualSpace
Standardization Harmonization
Terminology Base (Collection)Non-standards
21
Large-Scale Speech/Language/Image DB Construction
and Evaluation
Supported by Ministry of Science and
Technology Two Year Project (1999.10-2001.10)
22
Goals
Speech/Language/Image Evaluation Standardization
Final Goal
Organization
  • Working Group Organization
  • Survey and Planning
  • IR Test Suite and Evaluation Model Recommend
  • MT Test Suite and Evaluation Model Recommend

Language
Specification Standardization
  • Sentence-unit Speech DB
  • Prosody for Speech Synthesis

Speech
  • Image Attribute Format
  • Color-Lexical Entry
  • MPEG7 Specification

Image
  • IR/QA 90 query/200K doc, MT 5,000 sentences

Language
Test Suite
Speech
  • word-unit telephone speech DB 100 token 500

Image
  • Image 300 kinds - Meta Data

23
Question-Answering IR Test Suites
  • Test Suites for IR/QA
  • Documents
  • 207,067 records (370MB)
  • Newspapers
  • Query Generation
  • 90 queries (through 300 quiz query analysis)
  • Queries for WH-question and other various types
    of answers
  • for NLP problem solving
  • relevent document set to include the answer
  • by using four kinds of commercialized IR systems
    by 16 kinds of methods

24
English-Korean MT Test Suites
  • Type Classification About 300 Kinds
  • Test Sentences and Test Query 5,000 Records
  • Extracted from Textbook and Grammar books
    (1999-2000)
  • will be extracted from the Real usage like web,
    newspapers (2000-2001)
  • Evaluation by Yes/No Question
  • Tested for 4 Commercialized English-Korean MT
    Systems

25
MT Evaluation Workbench
26
Image Meta Data Editor
Meta data Input Workbench by XML
27
Image Retrieval by Meta data
28
http//korterm.kaist.ac.kr/ksurimal/
     
Write a Comment
User Comments (0)
About PowerShow.com